Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-big-data-analysis-using-googles-pagerank

14 Dec 2017

8 min read

Getting started with big data analysis using Google's PageRank algorithm

14 Dec 2017

[box type="note" align="" class="" width=""]The article given below is a book excerpt from Java Data Analysis written by John R. Hubbard. Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the aim of discovering useful information. Java is one of the most popular languages to perform your data analysis tasks. This book will help you learn the tools and techniques in Java to conduct data analysis without any hassle. [/box] This post aims to help you learn how to analyse big data using Google’s PageRank algorithm. The term big data generally refers to algorithms for the storage, retrieval, and analysis of massive datasets that are too large to be managed by a single file server. Commercially, these algorithms were pioneered by Google, Google’s PageRank being one of them is considered in this article. Google PageRank algorithm Within a few years of the birth of the web in 1990, there were over a dozen search engines that users could use to search for information. Shortly after it was introduced in 1995, AltaVista became the most popular among them. These search engines would categorize web pages according to the topics that the pages themselves specified. But the problem with these early search engines was that unscrupulous web page writers used deceptive techniques to attract traffic to their pages. For example, a local rug-cleaning service might list "pizza" as a topic in their web page header, just to attract people looking to order a pizza for dinner. These and other tricks rendered early search engines nearly useless. To overcome the problem, various page ranking systems were attempted. The objective was to rank a page based upon its popularity among users who really did want to view its contents. One way to estimate that is to count how many other pages have a link to that page. For example, there might be 100,000 links to https://en.wikipedia.org/wiki/Renaissance, but only 100 to https://en.wikipedia.org/wiki/Ernest_Renan, so the former would be given a much higher rank than the latter. But simply counting the links to a page will not work either. For example, the rug-cleaning service could simply create 100 bogus web pages, each containing a link to the page they want users to view. In 1996, Larry Page and Sergey Brin, while students at Stanford University, invented their PageRank algorithm. It simulates the web itself, represented by a very large directed graph, in which each web page is represented by a node in the graph, and each page link is represented by a directed edge in the graph. The directed graph shown in the figure below could represent a very small network with the same properties: This has four nodes, representing four web pages, A, B, C, and D. The arrows connecting them represent page links. So, for example, page A has a link to each of the other three pages, but page B has a link only to A. To analyze this tiny network, we first identify its transition matrix, M : This square has 16 entries, mij, for 1 ≤ i ≤ 4 and 1 ≤ j ≤ 4. If we assume that a web crawler always picks a link at random to move from one page to another, then mij, equals the probability that it will move to node i from node j, (numbering the nodes A, B, C, and D as 1, 2, 3, and 4). So m12 = 1 means that if it's at node B, there's a 100% chance that it will move next to A. Similarly, m13 = m43 = ½ means that if it's at node C, there's a 50% chance of it moving to A and a 50% chance of it moving to D. Suppose a web crawler picks one of those four pages at random, and then moves to another page, once a minute, picking each link at random. After several hours, what percentage of the time will it have spent at each of the four pages? Here is a similar question. Suppose there are 1,000 web crawlers who obey that transition matrix as we've just described, and that 250 of them start at each of the four pages. After several hours, how many will be on each of the four pages? This process is called a Markov chain. It is a mathematical model that has many applications in physics, chemistry, computer science, queueing theory, economics, and even finance. The diagram in the above figure is called the state diagram for the process, and the nodes of the graph are called the states of the process. Once the state diagram is given, the meaning of the nodes (web pages, in this case) becomes irrelevant. Only the structure of the diagram defines the transition matrix M, and from that we can answer the question. A more general Markov chain would also specify transition probabilities between the nodes, instead of assuming that all transition choices are made at random. In that case, those transition probabilities become the non-zero entries of the M. A Markov chain is called irreducible if it is possible to get to any state from any other state. According to the mathematical theory of Markov chains, if the chain is irreducible, then we can compute the answer to the preceding question using the transition matrix. What we want is the steady state solution; that is, a distribution of crawlers that doesn't change. The crawlers themselves will change, but the number at each node will remain the same. To calculate the steady state solution mathematically, we first have to realize how to apply the transition matrix M. The fact is that if x = (x1 , x2 , x3 , x4 ) is the distribution of crawlers at one minute, and the next minute the distribution is y = (y1 , y2 , y3 , y4 ), then y = Mx , using matrix multiplication. So now, if x is the steady state solution for the Markov chain, then Mx = x. This vector equation gives us four scalar equations in four unknowns: One of these equations is redundant (linearly dependent). But we also know that x1 + x2 + x3 + x4 = 1, since x is a probability vector. So, we're back to four equations in four unknowns. The solution is: The point of that example is to show that we can compute the steady state solution to a static Markov chain by solving an n × n matrix equation, where n is the number of states. By static here, we mean that the transition probabilities mij do not change. Of course, that does not mean that we can mathematically compute the web. In the first place, n > 30,000,000,000,000 nodes! And in the second place, the web is certainly not static. Nevertheless, this analysis does give some insight about the web; and it clearly influenced the thinking of Larry Page and Sergey Brin when they invented the PageRank algorithm. The purpose of the PageRank algorithm is to rank the web pages according to some criteria that would resemble their importance, or at least their frequency of access. The original simple (pre-PageRank) idea was to count the number of links to each page and use something proportional to that count for the rank. Following that line of thought, we can imagine that, if x = (x1 , x2 ,..., xn )T is the page rank for the web (that is, if xj is the relative rank of page j and ∑xj = 1), then Mx = x, at least approximately. Another way to put that is that repeated applications of M to x should nudge x closer and closer to that (unattainable) steady state. That brings us (finally) to the PageRank formula: where ε is a very small positive constant, z is the vector of all 1s, and n is the number of nodes. The vector expression on the right defines the transformation function f which replaces a page rank estimate x with an improved page rank estimate. Repeated applications of this function gradually converge to the unknown steady state. Note that in the formula, f is a function of more than just x. There are really four inputs: x, M, ε , and n. Of course, x is being updated, so it changes with each iteration. But M, ε , and n change too. M is the transition matrix, n is the number of nodes, and ε is a coefficient that determines how much influence the z/n vector has. For example, if we set ε to 0.00005, then the formula becomes: This is how Google's PageRank algorithm can be utilized for the analysis of very large datasets. To learn how to implement this algorithm and various other machine learning algorithms for big data, data visualization, and more using Java, check out this book Java Data Analysis.

0
0
32934

article-image-key-skills-for-data-professionals-to-learn-in-2020

Richard Gall

20 Dec 2019

6 min read

Key skills for data professionals to learn in 2020

Richard Gall

20 Dec 2019

6 min read

It’s easy to fall into the trap of thinking about your next job, or even the job after that. It’s far more useful, however, to think more about the skills you want and need to learn now. This will focus your mind and ensure that you don’t waste time learning things that simply aren’t helpful. It also means you can make use of the things you’re learning almost immediately. This will make you more productive and effective - and who knows, maybe it will make the pathway to your future that little bit clearer. So, to help you focus, here are some of the things you should focus on learning as a data professional. Reinforcement learning Reinforcement learning is one of the most exciting and cutting-edge areas of machine learning. Although the area itself is relatively broad, the concept itself is fundamentally about getting systems to ‘learn’ through a process of reward. Because reinforcement learning focuses on making the best possible decision at a given moment, it naturally finds many applications where decision making is important. This includes things like robotics, digital ad-bidding, configuring software systems, and even something as prosaic as traffic light control. Of course, the list of potential applications for reinforcement learning could be endless. To a certain extent, the real challenge with it is finding new use cases that are relevant to you. But to do that, you need to learn and master it - so make 2020 the year you do just that. Get to grips with reinforcement learning with Reinforcement Learning Algorithms with Python. Learn neural networks Neural networks are closely related to reinforcement learning - they’re essentially another element within machine learning. However, neural networks are even more closely aligned with what we think of as typical artificial intelligence. Indeed, even the name itself hints at the fact that these systems are supposed to in some way mimic the human brain. Like reinforcement learning, there are a number of different applications for neural networks. These include image and language processing, as well as forecasting. The complexity of relationships that can be figured inside neural networks systems is useful for handling data with many different variables and intricacies that would otherwise be difficult to capture. If you want to find out how artificial intelligence really works under the hood, make sure you learn neural networks in 2020. Learn how to build real-world neural networks projects with Neural Network Projects with Python. Meta-learning Metalearning is another area of machine learning. It’s designed to help engineers and analysts to use the right machine learning algorithms for specific problems - it’s particularly important in automatic machine learning, where removing human agency from the analytical process can lead to the wrong systems being used on data. Meta learning does this by being applied to metadata about machine learning projects. This metadata will include information about the data, such as algorithm features, performance measures, and patterns identified previously. Once meta learning algorithms have ‘learned’ from this data, they should, in theory, be well optimized to run on other sets of data. It has been said that meta learning is important in the move towards generalized artificial intelligence, or AGI (intelligence that is more akin to human intelligence). This is because getting machines to learn about learning allow systems to move between different problems - something that is incredibly difficult with even the most sophisticated neural networks. Whether it will actually get us any closer to AGI is certainly open to debate, but if you want to be a part of the cutting edge of AI development, getting stuck into meta learning is a good place to begin in 2020. Find out how meta learning works in Hands-on Meta Learning with Python. Learn a new programming language Python is now the undisputed language of data. But that’s far from the end of the story - R still remains relevant in the field, and there are even reasons to use other languages for machine learning. It might not be immediately obvious - especially if you’re content to use R or Python for analytics and algorithmic projects - but because machine learning is shifting into many different fields, from mobile development to cybersecurity, learning how other programming languages can be used to build machine learning algorithms could be incredibly valuable. From the perspective of your skill set, it gives you a level of flexibility that will not only help you to solve a wider range of problems, but also stand out from the crowd when it comes to the job market. The most obvious non-obvious languages to learn for machine learning practitioners and other data professionals are Java and Julia. But even new and emerging languages are finding their way into machine learning - Go and Swift, for example, could be interesting routes to explore, particularly if you’re thinking about machine learning in production software and systems. Find out how to use Go for machine learning with Go Machine Learning Projects. Learn new frameworks For data professionals there are probably few things more important than learning new frameworks. While it’s useful to become a polyglot, it’s nevertheless true that learning new frameworks and ecosystem tools are going to have a more immediate impact on your work. PyTorch and TensorFlow should almost certainly be on your list for 2020. But we’ve mentioned them a lot recently, so it’s probably worth highlighting other frameworks worth your focus: Pandas, for data wrangling and manipulation, Apache Kafka, for stream-processing, scikit-learn for machine learning, and Matplotlib for data visualization. The list could be much, much longer: however, the best way to approach learning a new framework is to start with your immediate problems. What’s causing issues? What would you like to be able to do but can’t? What would you like to be able to do faster? Explore TensorFlow eBooks and videos on the Packt store. Learn how to develop and communicate a strategy It’s easy to just roll your eyes when someone talks about how important ‘soft skills’ are for data professionals. Except it’s true - being able to strategize, communicate, and influence, are what mark you out as a great data pro rather than a merely competent one. The phrase ‘soft skills’ is often what puts people off - ironically, despite the name they’re often even more difficult to master than technical skill. This is because, of course, soft skills involve working with humans in all their complexity. However, while learning these sorts of skills can be tough, it doesn’t mean it's impossible. To a certain extent it largely just requires a level of self-awareness and reflexivity, as well as a sensitivity to wider business and organizational problems. A good way of doing this is to step back and think of how problems are defined, and how they relate to other parts of the business. Find out how to deliver impactful data science projects with Managing Data Science. If you can master these skills, you’ll undoubtedly be in a great place to push your career forward as the year continues.

0
0
32793

article-image-amazons-partnership-with-nhs-to-make-alexa-offer-medical-advice-raises-privacy-concerns-and-public-backlash

Bhagyashree R

12 Jul 2019

6 min read

Amazon’s partnership with NHS to make Alexa offer medical advice raises privacy concerns and public backlash

Bhagyashree R

12 Jul 2019

6 min read

Virtual assistants like Alexa and smart speakers are being increasingly used in today’s time because of the convenience they come packaged with. It is good to have someone play a song or restock your groceries just on your one command, or probably more than one command. You get the point! But, how comfortable will you be if these assistants can provide you some medical advice? Amazon has teamed up with UK’s National Health Service (NHS) to make Alexa your new medical consultant. The voice-enabled digital assistant will now answer your health-related queries by looking through the NHS website vetted by professional doctors. https://twitter.com/NHSX/status/1148890337504583680 The NHSX initiative to drive digital innovation in healthcare Voice search definitely gives us the most “humanized” way of finding information from the web. One of the striking advantages of voice-enabled digital assistants is that the elderly, the blind and those who are unable to access the internet in other ways can also benefit from them. UK’s health secretary, Matt Hancock, believes that “embracing” such technologies will not only reduce the pressure General Practitioners (GPs) and pharmacists face but will also encourage people to take better control of their health care. He adds, "We want to empower every patient to take better control of their healthcare." Partnering with Amazon is just one of many steps by NHS to adopt technology for healthcare. The NHS launched a full-fledged unit named NHSX (where X stands for User Experience) last week. Its mission is to provide staff and citizens “the technology they need” with an annual investment of more than $1 billion a year. This partnership was announced last year and NHS plans to partner with other companies such as Microsoft in the future to achieve its goal of “modernizing health services.” Can we consider Alexa’s advice safe Voice assistants are very fun and convenient to use, but only when they are actually working. Many a time it happens that the assistant fails to understand something and we have to yell the command again and again, which makes the experience outright frustrating. Furthermore, the track record of consulting the web to diagnose our symptoms has not been the most accurate one. Many Twitter users trolled this decision saying that Alexa is not yet capable of doing simple tasks like playing a song accurately and the NHS budget could have been instead used on additional NHS staff, lowering drug prices, and many other facilities. The public was also left sore because the government has given Amazon a new means to make a profit, instead of forcing them to pay taxes. Others also talked about the times when Google (mis)-diagnosed their symptoms. https://twitter.com/NHSMillion/status/1148883285952610304 https://twitter.com/doctor_oxford/status/1148857265946079232 https://twitter.com/TechnicallyRon/status/1148862592254906370 https://twitter.com/withorpe/status/1148886063290540032 AI ethicists and experts raise data privacy issues Amazon has been involved in several controversies around privacy concerns regarding Alexa. Earlier this month, it admitted that a few voice recordings made by Alexa are never deleted from the company's server, even when the user manually deletes them. Another news in April this year revealed that when you speak to an Echo smart speaker, not only does Alexa but potentially Amazon employees also listen to your requests. Last month, two lawsuits were filed in Seattle stating that Amazon is recording voiceprints of children using its Alexa devices without their consent. Last year, an Amazon Echo user in Portland, Oregon was shocked when she learned that her Echo device recorded a conversation with her husband and sent the audio file to one of his employees in Seattle. Amazon confirmed that this was an error because of which the device’s microphone misheard a series of words. Another creepy, yet funny incident was when Alexa users started hearing an unprompted laugh from their smart speaker devices. Alexa laughed randomly when the device was not even being used. https://twitter.com/CaptHandlebar/status/966838302224666624 Big tech including Amazon, Google, and Facebook constantly try to reassure their users that their data is safe and they have appropriate privacy measures in place. But, these promises are hard to believe when there is so many news of data breaches involving these companies. Last year, a German computer magazine c’t reported that a user received 1,700 Alexa voice recordings from Amazon when he asked for copies of the personal data Amazon has about him. Many experts also raised their concerns about using Alexa for giving medical advice. A Berlin-based tech expert Manthana Stender calls this move a “corporate capture of public institutions”. https://twitter.com/StenderWorld/status/1148893625914404864 Dr. David Wrigley, a British medical doctor who works as a general practitioner also asked how the voice recordings of people asking for health advice will be handled. https://twitter.com/DavidGWrigley/status/1148884541144219648 Director of Big Brother Watch, Silkie Carlo told BBC, "Any public money spent on this awful plan rather than frontline services would be a breathtaking waste. Healthcare is made inaccessible when trust and privacy is stripped away, and that's what this terrible plan would do. It's a data protection disaster waiting to happen." Prof Helen Stokes-Lampard, of the Royal College of GPs, believes that the move has "potential", especially for minor ailments. She added that it is important individuals do independent research to ensure the advice given is safe or it could "prevent people from seeking proper medical help and create even more pressure". She further said that not everyone is comfortable using such technology or could afford it. Amazon promises that the data will be kept confidential and will not be used to build a profile on customers. A spokesman shared with The Times, "All data was encrypted and kept confidential. Customers are in control of their voice history and can review or delete recordings." Amazon is being sued for recording children’s voices through Alexa without consent Amazon Alexa is HIPAA-compliant: bigger leap in the health care sector Amazon is supporting research into conversational AI with Alexa fellowships

0
0
32786

article-image-clustering-and-other-unsupervised-learning-methods

Packt

09 Jul 2015

19 min read

Clustering and Other Unsupervised Learning Methods

Packt

09 Jul 2015

19 min read

0
0
32729

article-image-building-a-microsoft-power-bi-data-model

Amarabha Banerjee

14 May 2018

11 min read

Building a Microsoft Power BI Data Model

Amarabha Banerjee

14 May 2018

11 min read

"The data model is what feeds and what powers Power BI." - Kasper de Jonge, Senior Program Manager, Microsoft Data models developed in Power BI Desktop are at the center of Power BI projects, as they expose the interface in support of data exploration and drive the analytical queries visualized in reports and dashboards. Well-designed data models leverage the data connectivity and transformation capabilities to provide an integrated view of distinct business processes and entities. Additionally, data models contain predefined calculations, hierarchies groupings, and metadata to greatly enhance both the analytical power of the dataset and its ease of use. The combination of, Building a Power BI data model, querying and modeling, serves as the foundation for the BI and analytical capabilities of Power BI. In this article, we explore how to design and develop robust data models. Common challenges in dimensional modeling are mapped to corresponding features and approaches in Power BI Desktop, including multiple grains and many-to-many relationships. Examples are also provided to embed business logic and definitions, develop analytical calculations with the DAX language, and configure metadata settings to increase the value and sustainability of models. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book contains powerful tutorials and techniques to help you with Data Analytics and visualization with Microsoft Power BI.[/box] Designing a multi fact data model Power BI Desktop lends itself to rapid, agile development in which significant value can be obtained quickly despite both imperfect data sources and an incomplete understanding of business requirements and use cases. However, rushing through the design phase can undermine the sustainability of the solution as future needs cannot be met without structural revisions to the model or complex workarounds. A balanced design phase in which fundamental decisions such as DirectQuery versus in-memory are analyzed while a limited prototype model is used to generate visualizations and business feedback can address both short- and long-term needs. This recipe describes a process for designing a multiple fact table data model and identifies some of the primary questions and factors to consider. Setting business expectations Everyone has seen impressive Power BI demonstrations and many business analysts have effectively used Power BI Desktop independently. These experiences may create an impression that integration, rich analytics, and collaboration can be delivered across many distinct systems and stakeholders very quickly or easily. It's important to reign in any unrealistic expectations and confirm feasibility. For example, Power BI Desktop is not an enterprise BI tool like SSIS or SSAS in terms of scalability, version control, features, and configurations. Power BI datasets cannot be incrementally refreshed like partitions in SSAS, and the current 1 GB file limit (after compression) places a hard limit on the amount of data a single model can store. Additionally, if multiple data sources are needed within the model, then DirectQuery models are not an option. Finally, it's critical to distinguish the data model as a platform supporting robust analysis of business processes, not an individual report or dashboard itself. Identify the top pain points and unanswered business questions in the current state. Contrast this input with an assessment of feasibility and complexity (for example, data quality and analytical needs) and Target realistic and sustainable deliverables. How to do it Dimensional modeling best practices and star schema designs are directly applicable to Power BI data models. Short, collaborative modeling sessions can be scheduled with subject matter experts and main stakeholders. With the design of the model in place, an informed decision of the model's data mode (Import or DirectQuery) can be made prior to Development. Four-step dimensional design process Choose the business process The number and nature of processes to include depends on the scale of the sources and scope of the project In this example, the chosen processes are Internet Sales, Reseller Sales and General Ledger Declare the granularity For each business process (or fact) to be modeled from step 1, define the meaning of each row: These should be clear, concise business definitions--each fact table should only contain one grain Consider scalability limitations with Power BI Desktop and balance the needs between detail and history (for example, greater history but lower granularity) Example: One Row per Sales Order Line, One Row per GL Account Balance per fiscal period Separate business processes, such as plan and sales should never be integrated into the same table. Likewise, a single fact table should not contain distinct processes such as shipping and receiving. Fact tables can be related to common dimensions but should never be related to each other in the data model (for example, PO Header and Line level). Identify the dimensions These entities should have a natural relationship with the business process or event at the given granularity Compare the dimension with any existing dimensions and hierarchies in the organization (for example, Store) If so, determine if there's a conflict or if additional columns are required Be aware of the query performance implications with large, high cardinality dimensions such as customer tables with over 2 million rows. It may be necessary to optimize this relationship in the model or the measures and queries that use this relationship. See Chapter 11, Enhancing and Optimizing Existing Power BI Solutions, for more details. Identify the facts These should align with the business processes being modeled: For example, the sum of a quantity or a unique count of a dimension Document the business and technical definition of the primary facts and compare this with any existing reports or metadata repository (for example, Net Sales = Extended Amount - Discounts). Given steps 1-3, you should be able to walk through top business questions and check whether the planned data model will support it. Example: "What was the variance between Sales and Plan for last month in Bikes?" Any clear gaps require modifying the earlier steps, removing the question from the scope of the data model, or a plan to address the issue with additional logic in the model (M or DAX). Focus only on the primary facts at this stage such as the individual source columns that comprise the cost facts. If the business definition or logic for core fact has multiple steps and conditions, check if the data model will naturally simplify it or if the logic can be developed in the data retrieval to avoid complex measures. Data warehouse and implementation bus matrix The Power BI model should preferably align with a corporate data architecture framework of standard facts and dimensions that can be shared across models. Though consumed into Power BI Desktop, existing data definitions and governance should be observed. Any new facts, dimensions, and measures developed with Power BI should supplement this architecture. Create a data warehouse bus matrix: A matrix of business processes (facts) and standard dimensions is a primary tool for designing and managing data models and communicating the overall BI architecture. In this example, the business processes selected for the model are Internet Sales, Reseller Sales, and General Ledger. Create an implementation bus matrix: An outcome of the model design process should include a more detailed implementation bus matrix. Clarity and approval of the grain of the fact tables, the definitions of the primary measures, and all dimensions gives confidence when entering the development phase. Power BI queries (M) and analysis logic (DAX) should not be considered a long-term substitute for issues with data quality, master data management, and the data warehouse. If it is necessary to move forward, document the "technical debts" incurred and consider long-term solutions such as Master Data Services (MDS). Choose the dataset storage mode - Import or DirectQuery With the logical design of a model in place, one of the top design questions is whether to implement this model with DirectQuery mode or with the default imported In-Memory mode. In-Memory mode The default in-memory mode is highly optimized for query performance and supports additional modeling and development flexibility with DAX functions. With compression, columnar storage, parallel query plans, and other techniques an import mode model is able to support a large amount of data (for example, 50M rows) and still perform well with complex analysis expressions. Multiple data sources can be accessed and integrated in a single data model and all DAX functions are supported for measures, columns, and role security. However, the import or refresh process must be scheduled and this is currently limited to eight refreshes per day for datasets in shared capacity (48X per day in premium capacity). As an alternative to scheduled refreshes in the Power BI service, REST APIs can be used to trigger a data refresh of a published dataset. For example, an HTTP request to a Power BI REST API calling for the refresh of a dataset can be added to the end of a nightly update or ETL process script such that published Power BI content remains aligned with the source systems. More importantly, it's not currently possible to perform an incremental refresh such as the Current Year rows of a table (for example, a table partition) or only the source rows that have changed. In-Memory mode models must maintain a file size smaller than the current limits (1 GB compressed currently, 10GB expected for Premium capacities by October 2017) and must also manage refresh schedules in the Power BI Service. Both incremental data refresh and larger dataset sizes are identified as planned capabilities of the Microsoft Power BI Premium Whitepaper (May 2017). DirectQuery mode A DirectQuery mode model provides the same semantic layer interface for users and contains the same metadata that drives model behaviors as In-Memory models. The performance of DirectQuery models, however, is dependent on the source system and how this data is presented to the model. By eliminating the import or refresh process, DirectQuery provides a means to expose reports and dashboards to source data as it changes. This also avoids the file size limit of import mode models. However, there are several limitations and restrictions to be aware of with DirectQuery: Only a single database from a single, supported data source can be used in a DirectQuery model. When deployed for widespread use, a high level of network traffic can be generated thus impacting performance. Power BI visualizations will need to query the source system, potentially via an on-premises data gateway. Some DAX functions cannot be used in calculated columns or with role security. Additionally, several common DAX functions are not optimized for DirectQuery performance. Many M query transformation functions cannot be used with DirectQuery. MDX client applications such as Excel are supported but less metadata (for example, hierarchies) is exposed. Given these limitations and the importance of a "speed of thought" user experience with Power BI, DirectQuery should generally only be used on centralized and smaller projects in which visibility to updates of the source data is essential. If a supported DirectQuery system (for example, Teradata or Oracle) is available, the performance of core measures and queries should be tested. Confirm referential integrity in the source database and use the Assume Referential Integrity relationship setting in DirectQuery mode models. This will generate more efficient inner join SQL queries against the source Database. How it works DAX formula and storage engine Power BI Datasets and SQL Server Analysis Services (SSAS) share the same database engine and architecture. Both tools support both Import and DirectQuery data models and both DAX and MDX client applications such as Power BI (DAX) and Excel (MDX). The DAX Query Engine is comprised of a formula and a storage engine for both Import and DirectQuery models. The formula engine produces query plans, requests data from the storage engine, and performs any remaining complex logic not supported by the storage engine against this data such as IF and SWITCH functions In DirectQuery models, the data source database is the storage engine--it receives SQL queries from the formula engine and returns the results to the formula engine. For In- Memory models, the imported and compressed columnar memory cache is the storage engine. We discussed about building data models using Microsoft power BI. If you liked our post, be sure to check out Microsoft Power BI Cookbook to gain more information on using Microsoft power BI for data analysis and visualization. Unlocking the secrets of Microsoft Power BI Microsoft spring updates for PowerBI and PowerApps How to build a live interactive visual dashboard in Power BI with Azure Stream

0
0
32451

article-image-6-popular-regression-techniques-must-know

Amey Varangaonkar

15 Feb 2018

8 min read

6 Popular Regression Techniques you must know

Amey Varangaonkar

15 Feb 2018

8 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box] In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular regression techniques and approaches You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.

0
0
32430

article-image-sizing-configuring-hadoop-cluster

Oli Huggins

16 Feb 2014

10 min read

Sizing and Configuring your Hadoop Cluster

Oli Huggins

16 Feb 2014

10 min read

This article, written by Khaled Tannir, the author of Optimizing Hadoop for MapReduce, discusses two of the most important aspects to consider while optimizing Hadoop for MapReduce: sizing and configuring the Hadoop cluster correctly. Sizing your Hadoop cluster Hadoop's performance depends on multiple factors based on well-configured software layers and well-dimensioned hardware resources that utilize its CPU, Memory, hard drive (storage I/O) and network bandwidth efficiently. Planning the Hadoop cluster remains a complex task that requires a minimum knowledge of the Hadoop architecture and may be out the scope of this book. This is what we are trying to make clearer in this section by providing explanations and formulas in order to help you to best estimate your needs. We will introduce a basic guideline that will help you to make your decision while sizing your cluster and answer some How to plan questions about cluster's needs such as the following: How to plan my storage? How to plan my CPU? How to plan my memory? How to plan the network bandwidth? While sizing your Hadoop cluster, you should also consider the data volume that the final users will process on the cluster. The answer to this question will lead you to determine how many machines (nodes) you need in your cluster to process the input data efficiently and determine the disk/memory capacity of each one. Hadoop is a Master/Slave architecture and needs a lot of memory and CPU bound. It has two main components: JobTracker: This is the critical component in this architecture and monitors jobs that are running on the cluster TaskTracker: This runs tasks on each node of the cluster To work efficiently, HDFS must have high throughput hard drives with an underlying filesystem that supports the HDFS read and write pattern (large block). This pattern defines one big read (or write) at a time with a block size of 64 MB, 128 MB, up to 256 MB. Also, the network layer should be fast enough to cope with intermediate data transfer and block. HDFS is itself based on a Master/Slave architecture with two main components: the NameNode / Secondary NameNode and DataNode components. These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure, names, and to process data. The NameNode component ensures that data blocks are properly replicated in the cluster. The second component, the DataNode component, manages the state of an HDFS node and interacts with its data blocks. It requires a lot of I/O for processing and data transfer. Typically, the MapReduce layer has two main prerequisites: input datasets must be large enough to fill a data block and split in smaller and independent data chunks (for example, a 10 GB text file can be split into 40,960 blocks of 256 MB each, and each line of text in any data block can be processed independently). The second prerequisite is that it should consider the data locality, which means that the MapReduce code is moved where the data lies, not the opposite (it is more efficient to move a few megabytes of code to be close to the data to be processed, than moving many data blocks over the network or the disk). This involves having a distributed storage system that exposes data locality and allows the execution of code on any storage node. Concerning the network bandwidth, it is used at two instances: during the replication process and following a file write, and during the balancing of the replication factor when a node fails. The most common practice to size a Hadoop cluster is sizing the cluster based on the amount of storage required. The more data into the system, the more will be the machines required. Each time you add a new node to the cluster, you get more computing resources in addition to the new storage capacity. Let's consider an example cluster growth plan based on storage and learn how to determine the storage needed, the amount of memory, and the number of DataNodes in the cluster. Daily data input 100 GB Storage space used by daily data input = daily data input * replication factor = 300 GB HDFS replication factor 3 Monthly growth 5% Monthly volume = (300 * 30) + 5% = 9450 GB After one year = 9450 * (1 + 0.05)^12 = 16971 GB Intermediate MapReduce data 25% Dedicated space = HDD size * (1 - Non HDFS reserved space per disk / 100 + Intermediate MapReduce data / 100) = 4 * (1 - (0.25 + 0.30)) = 1.8 TB (which is the node capacity) Non HDFS reserved space per disk 30% Size of a hard drive disk 4 TB Number of DataNodes needed to process: Whole first month data = 9.450 / 1800 ~= 6 nodes The 12th month data = 16.971/ 1800 ~= 10 nodes Whole year data = 157.938 / 1800 ~= 88 nodes Do not use RAID array disks on a DataNode. HDFS provides its own replication mechanism. It is also important to note that for every disk, 30 percent of its capacity should be reserved to non-HDFS use. It is easy to determine the memory needed for both NameNode and Secondary NameNode. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Then you can apply the following formulas to determine the memory amount: NameNode memory 2 GB - 4 GB Memory amount = HDFS cluster management memory + NameNode memory + OS memory Secondary NameNode memory 2 GB - 4 GB OS memory 4 GB - 8 GB HDFS memory 2 GB - 8 GB At least NameNode (Secondary NameNode) memory = 2 + 2 + 4 = 8 GB It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode. DataNode process memory 4 GB - 8 GB Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory DataNode TaskTracker memory 4 GB - 8 GB OS memory 4 GB - 8 GB CPU's core number 4+ Memory per CPU core 4 GB - 8 GB At least DataNode memory = 4*4 + 4 + 4 + 4 = 28 GB Regarding how to determine the CPU and the network bandwidth, we suggest using the now-a-days multicore CPUs with at least four physical cores per CPU. The more physical CPU's cores you have, the more you will be able to enhance your job's performance (according to all rules discussed to avoid underutilization or overutilization). For the network switches, we recommend to use equipment having a high throughput (such as 10 GB) Ethernet intra rack with N x 10 GB Ethernet inter rack. Configuring your cluster correctly To run Hadoop and get a maximum performance, it needs to be configured correctly. But the question is how to do that. Well, based on our experiences, we can say that there is not one single answer to this question. The experiences gave us a clear indication that the Hadoop framework should be adapted for the cluster it is running on and sometimes also to the job. In order to configure your cluster correctly, we recommend running a Hadoop job(s) the first time with its default configuration to get a baseline. Then, you will check the resource's weakness (if it exists) by analyzing the job history logfiles and report the results (measured time it took to run the jobs). After that, iteratively, you will tune your Hadoop configuration and re-run the job until you get the configuration that fits your business needs. The number of mappers and reducer tasks that a job should use is important. Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance. The number of reducer tasks should be less than the number of mapper tasks. Google reports one reducer for 20 mappers; the others give different guidelines. This is because mapper tasks often process a lot of data, and the result of those tasks are passed to the reducer tasks. Often, a reducer task is just an aggregate function that processes a minor portion of the data compared to the mapper tasks. Also, the correct number of reducers must also be considered. The number of mappers and reducers is related to the number of physical cores on the DataNode, which determines the maximum number of jobs that can run in parallel on DataNode. In a Hadoop cluster, master nodes typically consist of machines where one machine is designed as a NameNode, and another as a JobTracker, while all other machines in the cluster are slave nodes that act as DataNodes and TaskTrackers. When starting the cluster, you begin starting the HDFS daemons on the master node and DataNode daemons on all data nodes machines. Then, you start the MapReduce daemons: JobTracker on the master node and the TaskTracker daemons on all slave nodes. The following diagram shows the Hadoop daemon's pseudo formula: When configuring your cluster, you need to consider the CPU cores and memory resources that need to be allocated to these daemons. In a huge data context, it is recommended to reserve 2 CPU cores on each DataNode for the HDFS and MapReduce daemons. While in a small and medium data context, you can reserve only one CPU core on each DataNode. Once you have determined the maximum mapper's slot numbers, you need to determine the reducer's maximum slot numbers. Based on our experience, there is a distribution between the Map and Reduce tasks on DataNodes that give good performance result to define the reducer's slot numbers the same as the mapper's slot numbers or at least equal to two-third mapper slots. Let's learn to correctly configure the number of mappers and reducers and assume the following cluster examples: Cluster machine Nb Medium data size Large data size DataNode CPU cores 8 Reserve 1 CPU core Reserve 2 CPU cores DataNode TaskTracker daemon 1 1 1 DataNode HDFS daemon 1 1 1 Data block size 128 MB 256 MB DataNode CPU % utilization 95% to 120% 95% to 150% Cluster nodes 20 40 Replication factor 2 3 We want to use the CPU resources at least 95 percent, and due to Hyper-Threading, one CPU core might process more than one job at a time, so we can set the Hyper-Threading factor range between 120 percent and 170 percent. Maximum mapper's slot numbers on one node in a large data context = number of physical cores - reserved core * (0.95 -> 1.5) Reserved core = 1 for TaskTracker + 1 for HDFS Let's say the CPU on the node will use up to 120% (with Hyper-Threading) Maximum number of mapper slots = (8 - 2) * 1.2 = 7.2 rounded down to 7 Let's apply the 2/3 mappers/reducers technique: Maximum number of reducers slots = 7 * 2/3 = 5 Let's define the number of slots for the cluster: Mapper's slots: = 7 * 40 = 280 Reducer's slots: = 5 * 40 = 200 The block size is also used to enhance performance. The default Hadoop configuration uses 64 MB blocks, while we suggest using 128 MB in your configuration for a medium data context as well and 256 MB for a very large data context. This means that a mapper task can process one data block (for example, 128 MB) by only opening one block. In the default Hadoop configuration (set to 2 by default), two mapper tasks are needed to process the same amount of data. This may be considered as a drawback because initializing one more mapper task and opening one more file takes more time. Summary In this article, we learned about sizing and configuring the Hadoop cluster for optimizing it for MapReduce. Resources for Article: Further resources on this subject: Hadoop Tech Page Hadoop and HDInsight in a Heartbeat Securing the Hadoop Ecosystem Advanced Hadoop MapReduce Administration

0
3
32350

article-image-katie-bouman-unveils-the-first-ever-black-hole-image-with-her-brilliant-algorithm

Amrata Joshi

11 Apr 2019

11 min read

Katie Bouman unveils the first ever black hole image with her brilliant algorithm

Amrata Joshi

11 Apr 2019

11 min read

Remember how we got to see the supermassive black hole in the movie Interstellar? Well, that wasn’t for real. We know that black holes end up sucking everything that’s too close to it, even light for that matter. Black hole’s event horizon cast a shadow and that shadow is enough for answering a lot of questions attached to black hole theory. And scientists and researchers have been working towards it since years to get that one image to give an angle to their research. And finally comes the biggest news that a team of astronomers, engineers, researchers and scientists have managed to capture the first ever image of a black hole, which is located in a distant galaxy. It is three million times the size of the Earth and it measures 40 billion Km across. The team describes it as "a monster" and was photographed by a network of eight telescopes across the world. In this article, we give you a glimpse of how did the image of the black hole got captured? Katie Bouman, a PhD student at MIT appeared at TED Talks and discussed the efforts taken by the team of researchers, engineers, astronomers and scientists to capture the first ever image of the black hole. Katie is a part of an international team of astronomers who worked for creating the world’s largest telescope, Event Horizon Telescope to click the first ever picture of the black hole. She led the development of a computer programme that made this impossible, possible! She started working on the algorithm three years ago while she was a graduate student. https://twitter.com/jenzhuscott/status/1115987618464968705 Katie wrote in the caption to one of the Facebook post, "Watching in disbelief as the first image I ever made of a black hole was in the process of being reconstructed." https://twitter.com/MIT_CSAIL/status/1116035007406116864 Further, she explains how the stars we see in the sky basically orbit an invisible object. And according to the astronomers, the only thing that can cause this motion of the stars is a supermassive black hole. Zooming in at radio wavelengths to see a ring of light “Well, it turns out that if we were to zoom in at radio wavelengths, we'd expect to see a ring of light caused by the gravitational lensing of hot plasma zipping around the black hole. Is it possible to see something that, by definition, is impossible to see? ” -Katie Bouman If we closely look at it, we can see that the black hole casts a shadow on the backdrop of bright material that carves out a sphere of darkness. It is a bright ring that reveals the black hole's event horizon, where the gravitational pull becomes so powerful that even light can’t escape. Einstein's equations have predicted the size and shape of this ring and taking a picture of it would help to verify that these equations hold in the extreme conditions around the black hole. Capturing black hole needs a telescope the size of the Earth “So how big of a telescope do we need in order to see an orange on the surface of the moon and, by extension, our black hole? Well, it turns out that by crunching the numbers, you can easily calculate that we would need a telescope the size of the entire Earth.” -Katie Bouman Bouman further explains that black hole is so far away from Earth that this ring appears incredibly small, as small as an orange on the surface of the moon. And this makes it difficult to capture the photo of the black hole. There are fundamental limits to the smallest objects that we can see because of diffraction. So the astronomers realized that they need to make their telescope bigger and bigger. Even the most powerful optical telescopes couldn’t get close to the resolution necessary to image on the surface of the moon. She showed one of the highest resolution images ever taken of the moon from Earth to the audience which contained around 13,000 pixels, and each pixel contained over 1.5 million oranges. Capturing the black hole turned into reality by connecting telescopes “And so, my role in helping to take the first image of a black hole is to design algorithms that find the most reasonable image that also fits the telescope measurements.” -Katie Bouman According to Bouman, we would require a telescope as big as earth’s size to see an orange on the surface of the moon. Capturing a black hole seemed to be imaginary back then as it was nearly impossible to have a powerful telescope. Bouman highlighted the famous words of Mick Jagger, "You can't always get what you want, but if you try sometimes, you just might find you get what you need." Capturing the black hole turned into a reality by connecting telescopes from around the world. Event Horizon Telescope, an international collaboration created a computational telescope the size of the Earth which was capable of resolving structure on the scale of a black hole's event horizon. The setup was such that each telescope in the worldwide network worked together. The researcher teams at each of the sites collected thousands of terabytes of data. This data then processed in a lab in Massachusetts. Let’s understand this in depth by assuming that we can build an Earth sized telescope! Further imagining that Earth is a spinning disco ball and each of the mirror of the ball can collect light that can be combined together to form a picture. If most of those mirrors are removed then a few will remain. In this case, it is still possible to combine this information together, but now there will be a lot of holes. The remaining mirrors represent the locations where these telescopes have been setup. Though this seems like a small number of measurements to make a picture from but it is effective. The light gets collected at a few telescope locations but as the Earth rotates, other new measurements also get explored. So, as the disco ball spins, the mirrors change locations and the astronomers get to observe different parts of the image. The imaging algorithms developed by the experts, scientists and researchers fill in the missing gaps of the disco ball in order to reconstruct the underlying black hole image. Katie Bouman said, “If we had telescopes located everywhere on the globe -- in other words, the entire disco ball -- this would be trivial. However, we only see a few samples, and for that reason, there are an infinite number of possible images that are perfectly consistent with our telescope measurements.” According to Bouman, not all the images are created equal. So some of those images look more like what the astronomers, scientists and researchers think of as images as compared to others. Bouman’s role in helping to take the first image of the black hole was to design the algorithms that find the most relevant or reasonable image that fits the telescope measurements. The imaging algorithms developed by Katie used the limited telescope data to guide the astronomers to a picture. With the help of these algorithms, it was possible to bring together the pieces of pictures from the sparse and noisy data. How was the algorithm used in creation of the black hole image “I'd like to encourage all of you to go out and help push the boundaries of science, even if it may at first seem as mysterious to you as a black hole.” -Katie Bouman There is an infinite number of possible images that perfectly explain the telescope measurements and the astronomers and researchers have to choose between them. This is possible by ranking the images based upon how likely they are to be the black hole image and further selecting the one that's most likely. Bouman explained it with the help of an example, “Let's say we were trying to make a model that told us how likely an image were to appear on Facebook. We'd probably want the model to say it's pretty unlikely that someone would post this noise image on the left, and pretty likely that someone would post a selfie like this one on the right. The image in the middle is blurry, so even though it's more likely we'd see it on Facebook compared to the noise image, it's probably less likely we'd see it compared to the selfie.” While talking about the images from the black hole, according to Katie it gets confusing for the astronomers and researchers as they have never seen a black hole before. She further explained how difficult it is to rely on any of the previous theories for these images. It is even difficult to completely rely on the images of the simulations for comparison. She said, “What is a likely black hole image, and what should we assume about the structure of black holes? We could try to use images from simulations we've done, like the image of the black hole from "Interstellar," but if we did this, it could cause some serious problems. What would happen if Einstein's theories didn't hold? We'd still want to reconstruct an accurate picture of what was going on. If we bake Einstein's equations too much into our algorithms, we'll just end up seeing what we expect to see. In other words, we want to leave the option open for there being a giant elephant at the center of our galaxy.” According to Bouman, different types of images have distinct features, so it is quite possible to identify the difference between black hole simulation images and images captured by the team. So the researchers need to let the algorithms know what images look like without imposing one type of image features. And this can be done by imposing the features of different kinds of images and then looking at how the image type we assumed affects the reconstruction of the final image. The researchers and astronomers become more confident about their image assumptions if the images' types produce a very similar-looking image. She said, “This is a little bit like giving the same description to three different sketch artists from all around the world. If they all produce a very similar-looking face, then we can start to become confident that they're not imposing their own cultural biases on the drawings.” It is possible to impose different image features by using pieces of existing images. So the astronomers and researchers took a large collection of images and broke them down into little image patches. And then they treated each image patch like piece of a puzzle. They use commonly seen puzzle pieces to piece together an image that also fits in their telescope measurements. She said, “Let's first start with black hole image simulation puzzle pieces. OK, this looks reasonable. This looks like what we expect a black hole to look like. But did we just get it because we just fed it little pieces of black hole simulation images?” If we take a set of puzzle pieces from everyday images, like the ones we take with our own personal camera then we get the same image from all different sets of puzzle pieces. And we then become more confident that the image assumptions made by us aren't biasing the final image. According to Bouman, another thing that can be done is take the same set of puzzle pieces like the ones derived from everyday images and then use them to reconstruct different kinds of source images. Bouman said, “So in our simulations, we pretend a black hole looks like astronomical non-black hole objects, as well as everyday images like the elephant in the center of our galaxy.” And when the results of the algorithms look very similar to the simulated image then researchers and astronomers become more confident about their algorithms. She emphasized that all of these pictures were created by piecing together little pieces of everyday photographs, like the ones we take with own personal camera. So an image of a black hole which we have never seen before can be created by piecing together pictures we see regularly like images of people, buildings, trees, cats and dogs. She concluded by appreciating the efforts taken by her team, “But of course, getting imaging ideas like this working would never have been possible without the amazing team of researchers that I have the privilege to work with. It still amazes me that although I began this project with no background in astrophysics. But big projects like the Event Horizon Telescope are successful due to all the interdisciplinary expertise different people bring to the table.” This project will surely encourage many researchers, engineers, astronomers and students who are under dark and not confident of themselves but have the potential to make the impossible, possible. https://twitter.com/fchollet/status/1116294486856851459 Is the YouTube algorithm’s promoting of #AlternativeFacts like Flat Earth having a real-world impact? YouTube disables all comments on videos featuring children in an attempt to curb predatory behavior and appease advertisers Using Genetic Algorithms for optimizing your models [Tutorial]

0
0
32218

article-image-tensorflow-2-0-released-tighter-keras-integration-eager-execution-enabled-by-default

Bhagyashree R

03 Oct 2019

5 min read

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

Bhagyashree R

03 Oct 2019

5 min read

After releasing the beta version of TensorFlow 2.0 in June, Google announced its final release on Monday. This release comes with tighter integration with Keras, eager execution enabled by default, promises three times faster training performance, a cleaned-up API, and more. Key updates in TensorFlow 2.0 Tighter Keras integration for better developer productivity One of the important updates in TensorFlow 2.0 is its tighter integration with Keras, a popular high-level API used for easy and fast prototyping, building, and training deep learning models. This will enable developers to easily leverage its various model-building APIs including Sequential, Functional, and Subclassing. Explaining the motivation behind this change, the TensorFlow team wrote, “By establishing Keras as the high-level API for TensorFlow, we are making it easier for developers new to machine learning to get started with TensorFlow. A single high-level API reduces confusion and enables us to focus on providing advanced capabilities for researchers.” Eager execution enabled by default In TensorFlow 1.x, developers were required to define an abstract data structure named Graph and to run this graph they needed an encapsulation called Session. TensorFlow 2.0 has eager execution enabled by default to “eagerly” run code, similar to normal Python code. Eager execution enables fast iteration and intuitive debugging without building a graph. It also makes creating and experimenting with models using TensorFlow much easier. It can be especially useful when using the tf.keras model subclassing API. Also Read: Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Distribution Strategy API The Distribution Strategy API in TensorFlow 2.0 allows machine learning researchers to distribute training across a wide variety of compute configurations. This will allow them to “attain great out-of-the-box performance” with minimal code changes. This release also allows distributed training with Keras’ model.fit and custom training loops. Performance improvements on GPUs TensorFlow 2.0 includes multi-GPU support and experimental support for multi worker and Cloud TPUs. This release also has a number of performance improvements on GPUs. It promises three times faster training performance when using mixed precision on NVIDIA’s Volta and Turing GPUs. It includes tight integration with NVIDIA TensorRT, a platform for high-performance deep learning inference. The standardized SavedModel file format The SavedModel API allows you to save your trained ML model into a language-neutral format. With TensorFlow 2.0, all TensorFlow ecosystem projects including TensorFlow Lite, TensorFlow JS, TensorFlow Serving, and TensorFlow Hub, support SavedModels. Standardizing the SavedModel file format will enable developers to run their models on a variety of runtimes including the cloud, web, browser, Node.js, mobile, and embedded systems. “This allows you to run your models with TensorFlow, deploy them with TensorFlow Serving, use them on mobile and embedded systems with TensorFlow Lite, and train and run in the browser or Node.js with TensorFlow.js,” the team writes. API simplification TensorFlow 2.0 includes a number of API updates. Many API symbols are removed or renamed for better consistency and clarity. Also, the tf.app, tf.flags, and tf.logging API are removed in favor of abseil-py. Because of the huge number of API changes, developers in a discussion on Hacker News expressed that transitioning from TensorFlow 1.X to TensorFlow 2.0 is quite complicated. Some also mentioned switching to PyTorch instead. A user commented, “As someone who uses TensorFlow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well...In my opinion, there are some architectural problems with TF, which have not been addressed in this update...If you need to transition from TF1 to TF2, consider doing the TF1 to PyTorch transition instead.” While some others were happy with the recommended Keras API and eager execution. “I don't know if I'm the only one, but I actually love the changes they've made since v1. Eager execution and tf.function are fantastic, and the built-in Keras is even better than the standalone version. A big improvement compared to TF from last year,” a user commented on Reddit. Another user added, “The most important change in terms of usability, IMO, is the use of tf.keras as the recommended interface to TensorFlow. There hasn't been a case yet where I've needed to dip outside of Keras into raw TensorFlow, but the option is there and is easy to do. That said, TF 2.0 changes a lot. Many repos might break, so expect to see lots of tensorflow==1.14 in requirement.txt files from now on.” These were some of the updates in TensorFlow 2.0. Check out the official announcement and release notes to know more in detail. Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tf function and more Introducing TensorFlow Graphics packed with TensorBoard 3D, object transformations, and much more Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial] Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial]

0
0
32074

article-image-data-science-and-machine-learning-what-to-learn-in-2020

Richard Gall

19 Dec 2019

5 min read

Data science and machine learning: what to learn in 2020

Richard Gall

19 Dec 2019

5 min read

It’s hard to keep up with the pace of change in the data science and machine learning fields. And when you’re under pressure to deliver projects, learning new skills and technologies might be the last thing on your mind. But if you don’t have at least one eye on what you need to learn next you run the risk of falling behind. In turn this means you miss out on new solutions and new opportunities to drive change: you might miss the chance to do things differently. That’s why we want to make it easy for you with this quick list of what you need to watch out for and learn in 2020. The growing TensorFlow ecosystem TensorFlow remains the most popular deep learning framework in the world. With TensorFlow 2.0 the Google-based development team behind it have attempted to rectify a number of issues and improve overall performance. Most notably, some of the problems around usability have been addressed, which should help the project’s continued growth and perhaps even lower the barrier to entry. Relatedly TensorFlow.js is proving that the wider TensorFlow ecosystem is incredibly healthy. It will be interesting to see what projects emerge in 2020 - it might even bring JavaScript web developers into the machine learning fold. Explore Packt's huge range of TensorFlow eBooks and videos on the store. PyTorch PyTorch hasn’t quite managed to topple TensorFlow from its perch, but it’s nevertheless growing quickly. Easier to use and more accessible than TensorFlow, if you want to start building deep learning systems quickly your best bet is probably to get started on PyTorch. Search PyTorch eBooks and videos on the Packt store. End-to-end data analysis on the cloud When it comes to data analysis, one of the most pressing issues is to speed up pipelines. This is, of course, notoriously difficult - even in organizations that do their best to be agile and fast, it’s not uncommon to find that their data is fragmented and diffuse, with little alignment across teams. One of the opportunities for changing this is cloud. When used effectively cloud platforms can dramatically speed up analytics pipelines and make it much easier for data scientists and analysts to deliver insights quickly. This might mean that we need increased collaboration between data professionals, engineers, and architects, but if we’re to really deliver on the data at our disposal, then this shift could be massive. Learn how to perform analytics on the cloud with Cloud Analytics with Microsoft Azure. Data science strategy and leadership While cloud might help to smooth some of the friction that exists in our organizations when it comes to data analytics, there’s no substitute for strong and clear leadership. The split between the engineering side of data and the more scientific or interpretive aspect has been noted, which means that there is going to be a real demand for people that have a strong understanding of what data can do, what it shows, and what it means in terms of action. Indeed, the article just linked to also mentions that there is likely to be an increasing need for executive level understanding. That means data scientists have the opportunity to take a more senior role inside their organizations, by either working closely with execs or even moving up to that level. Learn how to build and manage a data science team and initiative that delivers with Managing Data Science. Going back to the algorithms In the excitement about the opportunities of machine learning and artificial intelligence, it’s possible that we’ve lost sight of some of the fundamentals: the algorithms. Indeed, given the conversation around algorithmic bias, and unintended consequences it certainly makes sense to place renewed attention on the algorithms that lie right at the center of our work. Even if you’re not an experienced data analyst or data scientist, if you’re a beginner it’s just as important to dive deep into algorithms. This will give you a robust foundation for everything else you do. And while statistics and mathematics will feel a long way from the supposed sexiness of data science, carefully considering what role they play will ensure that the models you build are accurate and perform as they should. Get stuck into algorithms with Data Science Algorithms in a Week. Computer vision and natural language processing Computer vision and Natural Language Processing are two of the most exciting aspects of modern machine learning and artificial intelligence. Both can be used for analytics projects, but they also have applications in real world digital products. Indeed, with augmented reality and conversational UI becoming more and more common, businesses need to be thinking very carefully about whether this could give them an edge in how they interact with customers. These sorts of innovations can be driven from many different departments - but technologists and data professionals should be seizing the opportunity to lead the way on how innovation can transform customer relationships. For more technology eBooks and videos to help you prepare for 2020, head to the Packt store.

0
0
31900

article-image-crud-create-read-update-delete-operations-elasticsearch

Pravin Dhandre

19 Feb 2018

5 min read

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Pravin Dhandre

19 Feb 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID: DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 and start building end-to-end real-time data processing solutions for your enterprise analytics applications.

0
0
31543

article-image-elastic-marks-its-entry-in-security-analytics-market-with-elastic-siem-and-endgame-acquisition

Bhagyashree R

13 Dec 2019

6 min read

Elastic marks its entry in security analytics market with Elastic SIEM and Endgame acquisition

Bhagyashree R

13 Dec 2019

6 min read

For many years, Elastic Stack has served as an open-source, simple yet powerful interface for security analysts to detect and mitigate malicious behavior. However, Elastic marked its official entry into the security analytics market with Elastic SIEM in June this year. Since its initial release, Elastic SIEM has seen a number of enhancements including machine learning-based anomaly detection, maps integration, and more. To further expand its presence in the security field, Elastic in early October, completed the acquisition of Endgame, a security company focused on endpoint prevention, detection, and response. Following this acquisition, Elastic introduced the Elastic Endpoint Security solution in October to help organizations “automatically and flexibly respond to threats in real-time.” The company has also eliminated per-endpoint pricing. In this article, we will look at what is Elastic SIEM, how it fits into the Elastic Stack, its components, and how a security operations team leverages Elastic SIEM to defend its data and infrastructure against attacks. [box type="shadow" align="" class="" width=""] Further learning This is a quick overview of the Elastic Stack. To learn more check out our book, Learning Elastic Stack 7.0 - Second Edition by Pranav Shukla and Sharath Kumar M N. This book will give you a fundamental understanding of what the stack is all about, and help you use it efficiently to build powerful real-time data processing. [/box] Introducing Elastic SIEM Elastic SIEM is not a standalone product but rather builds on the existing Elastic Stack capabilities used for security analytics including search, visualizations, dashboards, alerting, machine learning features, and more. The following diagram shows how Elastic SIEM fits into the Elastic Stack: Source: Elastic The beta version of Elastic SIEM was released in June this year with Elastic Stack 7.2. It includes a new set of data integrations for security use cases and a dedicated app in Kibana. It enables users to analyze host-related and network-related security events as part of alert investigations, threat hunting, initial investigations, and triaging of events. You can access Elastic SIEM through the Elastic Cloud or by downloading its default distribution. Elastic SIEM supports the recently introduced Elastic Common Schema (ECS), a uniform way to represent data across different sources. ECS defines a common set of fields and objects to ingest data into Elasticsearch enabling users to centrally analyze information like logs, flows, and contextual data from across environments. Features of Elastic SIEM Host-related security event analysis The Hosts view shows key metrics regarding host-related security events and a set of data tables that enable interaction with the Timeline Event Viewer. For further investigation, you can drag-and-drop items of interest from the Hosts view tables to Timeline. This gives you deeper insight into hosts, unique IPs, user authentications, uncommon processes, and events. We can filter the host view with the search bar at the top. To help you search faster, SIEM provides a search experience that combines traditional text-based search with the visual query builder that’s deeply integrated with drag-and-drop throughout the SIEM app and powered by the Elastic common schema. Network-related security event analysis The Network view provides analysts the key network activity metrics and event tables. You can drag-and-drop these tables to Timeline for further investigation to get deeper insight into the source and destination IP, top DNS domains, users, transport layer security certs, and more. Starting with Elastic Stack 7.4, you have Elastic Maps integrated right into Elastic SIEM. The interactive map is created based on live data that analysts can search, filter, and explore in real-time. The map gives analysts an overview of the network traffic. They can simply hover over source and destination points to uncover more details such as hostnames and IP addresses. They can also click a hostname to go to the SIEM Host view or an IP address to open the relevant network details. This integration lets Elastic SIEM leverage geospatial analytics and search capabilities of Elastic Maps. It also uses the new point-to-point line feature to easily visualize the connections in your data. Timeline Event Viewer The Timeline Event Viewer enables security analysts to gather and store evidence of an attack. They can pin and annotate relevant events, comment on and share their findings, and do everything within Kibana. It is a collaborative workspace for investigations or threat hunting where analysts can easily drag objects of interest from Network and Hosts view for further investigation. Anomaly detection with machine learning integration Cyber attacks today have become so sophisticated that it is hard to maintain an effective defense with just a set of static rules. Looking at the importance of automated analysis and detection, Elastic integrated machine learning capabilities right into the SIEM app in 7.3. This allowed security analysts to enable and run a set of machine learning anomaly detection jobs designed to detect specific cyber attack behaviors. The detected anomalies are then displayed on the Hosts and Network views in the SIEM app. However, in Elastic SIEM 7.3, there were only three built-in anomaly detection jobs. In the latest release (7.4), Elastic has added thirteen more anomaly detection jobs some of which are anomalous network activity, anomalous process, anomalous path activity, anomalous Powershell script, and more. This machine learning integration is extensible allowing users to add their own jobs to the SIEM job group. These were some of the key features in Elastic SIEM. Check out the Elastic SIEM 7.4 release announcement to know more. Also, to get a better understanding of how Elastic SIEM works, see the webinar Hands-on with Elastic SIEM: Defending your organization with the Elastic Stack by Elastic. To get started with Elastic Stack you can check out our book Learning Elastic Stack 7.0 - Second Edition. This book will help you learn how to use Elasticsearch for distributed searching and analytics, Logstash for logging, and Kibana for data visualization. As you work through the book, you will discover the technique of creating custom plugins using Kibana and Beats. The book also touches upon Elastic X-Pack, a useful extension for effective security and monitoring. You’ll also find helpful tips on how to use Elastic Cloud and deploy Elastic Stack in production environments. How to push Docker images to AWS’ Elastic Container Registry(ECR) [Tutorial] Core security features of Elastic Stack are now free! Elastic Stack 6.7 releases with Elastic Maps, Elastic Update and much more!

0
0
31498

article-image-openai-five-bots-destroyed-human-dota-2-players-this-weekend

Richard Gall

23 Apr 2019

3 min read

OpenAI Five bots destroyed human Dota 2 players this weekend

Richard Gall

23 Apr 2019

3 min read

0
0
31403

article-image-the-best-business-intelligence-tools-2019-when-to-use-them-and-how-much-they-cost

Richard Gall

19 Sep 2019

11 min read

The best business intelligence tools 2019: when to use them and how much they cost

Richard Gall

19 Sep 2019

11 min read

0
0
31335

article-image-apache-druid-hadoop-data-visualizations-tutorial

Sunith Shetty

27 Jul 2018

9 min read

Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]

Sunith Shetty

27 Jul 2018

9 min read

0
0
31223

How-To Tutorials - Data

Getting started with big data analysis using Google's PageRank algorithm

Key skills for data professionals to learn in 2020

Amazon’s partnership with NHS to make Alexa offer medical advice raises privacy concerns and public backlash

Clustering and Other Unsupervised Learning Methods

Building a Microsoft Power BI Data Model

6 Popular Regression Techniques you must know

Sizing and Configuring your Hadoop Cluster

Katie Bouman unveils the first ever black hole image with her brilliant algorithm

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

Data science and machine learning: what to learn in 2020

Trending Topics

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Elastic marks its entry in security analytics market with Elastic SIEM and Endgame acquisition

OpenAI Five bots destroyed human Dota 2 players this weekend

The best business intelligence tools 2019: when to use them and how much they cost

Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading