Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-implementing-k-nearest-neighbors-algorithm-python
Aaron Lazar
17 Nov 2017
7 min read
Save for later

Implementing the k-nearest neighbors algorithm in Python

Aaron Lazar
17 Nov 2017
7 min read
[box type="note" align="" class="" width=""]The following is an excerpt from Dávid Natingga's Data Science Algorithms in a Week. [/box] The nearest neighbor algorithm classifies a data instance based on its neighbors. The class of a data instance determined by the k-nearest neighbor algorithm is the class with the highest representation among the k-closest neighbors. In this short tutorial, we will cover the basics of the k-NN algorithm - understanding it and its implementation with a simple example: Mary and her temperature preferences. So let’s get right to it, shall we? Mary and her temperature preferences As an example, if we know that our friend Mary feels cold when it is 10 degrees Celsius, but warm when it is 25 degrees Celsius, then in a room where it is 22 degrees Celsius, the nearest neighbor algorithm would guess that our friend would feel warm, because 22 is closer to 25 than to 10. Suppose we would like to know when Mary feels warm and when she feels cold, as in the previous example, but in addition, wind speed data is also available when Mary was asked if she felt warm or cold: We could represent the data in a graph, as follows: Now, suppose we would like to find out how Mary feels at the temperature 16 degrees Celsius with a wind speed of 3km/h using the 1-NN algorithm: For simplicity, we will use a Manhattan metric to measure the distance between the neighbors on the grid. The Manhattan distance dMan of a neighbor N1=(x1,y1) from the neighbor N2=(x2,y2) is defined to be dMan=|x1- x2|+|y1- y2|. Let us label the grid with distances around the neighbors to see which neighbor with a known class is closest to the point we would like to classify: We can see that the closest neighbor with a known class is the one with the temperature 15 (blue) degrees Celsius and the wind speed 5km/h. Its distance from the questioned point is three units. Its class is blue (cold). The closest red (warm) neighbor is away four units from the questioned point. Since we are using the 1-nearest neighbor algorithm, we just look at the closest neighbor and, therefore, the class of the questioned point should be blue (cold). By applying this procedure to every data point, we can complete the graph as follows: Note that sometimes a data point can be distanced from two known classes with the same distance: for example, 20 degrees Celsius and 6km/h. In such situations, we would prefer one class over the other or ignore these boundary cases. The actual result depends on the specific implementation of an algorithm. Implementation of k-nearest neighbors algorithm in Python We’ll implement the k-NN algorithm in Python to find Mary's temperature preference: # source_code/1/mary_and_temperature_preferences/knn_to_data.py # Applies the knn algorithm to the input data. # The input text file is assumed to be of the format with one line per # every data entry consisting of the temperature in degrees Celsius, # wind speed and then the classification cold/warm. import sys sys.path.append('..') sys.path.append('../../common') import knn # noqa import common # noqa # Program start # E.g. "mary_and_temperature_preferences.data" input_file = sys.argv[1] # E.g. "mary_and_temperature_preferences_completed.data" output_file = sys.argv[2] k = int(sys.argv[3]) x_from = int(sys.argv[4]) x_to = int(sys.argv[5]) y_from = int(sys.argv[6]) y_to = int(sys.argv[7]) data = common.load_3row_data_to_dic(input_file) new_data = knn.knn_to_2d_data(data, x_from, x_to, y_from, y_to, k) common.save_3row_data_from_dic(output_file, new_data) # source_code/common/common.py # ***Library with common routines and functions*** def dic_inc(dic, key): if key is None: Pass if dic.get(key, None) is None: dic[key] = 1 Else: dic[key] = dic[key] + 1 # source_code/1/knn.py # ***Library implementing knn algorihtm*** def info_reset(info): info['nbhd_count'] = 0 info['class_count'] = {} # Find the class of a neighbor with the coordinates x,y. # If the class is known count that neighbor. def info_add(info, data, x, y): group = data.get((x, y), None) common.dic_inc(info['class_count'], group) info['nbhd_count'] += int(group is not None) # Apply knn algorithm to the 2d data using the k-nearest neighbors with # the Manhattan distance. # The dictionary data comes in the form with keys being 2d coordinates # and the values being the class. # x,y are integer coordinates for the 2d data with the range # [x_from,x_to] x [y_from,y_to]. def knn_to_2d_data(data, x_from, x_to, y_from, y_to, k): new_data = {} info = {} # Go through every point in an integer coordinate system. for y in range(y_from, y_to + 1): for x in range(x_from, x_to + 1): info_reset(info) # Count the number of neighbors for each class group for # every distance dist starting at 0 until at least k # neighbors with known classes are found. for dist in range(0, x_to - x_from + y_to - y_from): # Count all neighbors that are distanced dist from # the point [x,y]. if dist == 0: info_add(info, data, x, y) Else: for i in range(0, dist + 1): info_add(info, data, x - i, y + dist - i) info_add(info, data, x + dist - i, y - i) for i in range(1, dist): info_add(info, data, x + i, y + dist - i) info_add(info, data, x - dist + i, y - i) # There could be more than k-closest neighbors if the # distance of more of them is the same from the point # [x,y]. But immediately when we have at least k of # them, we break from the loop. if info['nbhd_count'] >= k: Break class_max_count = None # Choose the class with the highest count of the neighbors # from among the k-closest neighbors. for group, count in info['class_count'].items(): if group is not None and (class_max_count is None or count > info['class_count'][class_max_count]): class_max_count = group new_data[x, y] = class_max_count return new_data Input: The program above will use the file below as the source of the input data. The file contains the table with the known data about Mary's temperature preferences: # source_code/1/mary_and_temperature_preferences/ marry_and_temperature_preferences.data 10 0 cold 25 0 warm 15 5 cold 20 3 warm 18 7 cold 20 10 cold 22 5 warm 24 6 warm Output: We run the implementation above on the input file mary_and_temperature_preferences.data using the k-NN algorithm for k=1 neighbors. The algorithm classifies all the points with the integer coordinates in the rectangle with a size of (30-5=25) by (10-0=10), so with the a of (25+1) * (10+1) = 286 integer points (adding one to count points on boundaries). Using the wc command, we find out that the output file contains exactly 286 lines - one data item per point. Using the head command, we display the first 10 lines from the output file. We visualize all the data from the output file in the next section: $ python knn_to_data.py mary_and_temperature_preferences.data mary_and_temperature_preferences_completed.data 1 5 30 0 10 $ wc -l mary_and_temperature_preferences_completed.data 286 mary_and_temperature_preferences_completed.data $ head -10 mary_and_temperature_preferences_completed.data 7 3 cold 6 9 cold 12 1 cold 16 6 cold 16 9 cold 14 4 cold 13 4 cold 19 4 warm 18 4 cold 15 1 cold So, there you have it! The K Nearest Neighbors algorithm explained and implemented in Python. I hope you enjoyed this tutorial and found it interesting. If you want more, go ahead and purchase Dávid Natingga's Data Science Algorithms in a Week, from which the tutorial has been extracted.
Read more
  • 0
  • 0
  • 82509

article-image-revolutionize-power-bi-queries-with-openai
Gus Frazer
11 Dec 2024
10 min read
Save for later

Revolutionize Power BI Queries with OpenAI

Gus Frazer
11 Dec 2024
10 min read
This article is an excerpt from the book, Data Cleaning with Power BI, by Gus Frazer. Unlock the full potential of your data by mastering the art of cleaning, preparing, and transforming data with Power BI for smarter insights and data visualizations.IntroductionDiscover the transformative potential of leveraging Azure OpenAI, integrated with ChatGPT functionality, to enhance Power BI's M query capabilities. In this article, we delve into how this powerful combination offers expert guidance, efficient solutions, and insightful recommendations for optimizing data transformation tasks. From generating M queries to streamlining complex transformations, explore how Azure OpenAI with ChatGPT empowers users to boost productivity and efficiency in Power BI.Using OpenAI for M queriesAzure OpenAI, with ChatGPT functionality within it, can be a helpful tool for generating M queries in Power BI by providing suggestions, helping with syntax, and offering insights into data transformation tasks. In the following example, you will learn about how you can leverage the chat playground within OpenAI to improve your productivity and efficiency when writing M queries. We will do this by asking a series of questions directly within Azure OpenAI.Complete the next steps to  follow along with the example in your own environment:1. Click on Deployment on the left-hand side and then select Create new deployment to get started.2. Select a model from the base models, in this case, select gpt-35-turbo, and then name your deployment. In this example, name it CleaningDataOpenAI.3. Select Chat playground from the Azure OpenAI Studio home screen or from the Playground tab on the left of your screen. This should open the following view:Figure 15.4 – Screenshot of the chat playground from within Azure OpenAINow that you have set up an instance of OpenAI, here’s an example conversation taken from OpenAI, where the chat functionality has been used to assist with writing M queries in Power BI:User: Hi, I’m working on a Power BI project, and I need to transform some data using M queries. I have a table with sales data, and I want to calculate the total sales for each product category. Can you help me with the M query for that?ChatGPT: Of course! To calculate the total sales for each product category, you can use the Group By transformation in Power BI’s Power Query Editor. Here’s a sample M snippet for your scenario:let Source = YourDataSource, GroupedTable = Table.Group(Source, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableReplace YourDataSource with the actual name of your data source. This code groups the data by the ProductCategory column and calculates the total sales for each group.User: Thanks! That’s a great starting point. What if I want to filter out products with sales less than $1,000  before calculating the total sales?ChatGPT: You can add a filtering step before the Group By transformation. Here’s an updated version of the M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableThis code filters out rows where sales are less than $1,000 before grouping the data by product category and calculating the total sales.User: Perfect! This is really helpful. One more thing – what if I want to sort the results in descending order based on total sales?ChatGPT: You can add a sorting step using the Table.Sort function. Here’s the modified M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}), SortedTable = Table.Sort(GroupedTable,{{"TotalSales", Order. Descending}}) in SortedTableThis code sorts the grouped table in descending order based on the “TotalSales” column.This interaction represents a simple example of how those who can leverage the power of tools such as OpenAI and C hatGPT will be able to quickly upskill in areas such as coding. It has to be said, though, that while this is still in its infancy, it’s important to always test and validate the answers provided before implementing them in production. Also, ensure that you take precautions when using the publicly available ChatGPT model to avoid sharing sensitive data publicly. If you would like to use sensitive data or you want to ensure that requests are given within a secured governed environment, make sure to use the ChatGPT model within your own Azure OpenAI instance.In more complex examples, optimizing Power Query transformations could involve efficient interaction with Azure OpenAI. This includes streamlining API calls, managing large datasets, and incorporating caching mechanisms for repetitive queries, ensuring a seamless and performant data cleaning process.As we begin to explore the use cases where this technology can be most effective, there are a number of clear early winners:Optimizing query plans: ChatGPT’s natural language understanding can assist in formulating more efficient Power Query plans. By describing the desired transformations in natural language, users can interact with ChatGPT to generate optimized query plans. This involves selecting the most suitable Power Query functions and structuring transformations for performance gains.Caching strategies for repetitive queries: ChatGPT can guide users in devising effective caching strategies. By understanding the context of data transformations, it can recommend where to implement caching mechanisms to store and reuse intermediate results, minimizing redundant API calls and computations. The following is an example of just this, where I have asked Azure OpenAI to verify and optimize my query from the Power Query Advanced Editor. The model suggested I use the Table.Buffer function to help cache the table in memory and optimize the query.Figure – An example request to OpenAI to help optimize my query for Power Query                                                        Figure – An example response from OpenAI to help optimize my query for Power QueryNow as we highlighted in Chapter 11, M Query Optimization, Table.Buffer can indeed improve the performance of your queries and refreshes, but this really depends on the data you are working with. In the previous example, the model doesn’t take the characteristics, size, or complexity of your data into consideration as it isn’t plugged into your data at this stage. Also linking back to the example you walked through in Chapter 11, the placement of where you add Table.Buffer can really impact how your query performs. In the previous example, if you were connecting to a small dataset, you would likely cause it to run slower by adding the Table.Buffer function as the second variable in the query.Lastly, it’s worth mentioning that how you prompt these models is crucially important. In the previous example, we didn’t specify what type of data source we were using in our query. As such, the model hasn’t provided an insight or overview that using Table.Buffer on a data source supporting query folding will cause it to break the fold. Again, this is not so much of a problem if Table.Buffer is placed at the end of your query for smaller datasets, but it is a problem if you add it nearer to the beginning of the query, like in the previous example.Handling large datasets: Dealing with large datasets often poses a challenge in Power Query. OpenAI models, including ChatGPT, can provide insights into dividing and conquering large datasets. This includes strategies for parallel processing, filtering data early in the transformation pipeline, and using aggregations to reduce computational load.Dynamic query adjustments: ChatGPT’s interactive nature allows users to dynamically adjust queries based on evolving requirements. It can assist in crafting queries that adapt to changing data scenarios, ensuring that Power Query transformations remain flexible and responsive to varied datasets.Guidance on complex transformations: Power Query oft en involves intricate transformations. ChatGPT can act as a virtual assistant, guiding users through the process of complex transformations. It can suggest optimal function compositions, advise on conditional logic placement, and assist in structuring transformations to enhance efficiency. The best example of this can be seen in the following two screenshots of an active use case seen in many businesses. The example begins with a user asking the model for a description of what the query is doing. OpenAI then provides a breakdown of what the query is doing in each step to help the user interpret the code. It helps to break down the barriers to coding and also helps to decipher code that has not been documented well by previous employees.                                                     Figure – An example request to OpenAI to help translate my queryFigure – An example response from OpenAI to help describe my queryError handling strategies: Optimizing Power Query also entails robust error handling. ChatGPT can provide recommendations for anticipating and handling errors gracefully within a query. This includes strategies for logging errors, implementing fallback mechanisms, and ensuring the stability of the overall data preparation process.In this section, you learned how to optimize Power Query transformations with Azure OpenAI efficiently. Key takeaways include using ChatGPT for natural-language-based query planning and effective caching strategies. Insights include handling large datasets through parallel processing, early filtering, and aggregations. This knowledge equips you to streamline and enhance your Power Query processes effectively.In the next section, you will learn about Microsoft  Copilot, how to set up a Power BI instance with Copilot activated, and also how you can use this new AI technology to help clean and prepare your data.ConclusionIn conclusion, Azure OpenAI with ChatGPT presents a game-changing solution for maximizing Power BI's potential. From query optimization to error-handling strategies, this integration streamlines processes and enhances productivity. As users navigate complex data transformations, the guidance provided fosters efficient decision-making and empowers users to tackle challenges with confidence. With Azure OpenAI and ChatGPT, the possibilities for revolutionizing Power BI workflows are endless, offering a glimpse into the future of data transformation and analytics.Author BioGus Frazer is a seasoned Analytics Consultant focused on Business Intelligence solutions. With over 7 years of experience working for the two market-leading platforms, Power BI & Tableau, has amassed a wealth of knowledge and expertise. Gus has helped hundreds of customers to drive their digital and data transformations, scope data requirements, drive actionable insights, and most important of all, cleanse data ready for analysis. Most recently helping to set up, organize and run the Power BI UK community at Microsoft. He holds 6 Azure and Power BI certifications, including the PL-300 and DP-500 certifications. In this book, Gus offers readers invaluable guidance on ingesting, preparing, and cleansing data for analysis in Power BI. --This text refers to an out of print or unavailable edition of this title.
Read more
  • 0
  • 0
  • 81745

article-image-how-to-perform-exception-handling-in-python-with-try-catch-and-finally
Guest Contributor
10 Dec 2019
9 min read
Save for later

How to perform exception handling in Python with ‘try, catch and finally’

Guest Contributor
10 Dec 2019
9 min read
An integral part of using Python involves the art of handling exceptions. There are primarily two types of exceptions; Built-in exceptions and User-Defined Exceptions. In such cases, the error handling resolution is to save the state of execution in the moment of error which interrupts the normal program flow to execute a special function or a code which is called Exception Handler. There are many types of errors like ‘division by zero’, ‘file open error’, etc. where an error handler needs to fix the issue. This allows the program to continue based on prior data saved. Source: Eyehunts Tutorial Just like Java, exceptions handling in Python is no different. It is a code embedded in a try block to run exceptions. Compare that to Java where catch clauses are used to catch the Exceptions. The same sort of Catch clause is used in Python that begins with except. Also, custom-made exception is possible in Python by using the raise statement where it forces a specified exception to take place. Reason to use exceptions Errors are always expected while writing a program in Python which requires a backup mechanism. Such a mechanism is set to handle any encountered errors and not doing so may crash the program completely. The reason to equip python program with the exception mechanism is to set and define a backup plan just in case any possible error situation erupts while executing it. Catch exceptions in Python Try statement is used for handling the exception in Python. A Try clause will consist of a raised exception associated with a particular, critical operation. For handling the exception the code is written within the Except Clause. The choice of performing a type of operation depends on the programmer once catching the exception is done. The below-defined program loops until the user enters an integer value having a valid reciprocal. A part of code that triggers an exception is contained inside the Try block. In case of absence of any exceptions then the normal flow of execution continues skipping the except block. And in case of exceptions raising the except block is caught. Checkout the example: The Output will be: Naming the exception is possible by using the ex_info() function that is present inside the sys module. It asks the user to make another attempt for naming it. Any unexpected values like 'a' or '1.3' will trigger the ValueError. Also, the return value of '0' leads to ZeroDivisionError. Exception handling in Python: try, except and finally There are instances where the suspicious code may raise exceptions which are placed inside such try statement block. Again, there is a code that is dedicated to handling such raised exceptions and the same is placed within the Except block. Below is an example of above-explained try and except statement when used in Python. try:   ** Operational/Suspicious Code except for SomeException:   ** Code to handle the exception How do they work in Python: The primarily used try block statements are triggered for checking whether or not there is any exception occurring within the code. In the event of non-occurrence of exception, the except block (Containing the exceptions handling statements) is executed post executing the try block. When the exception matches the predefined name as mentioned in 'SomeException' for handling the except block, it does the handling and enables the program to continue. In case of absence of any corresponding handlers that deals with the ones to be found in the except block then the activity of program execution is halted along with the error defining it. Defining Except without the exception To define the Except Clause isn’t always a viable option regardless of which programming language is used. As equipping the execution with the try-except clause is capable of handling all the possible types of exceptions. It will keep users ignorant about whether the exception was even raised in the first place. It is also a good idea to use the except statement without the exceptions field, for example some of the statements are defined below: try:    You do your operations here;    ...................... except:    If there is an exception, then execute this block.    ...................... else:    If there is no exception then execute this block.  OR, follow the below-defined syntax: try:   #do your operations except:   #If there is an exception raised, execute these statements else:   #If there is no exception, execute these statements Here is an example if the intent is to catch an exception within the file. This is useful when the intention is to read the file but it does not exist. try:   fp = open('example.txt', r) except:   print ('File is not found')   fp.close This example deals with opening the 'example.txt'. In such cases, when the called upon file is not found or does not exist then the code executes the except block giving the error read like 'File is not found'. Defining except clause for multiple exceptions It is possible to deal with multiple exceptions in a single block using the try statement. It allows doing so by enabling programmers to specify the different exception handlers. Also, it is recommended to define a particular exception within the code as a part of good programming practice. The better way out in such cases is to define the multiple exceptions using the same, above-mentioned except clause. And it all boils down to the process of execution wherein if the interpreter gets hold of a matching exception, then the code written under the except code will be executed. One way to do is by defining a tuple that can deal with the predefined multiple exceptions within the except clause. The below example shows the way to define such exceptions: try:    # do something  except (Exception1, Exception2, ..., ExceptionN):    # handle multiple exceptions    pass except:    # handle all other exceptions You can also use the same except statement to handle multiple exceptions as follows − try:    You do your operations here;    ...................... except(Exception1[, Exception2[,...ExceptionN]]]):    If there is an exception from the given exception list,     then execute this block.    ...................... else:    If there is no exception then execute this block.  Exception handling in Python using the try-finally clause Apart from implementing the try and except blocks within one, it is also a good idea to put together try and finally blocks. Here, the final block will carry all the necessary statements required to be executed regardless of the exception being raised in the try block. One benefit of using this method is that it helps in releasing external resources and clearing up the cache memories beefing up the program. Here is the pseudo-code for try..finally clause. try:    # perform operations finally:    #These statements must be executed Defining exceptions in try... finally block The example given below executes an event that shuts the file once all the operations are completed. try:    fp = open("example.txt",'r')    #file operations finally:    fp.close() Again, using the try statement in Python, it is wise to consider that it also comes with an optional clause – finally. Under any given circumstances, this code is executed which is usually put to use for releasing the additional external resource. It is not new for the developers to be connected to a remote data centre using a network. Also, there are chances of developers working with a file loaded with Graphic User Interface. Such situations will push the developers to clean up the used resources. Even if the resources used, yield successful results, such post-execution steps are always considered as a good practice. Actions like shutting down the GUI, closing a file or even disconnecting from a connected network written down in the finally block assures the execution of the code. The finally block is something that defines what must be executed regardless of raised exceptions. Below is the syntax used for such purpose: The file operations example below illustrates this very well: try: f = open("test.txt",encoding = 'utf-8') # perform file operations finally: f.close() Or In simpler terms: try:    You do your operations here;    ......................    Due to any exception, this may be skipped. finally:    This would always be executed.    ...................... Constructing such a block is a better way to ensure the file is closed even if the exception has taken place. Make a note that it is not possible to use the else clause along with the above-defined finally clause. Understanding user-defined exceptions Python users can create exceptions and it is done by deriving classes out of the built-in exceptions that come as standard exceptions. There are instances where displaying any specific information to users is crucial, especially upon catching the exception. In such cases, it is best to create a class that is subclassed from the RuntimeError. For that matter, the try block will raise a user-defined exception. The same is caught in the except block. Creating an instance of the class Networkerror will need the user to use variable e. Below is the syntax: class Networkerror(RuntimeError):    def __init__(self, arg):       self.args = arg   Once the class is defined, raising the exception is possible by following the below-mentioned syntax. try:    raise Networkerror("Bad hostname") except Networkerror,e:    print e.args Key points to remember Note that an exception is an error that occurs while executing the program indicating such events (error) occur though less frequently. As mentioned in the examples above, the most common exceptions are ‘divisible by 0’, ‘attempt to access non-existent file’ and ‘adding two non-compatible types’. Ensure putting up a try statement with a code where you are not sure whether or not the exception will occur. Specify an else block alongside try-except statement which will trigger when there is no exception raised in a try block. Author bio Shahid Mansuri Co-founder Peerbits, one of the leading software development company, USA, founded in 2011 which provides Python development services. Under his leadership, Peerbits used Python on a project to embed reports & researches on a platform that helped every user to access the dashboard that was freely available and also to access the dashboard that was exclusively available. His visionary leadership and flamboyant management style have yield fruitful results for the company. He believes in sharing his strong knowledge base with a learned concentration on entrepreneurship and business. Introducing Spleeter, a Tensorflow based python library that extracts voice and sound from any music track Fake Python libraries removed from PyPi when caught stealing SSH and GPG keys, reports ZDNet There’s more to learning programming than just writing code
Read more
  • 0
  • 0
  • 81670

article-image-implementing-gradient-descent-algorithm-to-solve-optimization-problems
Sunith Shetty
22 Feb 2018
7 min read
Save for later

Implementing gradient descent algorithm to solve optimization problems

Sunith Shetty
22 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajdeep Dua and Manpreet Singh Ghotra titled Neural Network Programming with Tensorflow. In this book, you will learn to leverage the power of TensorFlow to train neural networks of varying complexities, without any hassle.[/box] Today we will focus on the gradient descent algorithm and its different variants. We will take a simple example of linear regression to solve the optimization problem. Gradient descent is the most successful optimization algorithm. As mentioned earlier, it is used to do weights updates in a neural network so that we minimize the loss function. Let's now talk about an important neural network method called backpropagation, in which we firstly propagate forward and calculate the dot product of inputs with their corresponding weights, and then apply an activation function to the sum of products which transforms the input to an output and adds non linearities to the model, which enables the model to learn almost any arbitrary functional mappings. Later, we back propagate in the neural network, carrying error terms and updating weights values using gradient descent, as shown in the following graph: Different variants of gradient descent Standard gradient descent, also known as batch gradient descent, will calculate the gradient of the whole dataset but will perform only one update. Therefore, it can be quite slow and tough to control for datasets which are extremely large and don't fit in the memory. Let's now look at algorithms that can solve this problem. Stochastic gradient descent (SGD) performs parameter updates on each training example, whereas mini batch performs an update with n number of training examples in each batch. The issue with SGD is that, due to the frequent updates and fluctuations, it eventually complicates the convergence to the accurate minimum and will keep exceeding due to regular fluctuations. Mini-batch gradient descent comes to the rescue here, which reduces the variance in the parameter update, leading to a much better and stable convergence. SGD and mini-batch are used interchangeably. Overall problems with gradient descent include choosing a proper learning rate so that we avoid slow convergence at small values, or divergence at larger values and applying the same learning rate to all parameter updates wherein if the data is sparse we might not want to update all of them to the same extent. Lastly, is dealing with saddle points. Algorithms to optimize gradient descent We will now be looking at various methods for optimizing gradient descent in order to calculate different learning rates for each parameter, calculate momentum, and prevent decaying learning rates. To solve the problem of high variance oscillation of the SGD, a method called momentum was discovered; this accelerates the SGD by navigating along the appropriate direction and softening the oscillations in irrelevant directions. Basically, it adds a fraction of the update vector of the past step to the current update vector. Momentum value is usually set to .9. Momentum leads to a faster and stable convergence with reduced oscillations. Nesterov accelerated gradient explains that as we reach the minima, that is, the lowest point on the curve, momentum is quite high and it doesn't know to slow down at that point due to the large momentum which could cause it to miss the minima entirely and continue moving up. Nesterov proposed that we first make a long jump based on the previous momentum, then calculate the gradient and then make a correction which results in a parameter update. Now, this update prevents us to go too fast and not miss the minima, and makes it more responsive to changes. Adagrad allows the learning rate to adapt based on the parameters. Therefore, it performs large updates for infrequent parameters and small updates for frequent parameters. Therefore, it is very well-suited for dealing with sparse data. The main flaw is that its learning rate is always decreasing and decaying. Problems with decaying learning rates are solved using AdaDelta. AdaDelta solves the problem of decreasing learning rate in AdaGrad. In AdaGrad, the learning rate is computed as one divided by the sum of square roots. At each stage, we add another square root to the sum, which causes the denominator to decrease constantly. Now, instead of summing all prior square roots, it uses a sliding window which allows the sum to decrease. Adaptive Moment Estimation (Adam) computes adaptive learning rates for each parameter. Like AdaDelta, Adam not only stores the decaying average of past squared gradients but additionally stores the momentum change for each parameter. Adam works well in practice and is one of the most used optimization methods today. The following two images (image credit: Alec Radford) show the optimization behavior of optimization algorithms described earlier. We see their behavior on the contours of a loss surface over time. Adagrad, RMsprop, and Adadelta almost quickly head off in the right direction and converge fast, whereas momentum and NAG are headed off-track. NAG is soon able to correct its course due to its improved responsiveness by looking ahead and going to the minimum. The second image displays the behavior of the algorithms at a saddle point. SGD, Momentum, and NAG find it challenging to break symmetry, but slowly they manage to escape the saddle point, whereas Adagrad, Adadelta, and RMsprop head down the negative slope, as can seen from the following image: Which optimizer to choose In the case that the input data is sparse or if we want fast convergence while training complex neural networks, we get the best results using adaptive learning rate methods. We also don't need to tune the learning rate. For most cases, Adam is usually a good choice. Optimization with an example Let's take an example of linear regression, where we try to find the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why we call it least squares regression. Essentially, we are formulating the problem as an optimization problem, where we are trying to minimize a loss function. Let's set up input data and look at the scatter plot: #  input  data xData  =  np.arange(100,  step=.1) yData  =  xData  +  20  *  np.sin(xData/10) Define the data size and batch size: #  define  the  data  size  and  batch  size nSamples  =  1000 batchSize  =  100 We will need to resize the data to meet the TensorFlow input format, as follows: #  resize  input  for  tensorflow xData  =  np.reshape(xData,  (nSamples,  1)) yData  =  np.reshape(yData,  (nSamples,  1)) The following scope initializes the weights and bias, and describes the linear model and loss function: with tf.variable_scope("linear-regression-pipeline"): W  =  tf.get_variable("weights",  (1,1), initializer=tf.random_normal_initializer()) b  =  tf.get_variable("bias",   (1,  ), initializer=tf.constant_initializer(0.0)) # model yPred  =  tf.matmul(X,  W)  +  b # loss  function loss  =  tf.reduce_sum((y  -  yPred)**2/nSamples) We then set optimizers for minimizing the loss: # set the optimizer #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) #optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdadeltaOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdagradOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.MomentumOptimizer(learning_rate=.001, momentum=0.9).minimize(loss) #optimizer = tf.train.FtrlOptimizer(learning_rate=.001).minimize(loss) optimizer = tf.train.RMSPropOptimizer(learning_rate=.001).minimize(loss) We then select the mini batch and run the optimizers errors = [] with tf.Session() as sess: # init variables sess.run(tf.global_variables_initializer()) for _ in range(1000): # select mini batch indices = np.random.choice(nSamples, batchSize) xBatch, yBatch = xData[indices], yData[indices] # run optimizer _, lossVal = sess.run([optimizer, loss], feed_dict={X: xBatch, y: yBatch}) errors.append(lossVal) plt.plot([np.mean(errors[i-50:i]) for i in range(len(errors))]) plt.show() plt.savefig("errors.png") The output of the preceding code is as follows: We also get a sliding curve, as follows: We learned optimization is a complicated subject and a lot depends on the nature and size of our data. Also, optimization depends on weight matrices. A lot of these optimizers are trained and tuned for tasks like image classification or predictions. However, for custom or new use cases, we need to perform trial and error to determine the best solution. To know more about how to build and optimize neural networks using TensorFlow, do checkout this book Neural Network Programming with Tensorflow.  
Read more
  • 0
  • 0
  • 78900

article-image-implementing-face-detection-using-haar-cascades-adaboost-algorithm
Sugandha Lahoti
20 Feb 2018
7 min read
Save for later

Implementing face detection using the Haar Cascades and AdaBoost algorithm

Sugandha Lahoti
20 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book serves as an effective guide to using ensemble techniques to enhance machine learning models.[/box] In today’s tutorial, we will learn how to apply the AdaBoost classifier in face detection using Haar cascades. Face detection using Haar cascades Object detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper Rapid Object Detection using a Boosted Cascade of Simple Features in 2001. It is a machine-learning-based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images. Here, we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. Features are nothing but numerical information extracted from the images that can be used to distinguish one image from another; for example, a histogram (distribution of intensity values) is one of the features that can be used to define several characteristics of an image even without looking at the image, such as dark or bright image, the intensity range of the image, contrast, and so on. We will use Haar features to detect faces in an image. Here is a figure showing different Haar features: These features are just like the convolution kernel; to know about convolution, you need to wait for the following chapters. For a basic understanding, convolutions can be described as in the following figure: So we can summarize convolution with these steps: Pick a pixel location from the image. Now crop a sub-image with the selected pixel as the center from the source image with the same size as the convolution kernel. Calculate an element-wise product between the values of the kernel and sub- image. Add the result of the product. Put the resultant value into the new image at the same place where you picked up the pixel location. Now we are going to do a similar kind of procedure, but with a slight difference for our images. Each feature of ours is a single value obtained by subtracting the sum of the pixels under the white rectangle from the sum of the pixels under the black rectangle. Now, all possible sizes and locations of each kernel are used to calculate plenty of features. (Just imagine how much computation it needs. Even a 24x24 window results in over 160,000 features.) For each feature calculation, we need to find the sum of the pixels under the white and black rectangles. To solve this, we will use the concept of integral image; we will discuss this concept very briefly here, as it's not a part of our context. Integral image Integral images are those images in which the pixel value at any (x,y) location is the sum of the all pixel values present before the current pixel. Its use can be understood by the following example: Image on the left and the integral image on the right. Let's see how this concept can help reduce computation time; let us assume a matrix A of size 5x5 representing an image, as shown here: Now, let's say we want to calculate the average intensity over the area highlighted: Region for addition Normally, you'd do the following: 9 + 1 + 2 + 6 + 0 + 5 + 3 + 6 + 5 = 37 37 / 9 = 4.11 This requires a total of 9 operations. Doing the same for 100 such operations would require: 100 * 9 = 900 operations. Now, let us first make a integral image of the preceding image: Making this image requires a total of 56 operations. Again, focus on the highlighted portion: To calculate the avg intensity, all you have to do is: (76 - 20) - (24 - 5) = 37 37 / 9 = 4.11 This required a total of 4 operations. To do this for 100 such operations, we would require: 56 + 100 * 4 = 456 operations. For just a hundred operations over a 5x5 matrix, using an integral image requires about 50% less computations. Imagine the difference it makes for large images and other such operations. Creation of an integral image changes other sum difference operations by almost O(1) time complexity, thereby decreasing the number of calculations. It simplifies the calculation of the sum of pixels—no matter how large the number of pixels—to an operation involving just four pixels. Nice, isn't it? It makes things superfast. However, among all of these features we calculated, most of them are irrelevant. For example, consider the following image. The top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same windows applying on cheeks or any other part is irrelevant. So how do we select the best features out of 160000+ features? It is achieved by AdaBoost. To do this, we apply each and every feature on all the training images. For each feature, it finds the best threshold that will classify the faces as positive and negative. Obviously, there will be errors or misclassifications. We select the features with the minimum error rate, which means they are the features that best classify the face and non-face images. Note: The process is not as simple as this. Each image is given an equal weight in the       beginning. After each classification, the weights of misclassified images are increased. Again, the same process is done. New error rates are calculated among the new weights. This process continues until the required accuracy or error rate is achieved or the required number of features is found. The final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't classify the image, but together with others, it forms a strong classifier. The paper says that even 200 features provide detection with 95% accuracy. Their final setup had around 6,000 features. (Imagine a reduction from 160,000+ to 6000 features. That is a big gain.) Face detection framework using the Haar cascade and AdaBoost algorithm So now, you take an image take each 24x24 window, apply 6,000 features to it, and check if it is a face or not. Wow! Wow! Isn't this a little inefficient and time consuming? Yes, it is. The authors of the algorithm have a good solution for that. In an image, most of the image region is non-face. So it is a better idea to have a simple method to verify that a window is not a face region. If it is not, discard it in a single shot. Don’t process it again. Instead, focus on the region where there can be a face. This way, we can find more time to check a possible face region. For this, they introduced the concept of a cascade of classifiers. Instead of applying all the 6,000 features to a window, we group the features into different stages of classifiers and apply one by one (normally first few stages will contain very few features). If a window fails in the first stage, discard it. We don’t consider the remaining features in it. If it passes, apply the second stage of features and continue the process. The window that passes all stages is a face region. How cool is the plan!!! The authors' detector had 6,000+ features with 38 stages, with 1, 10, 25, 25, and 50 features in the first five stages (two features in the preceding image were actually obtained as the best two features from AdaBoost). According to the authors, on average, 10 features out of 6,000+ are evaluated per subwindow. So this is a simple, intuitive explanation of how Viola-Jones face detection works. Read the paper for more details. If you found this post useful, do check out the book Ensemble Machine Learning to learn different machine learning aspects such as bagging, boosting, and stacking.    
Read more
  • 0
  • 0
  • 77087

article-image-stephen-hawking-artificial-intelligence-quotes
Richard Gall
15 Mar 2018
3 min read
Save for later

5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence

Richard Gall
15 Mar 2018
3 min read
Professor Stephen Hawking died today (March 14, 2018) aged 76 at his home in Cambridge, UK. Best known for his theory of cosmology that unified quantum mechanics with Einstein’s General Theory of Relativity, and for his book a Brief History of Time that brought his concepts to a wider general audience, Professor Hawking is quite possibly one of the most important and well-known voices in the scientific world. Among many things, Professor Hawking had a lot to say about artificial intelligence - its dangers, its opportunities and what we should be thinking about, not just as scientists and technologists, but as humans. Over the years, Hawking has remained cautious and consistent in his views on the topic constantly urging AI researchers and machine learning developers to consider the wider implications of their work on society and the human race itself.  The machine learning community is quite divided on all the issues Hawking has raised and will probably continue to be so as the field grows faster than it can be fathomed. Here are 5 widely debated things Stephen Hawking said about AI arranged in chronological order - and if you’re going to listen to anyone, you’ve surely got to listen to him?   On artificial intelligence ending the human race The development of full artificial intelligence could spell the end of the human race….It would take off on its own, and re-design itself at an ever-increasing rate. Humans, who are limited by slow biological evolution, couldn't compete and would be superseded. From an interview with the BBC, December 2014 On the future of AI research The establishment of shared theoretical frameworks, combined with the availability of data and processing power, has yielded remarkable successes in various component tasks such as speech recognition, image classification, autonomous vehicles, machine translation, legged locomotion, and question-answering systems. As capabilities in these areas and others cross the threshold from laboratory research to economically valuable technologies, a virtuous cycle takes hold whereby even small improvements in performance are worth large sums of money, prompting greater investments in research. There is now a broad consensus that AI research is progressing steadily, and that its impact on society is likely to increase.... Because of the great potential of AI, it is important to research how to reap its benefits while avoiding potential pitfalls. From Research Priorities for Robust and Beneficial Artificial Intelligence, an open letter co-signed by Hawking, January 2015 On AI emulating human intelligence I believe there is no deep difference between what can be achieved by a biological brain and what can be achieved by a computer. It, therefore, follows that computers can, in theory, emulate human intelligence — and exceed it From a speech given by Hawking at the opening of the Leverhulme Centre of the Future of Intelligence, Cambridge, U.K., October 2016 On making artificial intelligence benefit humanity Perhaps we should all stop for a moment and focus not only on making our AI better and more successful but also on the benefit of humanity. Taken from a speech given by Hawking at Web Summit in Lisbon, November 2017 On AI replacing humans The genie is out of the bottle. We need to move forward on artificial intelligence development but we also need to be mindful of its very real dangers. I fear that AI may replace humans altogether. If people design computer viruses, someone will design AI that replicates itself. This will be a new form of life that will outperform humans. From an interview with Wired, November 2017
Read more
  • 0
  • 2
  • 72294
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-implementing-matrix-operations-using-scipy-numpy
Pravin Dhandre
07 Mar 2018
5 min read
Save for later

Implementing matrix operations using SciPy and NumPy

Pravin Dhandre
07 Mar 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book includes hands-on recipes for using different components of the SciPy Stack such as NumPy, SciPy, matplotlib, pandas, etc.[/box] In this article, we will discuss how to leverage the power of SciPy and NumPy to perform numerous matrix operations and solve common challenges faced while proceeding with statistical analysis. Matrix operations and functions on two-dimensional arrays Basic matrix operations form the backbone of quite a few statistical analyses—for example, neural networks. In this section, we will be covering some of the most used operations and functions on 2D arrays: Addition Multiplication by scalar Matrix arithmetic Matrix-matrix multiplication Matrix inversion Matrix transposition In the following sections, we will look into the methods of implementing each of them in Python using SciPy/NumPy. How to do it… Let's look at the different methods. Matrix addition In order to understand how matrix addition is done, we will first initialize two arrays: # Initializing an array x = np.array([[1, 1], [2, 2]]) y = np.array([[10, 10], [20, 20]]) Similar to what we saw in a previous chapter, we initialize a 2 x 2 array by using the np.array function. There are two methods by which we can add two arrays. Method 1 A simple addition of the two arrays x and y can be performed as follows: x+y Note that x evaluates to: [[1 1] [2 2]] y evaluates to: [[10 10] [20 20]] The result of x+y would be equal to: [[1+10 1+10] [2+20 2+20]] Finally, this gets evaluated to: [[11 11] [22 22]] Method 2 The same preceding operation can also be performed by using the add function in the numpy package as follows: np.add(x,y) Multiplication by a scalar Matrix multiplication by a scalar can be performed by multiplying the vector with a number. We will perform the same using the following two steps: Initialize a two-dimensional array. Multiply the two-dimensional array with a scalar. We perform the steps, as follows: To initialize a two-dimensional array: x = np.array([[1, 1], [2, 2]]) To multiply the two-dimensional array with the k scalar: k*x For example, if the scalar value k = 2, then the value of k*x translates to: 2*x array([[2, 2], [4, 4]]) Matrix arithmetic Standard arithmetic operators can be performed on top of NumPy arrays too. The operations used most often are: Addition Subtraction Multiplication Division Exponentials The other major arithmetic operations are similar to the addition operation we performed on two matrices in the Matrix addition section earlier: # subtraction x-y array([[ -9, -9], [-18, -18]]) # multiplication x*y array([[10, 10], [40, 40]]) While performing multiplication here, there is an element to element multiplication between the two matrices and not a matrix multiplication (more on matrix multiplication in the next section): # division x/y array([[ 0.1, 0.1], [ 0.1, 0.1]]) # exponential x**y array([[ 1, 1], [1048576, 1048576]], dtype=int32) Matrix-matrix multiplication Matrix to matrix multiplication works in the following way: We have a set of two matrices with the following shape: Matrix A has n rows and m columns and matrix B has m rows and p columns. The matrix multiplication of A and B is calculated as follows: The matrix operation is performed by using the built-in dot function available in NumPy as follows: Initialize the arrays: x=np.array([[1, 1], [2, 2]]) y=np.array([[10, 10], [20, 20]]) Perform the matrix multiplication using the dot function in the numpy package: np.dot(x,y) array([[30, 30], [60, 60]]) The np.dot function does the multiplication in the following way: array([[1*10 + 1*20, 1*10 + 1*20], [2*10 + 2*20, 2*10 + 2*20]]) Whenever matrix multiplication happens, the number of columns in the first matrix should be equal to the number of rows in the second matrix. Matrix transposition Matrix transposition is performed by using the transpose function available in numpy package. The process to generate the transpose of a matrix is as follows: Initialize a matrix: A = np.array([[1,2],[3,4]]) Calculate the transpose of the matrix: A.transpose() array([[1, 3], [2, 4]]) The transpose of a matrix with m rows and n columns would be a matrix with n rows and m columns Matrix inversion While we performed most of the basic arithmetic operations on top of matrices earlier, we have not performed any specialist functions within scientific computing/analysis—for example, matrix inversion, transposition, ranking of a matrix, and so on. The other functions available within the scipy package shine through (over and above the previously discussed functions) in such a scenario where more data manipulation is required apart from the standard ones. Matrix inversion can be performed by using the function available in scipy.linalg. The process to perform matrix inversion and its implementation in Python is as follows: Import relevant packages and classes/functions within a package: from scipy import linalg Initialize a matrix: A = np.array([[1,2],[3,4]]) Pass the initialized matrix through the inverse function in package: linalg.inv(A) array([[-2. , 1. ], [ 1.5, -0.5]]) We saw how to easily perform implementation of all the basic matrix operations with Python’s scientific library - SciPy. You may check out this book SciPy Recipes to perform advanced computing tasks like Discrete Fourier Transform and K-means with the SciPy stack.
Read more
  • 0
  • 0
  • 71101

article-image-mongodb-is-going-to-acquire-realm-the-mobile-database-management-system-for-39-million
Richard Gall
25 Apr 2019
3 min read
Save for later

MongoDB is going to acquire Realm, the mobile database management system, for $39 million

Richard Gall
25 Apr 2019
3 min read
MongoDB, the open source NoSQL database, is going to acquire mobile database platform Realm. The purchase is certainly one with clear technological and strategic benefits for both companies - and with MongoDB paying $39 million for a company that has up to now raised $40 million since its launch in 2011, it's clear that this is a move that isn't about short term commercial gains. It's important to note that the acquisition is not yet complete. It's expected to close in January 2020 at the end of the second quarter MongoDB's fiscal year. Further details about the acquisition and what it means for both products, will be revealed at MongoDB World in June. Why is MongoDB acquiring Realm? In the materials that announce the launch there's a lot of talk about the alignment between the two projects. "The best thing in the world is when someone just gets you, and you get them" MongoDB CTO Eliot Horowitz wrote in a blog post accompanying the release, "because when you share a vision of the world like that, you can do incredible things together. That’s exactly the case with MongoDB and Realm." At a more fundamental level the acquisition allows MongoDB to do a number of things. It can reach a new community of developers  working primarily in mobile development (according to the press release Realm has 100,000 active users), but it also allows MongoDB to strengthen its capabilities as cloud evolves to become the dominant way that applications are built and hosted. According to Dev Ittycheria, MongoDB President and CEO, Realm "is a natural fit for our global cloud database, MongoDB Atlas, as well as a complement to Stitch, our serverless platform." Serverless might well be a nascent trend at the moment, but the level of conversation and interest around it indicates that it's going to play a big part in application developers lives in the months and years to come. What's in it for Realm? For Realm, the acquisition will give the project access to a new pool of users. With backing from MongoDB, is also provides robust foundations for the project to extend its roadmap and even move faster than it previously would have been able to. Realm CEO David Ratner wrote yesterday (April 24) that: "The combination of MongoDB and Realm will establish the modern standard for mobile application development and data synchronization for a new generation of connected applications and services. MongoDB and Realm are fully committed to investing in the Realm Database and the future of data synchronization, and taking both to the next phase of their evolution. We believe that MongoDB will help accelerate Realm’s product roadmap, go-to-market execution, and support our customers’ use cases at a whole new level of global operational scale." A new chapter for MongoDB? 2019 hasn't been the best year for MongoDB so far. The project withdrew its submission for its controversial Server Side Public License last month following news that Red Hat was dropping it from Enterprise Linux and Fedora. This brought an initiative that the leadership viewed as strategically important in defending MongoDB's interests to a dramatic halt. However, the Realm acquisition sets up a new chapter and could go some way in helping MongoDB bolster itself for a future that it has felt uncertain about.
Read more
  • 0
  • 0
  • 69071

article-image-vertex-ai-workbench-your-complete-guide-to-scaling-machine-learning-with-google-cloud
Jasmeet Bhatia, Kartik Chaudhary
04 Nov 2024
15 min read
Save for later

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Jasmeet Bhatia, Kartik Chaudhary
04 Nov 2024
15 min read
This article is an excerpt from the book, "The Definitive Guide to Google Vertex AI", by Jasmeet Bhatia, Kartik Chaudhary. The Definitive Guide to Google Vertex AI is for ML practitioners who want to learn Google best practices, MLOps tooling, and turnkey AI solutions for solving large-scale real-world AI/ML problems. This book takes a hands-on approach to help you become an ML rockstar on Google Cloud Platform in no time.Introduction While working on an ML project, if we are running a Jupyter Notebook in a local environment, or using a web-based Colab- or Kaggle-like kernel, we can perform some quick experiments and get some initial accuracy or results from ML algorithms very fast. But we hit a wall when it comes to performing large-scale experiments, launching long-running jobs, hosting a model, and also in the case of model monitoring. Additionally, if the data related to a project requires some more granular permissions on security and privacy (fine-grained control over who can view/access the data), it’s not feasible in local or Colab-like environments. All these challenges can be solved just by moving to the cloud. Vertex AI Workbench within Google Cloud is a JupyterLab-based environment that can be leveraged for all kinds of development needs of a typical data science project. The JupyterLab environment is very similar to the Jupyter Notebook environment, and thus we will be using these terms interchangeably throughout the book. Vertex AI Workbench has options for creating managed notebook instances as well as user-managed notebook instances. User-managed notebook instances give more control to the user, while managed notebooks come with some key extra features. We will discuss more about these later in this section. Some key features of the Vertex AI Workbench notebook suite include the following: Fully managed–Vertex AI Workbench provides a Jupyter Notebook-based fully managed environment that provides enterprise-level scale without managing infrastructure, security, and user-management capabilities. Interactive experience–Data exploration and model experiments are easier as managed notebooks can easily interact with other Google Cloud services such as storage systems, big data solutions, and so on. Prototype to production AI–Vertex AI notebooks can easily interact with other Vertex AI tools and Google Cloud services and thus provide an environment to run end-to-end ML projects from development to deployment with minimal transition. Multi-kernel support–Workbench provides multi-kernel support in a single managed notebook instance including kernels for tools such as TensorFlow, PyTorch, Spark, and R. Each of these kernels comes with pre-installed useful ML libraries and lets us install additional libraries as required. Scheduling notebooks–Vertex AI Workbench lets us schedule notebook runs on an ad hoc and recurring basis. This functionality is quite useful in setting up and running large-scale experiments quickly. This feature is available through managed notebook instances. More information will be provided on this in the coming sections. With this background, we can now start working with Jupyter Notebooks on Vertex AI Workbench. The next section provides basic guidelines for getting started with notebooks on Vertex AI. Getting started with Vertex AI Workbench Go to the Google Cloud console and open Vertex AI from the products menu on the left pane or by using the search bar on the top. Inside Vertex AI, click on Workbench, and it will open a page very similar to the one shown in Figure 4.3. More information on this is available in the official  documentation (https://cloud.google.com/vertex-ai/docs/workbench/ introduction).  Figure 4.3 – Vertex AI Workbench UI within the Google Cloud console As we can see, Vertex AI Workbench is basically Jupyter Notebook as a service with the flexibility of working with managed as well as user-managed notebooks. User-managed notebooks are suitable for use cases where we need a more customized environment with relatively higher control. Another good thing about user-managed notebooks is that we can choose a suitable Docker container based on our development needs; these notebooks also let us change the type/size of the instance later on with a restart. To choose the best Jupyter Notebook option for a particular project, it’s important to know about the common differences between the two solutions. Table 4.1 describes some common differences between fully managed and user-managed notebooks: Table 4.1 – Differences between managed and user-managed notebook instances Let’s create one user-managed notebook to check the available options:  Figure 4.4 – Jupyter Notebook kernel configurations As we can see in the preceding screenshot, user-managed notebook instances come with several customized image options to choose from. Along with the support of tools such as TensorFlow Enterprise, PyTorch, JAX, and so on, it also lets us decide whether we want to work with GPUs (which can be changed later, of course, as per needs). These customized images come with all useful libraries pre-installed for the desired framework, plus provide the flexibility to install any third-party packages within the instance. After choosing the appropriate image, we get more options to customize things such as notebook name, notebook region, operating system, environment, machine types, accelerators, and so on (see the following screenshot):  Figure 4.5 – Configuring a new user-managed Jupyter Notebook Once we click on the CREATE button, it can take a couple of minutes to create a notebook instance. Once it is ready, we can launch the Jupyter instance in a browser tab using the link provided inside Workbench (see Figure 4.6). We also get the option to stop the notebook for some time when we are not using it (to reduce cost):  Figure 4.6 – A running Jupyter Notebook instance This Jupyter instance can be accessed by all team members having access to Workbench, which helps in collaborating and sharing progress with other teammates. Once we click on OPEN JUPYTERLAB, it opens a familiar Jupyter environment in a new tab (see Figure 4.7):  Figure 4.7 – A user-managed JupyterLab instance in Vertex AI Workbench A Google-managed JupyterLab instance also looks very similar (see Figure 4.8):  Figure 4.8 – A Google-managed JupyterLab instance in Vertex AI Workbench Now that we can access the notebook instance in the browser, we can launch a new Jupyter Notebook or terminal and get started on the project. After providing sufficient permissions to the service account, many useful Google Cloud services such as BigQuery, GCS, Dataflow, and so on can be accessed from the Jupyter Notebook itself using SDKs. This makes Vertex AI Workbench a one-stop tool for every ML development need. Note: We should stop Vertex AI Workbench instances when we are not using them or don’t plan to use them for a long period of time. This will help prevent us from incurring costs from running them unnecessarily for a long period of time. In the next sections, we will learn how to create notebooks using custom containers and how to schedule notebooks with Vertex AI Workbench. Custom containers for Vertex AI Workbench Vertex AI Workbench gives us the flexibility of creating notebook instances based on a custom container as well. The main advantage of a custom container-based notebook is that it lets us customize the notebook environment based on our specific needs. Suppose we want to work with a new TensorFlow version (or any other library) that is currently not available as a predefined kernel. We can create a custom Docker container with the required version and launch a Workbench instance using this container. Custom containers are supported by both managed and user-managed notebooks. Here is how to launch a user-managed notebook instance using a custom container: 1. The first step is to create a custom container based on the requirements. Most of the time, a derivative container (a container based on an existing DL container image) would be easy to set up. See the following example Dockerfile; here, we are first pulling an existing TensorFlow GPU image and then installing a new TensorFlow version from the source: FROM gcr.io/deeplearning-platform-release/tf-gpu:latest RUN pip install -y tensorflow2. Next, build and push the container image to Container Registry, such that it should be accessible to the Google Compute Engine (GCE) service account. See the following source to build and push the container image: export PROJECT=$(gcloud config list project --format "value(core.project)") docker build . -f Dockerfile.example -t "gcr.io/${PROJECT}/ tf-custom:latest" docker push "gcr.io/${PROJECT}/tf-custom:latest"Note that the service account should be provided with sufficient permissions to build and push the image to the container registry, and the respective APIs should be enabled. 3. Go to the User-managed notebooks page, click on the New Notebook button, and then select Customize. Provide a notebook name and select an appropriate Region and Zone value. 4. In the Environment field, select Custom Container. 5. In the Docker Container Image field, enter the address of the custom image; in our case, it would look like this: gcr.io/${PROJECT}/tf-custom:latest 6. Make the remaining appropriate selections and click the Create button. We are all set now. While launching the notebook, we can select the custom container as a kernel and start working on the custom environment. Conclusion Vertex AI Workbench stands out as a powerful, cloud-based environment that streamlines machine learning development and deployment. By leveraging its managed and user-managed notebook options, teams can overcome local development limitations, ensuring better scalability, enhanced security, and integrated access to Google Cloud services. This guide has explored the foundational aspects of working with Vertex AI Workbench, including its customizable environments, scheduling features, and the use of custom containers. With Vertex AI Workbench, data scientists and ML practitioners can focus on innovation and productivity, confidently handling projects from inception to production. Author BioJasmeet Bhatia is a machine learning solution architect with over 18 years of industry experience, with the last 10 years focused on global-scale data analytics and machine learning solutions. In his current role at Google, he works closely with key GCP enterprise customers to provide them guidance on how to best use Google's cutting-edge machine learning products. At Google, he has also worked as part of the Area 120 incubator on building innovative data products such as Demand Signals, and he has been involved in the launch of Google products such as Time Series Insights. Before Google, he worked in similar roles at Microsoft and Deloitte.When not immersed in technology, he loves spending time with his wife and two daughters, reading books, watching movies, and exploring the scenic trails of southern California.He holds a bachelor's degree in electronics engineering from Jamia Millia Islamia University in India and an MBA from the University of California Los Angeles (UCLA) Anderson School of Management.Kartik Chaudhary is an AI enthusiast, educator, and ML professional with 6+ years of industry experience. He currently works as a senior AI engineer with Google to design and architect ML solutions for Google's strategic customers, leveraging core Google products, frameworks, and AI tools. He previously worked with UHG, as a data scientist, and helped in making the healthcare system work better for everyone. Kartik has filed nine patents at the intersection of AI and healthcare.Kartik loves sharing knowledge and runs his own blog on AI, titled Drops of AI.Away from work, he loves watching anime and movies and capturing the beauty of sunsets.
Read more
  • 1
  • 1
  • 67379

article-image-how-greedy-algorithms-work
Richard Gall
10 Apr 2018
2 min read
Save for later

How greedy algorithms work

Richard Gall
10 Apr 2018
2 min read
What is a greedy algorithm? Greedy algorithms are useful for optimization problems. They make the optimal choice at a localized and immediate level, with the aim being that you’ll find the optimal solution you want. It’s important to note that they don’t always find you the best solution for the data science problem you’re trying to solve - so apply them wisely. In the below video tutorial above, from Fundamental Algorithms in Scala, you'll learn when and how to apply a simple greedy algorithm, and see examples of both an iterative and recursive algorithm in action. with examples of both an iterative algorithm and a recursive algorithm. The advantages and disadvantages of greedy algorithms Greedy algorithms have a number of advantages and disadvantages. While on the one hand, it's relatively easy to come up with them, it is actually pretty challenging to identify the issues around the 'correctness' of your algorithm. That means that ultimately the optimization problem you're trying to solve by using greedy algorithms isn't really a technical issue as such. Instead, it's more of an issue with the scope and definition of your data analysis project. It's a human problem, not a mechanical one. Different ways to apply greedy algorithms There are a number of areas where greedy algorithms can most successfully be applied. In fact, it's worth exploring some of these problems if you want to get to know them in more detail. They should give you a clearer indication of how they work, what makes them useful, and potential drawbacks. Some of the best examples are: Huffman coding Dijkstra's algorithm Continuous knapsack problem  To learn more about other algorithms check out these articles: 4 popular algorithms for Distance-based outlier detection 10 machine learning algorithms every engineer needs to know 4 Clustering Algorithms every Data Scientist should know Backpropagation Algorithm To learn to implement specific algorithms, use these tutorials: Creating a reference generator for a job portal using Breadth First Search (BFS) algorithm Implementing gradient descent algorithm to solve optimization problems Implementing face detection using the Haar Cascades and AdaBoost algorithm Getting started with big data analysis using Google’s PageRank algorithm Implementing the k-nearest neighbors' algorithm in Python Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib
Read more
  • 0
  • 0
  • 66175
article-image-installing-numpy-scipy-matplotlib-ipython
Packt Editorial Staff
12 Oct 2014
7 min read
Save for later

Installing NumPy, SciPy, matplotlib, and IPython

Packt Editorial Staff
12 Oct 2014
7 min read
This article written by Ivan Idris, author of the book, Python Data Analysis, will guide you to install NumPy, SciPy, matplotlib, and IPython. We can find a mind map describing software that can be used for data analysis at https://www.xmind.net/m/WvfC/. Obviously, we can't install all of this software in this article. We will install NumPy, SciPy, matplotlib, and IPython on different operating systems. [box type="info" align="" class="" width=""]Packt has the following books that are focused on NumPy: NumPy Beginner's Guide Second Edition, Ivan Idris NumPy Cookbook, Ivan Idris Learning NumPy Array, Ivan Idris [/box] SciPy is a scientific Python library, which supplements and slightly overlaps NumPy. NumPy and SciPy, historically shared their codebase but were later separated. matplotlib is a plotting library based on NumPy. IPython provides an architecture for interactive computing. The most notable part of this project is the IPython shell. Software used The software used in this article is based on Python, so it is required to have Python installed. On some operating systems, Python is already installed. You, however, need to check whether the Python version is compatible with the software version you want to install. There are many implementations of Python, including commercial implementations and distributions. [box type="note" align="" class="" width=""]You can download Python from https://www.python.org/download/. On this website, we can find installers for Windows and Mac OS X, as well as source archives for Linux, Unix, and Mac OS X.[/box] The software we will install has binary installers for Windows, various Linux distributions, and Mac OS X. There are also source distributions, if you prefer that. You need to have Python 2.4.x or above installed on your system. Python 2.7.x is currently the best Python version to have because most Scientific Python libraries support it. Python 2.7 will be supported and maintained until 2020. After that, we will have to switch to Python 3. Installing software and setup on Windows Installing on Windows is, fortunately, a straightforward task that we will cover in detail. You only need to download an installer, and a wizard will guide you through the installation steps. We will give steps to install NumPy here. The steps to install the other libraries are similar. The actions we will take are as follows: Download installers for Windows from the SourceForge website (refer to the following table). The latest release versions may change, so just choose the one that fits your setup best. Library URL Latest Version NumPy http://sourceforge.net/projects/numpy/files/ 1.8.1 SciPy http://sourceforge.net/projects/scipy/files/ 0.14.0 matplotlib http://sourceforge.net/projects/matplotlib/files/ 1.3.1 IPython http://archive.ipython.org/release/ 2.0.0 Choose the appropriate version. In this example, we chose numpy-1.8.1-win32-superpack-python2.7.exe. Open the EXE installer by double-clicking on it. Now, we can see a description of NumPy and its features. Click on the Next button.If you have Python installed, it should automatically be detected. If it is not detected, maybe your path settings are wrong. Click on the Next button if Python is found; otherwise, click on the Cancel button and install Python (NumPy cannot be installed without Python). Click on the Next button. This is the point of no return. Well, kind of, but it is best to make sure that you are installing to the proper directory and so on and so forth. Now the real installation starts. This may take a while. [box type="note" align="" class="" width=""]The situation around installers is rapidly evolving. Other alternatives exist in various stage of maturity (see https://www.scipy.org/install.html). It might be necessary to put the msvcp71.dll file in your C:Windowssystem32 directory. You can get it from http://www.dll-files.com/dllindex/dll-files.shtml?msvcp71.[/box] Installing software and setup on Linux Installing the recommended software on Linux depends on the distribution you have. We will discuss how you would install NumPy from the command line, although, you could probably use graphical installers; it depends on your distribution (distro). The commands to install matplotlib, SciPy, and IPython are the same – only the package names are different. Installing matplotlib, SciPy, and IPython is recommended, but optional. Most Linux distributions have NumPy packages. We will go through the necessary steps for some of the popular Linux distros: Run the following instructions from the command line for installing NumPy on Red Hat: $ yum install python-numpy To install NumPy on Mandriva, run the following command-line instruction: $ urpmi python-numpy To install NumPy on Gentoo run the following command-line instruction: $ sudo emerge numpy To install NumPy on Debian or Ubuntu, we need to type the following: $ sudo apt-get install python-numpy The following table gives an overview of the Linux distributions and corresponding package names for NumPy, SciPy, matplotlib, and IPython. Linux distribution NumPy SciPy matplotlib IPython Arch Linux python-numpy python-scipy python-matplotlib Ipython Debian python-numpy python-scipy python-matplotlib Ipython Fedora numpy python-scipy python-matplotlib Ipython Gentoo dev-python/numpy scipy matplotlib ipython OpenSUSE python-numpy, python-numpy-devel python-scipy python-matplotlib ipython Slackware numpy scipy matplotlib ipython Installing software and setup on Mac OS X You can install NumPy, matplotlib, and SciPy on the Mac with a graphical installer or from the command line with a port manager such as MacPorts, depending on your preference. Prerequisite is to install XCode as it is not part of OS X releases. We will install NumPy with a GUI installer using the following steps: We can get a NumPy installer from the SourceForge website http://sourceforge.net/projects/numpy/files/. Similar files exist for matplotlib and SciPy. Just change numpy in the previous URL to scipy or matplotlib. IPython didn't have a GUI installer at the time of writing. Download the appropriate DMG file usually the latest one is the best.Another alternative is the SciPy Superpack (https://github.com/fonnesbeck/ScipySuperpack). Whichever option you choose it is important to make sure that updates which impact the system Python library don't negatively influence already installed software by not building against the Python library provided by Apple. Open the DMG file (in this example, numpy-1.8.1-py2.7-python.org-macosx10.6.dmg). Double-click on the icon of the opened box, the one having a subscript that ends with .mpkg. We will be presented with the welcome screen of the installer. Click on the Continue button to go to the Read Me screen, where we will be presented with a short description of NumPy. Click on the Continue button to the License the screen. Read the license, click on the Continue button and then on the Accept button, when prompted to accept the license. Continue through the next screens and click on the Finish button at the end. Alternatively, we can install NumPy, SciPy, matplotlib, and IPython through the MacPorts route, with Fink or Homebrew. The following installation steps shown, installs all these packages. [box type="info" align="" class="" width=""]For installing with MacPorts, type the following command: sudo port install py-numpy py-scipy py-matplotlib py- ipython [/box] Installing with setuptools If you have pip you can install NumPy, SciPy, matplotlib and IPython with the following commands. pip install numpy pip install scipy pip install matplotlib pip install ipython It may be necessary to prepend sudo to these commands, if your current user doesn't have sufficient rights on your system. Summary In this article, we installed NumPy, SciPy, matplotlib and IPython on Windows, Mac OS X and Linux. Resources for Article: Further resources on this subject: Plotting Charts with Images and Maps Importing Dynamic Data Python 3: Designing a Tasklist Application
Read more
  • 0
  • 0
  • 64635

article-image-different-types-of-nosql-databases-and-when-to-use-them
Richard Gall
10 Sep 2019
8 min read
Save for later

Different types of NoSQL databases and when to use them

Richard Gall
10 Sep 2019
8 min read
Why NoSQL databases? The popularity of NoSQL databases over the last decade or so has been driven by an explosion of data. Before what’s commonly described as ‘the big data revolution’, relational databases were the norm - these are databases that contain structured data. Structured data can only be structured if it is based on an existing schema that defines the relationships (hence relational) between the data inside the database. However, with the vast quantities of data that are now available to just about every business with an internet connection, relational databases simply aren’t equipped to handle the complexity and scale of large datasets. Why not SQL databases? This is for a couple of reasons. The defined schemas that are a necessary component of every relational database will not only undermine the richness and integrity of the data you’re working with, relational databases are also hard to scale. Relational databases can only scale vertically, not horizontally. That’s fine to a certain extent, but when you start getting high volumes of data - such as when millions of people use a web application, for example - things get really slow and you need more processing power. You can do this by upgrading your hardware, but that isn’t really sustainable. By scaling out, as you can with NoSQL databases, you can use a distributed network of computers to handle data. That gives you more speed and more flexibility. This isn’t to say that relational and SQL databases have had their day. They still fulfil many use cases. The only difference is that NoSQL can offers a level of far greater power and control for data intensive use cases. Indeed, using a NoSQL database when SQL will do is only going to add more complexity to something that just doesn’t need it. Seven NoSQL Databases in a Week Different types of NoSQL databases and when to use them So, now we’ve looked at why NoSQL databases have grown in popularity in recent years, lets dig into some of the different options available. There are a huge number of NoSQL databases out there - some of them open source, some premium products - many of them built for very different purposes. Broadly speaking there are 4 different models of NoSQL databases: Key-Value pair-based databases Column-based databases Document-oriented databases Graph databases Let’s take a look at these four models, how they’re different from one another, and some examples of the product options in each. Key Value pair-based NoSQL database management systems Key/Value pair based NoSQL databases store data in, as you might expect, pairs of keys and values. Data is stored with a matching key - keys have no relation or structure (so, keys could be height, age, hair color, for example). When should you use a key/value pair-based NoSQL DBMS? Key/value pair based NoSQL databases are the most basic type of NoSQL database. They’re useful for storing fairly basic information, like details about a customer. Which key/value pair-based DBMS should you use? There are a number of different key/value pair databases. The most popular is Redis. Redis is incredibly fast and very flexible in terms of the languages and tools it can be used with. It can be used for a wide variety of purposes - one of the reasons high-profile organizations use it, including Verizon, Atlassian, and Samsung. It’s also open source with enterprise options available for users with significant requirements. Redis 4.x Cookbook Other than Redis, other options include Memcached and Ehcache. As well as those, there are a number of other multi-model options (which will crop up later, no doubt) such as Amazon DynamoDB, Microsoft’s Cosmos DB, and OrientDB. Hands-On Amazon DynamoDB for Developers [Video] RDS PostgreSQL and DynamoDB CRUD: AWS with Python and Boto3 [Video] Column-based NoSQL database management systems Column-based databases separate data into discrete columns. Instead of using rows - whereby the row ID is the main key - column-based database systems flip things around to make the data the main key. By using columns you can gain much greater speed when querying data. Although it’s true that querying a whole row of data would take longer in a column-based DBMS, the use cases for column based databases mean you probably won’t be doing this. Instead you’ll be querying a specific part of the data rather than the whole row. When should you use a column-based NoSQL DBMS? Column-based systems are most appropriate for big data and instances where data is relatively simple and consistent (they don’t particularly handle volatility that well). Which column-based NoSQL DBMS should you use? The most popular column-based DBMS is Cassandra. The software prizes itself on its performance, boasting 100% availability thanks to lacking a single point of failure, and offering impressive scalability at a good price. Cassandra’s popularity speaks for itself - Cassandra is used by 40% of the Fortune 100. Mastering Apache Cassandra 3.x - Third Edition Learn Apache Cassandra in Just 2 Hours [Video] There are other options available, such as HBase and Cosmos DB. HBase High Performance Cookbook Document-oriented NoSQL database management systems Document-oriented NoSQL systems are very similar to key/value pair database management systems. The only difference is that the value that is paired with a key is stored as a document. Each document is self-contained, which means no schema is required - giving a significant degree of flexibility over the data you have. For software developers, this is essential - it’s for this reason that document-oriented databases such as MongoDB and CouchDB are useful components of the full-stack development tool chain. Some search platforms such as ElasticSearch use mechanisms similar to standard document-oriented systems - so they could be considered part of the same family of database management systems. When should you use a document-oriented DBMS? Document-oriented databases can help power many different types of websites and applications - from stores to content systems. However, the flexibility of document-oriented systems means they are not built for complex queries. Which document-oriented DBMS should you use? The leader in this space is, MongoDB. With an amazing 40 million downloads (and apparently 30,000 more every single day), it’s clear that MongoDB is a cornerstone of the NoSQL database revolution. MongoDB 4 Quick Start Guide MongoDB Administrator's Guide MongoDB Cookbook - Second Edition There are other options as well as MongoDB - these include CouchDB, CouchBase, DynamoDB and Cosmos DB. Learning Azure Cosmos DB Guide to NoSQL with Azure Cosmos DB Graph-based NoSQL database management systems The final type of NoSQL database is graph-based. The notable distinction about graph-based NoSQL databases is that they contain the relationships between different data. Subsequently, graph databases look quite different to any of the other databases above - they store data as nodes, with the ‘edges’ of the nodes describing their relationship to other nodes. Graph databases, compared to relational databases, are multidimensional in nature. They display not just basic relationships between tables and data, but more complex and multifaceted ones. When should you use a graph database? Because graph databases contain the relationships between a set of data (customers, products, price etc.) they can be used to build and model networks. This makes graph databases extremely useful for applications ranging from fraud detection to smart homes to search. Which graph database should you use? The world’s most popular graph database is Neo4j. It’s purpose built for data sets that contain strong relationships and connections. Widely used in the industry in companies such as eBay and Walmart, it has established its reputation as one of the world’s best NoSQL database products. Back in 2015 Packt’s Data Scientist demonstrated how he used Neo4j to build a graph application. Read more. Learning Neo4j 3.x [Video] Exploring Graph Algorithms with Neo4j [Video] NoSQL databases are the future - but know when to use the right one for the job Although NoSQL databases will remain a fixture in the engineering world, SQL databases will always be around. This is an important point - when it comes to databases, using the right tool for the job is essential. It’s a valuable exercise to explore a range of options and get to know how they work - sometimes the difference might just be a personal preference about usability. And that’s fine - you need to be productive after all. But what’s ultimately most essential is having a clear sense of what you’re trying to accomplish, and choosing the database based on your fundamental needs.
Read more
  • 0
  • 0
  • 63498

article-image-top-10-deep-learning-frameworks
Amey Varangaonkar
25 May 2017
9 min read
Save for later

Top 10 deep learning frameworks

Amey Varangaonkar
25 May 2017
9 min read
Deep learning frameworks are powering the artificial intelligence revolution. Without them, it would be almost impossible for data scientists to deliver the level of sophistication in their deep learning algorithms that advances in computing and processing power have made possible. Put simply, deep learning frameworks make it easier to build deep learning algorithms of considerable complexity. This follows a wider trend that you can see in other fields of programming and software engineering; open source communities are continually are developing new tools that simplify difficult tasks and minimize arduous ones. The deep learning framework you choose to use is ultimately down to what you're trying to do and how you work already. But to get you started here is a list of 10 of the best and most popular deep learning frameworks being used today. What are the best deep learning frameworks? Tensorflow One of the most popular Deep Learning libraries out there, Tensorflow, was developed by the Google Brain team and open-sourced in 2015. Positioned as a ‘second-generation machine learning system’, Tensorflow is a Python-based library capable of running on multiple CPUs and GPUs. It is available on all platforms, desktop, and mobile. It also has support for other languages such as C++ and R and can be used directly to create deep learning models, or by using wrapper libraries (for e.g. Keras) on top of it. In November 2017, Tensorflow announced a developer preview for Tensorflow Lite, a lightweight machine learning solution for mobile and embedded devices. The machine learning paradigm is continuously evolving - and the focus is now slowly shifting towards developing machine learning models that run on mobile and portable devices in order to make the applications smarter and more intelligent. Learn how to build a neural network with TensorFlow. If you're just starting out with deep learning, TensorFlow is THE go-to framework. It’s Python-based, backed by Google, has a very good documentation, and there are tons of tutorials and videos available on the internet to guide you. You can check out Packt’s TensorFlow catalog here. Keras Although TensorFlow is a very good deep learning library, creating models using only Tensorflow can be a challenge, as it is a pretty low-level library and can be quite complex to use for a beginner. To tackle this challenge, Keras was built as a simplified interface for building efficient neural networks in just a few lines of code and it can be configured to work on top of TensorFlow. Written in Python, Keras is very lightweight, easy to use, and pretty straightforward to learn. Because of these reasons, Tensorflow has incorporated Keras as part of its core API. Despite being a relatively new library, Keras has a very good documentation in place. If you want to know more about how Keras solves your deep learning problems, this interview by our best-selling author Sujit Pal should help you. Read now: Why you should use Keras for deep learning [box type="success" align="" class="" width=""]If you have some knowledge of Python programming and want to get started with deep learning, this is one library you definitely want to check out![/box] Caffe Built with expression, speed, and modularity in mind, Caffe is one of the first deep learning libraries developed mainly by Berkeley Vision and Learning Center (BVLC). It is a C++ library which also has a Python interface and finds its primary application in modeling Convolutional Neural Networks. One of the major benefits of using this library is that you can get a number of pre-trained networks directly from the Caffe Model Zoo, available for immediate use. If you’re interested in modeling CNNs or solve your image processing problems, you might want to consider this library. Following the footsteps of Caffe, Facebook also recently open-sourced Caffe2, a new light-weight, modular deep learning framework which offers greater flexibility for building high-performance deep learning models. Torch Torch is a Lua-based deep learning framework and has been used and developed by big players such as Facebook, Twitter and Google. It makes use of the C/C++ libraries as well as CUDA for GPU processing.  Torch was built with an aim to achieve maximum flexibility and make the process of building your models extremely simple. More recently, the Python implementation of Torch, called PyTorch, has found popularity and is gaining rapid adoption. PyTorch PyTorch is a Python package for building deep neural networks and performing complex tensor computations. While Torch uses Lua, PyTorch leverages the rising popularity of Python, to allow anyone with some basic Python programming language to get started with deep learning. PyTorch improves upon Torch’s architectural style and does not have any support for containers - which makes the entire deep modeling process easier and transparent to you. Still wondering how PyTorch and Torch are different from each other? Make sure you check out this interesting post on Quora. Deeplearning4j DeepLearning4j (or DL4J) is a popular deep learning framework developed in Java and supports other JVM languages as well. It is very slick and is very widely used as a commercial, industry-focused distributed deep learning platform. The advantage of using DL4j is that you can bring together the power of the whole Java ecosystem to perform efficient deep learning, as it can be implemented on top of the popular Big Data tools such as Apache Hadoop and Apache Spark. [box type="success" align="" class="" width=""]If Java is your programming language of choice, then you should definitely check out this framework. It is clean, enterprise-ready, and highly effective. If you’re planning to deploy your deep learning models to production, this tool can certainly be of great worth![/box] MXNet MXNet is one of the most languages-supported deep learning frameworks, with support for languages such as R, Python, C++ and Julia. This is helpful because if you know any of these languages, you won’t need to step out of your comfort zone at all, to train your deep learning models. Its backend is written in C++ and cuda, and is able to manage its own memory like Theano. MXNet is also popular because it scales very well and is able to work with multiple GPUs and computers, which makes it very useful for the enterprises. This is also one of the reasons why Amazon made MXNet its reference library for Deep Learning too. In November, AWS announced the availability of ONNX-MXNet, which is an open source Python package to import ONNX (Open Neural Network Exchange) deep learning models into Apache MXNet. Read why MXNet is a versatile deep learning framework here. Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit, previously known by its acronym CNTK, is an open-source deep learning toolkit to train deep learning models. It is highly optimized and has support for languages such as Python and C++. Known for its efficient resource utilization, you can easily implement efficient Reinforcement Learning models or Generative Adversarial Networks (GANs) using the Cognitive Toolkit. It is designed to achieve high scalability and performance and is known to provide high-performance gains when compared to other toolkits like Theano and Tensorflow when running on multiple machines. Here is a fun comparison of TensorFlow versus CNTK, if you would like to know more. deeplearn.js Gone are the days when you required serious hardware to run your complex machine learning models. With deeplearn.js, you can now train neural network models right on your browser! Originally developed by the Google Brain team, deeplearn.js is an open-source, JavaScript-based deep learning library which runs on both WebGL 1.0 and WebGL 2.0. deeplearn.js is being used today for a variety of purposes - from education and research to training high-performance deep learning models. You can also run your pre-trained models on the browser using this library. BigDL BigDL is distributed deep learning library for Apache Spark and is designed to scale very well. With the help of BigDL, you can run your deep learning applications directly on Spark or Hadoop clusters, by writing them as Spark programs. It has a rich deep learning support and uses Intel’s Math Kernel Library (MKL) to ensure high performance. Using BigDL, you can also load your pre-trained Torch or Caffe models into Spark. If you want to add deep learning functionalities to a massive set of data stored on your cluster, this is a very good library to use. [box type="shadow" align="" class="" width=""]Editor's Note: We have removed Theano and Lasagne from the original list due to the Theano retirement announcement. RIP Theano! Before Tensorflow, Caffe or PyTorch came to be, Theano was the most widely used library for deep learning. While it was a low-level library supporting CPU as well as GPU computations, you could wrap it with libraries like Keras to simplify the deep learning process. With the release of version 1.0, it was announced that the future development and support for Theano would be stopped. There would be minimal maintenance to keep it working for the next one year, after which even the support activities on the library would be suspended completely. “Supporting Theano is no longer the best way we can enable the emergence and application of novel research ideas”, said Prof. Yoshua Bengio, one of the main developers of Theano. Thank you Theano, you will be missed! Goodbye Lasagne Lasagne is a high-level deep learning library that runs on top of Theano.  It has been around for quite some time now and was developed with the aim of abstracting the complexities of Theano, and provide a more friendly interface to the users to build and train neural networks. It requires Python and finds many similarities to Keras, which we just saw above. However, if we are to find differences between the two, Keras is faster and has a better documentation in place.[/box] There are many other deep learning libraries and frameworks available for use today – DSSTNE, Apache Singa, Veles are just a few worth an honorable mention. Which deep learning frameworks will best suit your needs? Ultimately, it depends on a number of factors. If you want to get started with deep learning, your safest bet would be to use a Python-based framework like Tensorflow, which are quite popular. For seasoned professionals, the efficiency of the trained model, ease of use, speed and resource utilization are all important considerations for choosing the best deep learning framework.
Read more
  • 0
  • 0
  • 63122
article-image-airflow-ops-best-practices-observation-and-monitoring
Dylan Intorf, Kendrick van Doorn, Dylan Storey
12 Nov 2024
15 min read
Save for later

Airflow Ops Best Practices: Observation and Monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey
12 Nov 2024
15 min read
This article is an excerpt from the book, "Apache Airflow Best Practices", by Dylan Intorf, Kendrick van Doorn, Dylan Storey. With practical approach and detailed examples, this book covers newest features of Apache Airflow 2.x and it's potential for workflow orchestration, operational best practices, and data engineering.IntroductionIn this article, we will continue to explore the application of modern “ops” practices within Apache Airflow, focusing on the observation and monitoring of your systems and DAGs after they’ve been deployed.We’ll divide this observation into two segments – the core Airflow system and individual DAGs. Each segment will cover specific metrics and measurements you should be monitoring for alerting and potential intervention.When we discuss monitoring in this section, we will consider two types of monitoring – active and suppressive.In an active monitoring scenario, a process will actively check a service’s health state, recording its state and potentially taking action directly on the return value.In a suppressive monitoring scenario, the absence of a state (or state change) is usually meaningful. In these scenarios, the monitored application sends an active schedule to a process to inform it that it is OK, usually suppressing an action (such as an alert) from occurring.This chapter covers the following topics:Monitoring core Airflow componentsMonitoring your DAGsTechnical requirementsBy now, we expect you to have a good understanding of Airflow and its core components, along with functional knowledge in the deployment and operation of Airflow and Airflow DAGs.We will not be covering specific observability aggregators or telemetry tools; instead, we will focus on the activities you should be keeping an eye on. We strongly recommend that you work closely with your ops teams to understand what tools exist in your stack and how to configure them for capture and alerting your deployments.Monitoring core Airflow componentsAll of the components we will discuss here are critical to ensuring a functioning Airflow deployment. Generally, all of them should be monitored with a bare minimum check of Is it on? and if a component is not, an alert should surface to your team for investigation. The easiest way to check this is to query the REST API on the web server at `/health/`; this will return a JSON object that can be parsed to determine whether components are healthy and, if not, when they were last seen.SchedulerThis component needs to be running and working effectively in order for tasks to be scheduled for execution.When the scheduler service is started, it also starts a `/health` endpoint that can be checked by an external process with an active monitoring approach.The returned signal does not always indicate that the scheduler is working properly, as its state is simply indicative that the service is up and running. There are many scenarios where the scheduler may be operating but unable to schedule jobs; as a result, many deployments will include a canary dag to their deployment that has a single task, acting to suppress an external alert from going off.Import metrics that airflow exposes for you include the following:scheduler.scheduler_loop_duration: This should be monitored to ensure that your scheduler is able to loop and schedule tasks for execution. As this metric increases, you will see tasks beginning to schedule more slowly, to the point where you may begin missing SLAs because tasks fail to reach a schedulable state.scheduler.tasks.starving: This indicates how many tasks cannot be scheduled because there are no slots available. Pools are a mechanism that Airflow uses to balance large numbers of submitted task executions versus a finite amount of execution throughput. It is likely that this number will not be zero, but being high for extended periods of time may point to an issue in how DAGs are being written to schedule work.scheduler.tasks.executable: This indicates how many tasks are ready for execution (i.e., queued). This number will sometimes not be zero, and that is OK, but if the number increases and stays high for extended periods of time, it indicates that you may need additional computer resources to handle the load. Look at your executor to increase the number of workers it can run. Metadata databaseThe metadata database is used to store and track all of the metadata for your Airflow deployments’ previous DAG/task executions, along with information about your environment’s roles and permissions. Losing data from this database can interrupt normal operations and cause unintended consequences, with DAG runs being repeated.While critical, because it is architecturally ubiquitous, the database is also least likely to encounter issues, and if it does, they are absolutely catastrophic in nature.We generally suggest you utilize a managed service for provisioning and operating your backing database, ensuring that a disaster recovery plan for your metadata database is in place at all times.Some active areas to monitor on your database include the following:Connection pool size/usage: Monitor both the connection pool size and usage over time to ensure appropriate configuration, and identify potential bottlenecks or resource contention arising from Airflow components’ concurrent connections.Query performance: Measure query latency to detect inefficient queries or performance issues, while monitoring query throughput to ensure effective workload handling by the database.Storage metrics: Monitor the disk space utilization of the metadata database to ensure that it has sufficient storage capacity. Set up alerts for low disk space conditions to prevent database outages due to storage constraints.Backup status: Monitor the status of database backups to ensure that they are performed regularly and successfully. Verify backup integrity and retention policies to mitigate the risk of data loss if there is a database failure.TriggererThe Triggerer instance manages all of the asynchronous operations of deferrable operators in a deferred state. As such, major operational concerns generally relate to ensuring that individual deferred operators don’t cause major blocking calls to the event loop. If this occurs, your deferrable tasks will not be able to check their state changes as frequently, and this will impact scheduling performance.Import metrics that airflow exposes for you include the following:triggers.blocked_main_thread: The number of triggers that have blocked the main thread. This is a counter and should monotonically increase over time; pay attention to large differences between recording (or quick acceleration) counts, as it’s indicative of a larger problem.triggers.running: The number of triggers currently on a triggerer instance. This metric should be monitored to determine whether you need to increase the number of triggerer instances you are running. While the official documentation claims that up to tens of thousands of triggers can be on an instance, the common operational number is much lower. Tune at your discretion, but depending on the complexity of your triggers, you may need to add a new instance for every few hundred consistent triggers you run.Executors/workersDepending on the executor you use, you will need to monitor your executors and workers a bit differently.The Kubernetes executor will utilize the Kubernetes API to schedule tasks for execution; as such, you should utilize the Kubernetes events and metrics servers to gather logs and metrics for your task instances. Common metrics to collect on an individual task are CPU and memory usage. This is crucial for tuning requests or mutating individual task resource requests to ensure that they execute safely.The Celery worker has additional components and long-lived processes that you need to metricize. You should monitor an individual Celery worker’s memory and CPU utilization to ensure that it is not over- or under-provisioned, tuning allocated resources accordingly. You also need to monitor the message broker (usually Redis or RabbitMQ) to ensure that it is appropriately sized. Finally, it is critical to measure the queue length of your message broker and ensure that too much “back pressure” isn’t being created in the system. If you find that your tasks are sitting in a queued state for a long period of time and the queue length is consistently growing, it’s a sign that you should start an additional Celery worker to execute on scheduled tasks. You should also investigate using the native Celery monitoring tool Flower (https://flower.readthedocs.io/en/latest/) for additional, more nuanced methods of monitoring.Web serverThe Airflow web server is the UI for not just your Airflow deployment but also the RESTful interface. Especially if you happen to be controlling Airflow scheduling behavior with API calls, you should keep an eye on the following metrics:Response time: Measure the time taken for the API to respond to requests. This metric indicates the overall performance of the API and can help identify potential bottlenecks.Error rate: Monitor the rate of errors returned by the API, such as 4xx and 5xx HTTP status codes. High error rates may indicate issues with the API implementation or underlying systems.Request rate: Track the rate of incoming requests to the API over time. Sudden spikes or drops in request rates can impact performance and indicate changes in usage patterns.System resource utilization: Monitor resource utilization metrics such as CPU, memory, disk I/O, and network bandwidth on the servers hosting the API. High resource utilization can indicate potential performance bottlenecks or capacity limits.Throughput: Measure the number of successful requests processed by the API per unit of time. Throughput metrics provide insights into the API’s capacity to handle incoming traffic.Now that you have some basic metrics to collect from your core architectural components and can monitor the overall health of an application, we need to monitor the actual DAGs themselves to ensure that they function as intended.Monitoring your DAGsThere are multiple aspects to monitoring your DAGs, and while they’re all valuable, they may not all be necessary. Take care to ensure that your monitoring and alerting stack match your organizational needs with regard to operational parameters for resiliency and, if there is a failure, recovery times. No matter how much or how little you choose to implement, knowing that your DAGs work and if and how they fail is the first step in fixing problems that will arise.LoggingAirflow writes logs for tasks in a hierarchical structure that allows you to see each task’s logs in the Airflow UI. The community also provides a number of providers to utilize other services for backing log storage and retrieval. A complete list of supported providers is available at https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/logging.html.Airflow uses the standard Python logging framework to write logs. If you’re writing custom operators or executing Python functions with a PythonOperator, just make sure that you instantiate a Python logger instance, and then the associated methods will handle everything for you.AlertingAirflow provides mechanisms for alerting on operational aspects of your executing workloads that can be configured within your DAG:Email notifications: Email notifications can be sent if a task is put into a marked or retry state with the `email_on_failure` or `email_on_retry` state, respectively. These arguments can be provided to all tasks in the DAG with the `default_args` key work in the DAG, or individual tasks by setting the keyword argument individually.Callbacks: Callbacks are special actions that are executed if a specific state change occurs. Generally, these callbacks should be thoughtfully leveraged to send alerts that are critical operationally:on_success_callback: This callback will be executed at both the task and DAG levels when entering a successful state. Unless it is critical that you know whether something succeeds, we generally suggest not using this for alerting.on_failure_callback: This callback is invoked when a task enters a failed state. Generally, this callback should always be set and, in critical scenarios, alert on failures that require intervention and support.on_execute_callback: This is invoked right before a task executes and only exists at the task level. Use sparingly for alerting, as it can quickly become a noisy alert when overused.on_retry_callback: This is invoked when a task is placed in a retry state. This is another callback to be cautious about as an alert, as it can become noisy and cause false alarms.sla_miss_callback: This is invoked when a DAG misses its defined SLA. This callback is only executed at the end of a DAG’s execution cycle so tends to be a very reactive notification that something has gone wrong.SLA monitoringAs awesome of a tool as Airflow is, it is a well-known fact in the community that SLAs, while largely functional, have some unfortunate details with regard to implementation that can make them problematic at best, and they are generally regarded as a broken feature in Airflow. We suggest that if you require SLA monitoring on your workflows, you deploy a CRON job monitoring tool such as healthchecks (https://github.com/healthchecks/healthchecks) that allows you to create suppressive alerts for your services through its rest API to manage SLAs. By pairing this third- party service with either HTTP operators or simple requests from callbacks, you can ensure that your most critical workflows achieve dynamic and resilient SLA alerting.Performance profilingThe Airflow UI is a great tool for profiling the performance of individual DAGs:The Gannt chart view: This is a great visualization for understanding the amount of time spent on individual tasks and the relative order of execution. If you’re worried about bottlenecks in your workflow, start here.Task duration: This allows you to profile the run characteristics of tasks within your DAG over a historical period. This tool is great at helping you understand temporal patterns in execution time and finding outliers in execution. Especially if you find that a DAG slows down over time, this view can help you understand whether it is a systemic issue and which tasks might need additional development.Landing times: This shows the delta between task completion and the start of the DAG run. This is an un-intuitive but powerful metric, as increases in it, when paired with stable task durations in upstream tasks, can help identify whether a scheduler is under heavy load and may need tuning.Additional metrics that have proven to be useful (but may need to be calculated) include the following:Task startup time: This is an especially useful metric when operating with a Kubernetes executor. To calculate this, you will need to calculate the difference between `start_date` and `execution_date` on each task instance. This metric will especially help you identify bottlenecks outside of Airflow that may impact task run times.Task failure and retry counts: Monitoring the frequency of task failures and retries can help identify information about the stability and robustness of your environment. Especially if these types of failure can be linked back to patterns in time or execution, it can help debug interactions with other services.DAG parsing time: Monitoring the amount of time a DAG takes to parse is very important to understand scheduler load and bottlenecks. If an individual DAG takes a long time to load (either due to heavy imports or long blocking calls being executed during parsing), it can have a material impact on the timeliness of scheduling tasks.ConclusionIn this article, we covered some essential strategies to effectively monitor both the core Airflow system and individual DAGs post-deployment. We highlighted the importance of active and suppressive monitoring techniques and provided insights into the critical metrics to track for each component, including the scheduler, metadata database, triggerer, executors/workers, and web server. Additionally, we discussed logging, alerting mechanisms, SLA monitoring, and performance profiling techniques to ensure the reliability, scalability, and efficiency of Airflow workflows. By implementing these monitoring practices and leveraging the insights gained, operators can proactively manage and optimize their Airflow deployments for optimal performance and reliability.Author BioDylan Intorf is a solutions architect and data engineer with a BS from Arizona State University in Computer Science. He has 10+ years of experience in the software and data engineering space, delivering custom tailored solutions to Tech, Financial, and Insurance industries.Kendrick van Doorn is an engineering and business leader with a background in software development, with over 10 years of developing tech and data strategies at Fortune 100 companies. In his spare time, he enjoys taking classes at different universities and is currently an MBA candidate at Columbia University.Dylan Storey has a B.Sc. and M.Sc. from California State University, Fresno in Biology and a Ph.D. from University of Tennessee, Knoxville in Life Sciences where he leveraged computational methods to study a variety of biological systems. He has over 15 years of experience in building, growing, and leading teams; solving problems in developing and operating data products at a variety of scales and industries.
Read more
  • 2
  • 0
  • 62994

article-image-data-cleaning-worst-part-of-data-analysis
Amey Varangaonkar
04 Jun 2018
5 min read
Save for later

Data cleaning is the worst part of data analysis, say data scientists

Amey Varangaonkar
04 Jun 2018
5 min read
The year was 2012. Harvard Business Review had famously declared the role of data scientist as the ‘sexiest job of the 21st century’. Companies were slowly working with more data than ever before. The real actionable value of the data that could be used for commercial purposes was slowly beginning to uncover. Someone who could derive these actionable insights from the data was needed. The demand for data scientists was higher than ever. Fast forward to 2018 - more data has been collected in the last 2 years than ever before. Data scientists are still in high demand, and the need for insights is higher than ever. There has been one significant change, though - the process of deriving insights has become more complex. If you ask the data scientists, the first initial phase of this process, which involves data cleansing, has become a lot more cumbersome. So much so, that it is no longer a myth that data scientists spend almost 80% of their time cleaning and readying the data for analysis. Why data cleaning is a nightmare In the recently conducted Packt Skill-Up survey, we asked data professionals what the worst part of the data analysis process was, and a staggering 50% responded with data cleaning. Source: Packt Skill Up Survey We dived deep into this, and tried to understand why many data science professionals have this common feeling of dislike towards data cleaning, or scrubbing - as many call it. Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. There is no consistent data format Organizations these days work with a lot of data. Some of it is in a structured, readily understandable format. This kind of data is usually quite easy to clean, parse and analyze. However, some of the data is really messy, and cannot be used as is for analysis. This includes missing data, irregularly formatted data, and irrelevant data which is not worth analyzing at all. There is also the problem of working with unstructured data which needs to be pre-processed to get the data worth analyzing. Audio or video files, email messages, presentations, xml documents and web pages are some classic examples of this. There’s too much data to be cleaned The volume of data that businesses deal with on a day to day basis is in the scale of terabytes or even petabytes. Making sense of all this data, coming from a variety of sources and in different formats is, undoubtedly, a huge task. There are a whole host of tools designed to ease this process today, but it remains an incredibly tricky challenge to sift through the large volumes of data and prepare it for analysis. Data cleaning is tricky and time-consuming Data cleansing can be quite an exhaustive and time-consuming task, especially for data scientists. Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time. Once the data is cleaned, it needs to be placed in a secure location. Also, a log of the entire process needs to be kept to ensure the right data goes through the right process. All of this requires the data scientists to create a well-designed data scrubbing framework to avoid the risk of repetition. All of this is more of a grunt work and requires a lot of manual effort. Sadly, there are no tools in the market which can effectively automate this process. Outsourcing the process is expensive Given that data cleaning is a rather tedious job, many businesses think of outsourcing the task to third party vendors. While this reduces a lot of time and effort on the company’s end, it definitely increases the cost of the overall process. Many small and medium scale businesses may not be able to afford this, and thus are heavily reliant on the data scientist to do the job for them. You can hate it, but you cannot ignore it It is quite obvious that data scientists need clean, ready-to-analyze data if they are to to extract actionable business insights from it. Some data scientists equate data cleaning to donkey work, suggesting there’s not a lot of innovation involved in this process. However, some believe data cleaning is rather important, and pay special attention to it given once it is done right, most of the problems in data analysis are solved. It is very difficult to take advantage of the intrinsic value offered by the dataset if it does not adhere to the quality standards set by the business, making data cleaning a crucial component of the data analysis process. Now that you know why data cleaning is essential, why not dive deeper into the technicalities? Check out our book Practical Data Wrangling for expert tips on turning your noisy data into relevant, insight-ready information using R and Python. Read more Cleaning Data in PDF Files 30 common data science terms explained How to create a strong data science project portfolio that lands you a job  
Read more
  • 0
  • 2
  • 62843
Modal Close icon
Modal Close icon