How-To Tutorials

article-image-4-ways-implement-feature-selection-python-machine-learning

16 Feb 2018

13 min read

4 ways to implement feature selection in Python for machine learning

16 Feb 2018

[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box] In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library: Univariate selection Recursive Feature Elimination (RFE) Principle Component Analysis (PCA) Choosing important features (feature importance) We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community. Univariate selection Statistical tests can be used to select those features that have the strongest relationships with the output variable. The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features. The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset: #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:]) You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age. Scores for each feature: [111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304] Selected Features: [[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]] Recursive Feature Elimination RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_) After execution, we will get: Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4] You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array. Principle Component Analysis PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the following example, we use PCA and select three principal components: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) You can see that the transformed dataset (three principal components) bears little resemblance to the source data: Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02 9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]] Choosing important features (feature importance) Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy. A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory. This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy). We will start with importing all of the libraries: #Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn from sklearn.feature_selection import SelectFromModel np.random.seed(1) Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model: #Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy: #Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier: #Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6)) Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see: Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data. In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn: #Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees = 250 max_feat = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc)) The accuracy of our model is: Accuracy of model before feature selection is 98.82 As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes. So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature: #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_): print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366) As you can see here, each feature has a different importance based on its contribution to the final prediction. We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training: #Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain) Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset: #Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1 = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20) Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset. In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set: #Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97 Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly. This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table: Evaluation criteria Before feature selection After feature selection Number of features 94 20 Size of dataset 26.32 MB 5.60 MB Training time 2.91 seconds 1.71 seconds Accuracy 98.82 percent 99.97 percent The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before. To summarize the article, we explored 4 ways of feature selection in machine learning. If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.

0
4
99736

Packt

11 Aug 2015

17 min read

Divide and Conquer – Classification Using Decision Trees and Rules

Packt

11 Aug 2015

17 min read

In this article by Brett Lantz, author of the book Machine Learning with R, Second Edition, we will get a basic understanding about decision trees and rule learners, including the C5.0 decision tree algorithm. This algorithm will cover mechanisms such as choosing the best split and pruning the decision tree. While deciding between several job offers with various levels of pay and benefits, many people begin by making lists of pros and cons, and eliminate options based on simple rules. For instance, ''if I have to commute for more than an hour, I will be unhappy.'' Or, ''if I make less than $50k, I won't be able to support my family.'' In this way, the complex and difficult decision of predicting one's future happiness can be reduced to a series of simple decisions. This article covers decision trees and rule learners—two machine learning methods that also make complex decisions from sets of simple choices. These methods then present their knowledge in the form of logical structures that can be understood with no statistical knowledge. This aspect makes these models particularly useful for business strategy and process improvement. By the end of this article, you will learn: How trees and rules "greedily" partition data into interesting segments The most common decision tree and classification rule learners, including the C5.0, 1R, and RIPPER algorithms We will begin by examining decision trees, followed by a look at classification rules. (For more resources related to this topic, see here.) Understanding decision trees Decision tree learners are powerful classifiers, which utilize a tree structure to model the relationships among the features and the potential outcomes. As illustrated in the following figure, this structure earned its name due to the fact that it mirrors how a literal tree begins at a wide trunk, which if followed upward, splits into narrower and narrower branches. In much the same way, a decision tree classifier uses a structure of branching decisions, which channel examples into a final predicted class value. To better understand how this works in practice, let's consider the following tree, which predicts whether a job offer should be accepted. A job offer to be considered begins at the root node, where it is then passed through decision nodes that require choices to be made based on the attributes of the job. These choices split the data across branches that indicate potential outcomes of a decision, depicted here as yes or no outcomes, though in some cases there may be more than two possibilities. In the case a final decision can be made, the tree is terminated by leaf nodes (also known as terminal nodes) that denote the action to be taken as the result of the series of decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree. A great benefit of decision tree algorithms is that the flowchart-like tree structure is not necessarily exclusively for the learner's internal use. After the model is created, many decision tree algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn't work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results need to be shared with others in order to inform future business practices. With this in mind, some potential uses include: Credit scoring models in which the criteria that causes an applicant to be rejected need to be clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression Although the previous applications illustrate the value of trees in informing decision processes, this is not to suggest that their utility ends here. In fact, decision trees are perhaps the single most widely used machine learning technique, and can be applied to model almost any type of data—often with excellent out-of-the-box applications. This said, in spite of their wide applicability, it is worth noting some scenarios where trees may not be an ideal fit. One such case might be a task where the data has a large number of nominal features with many levels or it has a large number of numeric features. These cases may result in a very large number of decisions and an overly complex tree. They may also contribute to the tendency of decision trees to overfit data, though as we will soon see, even this weakness can be overcome by adjusting some simple parameters. Divide and conquer Decision trees are built using a heuristic called recursive partitioning. This approach is also commonly known as divide and conquer because it splits the data into subsets, which are then split repeatedly into even smaller subsets, and so on and so forth until the process stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met. To see how splitting a dataset can create a decision tree, imagine a bare root node that will grow into a mature tree. At first, the root node represents the entire dataset, since no splitting has transpired. Next, the decision tree algorithm must choose a feature to split upon; ideally, it chooses the feature most predictive of the target class. The examples are then partitioned into groups according to the distinct values of this feature, and the first set of tree branches are formed. Working down each branch, the algorithm continues to divide and conquer the data, choosing the best candidate feature each time to create another decision node, until a stopping criterion is reached. Divide and conquer might stop at a node in a case that: All (or nearly all) of the examples at the node have the same class There are no remaining features to distinguish among the examples The tree has grown to a predefined size limit To illustrate the tree building process, let's consider a simple example. Imagine that you work for a Hollywood studio, where your role is to decide whether the studio should move forward with producing the screenplays pitched by promising new authors. After returning from a vacation, your desk is piled high with proposals. Without the time to read each proposal cover-to-cover, you decide to develop a decision tree algorithm to predict whether a potential movie would fall into one of three categories: Critical Success, Mainstream Hit, or Box Office Bust. To build the decision tree, you turn to the studio archives to examine the factors leading to the success and failure of the company's 30 most recent releases. You quickly notice a relationship between the film's estimated shooting budget, the number of A-list celebrities lined up for starring roles, and the level of success. Excited about this finding, you produce a scatterplot to illustrate the pattern: Using the divide and conquer strategy, we can build a simple decision tree from this data. First, to create the tree's root node, we split the feature indicating the number of celebrities, partitioning the movies into groups with and without a significant number of A-list stars: Next, among the group of movies with a larger number of celebrities, we can make another split between movies with and without a high budget: At this point, we have partitioned the data into three groups. The group at the top-left corner of the diagram is composed entirely of critically acclaimed films. This group is distinguished by a high number of celebrities and a relatively low budget. At the top-right corner, majority of movies are box office hits with high budgets and a large number of celebrities. The final group, which has little star power but budgets ranging from small to large, contains the flops. If we wanted, we could continue to divide and conquer the data by splitting it based on the increasingly specific ranges of budget and celebrity count, until each of the currently misclassified values resides in its own tiny partition, and is correctly classified. However, it is not advisable to overfit a decision tree in this way. Though there is nothing to stop us from splitting the data indefinitely, overly specific decisions do not always generalize more broadly. We'll avoid the problem of overfitting by stopping the algorithm here, since more than 80 percent of the examples in each group are from a single class. This forms the basis of our stopping criterion. You might have noticed that diagonal lines might have split the data even more cleanly. This is one limitation of the decision tree's knowledge representation, which uses axis-parallel splits. The fact that each split considers one feature at a time prevents the decision tree from forming more complex decision boundaries. For example, a diagonal line could be created by a decision that asks, "is the number of celebrities is greater than the estimated budget?" If so, then "it will be a critical success." Our model for predicting the future success of movies can be represented in a simple tree, as shown in the following diagram. To evaluate a script, follow the branches through each decision until the script's success or failure has been predicted. In no time, you will be able to identify the most promising options among the backlog of scripts and get back to more important work, such as writing an Academy Awards acceptance speech. Since real-world data contains more than two features, decision trees quickly become far more complex than this, with many more nodes, branches, and leaves. In the next section, you will learn about a popular algorithm to build decision tree models automatically. The C5.0 decision tree algorithm There are numerous implementations of decision trees, but one of the most well-known implementations is the C5.0 algorithm. This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an improvement over his Iterative Dichotomiser 3 (ID3) algorithm. Although Quinlan markets C5.0 to commercial clients (see http://www.rulequest.com/ for details), the source code for a single-threaded version of the algorithm was made publically available, and it has therefore been incorporated into programs such as R. To further confuse matters, a popular Java-based open source alternative to C4.5, titled J48, is included in R's RWeka package. Because the differences among C5.0, C4.5, and J48 are minor, the principles in this article will apply to any of these three methods, and the algorithms should be considered synonymous. The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box. Compared to other advanced machine learning models, the decision trees built by C5.0 generally perform nearly as well, but are much easier to understand and deploy. Additionally, as shown in the following table, the algorithm's weaknesses are relatively minor and can be largely avoided: Strengths Weaknesses An all-purpose classifier that does well on most problems Highly automatic learning process, which can handle numeric or nominal features, as well as missing data Excludes unimportant features Can be used on both small and large datasets Results in a model that can be interpreted without a mathematical background (for relatively small trees) More efficient than other complex models Decision tree models are often biased toward splits on features having a large number of levels It is easy to overfit or underfit the model Can have trouble modeling some relationships due to reliance on axis-parallel splits Small changes in the training data can result in large changes to decision logic Large trees can be difficult to interpret and the decisions they make may seem counterintuitive To keep things simple, our earlier decision tree example ignored the mathematics involved in how a machine would employ a divide and conquer strategy. Let's explore this in more detail to examine how this heuristic works in practice. Choosing the best split The first challenge that a decision tree will face is to identify which feature to split upon. In the previous example, we looked for a way to split the data such that the resulting partitions contained examples primarily of a single class. The degree to which a subset of examples contains only a single class is known as purity, and any subset composed of only a single class is called pure. There are various measurements of purity that can be used to identify the best decision tree splitting candidate. C5.0 uses entropy, a concept borrowed from information theory that quantifies the randomness, or disorder, within a set of class values. Sets with high entropy are very diverse and provide little information about other items that may also belong in the set, as there is no apparent commonality. The decision tree hopes to find splits that reduce entropy, ultimately increasing homogeneity within the groups. Typically, entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each case, the minimum value indicates that the sample is completely homogenous, while the maximum value indicates that the data are as diverse as possible, and no group has even a small plurality. In the mathematical notion, entropy is specified as follows: In this formula, for a given segment of data (S), the term c refers to the number of class levels and pi refers to the proportion of values falling into class level i. For example, suppose we have a partition of data with two classes: red (60 percent) and white (40 percent). We can calculate the entropy as follows: > -0.60 * log2(0.60) - 0.40 * log2(0.40) [1] 0.9709506 We can examine the entropy for all the possible two-class arrangements. If we know that the proportion of examples in one class is x, then the proportion in the other class is (1 – x). Using the curve() function, we can then plot the entropy for all the possible values of x: > curve(-x * log2(x) - (1 - x) * log2(1 - x), col = "red", xlab = "x", ylab = "Entropy", lwd = 4) This results in the following figure: As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in maximum entropy. As one class increasingly dominates the other, the entropy reduces to zero. To use entropy to determine the optimal feature to split upon, the algorithm calculates the change in homogeneity that would result from a split on each possible feature, which is a measure known as information gain. The information gain for a feature F is calculated as the difference between the entropy in the segment before the split (S1) and the partitions resulting from the split (S2): One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions. It does this by weighing each partition's entropy by the proportion of records falling into the partition. This can be stated in a formula as: In simple terms, the total entropy resulting from a split is the sum of the entropy of each of the n partitions weighted by the proportion of examples falling in the partition (wi). The higher the information gain, the better a feature is at creating homogeneous groups after a split on this feature. If the information gain is zero, there is no reduction in entropy for splitting on this feature. On the other hand, the maximum information gain is equal to the entropy prior to the split. This would imply that the entropy after the split is zero, which means that the split results in completely homogeneous groups. The previous formulae assume nominal features, but decision trees use information gain for splitting on numeric features as well. To do so, a common practice is to test various splits that divide the values into groups greater than or less than a numeric threshold. This reduces the numeric feature into a two-level categorical feature that allows information gain to be calculated as usual. The numeric cut point yielding the largest information gain is chosen for the split. Though it is used by C5.0, information gain is not the only splitting criterion that can be used to build decision trees. Other commonly used criteria are Gini index, Chi-Squared statistic, and gain ratio. For a review of these (and many more) criteria, refer to Mingers J. An Empirical Comparison of Selection Measures for Decision-Tree Induction. Machine Learning. 1989; 3:319-342. Pruning the decision tree A decision tree can continue to grow indefinitely, choosing splitting features and dividing the data into smaller and smaller partitions until each example is perfectly classified or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will be overfitted to the training data. The process of pruning a decision tree involves reducing its size such that it generalizes better to unseen data. One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions or when the decision nodes contain only a small number of examples. This is called early stopping or pre-pruning the decision tree. As the tree avoids doing needless work, this is an appealing strategy. However, one downside to this approach is that there is no way to know whether the tree will miss subtle, but important patterns that it would have learned had it grown to a larger size. An alternative, called post-pruning, involves growing a tree that is intentionally too large and pruning leaf nodes to reduce the size of the tree to a more appropriate level. This is often a more effective approach than pre-pruning, because it is quite difficult to determine the optimal depth of a decision tree without growing it first. Pruning the tree later on allows the algorithm to be certain that all the important data structures were discovered. The implementation details of pruning operations are very technical and beyond the scope of this article. For a comparison of some of the available methods, see Esposito F, Malerba D, Semeraro G. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997;19: 476-491. One of the benefits of the C5.0 algorithm is that it is opinionated about pruning—it takes care of many decisions automatically using fairly reasonable defaults. Its overall strategy is to post-prune the tree. It first grows a large tree that overfits the training data. Later, the nodes and branches that have little effect on the classification errors are removed. In some cases, entire branches are moved further up the tree or replaced by simpler decisions. These processes of grafting branches are known as subtree raising and subtree replacement, respectively. Balancing overfitting and underfitting a decision tree is a bit of an art, but if model accuracy is vital, it may be worth investing some time with various pruning options to see if it improves the performance on test data. As you will soon see, one of the strengths of the C5.0 algorithm is that it is very easy to adjust the training options. Summary This article covered two classification methods that use so-called "greedy" algorithms to partition the data according to feature values. Decision trees use a divide and conquer strategy to create flowchart-like structures, while rule learners separate and conquer data to identify logical if-else rules. Both methods produce models that can be interpreted without a statistical background. One popular and highly configurable decision tree algorithm is C5.0. We used the C5.0 algorithm to create a tree to predict whether a loan applicant will default. This article merely scratched the surface of how trees and rules can be used. Resources for Article: Further resources on this subject: Introduction to S4 Classes [article] First steps with R [article] Supervised learning [article]

0
0
99630

article-image-creating-2d-3d-plots-using-matplotlib

Pravin Dhandre

22 Mar 2018

10 min read

Creating 2D and 3D plots using Matplotlib

Pravin Dhandre

22 Mar 2018

10 min read

0
0
91754

article-image-cross-validation-strategies-for-time-series-forecasting-tutorial

Packt Editorial Staff

06 May 2019

12 min read

Cross-Validation strategies for Time Series forecasting [Tutorial]

Packt Editorial Staff

06 May 2019

12 min read

Time series modeling and forecasting are tricky and challenging. The i.i.d (identically distributed independence) assumption does not hold well to time series data. There is an implicit dependence on previous observations and at the same time, a data leakage from response variables to lag variables is more likely to occur in addition to inherent non-stationarity in the data space. By non-stationarity, we mean flickering changes of observed statistics such as mean and variance. It even gets trickier when taking inherent nonlinearity into consideration. Cross-validation is a well-established methodology for choosing the best model by tuning hyper-parameters or performing feature selection. There are a plethora of strategies for implementing optimal cross-validation. K-fold cross-validation is a time-proven example of such techniques. However, it is not robust in handling time series forecasting issues due to the nature of the data as explained above. In this tutorial, we shall explore two more techniques for performing cross-validation; time series split cross-validation and blocked cross-validation, which is carefully adapted to solve issues encountered in time series forecasting. We shall use Python 3.5, SciKit Learn, Matplotlib, Numpy, and Pandas. By the end of this tutorial you will have explored the following topics: Time Series Split Cross-Validation Blocked Cross-Validation Grid Search Cross-Validation Loss Function Elastic Net Regression Cross-Validation Image Source: scikit-learn.org First, the data set is split into a training and testing set. The testing set is preserved for evaluating the best model optimized by cross-validation. In k-fold cross-validation, the training set is further split into k folds aka partitions. During each iteration of the cross-validation, one fold is held as a validation set and the remaining k - 1 folds are used for training. This allows us to make the best use of the data available without annihilation. It also allows us to avoid biasing the model towards patterns that may be overly represented in a given fold. Then the error obtained on all folds is averaged and the standard deviation is calculated. One usually performs cross-validation to find out which settings give the minimum error before training a final model using these elected settings on the complete training set. Flavors of k-fold cross-validations exist, for example, leave-one-out and nested cross-validation. However, these may be the topic of another tutorial. Grid Search Cross-Validation One idea to fine-tune the hyper-parameters is to randomly guess the values for model parameters and apply cross-validation to see if they work. This is infeasible as there may be exponential combinations of such parameters. This approach is also called Random Search in the literature. Grid search works by exhaustively searching the possible combinations of the model’s parameters, but it makes use of the loss function to guide the selection of the values to be tried at each iteration. That is solving a minimization optimization problem. However, in SciKit Learn it explicitly tries all the possible combination which makes it computationally expensive. When cross-validation is used in the inner loop of the grid search, it is called grid search cross-validation. Hence, the optimization objective becomes minimizing the average loss obtained on the k folds. R2 Loss Function Choosing the loss function has a very high impact on model performance and convergence. In this tutorial, I would like to introduce to you a loss function, most commonly used in regression tasks. R2 loss works by calculating correlation coefficients between the ground truth target values and the response output from the model. The formula is, however, slightly modified so that the range of the function is in the open interval [+1, -∞]. Hence, +1 indicates maximum positive correlation and negative values indicate the opposite. Thus, all the errors obtained in this tutorial should be interpreted as desirable if their value is close to +1. It is worth mentioning that we could have chosen a different loss function such as L1-norm or L2-norm. I would encourage you to try the ideas discussed in this tutorial using other loss functions and observe the difference. Elastic Net Regression This also goes in the literature by the name elastic net regularization. Regularization is a very robust technique to avoid overfitting by penalizing large weights or in other words it alters the objective function by emphasizing the errors caused by memorizing the training set. Vanilla linear regression can be tricked into learning the parameters that perform very well on the training set, but yet fail to generalize for unseen new samples. Both L1-regularization and L2-regularization were incorporated to resolve overfitting and are known in the literature as Lasso and Ridge regression respectively. Due to the critique of both Lasso and Ridge regression, Elastic Net regression was introduced to mix the two models. As a result, some variables’ coefficients are set to zero as per L1-norm and some others are penalized or shrank as per the L2-norm. This model combines the best from both worlds and the result is a stable, robust, and a sparse model. As a consequence, there are more parameters to be fine-tuned. That’s why this is a good example to demonstrate the power of cross-validation. Crypto Data Set I have obtained ETHereum/USD exchange prices for the year 2019 from cryptodatadownload.com which you can get for free from the website or by running the following command: $ wget http://www.cryptodatadownload.com/cdd/Gemini_ETHUSD_d.csv Now that you have the CSV file you can import it to Python using Pandas. The daily close price is used as both regressor and response variables. In this setup, I have used a lag of 64 days for regressors and a target of 8 days for responses. That is, given the past 64 days closing prices forecast the next 8 days. Then the resulting nan rows at the tail are dropped as a way to handle missing values. df = pd.read_csv('./Gemini_ETHUSD_d.csv', skiprows=1) for i in range(1, STEPS): col_name = 'd{}'.format(i) df[col_name] = df['d0'].shift(periods=-1 * i) df = df.dropna() Next, we split the data frame into two one for the regressors and the other for the responses. And then split both into two one for training and the other for testing. X = df.iloc[:, :TRAIN_STEPS] y = df.iloc[:, TRAIN_STEPS:] X_train = X.iloc[:SPLIT_IDX, :] y_train = y.iloc[:SPLIT_IDX, :] X_test = X.iloc[SPLIT_IDX:, :] y_test = y.iloc[SPLIT_IDX:, :] Model Design Let’s define a method that creates an elastic net model from sci-kit learn and since we are going to forecast more than one future time step, let’s use a multi-output regressor wrapper that trains a separate model for each target time step. However, this introduces more demand for computation resources. def build_model(_alpha, _l1_ratio): estimator = ElasticNet( alpha=_alpha, l1_ratio=_l1_ratio, fit_intercept=True, normalize=False, precompute=False, max_iter=16, copy_X=True, tol=0.1, warm_start=False, positive=False, random_state=None, selection='random' ) return MultiOutputRegressor(estimator, n_jobs=4) Blocked and Time Series Splits Cross-Validation The best way to grasp the intuition behind blocked and time series splits is by visualizing them. The three split methods are depicted in the above diagram. The horizontal axis is the training set size while the vertical axis represents the cross-validation iterations. The folds used for training are depicted in blue and the folds used for validation are depicted in orange. You can intuitively interpret the horizontal axis as time progression line since we haven’t shuffled the dataset and maintained the chronological order. The idea for time series splits is to divide the training set into two folds at each iteration on condition that the validation set is always ahead of the training split. At the first iteration, one trains the candidate model on the closing prices from January to March and validates on April’s data, and for the next iteration, train on data from January to April, and validate on May’s data, and so on to the end of the training set. This way dependence is respected. However, this may introduce leakage from future data to the model. The model will observe future patterns to forecast and try to memorize them. That’s why blocked cross-validation was introduced. It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next. Implementing k-fold cross-validation using sci-kit learn is pretty straightforward, but in the following lines of code, we pass the k-fold splitter explicitly as we will develop the idea further in order to implement other kinds of cross-validation. model = build_model(_alpha=1.0, _l1_ratio=0.3) kfcv = KFold(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=kfcv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -103.076 (+/- 205.979) The same applies to time series splitter as follows: model = build_model(_alpha=1.0, _l1_ratio=0.3) tscv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -9.799 (+/- 19.292) Sci-kit learn gives us the luxury to define any new types of splitters as long as we abide by its splitter API and inherit from the base splitter. class BlockingTimeSeriesSplit(): def __init__(self, n_splits): self.n_splits = n_splits def get_n_splits(self, X, y, groups): return self.n_splits def split(self, X, y=None, groups=None): n_samples = len(X) k_fold_size = n_samples // self.n_splits indices = np.arange(n_samples) margin = 0 for i in range(self.n_splits): start = i * k_fold_size stop = start + k_fold_size mid = int(0.8 * (stop - start)) + start yield indices[start: mid], indices[mid + margin: stop] Then we can use it exactly the same way like before. model = build_model(_alpha=1.0, _l1_ratio=0.3) btscv = BlockingTimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=btscv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -15.527 (+/- 27.488) Please notice how the loss is different among the different types of splitters. In order to interpret the results correctly, let’s put it to test by using grid search cross-validation to find the optimal values for both regularization parameter alpha and -ratio that controls how much -norm contributes to the regularization. It follows that -norm contributes 1 - . params = { 'estimator__alpha':(0.1, 0.3, 0.5, 0.7, 0.9), 'estimator__l1_ratio':(0.1, 0.3, 0.5, 0.7, 0.9) } for i in range(100): model = build_model(_alpha=1.0, _l1_ratio=0.3) finder = GridSearchCV( estimator=model, param_grid=params, scoring=r2, fit_params=None, n_jobs=None, iid=False, refit=False, cv=kfcv, # change this to the splitter subject to test verbose=1, pre_dispatch=8, error_score=-999, return_train_score=True ) finder.fit(X_train, y_train) best_params = finder.best_params_ Experimental Results K-Fold Cross-Validation Optimal Parameters Grid-search cross-validation was run 100 times in order to objectively measure the consistency of the results obtained using each splitter. This way we can evaluate the effectiveness and robustness of the cross-validation method on time series forecasting. As for the k-fold cross-validation, the parameters suggested were almost uniform. That is, it did not really help us in discriminating the optimal parameters since all were equally good or bad. Time Series Split Cross-Validation Optimal Parameters Blocked Cross-Validation Optimal Parameters However, in both the cases of time series split cross-validation and blocked cross-validation, we have obtained a clear indication of the optimal values for both parameters. In case of blocked cross-validation, the results were even more discriminative as the blue bar indicates the dominance of -ratio optimal value of 0.1. Ground Truth vs Forecasting After having obtained the optimal values for our model parameters, we can train the model and evaluate it on the testing set. The results, as depicted in the plot above, indicate smooth capture of the trend and minimum error rate. # optimal model model = build_model(_alpha=0.1, _l1_ratio=0.1) # train model model.fit(X_train, y_train) # test score y_predicted = model.predict(X_test) score = r2_score(y_test, y_predicted, multioutput='uniform_average') print("Test Loss: {0:.3f}".format(score)) The output is: Test Loss: 0.925 Ideas for the Curious In this tutorial, we have demonstrated the power of using the right cross-validation strategy for time-series forecasting. The beauty of machine learning is endless. Here you’re a few ideas to try out and experiment on your own: Try using a different more volatile data set Try using different lag and target length instead of 64 and 8 days each. Try different regression models Try different loss functions Try RNN models using Keras Try increasing or decreasing the blocked splits margins Try a different value for k in cross-validation References Jeff Racine,Consistent cross-validatory model-selection for dependent data: hv-block cross-validation,Journal of Econometrics,Volume 99, Issue 1,2000,Pages 39-61,ISSN 0304-4076. Dabbs, Beau & Junker, Brian. (2016). Comparison of Cross-Validation Methods for Stochastic Block Models. Marcos Lopez de Prado, 2018, Advances in Financial Machine Learning (1st ed.), Wiley Publishing. Doctor, Grado DE et al. “New approaches in time series forecasting: methods, software, and evaluation procedures.” (2013). Learn More Seize the chance to learn more about time series forecasting techniques, machine learning, trading strategies, and algorithmic trading on my step by step online video course: Hands-on Machine Learning for Algorithmic Trading Bots with Python on PacktPub. Author Bio Mustafa Qamar-ud-Din is a machine learning engineer with over 10 years of experience in the software development industry engaged with startups on solving problems in various domains; e-commerce applications, recommender systems, biometric identity control, and event management. Time series modeling: What is it, Why it matters and How it’s used Implementing a simple Time Series Data Analysis in R Training RNNs for Time Series Forecasting

0
0
90763

How-To Tutorials

article-image-25-datasets-deep-learning-iot

Sugandha Lahoti

20 Mar 2018

8 min read

25 Datasets for Deep Learning in IoT

Sugandha Lahoti

20 Mar 2018

8 min read

Deep Learning is one of the major players for facilitating the analytics and learning in the IoT domain. A really good roundup of the state of deep learning advances for big data and IoT is described in the paper Deep Learning for IoT Big Data and Streaming Analytics: A Survey by Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. In this article, we have attempted to draw inspiration from this research paper to establish the importance of IoT datasets for deep learning applications. The paper also provides a handy list of commonly used datasets suitable for building deep learning applications in IoT, which we have added at the end of the article. IoT and Big Data: The relationship IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two. Large-Scale Streaming data: IoT data is a large-scale streaming data. This is because a large number of IoT devices generate streams of data continuously. Big data, on the other hand, lack real-time processing. Heterogeneity: IoT data is heterogeneous as various IoT data acquisition devices gather different information. Big data devices are generally homogeneous in nature. Time and space correlation: IoT sensor devices are also attached to a specific location, and thus have a location and time-stamp for each of the data items. Big data sensors lack time-stamp resolution. High noise data: IoT data is highly noisy, owing to the tiny pieces of data in IoT applications, which are prone to errors and noise during acquisition and transmission. Big data, in contrast, is generally less noisy. Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s. Volume: The quantity of generated data using IoT devices is much more than before and clearly fits this feature. Velocity: Advanced tools and technologies for analytics are needed to efficiently operate the high rate of data production. Variety: Big data may be structured, semi-structured, and unstructured data. The data types produced by IoT include text, audio, video, sensory data and so on. Veracity: Veracity refers to the quality, consistency, and trustworthiness of the data, which in turn leads to accurate analytics. Variability: This property refers to the different rates of data flow. Value: Value is the transformation of big data to useful information and insights that bring competitive advantage to organizations. Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets. IoT datasets and why are they needed Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because: Most IoT datasets are available with large organizations who are unwilling to share it so easily. Access to the copyrighted datasets or privacy considerations. These are more common in domains with human data such as healthcare and education. While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT. Dataset Name Domain Provider Notes Address/Link CGIAR dataset Agriculture, Climate CCAFS High-resolution climate datasets for a variety of fields including agricultural http://www.ccafs-climate.org/ Educational Process Mining Education University of Genova Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set Commercial Building Energy Dataset Energy, Smart Building IIITD Energy related data set from a commercial building where data is sampled more than once a minute. http://combed.github.io/ Individual household electric power consumption Energy, Smart home EDF R&D, Clamart, France One-minute sampling rate over a period of almost 4 years http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption AMPds dataset Energy, Smart home S. Makonin AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring http://ampds.org/ UK Domestic Appliance-Level Electricity Energy, Smart Home Kelly and Knottenbelt Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded. http://www.doc.ic.ac.uk/∼dk3810/data/ PhysioBank databases Healthcare PhysioNet Archive of over 80 physiological datasets. https://physionet.org/physiobank/database/ Saarbruecken Voice Database Healthcare Universitat¨ des Saarlandes A collection of voice recordings from more than 2000 persons for pathological voice detection. http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4 T-LESS Industry CMP at Czech Technical University An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects http://cmp.felk.cvut.cz/t-less/ CityPulse Dataset Collection Smart City CityPulse EU FP7 project Road Traffic Data, Pollution Data, Weather, Parking http://iot.ee.surrey.ac.uk:8080/datasets.html Open Data Institute - node Trento Smart City Telecom Italia Weather, Air quality, Electricity, Telecommunication http://theodi.fbk.eu/openbigdata/ Malaga datasets Smart City City of Malaga A broad range of categories such as energy, ITS, weather, Industry, Sport, etc. http://datosabiertos.malaga.eu/dataset Gas sensors for home activity monitoring Smart home Univ. of California San Diego Recordings of 8 gas sensors under three conditions including background, wine and banana presentations. http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring CASAS datasets for activities of daily living Smart home Washington State University Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. http://ailab.wsu.edu/casas/datasets.html ARAS Human Activity Dataset Smart home Bogazici University Human activity recognition datasets collected from two real houses with multiple residents during two months. https://www.cmpe.boun.edu.tr/aras/ MERLSense Data Smart home, building Mitsubishi Electric Research Labs Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records. http://www.merl.com/wmd SportVU Sport Stats LLC Video of basketball and soccer games captured from 6 cameras. http://go.stats.com/sportvu RealDisp Sport O. Banos Includes a wide range of physical activities (warm up, cool down and fitness exercises). http://orestibanos.com/datasets.htm Taxi Service Trajectory Transportation Prediction Challenge, ECML PKDD 2015 Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal. http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html GeoLife GPS Trajectories Transportation Microsoft A GPS trajectory by a sequence of time-stamped points https://www.microsoft.com/en-us/download/details.aspx?id=52367 T-Drive trajectory data Transportation Microsoft Contains a one-week trajectories of 10,357 taxis https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ Chicago Bus Traces data Transportation M. Doering Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds. http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/ Uber trip data Transportation FiveThirtyEight About 20 million Uber pickups in New York City during 12 months. https://github.com/fivethirtyeight/uber-tlc-foil-response Traffic Sign Recognition Transportation K. Lim Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules. https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795 DDD17 Transportation J. Binas End-To-End DAVIS Driving Dataset. http://sensors.ini.uzh.ch/databases.html

0
2
88157

article-image-how-to-customize-lines-and-markers-in-matplotlib-2-0

Sugandha Lahoti

13 Dec 2017

6 min read

How to Customize lines and markers in Matplotlib 2.0

Sugandha Lahoti

13 Dec 2017

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim, titled Matplotlib 2.x By Example. The book illustrates methods and applications of various plot types through real world examples.[/box] In this post we demonstrate how you can manipulate Lines and Markers in Matplotlib 2.0. It covers steps to plot, customize, and adjust line graphs and markers. What are Lines and Markers Lines and markers are key components found among various plots. Many times, we may want to customize their appearance to better distinguish different datasets or for better or more consistent styling. Whereas markers are mainly used to show data, such as line plots and scatter plots, lines are involved in various components, such as grids, axes, and box outlines. Like text properties, we can easily apply similar settings for different line or marker objects with the same method. Lines Most lines in Matplotlib are drawn with the lines class, including the ones that display the data and those setting area boundaries. Their style can be adjusted by altering parameters in lines.Line2D. We usually set color, linestyle, and linewidth as keyword arguments. These can be written in shorthand as c, ls, and lw respectively. In the case of simple line graphs, these parameters can be parsed to the plt.plot() function: import numpy as np import matplotlib.pyplot as plt # Prepare a curve of square numbers x = np.linspace(0,200,100) # Prepare 100 evenly spaced numbers from # 0 to 200 y = x**2 # Prepare an array of y equals to x squared # Plot a curve of square numbers plt.plot(x,y,label = '$x^2$',c='burlywood',ls=('dashed'),lw=2) plt.legend() plt.show() With the preceding keyword arguments for line color, style, and weight, you get a woody dashed curve: Choosing dash patterns Whether a line will appear solid or with dashes is set by the keyword argument linestyle. There are a few simple patterns that can be set by the linestyle name or the corresponding shorthand. We can also define our own dash pattern: 'solid' or '-': Simple solid line (default) 'dashed' or '--': Dash strokes with equal spacing 'dashdot' or '-.': Alternate dashes and dots 'None', ' ', or '': No lines (offset, on-off-dash-seq): Customized dashes; we will demonstrate in the following advanced example Setting capstyle of dashes The cap of dashes can be rounded by setting the dash_capstyle parameter if we want to create a softer image such as in promotion: import numpy as np import matplotlib.pyplot as plt # Prepare 6 lines x = np.linspace(0,200,100) y1 = x*0.5 y2 = x y3 = x*2 y4 = x*3 y5 = x*4 y6 = x*5 # Plot lines with different dash cap styles plt.plot(x,y1,label = '0.5x', lw=5, ls=':',dash_capstyle='butt') plt.plot(x,y2,label = 'x', lw=5, ls='--',dash_capstyle='butt') plt.plot(x,y3,label = '2x', lw=5, ls=':',dash_capstyle='projecting') plt.plot(x,y4,label = '3x', lw=5, ls='--',dash_capstyle='projecting') plt.plot(x,y5,label = '4x', lw=5, ls=':',dash_capstyle='round') plt.plot(x,y6,label = '5x', lw=5, ls='--',dash_capstyle='round') plt.show() Looking closely, you can see that the top two lines are made up of rounded dashes. The middle two lines with projecting capstyle have closer spaced dashes than the lower two with butt one, given the same default spacing: Markers A marker is another type of important component for illustrating data, for example, in scatter plots, swarm plots, and time series plots. Choosing markers There are two groups of markers, unfilled markers and filled_markers. The full set of available markers can be found by calling Line2D.markers, which will output a dictionary of symbols and their corresponding marker style names. A subset of filled markers that gives more visual weight is under Line2D.filled_markers. Here are some of the most typical markers: 'o' : Circle 'x' : Cross '+' : Plus sign 'P' : Filled plus sign 'D' : Filled diamond 'S' : Square '^' : Triangle Here is a scatter plot of random numbers to illustrate the various marker types: import numpy as np import matplotlib.pyplot as plt from matplotlib.lines import Line2D # Prepare 100 random numbers to plot x = np.random.rand(100) y = np.random.rand(100) # Prepare 100 random numbers within the range of the number of # available markers as index # Each random number will serve as the choice of marker of the # corresponding coordinates markerindex = np.random.randint(0, len(Line2D.markers), 100) # Plot all kinds of available markers at random coordinates # for each type of marker, plot a point at the above generated # random coordinates with the marker type for k, m in enumerate(Line2D.markers): i = (markerindex == k) plt.scatter(x[i], y[i], marker=m) plt.show() The different markers suit different densities of data for better distinction of each point: Adjusting marker sizes We often want to change the marker sizes so as to make them clearer to read from a slideshow. Sometimes we need to adjust the markers to have a different numerical value of marker size to: import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as ticker # Prepare 5 lines x = np.linspace(0,20,10) y1 = x y2 = x*2 y3 = x*3 y4 = x*4 y5 = x*5 # Plot lines with different marker sizes plt.plot(x,y1,label = 'x', lw=2, marker='s', ms=10) # square size 10 plt.plot(x,y2,label = '2x', lw=2, marker='^', ms=12) # triangle size 12 plt.plot(x,y3,label = '3x', lw=2, marker='o', ms=10) # circle size 10 plt.plot(x,y4,label = '4x', lw=2, marker='D', ms=8) # diamond size 8 plt.plot(x,y5,label = '5x', lw=2, marker='P', ms=12) # filled plus sign # size 12 # get current axes and store it to ax ax = plt.gca() plt.show() After tuning the marker sizes, the different series look quite balanced: If all markers are set to have the same markersize value, the diamonds and squares may look heavier: Thus, we learned how to customize lines and markers in a Matplotlib plot for better visualization and styling. To know more about how to create and customize plots in Matplotlib, check out this book Matplotlib 2.x By Example.

0
0
87978

article-image-mastering-the-api-life-cycle-a-comprehensive-guide-to-design-implementation-release-and-maintenance

Bruno Pedro

06 Nov 2024

15 min read

Mastering the API Life Cycle: A Comprehensive Guide to Design, Implementation, Release, and Maintenance

Bruno Pedro

06 Nov 2024

15 min read

1
0
87683

article-image-how-does-elasticsearch-work-tutorial

Savia Lobo

30 Jul 2018

12 min read

How does Elasticsearch work? [Tutorial]

Savia Lobo

30 Jul 2018

12 min read

0
2
86390

article-image-which-python-framework-is-best-for-building-restful-apis-django-or-flask

Vincy Davis

07 May 2019

9 min read

Which Python framework is best for building RESTful APIs? Django or Flask?

Vincy Davis

07 May 2019

9 min read

Python is one of the top-rated programming languages. It's also known for its less-complex syntax, and its high-level, object-oriented, robust, and general-purpose programming. Python is the top choice for any first-time programmer. Since its release in 1991, Python has evolved and powered by several frameworks for web application development, scientific and mathematical computing, and graphical user interfaces to the latest REST API frameworks. This article is an excerpt taken from the book, 'Hands-On RESTful API Design Patterns and Best Practices' written by Harihara Subramanian and Pethura Raj. This book covers design strategy, essential and advanced Restful API Patterns, Legacy Modernization to Microservices centric apps. In this article, we'll explore two comprehensive frameworks, Django and Flask, so that you can choose the best one for developing your RESTful API. Django Django is a web framework also available as open source with the BSD license, designed to help developers create their web app very quickly as it takes care of additional web-development needs. It includes several packages (also known as applications) to handle typical web-development tasks, such as authentication, content administration, scaffolding, templates, caching, and syndication. Let's use the Django REST Framework (DRF) built with Python, and use it for REST API development and deployment. Django Rest Framework DRF is an open source, well-matured Python and Django library intended to help APP developers build sophisticated web APIs. DRF's modular, flexible, and customizable architecture makes the development of both simple, turnkey API endpoints and complicated REST constructs possible. The goal of DRF is to divide a model, generalize the wire representation, such as JSON or XML, and customize a set of class-based views to satisfy the specific API endpoint using a serializer that describes the mapping between views and API endpoints. Core features Django has many distinct features including: Web-browsable API This feature enhances the REST API developed with DRF. It has a rich interface, and the web-browsable API supports multiple media types too. The browsable API does mean that the APIs we build will be self-describing and the API endpoints that we create as part of the REST services and return JSON or HTML representations. The interesting fact about the web-browsable API is that we can interact with it fully through the browser, and any endpoint that we interact with using a programmatic client will also be capable of responding with a browser-friendly view onto the web-browsable API. Authentication One of the main attractive features of Django is authentication; it supports broad categories of authentication schemes, from basic authentication, token authentication, session authentication, remote user authentication, to OAuth Authentication. It also supports custom authentication schemes if we wish to implement one. DRF runs the authentication scheme at the start of the view, that is, before any other code is allowed to proceed. DRF determines the privileges of the incoming request from the permission and throttling policies and then decides whether the incoming request can be allowed or disallowed with the matched credentials. Serialization and deserialization Serialization is the process of converting complex data, such as querysets and model instances, into native Python datatypes. Converting facilitates the rendering of native data types, such as JSON or XML. DRF supports serialization through serializers classes. The serializers of DRF are similar to Django's Form and ModelForm classes. It provides a serializer class, which helps to control the output of responses. The DRF ModelSerializer classes provide a simple mechanism with which we can create serializers that deal with model instances and querysets. Serializers also do deserialization, that is, serializers allow parsed data that needs to be converted back into complex types. Also, deserialization happens only after validating the incoming data. Other noteworthy features Here are some other noteworthy features of the DRF: Routers: The DRF supports automatic URL routing to Django and provides a consistent and straightforward way to wire the view logic to a set of URLs Class-based views: A dominant pattern that enables the reusability of common functionalities Hyperlinking APIs: The DRF supports various styles (using primary keys, hyperlinking between entities, and so on) to represent the relationship between entities Generic views: Allows us to build API views that map to the database models DRF has many other features such as caching, throttling, testing, etc. Benefits of the DRF Here are some of the benefits of the DRF: Web-browsable API Authentication policies Powerful serialization Extensive documentation and excellent community support Simple yet powerful Test coverage of source code Secure and scalable Customizable Drawbacks of the DRF Here are some facts that may disappoint some Python app developers who intend to use the DRF: Monolithic and components get deployed together Based on Django ORM Steep learning curve Slow response time Flask Flask is a microframework for Python developers based on Werkzeug (WSGI toolkit) and Jinja 2 (template engine). It comes under BSD licensing. Flask is very easy to set up and simple to use. Like other frameworks, it comes with several out-of-the-box capabilities, such as a built-in development server, debugger, unit test support, templating, secure cookies, and RESTful request dispatching. The powerful Flask RESTful API framework is discussed below. Flask-RESTful Flask-RESTful is an extension for Flask that provides additional support for building REST APIs. You will never be disappointed with the time it takes to develop an API. Flask-Restful is a lightweight abstraction that works with the existing ORM/libraries. Flask-RESTful encourages best practices with minimal setup. Core features of Flask-RESTful Flask-RESTful comes with several built-in features. Django and Flask have many common RESTful frameworks, because they have almost the same supporting core features. The unique RESTful features of Flask is mentioned below. Resourceful routing The design goal of Flask-RESTful is to provide resources built on top of Flask pluggable views. The pluggable views provide a simple way to access the HTTP methods. Consider the following example code: class Todo(Resource): def get(self, user_id): .... def delete(self, user_id): .... def put(self, user_id): args = parser.parse_args() .... Restful request parsing Request parsing refers to an interface, modeled after the Python parser interface for command-line arguments, called argparser. The RESTful request parser is designed to provide uniform and straightforward access to any variable that comes within the (flask.request) request object. Output fields In most cases, app developers prefer to control rendering response data, and Flask-RESTful provides a mechanism where you can use ORM models or even custom classes as an object to render. Another interesting fact about this framework is that app developers don't need to worry about exposing any internal data structures as its let one format and filter the response objects. So, when we look at the code, it'll be evident which data would go for rendering and how it'll be formatted. Other noteworthy features Here are some other noteworthy features of Flask-RESTful: API: This is the main entry point for the restful API, which we'll initialize with the Flask application. ReqParse: This enables us to add and parse multiple arguments in the context of the single request. Input: A useful functionality, it parses the input string and returns true or false depending on the Input. If the input is from the JSON body, the type is already native Boolean and passed through without further parsing. Benefits of the Flask framework Here are some of the benefits of Flask framework: Built-in development server and debugger Out-of-the-box RESTful request dispatching Support for secure cookies Integrated unit-test support Lightweight Very minimal setup Faster (performance) Easy NoSQL integration Extensive documentation Drawbacks of Flask Here are some of Flask and Flask-RESTful's disadvantages: Version management (managed by developers) No brownie points as it doesn't have browsable APIs May incur a steep learning curve Frameworks – a table of reference The following table provides a quick reference of a few other prominent micro-frameworks, their features, and supported programming languages: Language Framework Short description Prominent features Java Blade Fast and elegant MVC framework for Java8 Lightweight High performance Based on the MVC pattern RESTful-style router interface Built-in security Java/Scala Play Framework High-velocity Reactive web framework for Java and Scala Lightweight, stateless, and web-friendly architecture Built on Akka Supports predictable and minimal resource-consumption for highly-scalable applications Developer-friendly Java Ninja Web Framework Full-stack web framework Fast Developer-friendly Rapid prototyping Plain vanilla Java, dependency injection, first-class IDE integration Simple and fast to test (mocked tests/integration tests) Excellent build and CI support Clean codebase – easy to extend Java RESTEASY JBoss-based implementation that integrates several frameworks to help to build RESTful Web and Java applications Fast and reliable Large community Enterprise-ready Security support Java RESTLET A lightweight and comprehensive framework based on Java, suitable for both server and client applications. Lightweight Large community Native REST support Connectors set JavaScript Express.js Minimal and flexible Node.js-based JavaScript framework for mobile and web applications HTTP utility methods Security updates Templating engine PHP Laravel An open source web-app builder based on PHP and the MVC architecture pattern Intuitive interface Blade template engine Eloquent ORM as default Elixir Phoenix (Elixir) Powered with the Elixir functional language, a reliable and faster micro-framework MVC-based High application performance Erlong virtual machine enables better use of resources Python Pyramid Python-based micro-framework Lightweight Function decorators Events and subscribers support Easy implementations and high productivity Summary It's evident that Python has two excellent frameworks. Depending on the choice of programming language you are intending to use and the required features, you can choose your type of framework to work on. If you are interested in learning more about the design strategy, guidelines and best practices of Restful API Patterns, you can refer to our book 'Hands-On RESTful API Design Patterns and Best Practices' here. Stack Overflow survey data further confirms Python’s popularity as it moves above Java in the most used programming language list. Svelte 3 releases with reactivity through language instead of an API Microsoft introduces Pyright, a static type checker for the Python language written in TypeScript

0
0
85386

How-To Tutorials

article-image-image-filtering-techniques-opencv

Vijin Boricha

12 Apr 2018

15 min read

Image filtering techniques in OpenCV

Vijin Boricha

12 Apr 2018

15 min read

In the world of computer vision, image filtering is used to modify images. These modifications essentially allow you to clarify an image in order to get the information you want. This could involve anything from extracting edges from an image, blurring it, or removing unwanted objects. There are, of course, lots of reasons why you might want to use image filtering to modify an image. For example, taking a picture in sunlight or darkness will impact an images clarity - you can use image filters to modify the image to get what you want from it. Similarly, you might have a blurred or 'noisy' image that needs clarification and focus. Let's use an example to see how to do image filtering in OpenCV. This image filtering tutorial is an extract from Practical Computer Vision. Here's an example with considerable salt and pepper noise. This occurs when there is a disturbance in the quality of the signal that's used to generate the image. The image above can be easily generated using OpenCV as follows: # initialize noise image with zeros noise = np.zeros((400, 600)) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) Let's add weighted noise to a grayscale image (on the left) so the resulting image will look like the one on the right: The code for this is as follows: # add noise to existing image noisy_gray = gray + np.array(0.2*noise, dtype=np.int) Here, 0.2 is used as parameter, increase or decrease the value to create different intensity noise. In several applications, noise plays an important role in improving a system's capabilities. This is particularly true when you're using deep learning models. The noise becomes a way of testing the precision of the deep learning application, and building it into the computer vision algorithm. Linear image filtering The simplest filter is a point operator. Each pixel value is multiplied by a scalar value. This operation can be written as follows: Here: The input image is F and the value of pixel at (i,j) is denoted as f(i,j) The output image is G and the value of pixel at (i,j) is denoted as g(i,j) K is scalar constant This type of operation on an image is what is known as a linear filter. In addition to multiplication by a scalar value, each pixel can also be increased or decreased by a constant value. So overall point operation can be written like this: This operation can be applied both to grayscale images and RGB images. For RGB images, each channel will be modified with this operation separately. The following is the result of varying both K and L. The first image is input on the left. In the second image, K=0.5 and L=0.0, while in the third image, K is set to 1.0 and L is 10. For the final image on the right, K=0.7 and L=25. As you can see, varying K changes the brightness of the image and varying L changes the contrast of the image: This image can be generated with the following code: import numpy as np import matplotlib.pyplot as plt import cv2 def point_operation(img, K, L): """ Applies point operation to given grayscale image """ img = np.asarray(img, dtype=np.float) img = img*K + L # clip pixel values img[img > 255] = 255 img[img < 0] = 0 return np.asarray(img, dtype = np.int) def main(): # read an image img = cv2.imread('../figures/flower.png') gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # k = 0.5, l = 0 out1 = point_operation(gray, 0.5, 0) # k = 1., l = 10 out2 = point_operation(gray, 1., 10) # k = 0.8, l = 15 out3 = point_operation(gray, 0.7, 25) res = np.hstack([gray,out1, out2, out3]) plt.imshow(res, cmap='gray') plt.axis('off') plt.show() if __name__ == '__main__': main() 2D linear image filtering While the preceding filter is a point-based filter, image pixels have information around the pixel as well. In the previous image of the flower, the pixel values in the petal are all yellow. If we choose a pixel of the petal and move around, the values will be quite close. This gives some more information about the image. To extract this information in filtering, there are several neighborhood filters. In neighborhood filters, there is a kernel matrix which captures local region information around a pixel. To explain these filters, let's start with an input image, as follows: This is a simple binary image of the number 2. To get certain information from this image, we can directly use all the pixel values. But instead, to simplify, we can apply filters on this. We define a matrix smaller than the given image which operates in the neighborhood of a target pixel. This matrix is termed kernel; an example is given as follows: The operation is defined first by superimposing the kernel matrix on the original image, then taking the product of the corresponding pixels and returning a summation of all the products. In the following figure, the lower 3 x 3 area in the original image is superimposed with the given kernel matrix and the corresponding pixel values from the kernel and image are multiplied. The resulting image is shown on the right and is the summation of all the previous pixel products: This operation is repeated by sliding the kernel along image rows and then image columns. This can be implemented as in following code. We will see the effects of applying this on an image in coming sections. # design a kernel matrix, here is uniform 5x5 kernel = np.ones((5,5),np.float32)/25 # apply on the input image, here grayscale input dst = cv2.filter2D(gray,-1,kernel) However, as you can see previously, the corner pixel will have a drastic impact and results in a smaller image because the kernel, while overlapping, will be outside the image region. This causes a black region, or holes, along with the boundary of an image. To rectify this, there are some common techniques used: Padding the corners with constant values maybe 0 or 255, by default OpenCV will use this. Mirroring the pixel along the edge to the external area Creating a pattern of pixels around the image The choice of these will depend on the task at hand. In common cases, padding will be able to generate satisfactory results. The effect of the kernel is most crucial as changing these values changes the output significantly. We will first see simple kernel-based filters and also see their effects on the output when changing the size. Box filtering This filter averages out the pixel value as the kernel matrix is denoted as follows: Applying this filter results in blurring the image. The results are as shown as follows: In frequency domain analysis of the image, this filter is a low pass filter. The frequency domain analysis is done using Fourier transformation of the image, which is beyond the scope of this introduction. We can see on changing the kernel size, the image gets more and more blurred: As we increase the size of the kernel, you can see that the resulting image gets more blurred. This is due to averaging out of peak values in small neighbourhood where the kernel is applied. The result for applying kernel of size 20x20 can be seen in the following image. However, if we use a very small filter of size (3,3) there is negligible effect on the output, due to the fact that the kernel size is quite small compared to the photo size. In most applications, kernel size is heuristically set according to image size: The complete code to generate box filtered photos is as follows: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Box Filter (5,5)') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # To try different kernel, change size here. kernel_size = (5,5) # opencv has implementation for kernel based box blurring blur = cv2.blur(img,kernel_size) # Do plot plot_cv_img(img, blur) if __name__ == '__main__': main() Properties of linear filters Several computer vision applications are composed of step by step transformations of an input photo to output. This is easily done due to several properties associated with a common type of filters, that is, linear filters: The linear filters are commutative such that we can perform multiplication operations on filters in any order and the result still remains the same: a * b = b * a They are associative in nature, which means the order of applying the filter does not affect the outcome: (a * b) * c = a * (b * c) Even in cases of summing two filters, we can perform the first summation and then apply the filter, or we can also individually apply the filter and then sum the results. The overall outcome still remains the same: Applying a scaling factor to one filter and multiplying to another filter is equivalent to first multiplying both filters and then applying scaling factor These properties play a significant role in other computer vision tasks such as object detection and segmentation. A suitable combination of these filters enhances the quality of information extraction and as a result, improves the accuracy. Non-linear image filtering While in many cases linear filters are sufficient to get the required results, in several other use cases performance can be significantly increased by using non-linear image filtering. Mon-linear image filtering is more complex, than linear filtering. This complexity can, however, give you more control and better results in your computer vision tasks. Let's take a look at how non-linear image filtering works when applied to different images. Smoothing a photo Applying a box filter with hard edges doesn't result in a smooth blur on the output photo. To improve this, the filter can be made smoother around the edges. One of the popular such filters is a Gaussian filter. This is a non-linear filter which enhances the effect of the center pixel and gradually reduces the effects as the pixel gets farther from the center. Mathematically, a Gaussian function is given as: where μ is mean and σ is variance. An example kernel matrix for this kind of filter in 2D discrete domain is given as follows: This 2D array is used in normalized form and effect of this filter also depends on its width by changing the kernel width has varying effects on the output as discussed in further section. Applying gaussian kernel as filter removes high-frequency components which results in removing strong edges and hence a blurred photo: While this filter performs better blurring than a box filter, the implementation is also quite simple with OpenCV: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Gaussian Blurred') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # apply gaussian blur, # kernel of size 5x5, # change here for other sizes kernel_size = (5,5) # sigma values are same in both direction blur = cv2.GaussianBlur(img,(5,5),0) plot_cv_img(img, blur) if __name__ == '__main__': main() The histogram equalization technique The basic point operations, to change the brightness and contrast, help in improving photo quality but require manual tuning. Using histogram equalization technique, these can be found algorithmically and create a better-looking photo. Intuitively, this method tries to set the brightest pixels to white and the darker pixels to black. The remaining pixel values are similarly rescaled. This rescaling is performed by transforming original intensity distribution to capture all intensity distribution. An example of this equalization is as following: The preceding image is an example of histogram equalization. On the right is the output and, as you can see, the contrast is increased significantly. The input histogram is shown in the bottom figure on the left and it can be observed that not all the colors are observed in the image. After applying equalization, resulting histogram plot is as shown on the right bottom figure. To visualize the results of equalization in the image , the input and results are stacked together in following figure. Code for the preceding photos is as follows: def plot_gray(input_image, output_image): """ Converts an image from BGR to RGB and plots """ # change color channels order for matplotlib fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(input_image, cmap='gray') ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(output_image, cmap='gray') ax[1].set_title('Histogram Equalized ') ax[1].axis('off') plt.savefig('../figures/03_histogram_equalized.png') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # grayscale image is used for equalization gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # following function performs equalization on input image equ = cv2.equalizeHist(gray) # for visualizing input and output side by side plot_gray(gray, equ) if __name__ == '__main__': main() Median image filtering Median image filtering a similar technique as neighborhood filtering. The key technique here, of course, is the use of a median value. As such, the filter is non-linear. It is quite useful in removing sharp noise such as salt and pepper. Instead of using a product or sum of neighborhood pixel values, this filter computes a median value of the region. This results in the removal of random peak values in the region, which can be due to noise like salt and pepper noise. This is further shown in the following figure with different kernel size used to create output. In this image first input is added with channel wise random noise as: # read the image flower = cv2.imread('../figures/flower.png') # initialize noise image with zeros noise = np.zeros(flower.shape[:2]) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) # add noise to existing image, apply channel wise noise_factor = 0.1 noisy_flower = np.zeros(flower.shape) for i in range(flower.shape[2]): noisy_flower[:,:,i] = flower[:,:,i] + np.array(noise_factor*noise, dtype=np.int) # convert data type for use noisy_flower = np.asarray(noisy_flower, dtype=np.uint8) The created noisy image is used for median image filtering as: # apply median filter of kernel size 5 kernel_5 = 5 median_5 = cv2.medianBlur(noisy_flower,kernel_5) # apply median filter of kernel size 3 kernel_3 = 3 median_3 = cv2.medianBlur(noisy_flower,kernel_3) In the following photo, you can see the resulting photo after varying the kernel size (indicated in brackets). The rightmost photo is the smoothest of them all: The most common application for median blur is in smartphone application which filters input image and adds an additional artifacts to add artistic effects. The code to generate the preceding photograph is as follows: def plot_cv_img(input_image, output_image1, output_image2, output_image3): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=4) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image1, cv2.COLOR_BGR2RGB)) ax[1].set_title('Median Filter (3,3)') ax[1].axis('off') ax[2].imshow(cv2.cvtColor(output_image2, cv2.COLOR_BGR2RGB)) ax[2].set_title('Median Filter (5,5)') ax[2].axis('off') ax[3].imshow(cv2.cvtColor(output_image3, cv2.COLOR_BGR2RGB)) ax[3].set_title('Median Filter (7,7)') ax[3].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # compute median filtered image varying kernel size median1 = cv2.medianBlur(img,3) median2 = cv2.medianBlur(img,5) median3 = cv2.medianBlur(img,7) # Do plot plot_cv_img(img, median1, median2, median3) if __name__ == '__main__': main() Image filtering and image gradients These are more edge detectors or sharp changes in a photograph. Image gradients widely used in object detection and segmentation tasks. In this section, we will look at how to compute image gradients. First, the image derivative is applying the kernel matrix which computes the change in a direction. The Sobel filter is one such filter and kernel in the x-direction is given as follows: Here, in the y-direction: This is applied in a similar fashion to the linear box filter by computing values on a superimposed kernel with the photo. The filter is then shifted along the image to compute all values. Following is some example results, where X and Y denote the direction of the Sobel kernel: This is also termed as an image derivative with respect to given direction(here X or Y). The lighter resulting photographs (middle and right) are positive gradients, while the darker regions denote negative and gray is zero. While Sobel filters correspond to first order derivatives of a photo, the Laplacian filter gives a second-order derivative of a photo. The Laplacian filter is also applied in a similar way to Sobel: The code to get Sobel and Laplacian filters is as follows: # sobel x_sobel = cv2.Sobel(img,cv2.CV_64F,1,0,ksize=5) y_sobel = cv2.Sobel(img,cv2.CV_64F,0,1,ksize=5) # laplacian lapl = cv2.Laplacian(img,cv2.CV_64F, ksize=5) # gaussian blur blur = cv2.GaussianBlur(img,(5,5),0) # laplacian of gaussian log = cv2.Laplacian(blur,cv2.CV_64F, ksize=5) We learnt about types of filters and how to perform image filtering in OpenCV. To know more about image transformation and 3D computer vision check out this book Practical Computer Vision. Check out for more: Fingerprint detection using OpenCV 3 3 ways to deploy a QT and OpenCV application OpenCV 4.0 is on schedule for July release

0
1
85106

article-image-implementing-3-naive-bayes-classifiers-in-scikit-learn

Packt Editorial Staff

07 May 2018

13 min read

Implementing 3 Naive Bayes classifiers in scikit-learn

Packt Editorial Staff

07 May 2018

13 min read

Scikit-learn provide three naive Bayes implementations: Bernoulli, multinomial and Gaussian. The only difference is about the probability distribution adopted. The first one is a binary algorithm particularly useful when a feature can be present or not. Multinomial naive Bayes assumes to have feature vector where each element represents the number of times it appears (or, very often, its frequency). This technique is very efficient in natural language processing or whenever the samples are composed starting from a common dictionary. The Gaussian Naive Bayes, instead, is based on a continuous distribution and it's suitable for more generic classification tasks. Ok, now that we have established naive Bayes variants are a handy set of algorithms to have in our machine learning arsenal and that Scikit-learn is a good tool to implement them, let’s rewind a bit. What is Naive Bayes? Naive Bayes are a family of powerful and easy-to-train classifiers, which determine the probability of an outcome, given a set of conditions using the Bayes' theorem. In other words, the conditional probabilities are inverted so that the query can be expressed as a function of measurable quantities. The approach is simple and the adjective naive has been attributed not because these algorithms are limited or less efficient, but because of a fundamental assumption about the causal factors that we will discuss. Naive Bayes are multi-purpose classifiers and it's easy to find their application in many different contexts. However, the performance is particularly good in all those situations when the probability of a class is determined by the probabilities of some causal factors. A good example is given by natural language processing, where a text can be considered as a particular instance of a dictionary and the relative frequencies of all terms provide enough information to infer a belonging class. Our examples may be generic, so to let you understand the application of naive Bayes in various context. The Bayes' theorem Let's consider two probabilistic events A and B. We can correlate the marginal probabilities P(A) and P(B) with the conditional probabilities P(A|B) and P(B|A) using the product rule: Considering that the intersection is commutative, the first members are equal, so we can derive the Bayes' theorem: This formula has very deep philosophical implications and it's a fundamental element of statistical learning. First of all, let's consider the marginal probability P(A): this is normally a value that determines how probable a target event is, like P(Spam) or P(Rain). As there are no other elements, this kind of probability is called Apriori, because it's often determined by mathematical considerations or simply by a frequency count. For example, imagine we want to implement a very simple spam filter and we've collected 100 emails. We know that 30 are spam and 70 are regular. So we can say that P(Spam) = 0.3. However, we'd like to evaluate using some criteria (for simplicity, let's consider a single one), for example, e-mail text is shorter than 50 characters. Therefore, our query becomes: The first term is similar to P(Spam) because it's the probability of spam given a certain condition. For this reason, it's called a posteriori (in other words, it's a probability that can estimate after knowing some additional elements). On the right side, we need to calculate the missing values, but it's simple. Let's suppose that 35 emails have a text shorter than 50 characters, P(Text < 50 chars) = 0.35 and, looking only into our spam folder, we discover that only 25 spam emails have a short text, so that P(Text < 50 chars|Spam) = 25/30 = 0.83. The result is: So, after receiving a very short email, there is 71% probability that it's a spam. Now we can understand the role of P(Text < 50 chars|Spam): as we have actual data, we can measure how probable is our hypothesis given the query, in other words, we have defined a likelihood (compare this with logistic regression) which is a weight between the Apriori probability and the a posteriori one (the term on the denominator is less important because it works as normalizing factor): The normalization factor is often represented by the Greek letter alpha, so the formula becomes: The last step is considering the case when there are more concurrent conditions (that is more realistic in real-life problems): A common assumption is called conditional independence (in other words, the eﬀects produced by every cause are independent among each other) and allows us to write a simpliﬁed expression: Naive Bayes classifiers A naive Bayes classifier is called in this way because it's based on a naive condition, which implies the conditional independence of causes. This can seem very difficult to accept in many contexts where the probability of a particular feature is strictly correlated to another one. For example, in spam filtering, a text shorter than 50 characters can increase the probability of the presence of an image, or if the domain has been already blacklisted for sending the same spam emails to million users, it's likely to find particular keywords. In other words, the presence of a cause isn't normally independent from the presence of other ones. However, in Zhang H., The Optimality of Naive Bayes, AAAI 1, no. 2 (2004): 3, the author showed that under particular conditions (not so rare to happen), different dependencies clears one another, and a naive Bayes classifier succeeds in achieving very high performances even if its naiveness is violated. Let's consider a dataset: Every feature vector, for simplicity, will be represented as: We need also a target dataset: where each y can belong to one of P different classes. Considering the Bayes' theorem under conditional independence, we can write: The values of the marginal Apriori probability P(y) and of the conditional probabilities P(xi|y) is obtained through a frequency count, therefore, given an input vector x, the predicted class is the one which a posteriori probability is maximum. Naive Bayes in scikit-learn scikit-learn implements three naive Bayes variants based on the same number of different probabilistic distributions: Bernoulli, multinomial, and Gaussian. The first one is a binary distribution useful when a feature can be present or absent. The second one is a discrete distribution used whenever a feature must be represented by a whole number (for example, in natural language processing, it can be the frequency of a term), while the latter is a continuous distribution characterized by its mean and variance. Bernoulli naive Bayes If X is random variable Bernoulli-distributed, it can assume only two values (for simplicity, let's call them 0 and 1) and their probability is: To try this algorithm with scikit-learn, we're going to generate a dummy dataset. Bernoulli naive Bayes expects binary feature vectors, however, the class BernoulliNB has a binarize parameter which allows specifying a threshold that will be used internally to transform the features: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) We have a generated the bidimensional dataset shown in the following figure: We have decided to use 0.0 as a binary threshold, so each point can be characterized by the quadrant where it's located. Of course, this is a rational choice for our dataset, but Bernoulli naive Bayes is thought for binary feature vectors or continuous values which can be precisely split with a predeﬁned threshold. from sklearn.naive_bayes import BernoulliNB from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> bnb = BernoulliNB(binarize=0.0) >>> bnb.fit(X_train, Y_train) >>> bnb.score(X_test, Y_test) 0.85333333333333339 The score in rather good, but if we want to understand how the binary classifier worked, it's useful to see how the data have been internally binarized: Now, checking the naive Bayes predictions we obtain: >>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) >>> bnb.predict(data) array([0, 0, 1, 1]) Which is exactly what we expected. Multinomial naive Bayes A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency. If the feature vectors have n elements and each of them can assume k different values with probability pk, then: The conditional probabilities P(xi|y) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider the alpha parameter (called Laplace smoothing factor) which default value is 1.0 and prevents the model from setting null probabilities when the frequency is zero. It's possible to assign all non-negative values, however, larger values will assign higher probabilities to the missing features and this choice could alter the stability of the model. In our example, we're going to consider the default value of 1.0. For our purposes, we're going to use the DictVectorizer. There are automatic instruments to compute the frequencies of terms, but we're going to discuss them later. Let's consider only two records: the first one representing a city, while the second one countryside. Our dictionary contains hypothetical frequencies, like if the terms were extracted from a text description: from sklearn.feature_extraction import DictVectorizer >>> data = [ {'house': 100, 'street': 50, 'shop': 25, 'car': 100, 'tree': 20}, {'house': 5, 'street': 5, 'shop': 0, 'car': 10, 'tree': 500, 'river': 1} ] >>> dv = DictVectorizer(sparse=False) >>> X = dv.fit_transform(data) >>> Y = np.array([1, 0]) >>> X array([[ 100., 100., 0., 25., 50., 20.], [ 10., 5., 1., 0., 5., 500.]]) Note that the term 'river' is missing from the first set, so it's useful to keep alpha equal to 1.0 to give it a small probability. The output classes are 1 for city and 0 for the countryside. Now we can train a MultinomialNB instance: from sklearn.naive_bayes import MultinomialNB >>> mnb = MultinomialNB() >>> mnb.fit(X, Y) MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) To test the model, we create a dummy city with a river and a dummy country place without any river. >>> test_data = data = [ {'house': 80, 'street': 20, 'shop': 15, 'car': 70, 'tree': 10, 'river': 1}, ] {'house': 10, 'street': 5, 'shop': 1, 'car': 8, 'tree': 300, 'river': 0} >>> mnb.predict(dv.fit_transform(test_data)) array([1, 0]) As expected the prediction is correct. Later on, when discussing some elements of natural language processing, we're going to use multinomial naive Bayes for text classification with larger corpora. Even if the multinomial distribution is based on the number of occurrences, it can be successfully used with frequencies or more complex functions. Gaussian Naive Bayes Gaussian Naive Bayes is useful when working with continuous values which probabilities can be modeled using a Gaussian distribution: The conditional probabilities P(xi|y) are also Gaussian distributed and, therefore, it's necessary to estimate mean and variance of each of them using the maximum likelihood approach. This quite easy, in fact, considering the property of a Gaussian, we get: Where the k index refers to the samples in our dataset and P(xi|y) is a Gaussian itself. By minimizing the inverse of this expression (in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson there's a complete analytical explanation), we get mean and variance for each Gaussian associated to P(xi|y) and the model is hence trained. As an example, we compare Gaussian Naive Bayes with logistic regression using the ROC curves. The dataset has 300 samples with two features. Each sample belongs to a single class: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) A plot of the dataset is shown in the following figure: Now we can train both models and generate the ROC curves (the Y scores for naive Bayes are obtained through the predict_proba method): from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> gnb = GaussianNB() >>> gnb.fit(X_train, Y_train) >>> Y_gnb_score = gnb.predict_proba(X_test) >>> lr = LogisticRegression() >>> lr.fit(X_train, Y_train) >>> Y_lr_score = lr.decision_function(X_test) >>> fpr_gnb, tpr_gnb, thresholds_gnb = roc_curve(Y_test, Y_gnb_score[:, 1]) >>> fpr_lr, tpr_lr, thresholds_lr = roc_curve(Y_test, Y_lr_score) The resulting ROC curves are shown in the following figure: Naive Bayes performances are slightly better than logistic regression, however, the two classifiers have similar accuracy and Area Under the Curve (AUC). It's interesting to compare the performances of Gaussian and multinomial naive Bayes with the MNIST digit dataset. Each sample (belonging to 10 classes) is an 8x8 image encoded as an unsigned integer (0 - 255), therefore, even if each feature doesn't represent an actual count, it can be considered like a sort of magnitude or frequency. from sklearn.datasets import load_digits from sklearn.model_selection import cross_val_score >>> digits = load_digits() >>> gnb = GaussianNB() >>> mnb = MultinomialNB() >>> cross_val_score(gnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.81035375835678214 >>> cross_val_score(mnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.88193962163008377 The multinomial naive Bayes performs better than the Gaussian variant and the result is not really surprising. In fact, each sample can be thought as a feature vector derived from a dictionary of 64 symbols. The value can be the count of each occurrence, so a multinomial distribution can better fit the data, while a Gaussian is slightly more limited by its mean and variance. We've exposed the generic naive Bayes approach starting from the Bayes' theorem and its intrinsic philosophy. The naiveness of such algorithm is due to the choice to assume all the causes to be conditional independent. It means that each contribution is the same in every combination and the presence of a specific cause cannot alter the probability of the other ones. This is not so often realistic, however, under some assumptions; it's possible to show that internal dependencies clear each other so that the resulting probability appears unaffected by their relations. [box type="note" align="" class="" width=""]You read an excerpt from the book, Machine Learning Algorithms, written by Giuseppe Bonaccorso. This book will help you build strong foundation to enter the world of machine learning and data science. You will learn to build a data model and see how it behaves using different ML algorithms, explore support vector machines, recommendation systems, and even create a machine learning architecture from scratch. Grab your copy today![/box] What is Naïve Bayes classifier? Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving

0
0
85097

article-image-building-an-llm-powered-app-using-snowflake-and-streamlit

Ryan Goodman

30 Jan 2024

11 min read

Building an LLM-powered App using Snowflake and Streamlit

Ryan Goodman

30 Jan 2024

11 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionFor years, self-service analytics apps have enabled both information consumers (business users) and information workers (analysts) to meet their need for data assets that aid analysis and problem-solving. These data assets can include ready-made insights and analysis in the form of statistics, visual stories, or formatted data for further discovery. Historically, for an enterprise to embark on creating analytics apps, it required a specialized skillset, technology tools, and a steep learning curve to deliver value.Three significant trends have shifted how we view analytics apps today:● No-code and low-code data acquisition, along with cloud data/warehouse platforms, have helped democratize the data platform.● Data platforms like Snowflake are designed to bring analytics computing into a single platform where data no longer needs to be copied and moved.● The democratization of machine learning and the widespread availability of powerful generative AI models have changed the entire user experience and expectations for information discovery and natural language exploration.The result of these trends has accelerated technology cycles and the rate of innovation in unprecedented ways. Prudent technology and business leaders are strained with more requests and fewer resources to use data to build information-focused businesses.Currently, we have AI app and analytics waves breaking at the same time with different use cases in mind but the same objective. For this article, we wanted to explore the basics of building a simple analytics app inside of Snowflake, allowing an OpenAI interface to execute code without ever accessing any of the resulting data.Modern Data Cloud and Analytics Technology ToolsLet us explore the process and benefits of building an LLM-powered application using a cloud-based data warehousing platform like Snowflake and an open-source Python library for creating web applications like Streamlit. Ref: https://www.snowflake.com/blog/building-python-data-apps-streamlit/Understanding Snowflake Data Warehousing Snowflake is a leading cloud data platform offering secure and scalable solutions for processing and storing data. The architecture of Snowflake allows easy integration with programming languages. It eventually works on data-intensive applications. To work with Snowflake, one must create a Snowflake account to set up the database for data storage.LLM Powered Inputs and TranslationEvery large language model, including GPT-4, is capable of understanding and generating human-like texts based on prompts and inputs it receives. These models are trained on vast datasets, enabling them to comprehend large and complex language patterns and generate contextually relevant responses. An incredible aspect of large language models, particularly GPT-4, is their ability to effectively translate natural language into code, including SQL and Python.Large language models are not designed for computational procedures like statistics and analytics, but with the right prompting and, most importantly, context, you can streamline many common tasks.Integration of Snowflake with Python and Streamlit SnowparkIn data analysis and machine learning (ML), Python is the most versatile programming language. Snowflake offers a Python connector that enables seamless communication between Snowflake databases and Python scripts. In this article, we are not using Snowpark.Storyboarding our AppThe difference between a good app and a great app lies in the value you create for your user. The secret to building a great app is empowering users to solve problems that would otherwise be painful or impossible due to a lack of skills. The app we are building here demonstrates how to fit technology components together.Minimum Viable Product Storyboard:● End user: Analytics app developer● Intent: Demonstrate core tech components● Outcome: Have● Value: Quickly understand a functional code example without having to researchWe will build a native Streamlit app inside of Snowflake:● The app will feature a chat interface powered by ChatGPT.● The chat history will be written on a Snowflake table.● The GPT model will read the results of a simple query, interpret the results, and summarize them in plain English.Bringing Technology Components TogetherFor this article, we decided to build a simple end-to-end demonstration of how a native Snowflake app built with Python and Streamlit can utilize a chatbot interface that uses ChatGPT-4 to generate SQL code that can be executed natively in Snowflake with the context of the schema.Snowflake Integration of ChatGPT Large Language Model APITo receive responses with the help of a large language model, leverage the OpenAI Documentation and Playground. Obtain the OpenAI GPT Key, and then use the following code to interact with a large language model.-- Step 1 - Create a Secret for open ai key . CREATE OR REPLACE SECRET open_ai_api_key TYPE = GENERIC_STRING SECRET_STRING = '<OPEN_AI_KEY>'; -- Step 2 - Create a Network rule on Snowflake CREATE OR REPLACE NETWORK RULE openai_network_rule MODE = EGRESS TYPE = HOST_PORT VALUE_LIST = ('api.openai.com'); -- Step 3 Create a EXTERNAL ACCESS INTEGRATION in Snowflake CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION external_access_int ALLOWED_NETWORK_RULES = (openai_network_rule) ALLOWED_AUTHENTICATION_SECRETS = (open_ai_api_key) ENABLED = true; -- Step 4 Create a UDF using openai packages . Here we are using "gpt-3.5-turbo" Model CREATE OR REPLACE FUNCTION CHATGPTv1(query varchar) RETURNS STRING LANGUAGE PYTHON RUNTIME_VERSION = 3.9 HANDLER = 'runner' EXTERNAL_ACCESS_INTEGRATIONS = (external_access_int) SECRETS = ('openai_key' = open_ai_api_key) PACKAGES = ('openai') AS $$ import _snowflake import openai def runner(QUERY): openai.api_key = _snowflake.get_generic_secret_string('openai_key') messages = [{"role": "user", "content": QUERY}] model="gpt-3.5-turbo" response = openai.ChatCompletion.create(model=model,messages=messages,temperature=0,) return response.choices[0].message["content"] $$; -- Test your UDF SELECT CHATGPTv1('Hi')Creation of Streamlit User Experience InterfaceTo create the Streamlit user experience the following code was utilized to build a very basic functional prototype with GPT3.5 Turbo.1. Installation:pip install Streamlit2. Creation:from snowflake.snowpark.context import get_active_session st.set_page_config(layout="wide") st.title("OPEN AI IN SIS - GPT-3.5-turbo(MODEL)") st.write("##") st.write("##") # Get the current credentials session = get_active_session() if 'request_response' not in st.session_state: st.session_state['request_response'] = {} if st.session_state['request_response']: for itr in st.session_state['request_response'].keys(): request_col , request_col1 = st.columns(2) response_col1 , response_col = st.columns(2) with request_col: st.write(f":bust_in_silhouette: :blue[{itr}]") st.write("##") with response_col: st.write(f":speech_balloon: :red[{st.session_state['request_response'][itr][0]}]") col1 ,col2 = st.columns(2) with col1: search_text= st.text_input("Send a message") search_button = st.button("Send") if search_text and search_button: search_result = session.sql(f"SELECT CHATGPTv1('{search_text}')").collect() if search_result: st.session_state['request_response'][search_text] = [search_result[0][0]] st.experimental_rerun()3. Run:Streamlit run app.pyMoving from MVP to Real-World ApplicationReal-world analytics apps are designed with a narrow scope, outcome, and value in mind. Let's expand on the same technology components and formulate a real-world use case that will be more impactful to an enterprise. When evaluating real-world business cases to apply Streamlit and OpenAI, focus on use cases that deliver value frequently, to many (or important) people in your organization, and are tied to high-impact business processes.Data Tape Co-pilot Tool:● End user: Financial Analysts, Business Analysts, Data Analysts.● Intent: Deliver a data tape with the ability to constrain data to business needs and provide a basic summary.● Outcome: End users can download the data tape and receive a plain English summary of key stats (record count, distinct key, constraints in the query contained in the WHERE clause).● Value: Provide natural language access to a single, widely used data tape with a clear, plain English explanation of the dataset.Streamlit Analytics Improves User Adoption and Success with Snowflake With a better understanding of Streamlit as a driver for the adoption of Snowflake and the increasing adoption of data assets, let's dig deeper into Streamlit as the conduit for adoption. While Snowflake may be a known entity within your enterprise, few business-facing professionals will ever know they are interfacing with Snowflake, and that is okay. Without more technology tools and platforms, Streamlit opens the doors to Snowflake but most importantly eliminates other tools, platforms, and an additional layer of services to manage. Instead, you can leverage the skills already on hand within most data and analytics teams. Here are some additional features that make Streamlit quite compelling:● Simplicity and Ease of Use: Streamlit provides an intuitive API that allows developers to create interactive UI elements with minimal code. Its straightforward syntax enables both beginners and experienced developers to quickly prototype and deploy applications without a steep learning curve.● Rapid Prototyping: Streamlit excels at rapid prototyping, enabling developers to iterate quickly on their ideas. With its live reloading feature, developers can see changes in real time as they modify the code. This development speed is crucial for experimenting with different UI layouts and functionalities.● Data Exploration and Visualization: Streamlit integrates seamlessly with popular data science libraries . Some of these are Pandas, Matplotlib, and Plotly. This integration allows developers to create dynamic and interactive charts, graphs, and dashboards with minimal effort. Data scientists and analysts can effectively showcase their findings, making it an excellent choice for data exploration and visualization tasks.● Customization and Theming: While Streamlit provides a simple interface, it also offers customization options for developers who want to create visually appealing applications. Developers can customize the appearance of their apps, including layout, colors, and themes, to match their brand or specific design preferences.● Seamless Integration with Machine Learning and AI Models: Streamlit makes integrating machine learning models, natural language processing tools, and other AI technologies into applications easy. Developers can create interactive interfaces for AI-powered applications, enabling users to interact with complex algorithms and models without understanding the underlying complexities.● Sharing and Deployment: Streamlit apps can be easily shared and deployed on various platforms. Whether it's sharing within a team, showcasing a prototype to stakeholders, or deploying a full-fledged application for public use, Streamlit simplifies the process. Streamlit sharing, Streamlit's deployment platform, allows developers to deploy apps with minimal configuration, making them accessible to a broader audience.● Active Community and Documentation: Streamlit has a vibrant and active community of developers. The availability of numerous examples, tutorials, and community-contributed components enhances the development experience. Streamlit's comprehensive documentation provides detailed guidance on various aspects of building interactive applications, making it easier for developers to find solutions to their queries.● Flexibility and Extensibility: While Streamlit is easy for beginners, it also offers flexibility and extensibility for advanced users. Developers can create custom components and integrate JavaScript functionality when needed, allowing them to extend Streamlit's capabilities based on their requirements.ConclusionThe integration of Snowflake and Streamlit offers a powerful combination for building analytics and data delivery apps. A single, blended data warehousing solution with intuitive application development can democratize data access, enabling users across an organization to transform complex datasets into palatable, prepared information assets. Though the Snowflake modern data cloud app store is in its infancy, you can jump in today and seize a great opportunity to build powerful data apps. While this article explained a simple GPT API interface, the recent introduction of GPT Assistants API expands the possibilities for even more intelligent, contextual agents running securely running right where you work. I look forward to expanding on this basic prototype to a more intelligent co-pilot experience soon.Author BioRyan Goodman has dedicated 20 years to the business of data and analytics, working as a practitioner, executive, and entrepreneur. He recently founded DataTools Pro after 4 years at Reliant Funding, where he served as the VP of Analytics and BI. There, he implemented a modern data stack, utilized data sciences, integrated cloud analytics, and established a governance structure. Drawing from his experiences as a customer, Ryan is now collaborating with his team to develop rapid deployment industry solutions. These solutions utilize machine learning, LLMs, and modern data platforms to significantly reduce the time to value for data and analytics teams.

0
0
85058

article-image-implementing-memory-management-with-golang-garbage-collector

Packt Editorial Staff

03 Sep 2019

10 min read

Implementing memory management with Golang's garbage collector

Packt Editorial Staff

03 Sep 2019

10 min read

Did you ever think of how bulk messages are pushed in real-time that fast? How is it possible? Low latency garbage collector (GC) plays an important role in this. In this article, we present ways to look at certain parameters to implement memory management with the Golang GC. Garbage collection is the process of freeing up memory space that is not being used. In other words, the GC sees which objects are out of scope and cannot be referenced anymore and frees the memory space they consume. This process happens in a concurrent way while a Go program is running and not before or after the execution of the program. This article is an excerpt from the book Mastering Go - Third Edition by Mihalis Tsoukalos. Mihalis runs through the nuances of Go, with deep guides to types and structures, packages, concurrency, network programming, compiler design, optimization, and more. Implementing the Golang GC The Go standard library offers functions that allow you to study the operation of the GC and learn more about what the GC does secretly. These functions are illustrated in the gColl.go utility. The source code of gColl.go is presented here in chunks. Package main import ( "fmt" "runtime" "time" ) You need the runtime package because it allows you to obtain information about the Go runtime system, which, among other things, includes the operation of the GC. func printStats(mem runtime.MemStats) { runtime.ReadMemStats(&mem) fmt.Println("mem.Alloc:", mem.Alloc) fmt.Println("mem.TotalAlloc:", mem.TotalAlloc) fmt.Println("mem.HeapAlloc:", mem.HeapAlloc) fmt.Println("mem.NumGC:", mem.NumGC, "\n") } The purpose of the printStats() function is to avoid writing the same Go code all the time. The runtime.ReadMemStats() call gets the latest garbage collection statistics for you. func main() { var mem runtime.MemStats printStats(mem) for i := 0; i < 10; i++ { // Allocating 50,000,000 bytes s := make([]byte, 50000000) if s == nil { fmt.Println("Operation failed!") } } printStats(mem) In this part, we have a for loop that creates 10-byte slices with 50,000,000 bytes each. The reason for this is that by allocating large amounts of memory, we can trigger the GC. for i := 0; i < 10; i++ { // Allocating 100,000,000 bytes s := make([]byte, 100000000) if s == nil { fmt.Println("Operation failed!") } time.Sleep(5 * time.Second) } printStats(mem) } The last part of the program makes even bigger memory allocations – this time, each byte slice has 100,000,000 bytes. Running gColl.go on a macOS Big Sur machine with 24 GB of RAM produces the following kind of output: $ go run gColl.go mem.Alloc: 124616 mem.TotalAlloc: 124616 mem.HeapAlloc: 124616 mem.NumGC: 0 mem.Alloc: 50124368 mem.TotalAlloc: 500175120 mem.HeapAlloc: 50124368 mem.NumGC: 9 mem.Alloc: 122536 mem.TotalAlloc: 1500257968 mem.HeapAlloc: 122536 mem.NumGC: 19 The value of mem.Alloc is the bytes of allocated heap objects — allocated are all the objects that the GC has not yet freed. mem.TotalAlloc shows the cumulative bytes allocated for heap objects—this number does not decrease when objects are freed, which means that it keeps increasing. Therefore, it shows the total number of bytes allocated for heap objects during program execution. mem.HeapAlloc is the same as mem.Alloc. Last, mem.NumGC shows the total number of completed garbage collection cycles. The bigger that value is, the more you have to consider how you allocate memory in your code and if there is a way to optimize that. If you want even more verbose output regarding the operation of the GC, you can combine go run gColl.go with GODEBUG=gctrace=1. Apart from the regular program output, you get some extra metrics—this is illustrated in the following output: $ GODEBUG=gctrace=1 go run gColl.go gc 1 @0.021s 0%: 0.020+0.32+0.015 ms clock, 0.16+0.17/0.33/0.22+0.12 ms cpu, 4->4->0 MB, 5 MB goal, 8 P gc 2 @0.041s 0%: 0.074+0.32+0.003 ms clock, 0.59+0.087/0.37/0.45+0.030 ms cpu, 4->4->0 MB, 5 MB goal, 8 P . . . gc 18 @40.152s 0%: 0.065+0.14+0.013 ms clock, 0.52+0/0.12/0.042+0.10 ms cpu, 95->95->0 MB, 96 MB goal, 8 P gc 19 @45.160s 0%: 0.028+0.12+0.003 ms clock, 0.22+0/0.13/0.081+0.028 ms cpu, 95->95->0 MB, 96 MB goal, 8 P mem.Alloc: 120672 mem.TotalAlloc: 1500256376 mem.HeapAlloc: 120672 mem.NumGC: 19 Now, let us explain the 95->95->0 MB triplet in the previous line of output. The first value (95) is the heap size when the GC is about to run. The second value (95) is the heap size when the GC ends its operation. The last value is the size of the live heap (0). Go garbage collection is based on the tricolor algorithm The operation of the Go GC is based on the tricolor algorithm, which is the subject of this subsection. Note that the tricolor algorithm is not unique to Go and can be used in other programming languages as well. Strictly speaking, the official name for the algorithm used in Go is the tricolor mark-and-sweep algorithm. It can work concurrently with the program and uses a write barrier. This means that when a Go program runs, the Go scheduler is responsible for the scheduling of the application and the GC. This is as if the Go scheduler has to deal with a regular application with multiple goroutines! The core idea behind this algorithm came from Edsger W. Dijkstra, Leslie Lamport, A. J. Martin, C. S. Scholten, and E. F. M. Steffens and was first illustrated in a paper named On-the-Fly Garbage Collection: An Exercise in Cooperation. The primary principle behind the tricolor mark-and-sweep algorithm is that it divides the objects of the heap into three different sets according to their color, which is assigned by the algorithm. It is now time to talk about the meaning of each color set. The objects of the black set are guaranteed to have no pointers to any object of the white set. However, an object of the white set can have a pointer to an object of the black set because this has no effect on the operation of the GC. The objects of the gray set might have pointers to some objects of the white set. Finally, the objects of the white set are the candidates for garbage collection. So, when the garbage collection begins, all objects are white, and the GC visits all the root objects and colors them gray. The roots are the objects that can be directly accessed by the application, which includes global variables and other things on the stack. These objects mostly depend on the Go code of a program. After that, the GC picks a gray object, makes it black, and starts looking at whether that object has pointers to other objects of the white set or not. Therefore, when an object of the gray set is scanned for pointers to other objects, it is colored black. If that scan discovers that this particular object has one or more pointers to a white object, it puts that white object in the gray set. This process keeps going for as long as objects exist in the gray set. After that, the objects in the white set are unreachable and their memory space can be reused. Therefore, at this point, the elements of the white set are said to be garbage collected. Please note that no object can go directly from the black set to the white set, which allows the algorithm to operate and be able to clear the objects on the white set. As mentioned before, no object of the black set can directly point to an object of the white set. Additionally, if an object of the gray set becomes unreachable at some point in a garbage collection cycle, it will not be collected at this garbage collection cycle but in the next one! Although this is not an optimal situation, it is not that bad. During this process, the running application is called the mutator. The mutator runs a small function named write barrier that is executed each time a pointer in the heap is modified. If the pointer of an object in the heap is modified, which means that this object is now reachable, the write barrier colors it gray and puts it in the gray set. The mutator is responsible for the invariant that no element of the black set has a pointer to an element of the white set. This is accomplished with the help of the write barrier function. Failing to accomplish this invariant will ruin the garbage collection process and will most likely crash your program in a pretty bad and undesirable way! So, there are three different colors: black, white, and gray. When the algorithm begins, all objects are colored white. As the algorithm keeps going, white objects are moved into one of the other two sets. The objects that are left in the white set are the ones that are going to be cleared at some point. The next figure displays the three color sets with objects in them. Figure 1: The Go GC represents the heap of a program as a graph In the presented graph, you can see that while object E, which is in the white set, can access object F, it cannot be accessed by any other object because no other object points to object E, which makes it a perfect candidate for garbage collection! Additionally, objects A, B, and C are root objects and are always reachable; therefore, they cannot be garbage collected. Graph comprehended Can you guess what will happen next in that graph? Well, it is not that difficult to realize that the algorithm will have to process the remaining elements of the gray set, which means that both objects A and F will go to the black set. Object A will go to the black set because it is a root element and F will go to the black set because it does not point to any other object while it is in the gray set. After object A is garbage collected, object F will become unreachable and will be garbage collected in the next cycle of the GC because an unreachable object cannot magically become reachable in the next iteration of the garbage collection cycle. Note: The Go garbage collection can also be applied to variables such as channels. When the GC finds out that a channel is unreachable, that is when the channel variable cannot be accessed anymore, it will free its resources even if the channel has not been closed. Go allows you to manually initiate a garbage collection by putting a runtime.GC() statement in your Go code. However, have in mind that runtime.GC() will block the caller and it might block the entire program, especially if you are running a very busy Go program with many objects. This mainly happens because you cannot perform garbage collections while everything else is rapidly changing, as this will not give the GC the opportunity to clearly identify the members of the white, black, and gray sets. This garbage collection status is also called garbage collection safe-point. You can find the long and relatively advanced Go code of the GC at https://github.com/golang/go/blob/master/src/runtime/mgc.go, which you can study if you want to learn even more information about the garbage collection operation. You can even make changes to that code if you are brave enough! Understanding Go Internals: defer, panic() and recover() functions [Tutorial] Implementing hashing algorithms in Golang [Tutorial] Is Golang truly community driven and does it really matter?

0
0
84250

article-image-using-lambda-expressions-in-java-11-tutorial

Prasad Ramesh

22 Feb 2019

9 min read

Using lambda expressions in Java 11 [Tutorial]

Prasad Ramesh

22 Feb 2019

9 min read

0
0
83283

How-To Tutorials

article-image-implementing-k-nearest-neighbors-algorithm-python

Aaron Lazar

17 Nov 2017

7 min read

Implementing the k-nearest neighbors algorithm in Python

Aaron Lazar

17 Nov 2017

7 min read

0
0
82509

4 ways to implement feature selection in Python for machine learning

Divide and Conquer – Classification Using Decision Trees and Rules

Creating 2D and 3D plots using Matplotlib

Cross-Validation strategies for Time Series forecasting [Tutorial]

25 Datasets for Deep Learning in IoT

How to Customize lines and markers in Matplotlib 2.0

Mastering the API Life Cycle: A Comprehensive Guide to Design, Implementation, Release, and Maintenance

How does Elasticsearch work? [Tutorial]

Which Python framework is best for building RESTful APIs? Django or Flask?

Image filtering techniques in OpenCV

Trending Topics

Implementing 3 Naive Bayes classifiers in scikit-learn

Building an LLM-powered App using Snowflake and Streamlit

Implementing memory management with Golang's garbage collector

Using lambda expressions in Java 11 [Tutorial]

Implementing the k-nearest neighbors algorithm in Python

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access