Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-selecting-statistical-based-features-in-machine-learning-application

14 Mar 2018

16 min read

Selecting Statistical-based Features in Machine Learning application

14 Mar 2018

In today’s tutorial, we will work on one of the methods of executing feature selection, the statistical-based method for interpreting both quantitative and qualitative datasets. Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline. To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset. Statistical-based feature selections Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection: Pearson correlations hypothesis testing Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline. Using Pearson correlation to select features We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method: credit_card_default.corr() The output of the preceding code produces is the following: As a continuation of the preceding table we have: The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship. It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold). The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible: # using seaborn to generate heatmaps import seaborn as sns import matplotlib.style as style # Use a clean stylizatino for our charts and graphs style.use('fivethirtyeight') sns.heatmap(credit_card_default.corr()) The heatmap generated will be as follows: Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us. Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods. Let's isolate the correlations between the features and the response variable, using the following code: # just correlations between every feature and the response credit_card_default.corr()['default payment next month'] LIMIT_BAL -0.153520 SEX -0.039961 EDUCATION 0.028006 MARRIAGE -0.024339 AGE 0.013890 PAY_0 0.324794 PAY_2 0.263551 PAY_3 0.235253 PAY_4 0.216614 PAY_5 0.204149 PAY_6 0.186866 BILL_AMT1 -0.019644 BILL_AMT2 -0.014193 BILL_AMT3 -0.014076 BILL_AMT4 -0.010156 BILL_AMT5 -0.006760 BILL_AMT6 -0.005372 PAY_AMT1 -0.072929 PAY_AMT2 -0.058579 PAY_AMT3 -0.056250 PAY_AMT4 -0.056827 PAY_AMT5 -0.055124 PAY_AMT6 -0.053183 default payment next month 1.000000 We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative). Let's do this by first defining a pandas mask, which will act as our filter, using the following code: # filter only correlations stronger than .2 in either direction (positive or negative) credit_card_default.corr()['default payment next month'].abs() > .2 LIMIT_BAL False SEX False EDUCATION False MARRIAGE False AGE False PAY_0 True PAY_2 True PAY_3 True PAY_4 True PAY_5 True PAY_6 False BILL_AMT1 False BILL_AMT2 False BILL_AMT3 False BILL_AMT4 False BILL_AMT5 False BILL_AMT6 False PAY_AMT1 False PAY_AMT2 False PAY_AMT3 False PAY_AMT4 False PAY_AMT5 False PAY_AMT6 False default payment next month True Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code: # store the features highly_correlated_features = credit_card_default.columns[credit_card_default.corr()['default payment next month'].abs() > .2] highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5', u'default payment next month'], dtype='object') The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating: # drop the response variable highly_correlated_features = highly_correlated_features.drop('default payment next month') highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object') So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code: # only include the five highly correlated features X_subsetted = X[highly_correlated_features] get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y) # barely worse, but about 20x faster to fit the model Best Accuracy: 0.819666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.01 Average Time to Score (s): 0.002 Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe. Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class. We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are: The fit logic will select columns from the features matrix that are higher than a specified threshold The transform logic will subset any future datasets to only include those columns that were deemed important from sklearn.base import TransformerMixin, BaseEstimator class CustomCorrelationChooser(TransformerMixin, BaseEstimator): def __init__(self, response, cols_to_keep=[], threshold=None): # store the response series self.response = response # store the threshold that we wish to keep self.threshold = threshold # initialize a variable that will eventually # hold the names of the features that we wish to keep self.cols_to_keep = cols_to_keep def transform(self, X): # the transform method simply selects the appropiate # columns from the original dataset return X[self.cols_to_keep] def fit(self, X, *_): # create a new dataframe that holds both features and response df = pd.concat([X, self.response], axis=1) # store names of columns that meet correlation threshold self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > Self.threshold] # only keep columns in X, for example, will remove response Variable self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns] return self Let's take our new correlation feature selector for a spin, with the help of the following code: # instantiate our new feature selector ccc = CustomCorrelationChooser(threshold=.2, response=y) ccc.fit(X) ccc.cols_to_keep ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'] Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code: ccc.transform(X).head() The preceding code produces the following table as the output: We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code: # instantiate our feature selector with the response variable set ccc = CustomCorrelationChooser(response=y) # make our new pipeline, including the selector ccc_pipe = Pipeline([('correlation_select', ccc), ('classifier', d_tree)]) # make a copy of the decisino tree pipeline parameters ccc_pipe_params = deepcopy(tree_pipe_params) # update that dictionary with feature selector specific parameters ccc_pipe_params.update({ 'correlation_select__threshold':[0, .1, .2, .3]}) print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2, 0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # better than original (by a little, and a bit faster on # average overall get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'correlation_select__threshold': 0.1, 'classifier__max_depth': 5} Average Time to Fit (s): 0.105 Average Time to Score (s): 0.003 Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep: # check the threshold of .1 ccc = CustomCorrelationChooser(threshold=0.1, response=y) ccc.fit(X) # check which columns were kept ccc.cols_to_keep ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own. Feature selection using hypothesis testing Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values. A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion. In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance. To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code: # SelectKBest selects features according to the k highest scores of a given scoring function from sklearn.feature_selection import SelectKBest # This models a statistical test known as ANOVA from sklearn.feature_selection import f_classif # f_classif allows for negative values, not all do # chi2 is a very common classification criteria but only allows for positive values # regression has its own statistical tests SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking. Interpreting the p-value The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it. The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python. Ranking the p-value Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values: # keep only the best five features according to p-values of ANOVA test k_best = SelectKBest(f_classif, k=5) We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector: # matrix after selecting the top 5 features k_best.fit_transform(X, y) # 30,000 rows x 5 columns array([[ 2, 2, -1, -1, -2], [-1, 2, 0, 0, 0], [ 0, 0, 0, 0, 0], ..., [ 4, 3, 2, -1, 0], [ 1, -1, 0, 0, 0], [ 0, 0, 0, 0, 0]]) If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables: # get the p values of columns k_best.pvalues_ # make a dataframe of features and p-values # sort that dataframe by p-value p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value') # show the top 5 features p_values.head() The preceding code produces the following table as the output: We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology: # features with a low p value p_values[p_values['p_value'] < .05] The preceding code produces the following table as the output: The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code: # features with a high p value p_values[p_values['p_value'] >= .05] The preceding code produces the following table as the output: These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code: k_best = SelectKBest(f_classif) # Make a new pipeline with SelectKBest select_k_pipe = Pipeline([('k_best', k_best), ('classifier', d_tree)]) select_k_best_pipe_params = deepcopy(tree_pipe_params) # the 'all' literally does nothing to subset select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']}) print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # comparable to our results with correlationchooser get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5} Average Time to Fit (s): 0.102 Average Time to Score (s): 0.002 It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code: k_best = SelectKBest(f_classif, k=7) # lowest 7 p values match what our custom correlationchooser chose before # ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] p_values.head(7) The preceding code produces the following table as the output: They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us. There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http://scikit-learn.org/stable/modules/feature_selection. html#univariate-feature-selection Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks: # sanity check # If we only the worst columns the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])] # goes to show, that selecting the wrong features will # hurt us in predictive performance get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y) Best Accuracy: 0.783966666667 Best Parameters: {'max_depth': 5} Average Time to Fit (s): 0.21 Average Time to Score (s): 0.002 Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average. We statistically selected features from the dataset for our machine learning pipeline. [box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]

0
0
34086

article-image-jupyter-and-python-scripting

Packt

21 Oct 2016

9 min read

Jupyter and Python Scripting

Packt

21 Oct 2016

9 min read

In this article by Dan Toomey, author of the book Learning Jupyter, we will see data access in Jupyter with Python and the effect of pandas on Jupyter. We will also see Python graphics and lastly Python random numbers. (For more resources related to this topic, see here.) Python data access in Jupyter I started a view for pandas using Python Data Access as the name. We will read in a large dataset and compute some standard statistics on the data. We are interested in seeing how we use pandas in Jupyter, how well the script performs, and what information is stored in the metadata (especially if it is a larger dataset). Our script accesses the iris dataset built into one of the Python packages. All we are looking to do is read in a slightly large number of items and calculate some basic operations on the dataset. We are really interested in seeing how much of the data is cached in the PYNB file. The Python code is: # import the datasets package from sklearn import datasets # pull in the iris data iris_dataset = datasets.load_iris() # grab the first two columns of data X = iris_dataset.data[:, :2] # calculate some basic statistics x_count = len(X.flat) x_min = X[:, 0].min() - .5 x_max = X[:, 0].max() + .5 x_mean = X[:, 0].mean() # display our results x_count, x_min, x_max, x_mean I broke these steps into a couple of cells in Jupyter, as shown in the following screenshot: Now, run the cells (using Cell | Run All) and you get this display below. The only difference is the last Out line where our values are displayed. It seemed to take longer to load the library (the first time I ran the script) than to read the data and calculate the statistics. If we look in the PYNB file for this notebook, we see that none of the data is cached in the PYNB file. We simply have code references to the library, our code, and the output from when we last calculated the script: { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(300, 3.7999999999999998, 8.4000000000000004, 5.8433333333333337)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate some basic statisticsn", "x_count = len(X.flat)n", "x_min = X[:, 0].min() - .5n", "x_max = X[:, 0].max() + .5n", "x_mean = X[:, 0].mean()n", "n", "# display our resultsn", "x_count, x_min, x_max, x_mean" ] } Python pandas in Jupyter One of the most widely used features of Python is pandas. pandas are built-in libraries of data analysis packages that can be used freely. In this example, we will develop a Python script that uses pandas to see if there is any effect to using them in Jupyter. I am using the Titanic dataset from http://www.kaggle.com/c/titanic-gettingStarted/download/train.csv. I am sure the same data is available from a variety of sources. Here is our Python script that we want to run in Jupyter: from pandas import * training_set = read_csv('train.csv') training_set.head() male = training_set[training_set.sex == 'male'] female = training_set[training_set.sex =='female'] womens_survival_rate = float(sum(female.survived))/len(female) mens_survival_rate = float(sum(male.survived))/len(male) The result is… we calculate the survival rates of the passengers based on sex. We create a new notebook, enter the script into appropriate cells, include adding displays of calculated data at each point and produce our results. Here is our notebook laid out where we added displays of calculated data at each cell,as shown in the following screenshot: When I ran this script, I had two problems: On Windows, it is common to use backslash ("") to separate parts of a filename. However, this coding uses the backslash as a special character. So, I had to change over to use forward slash ("/") in my CSV file path. I originally had a full path to the CSV in the above code example. The dataset column names are taken directly from the file and are case sensitive. In this case, I was originally using the 'sex' field in my script, but in the CSV file the column is named Sex. Similarly I had to change survived to Survived. The final script and result looks like the following screenshot when we run it: I have used the head() function to display the first few lines of the dataset. It is interesting… the amount of detail that is available for all of the passengers. If you scroll down, you see the results as shown in the following screenshot: We see that 74% of the survivors were women versus just 19% men. I would like to think chivalry is not dead! Curiously the results do not total to 100%. However, like every other dataset I have seen, there is missing and/or inaccurate data present. Python graphics in Jupyter How do Python graphics work in Jupyter? I started another view for this named Python Graphics so as to distinguish the work. If we were to build a sample dataset of baby names and the number of births in a year of that name, we could then plot the data. The Python coding is simple: import pandas import matplotlib %matplotlib inline baby_name = ['Alice','Charles','Diane','Edward'] number_births = [96, 155, 66, 272] dataset = list(zip(baby_name,number_births)) df = pandas.DataFrame(data = dataset, columns=['Name', 'Number']) df['Number'].plot() The steps of the script are as follows: We import the graphics library (and data library) that we need Define our data Convert the data into a format that allows for easy graphical display Plot the data We would expect a resultant graph of the number of births by baby name. Taking the above script and placing it into cells of our Jupyter node, we get something that looks like the following screenshot: I have broken the script into different cells for easier readability. Having different cells also allows you to develop the script easily step by step, where you can display the values computed so far to validate your results. I have done this in most of the cells by displaying the dataset and DataFrame at the bottom of those cells. When we run this script (Cell | Run All), we see the results at each step displayed as the script progresses: And finally we see our plot of the births as shown in the following screenshot. I was curious what metadata was stored for this script. Looking into the IPYNB file, you can see the expected value for the formula cells. The tabular data display of the DataFrame is stored as HTML—convenient: { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>n", "<table border="1" class="dataframe">n", "<thead>n", "<tr style="text-align: right;">n", "<th></th>n", "<th>Name</th>n", "<th>Number</th>n", "</tr>n", "</thead>n", "<tbody>n", "<tr>n", "<th>0</th>n", "<td>Alice</td>n", "<td>96</td>n", "</tr>n", "<tr>n", "<th>1</th>n", "<td>Charles</td>n", "<td>155</td>n", "</tr>n", "<tr>n", "<th>2</th>n", "<td>Diane</td>n", "<td>66</td>n", "</tr>n", "<tr>n", "<th>3</th>n", "<td>Edward</td>n", "<td>272</td>n", "</tr>n", "</tbody>n", "</table>n", "</div>" ], "text/plain": [ " Name Numbern", "0 Alice 96n", "1 Charles 155n", "2 Diane 66n", "3 Edward 272" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], The graphic output cell that is stored like this: { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x47cf8f0>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "<a few hundred lines of hexcodes> …/wc/B0RRYEH0EQAAAABJRU5ErkJggg==n", "text/plain": [ "<matplotlib.figure.Figure at 0x47d8e30>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the datan", "df['Number'].plot()n" ] } ], Where the image/png tag contains a large hex digit string representation of the graphical image displayed on screen (I abbreviated the display in the coding shown). So, the actual generated image is stored in the metadata for the page. Python random numbers in Jupyter For many analyses we are interested in calculating repeatable results. However, much of the analysis relies on some random numbers to be used. In Python, you can set the seed for the random number generator to achieve repeatable results with the random_seed() function. In this example, we simulate rolling a pair of dice and looking at the outcome. We would example the average total of the two dice to be 6—the halfway point between the faces. The script we are using is this: import pylab import random random.seed(113) samples = 1000 dice = [] for i in range(samples): total = random.randint(1,6) + random.randint(1,6) dice.append(total) pylab.hist(dice, bins= pylab.arange(1.5,12.6,1.0)) pylab.show() Once we have the script in Jupyter and execute it, we have this result: I had added some more statistics. Not sure if I would have counted on such a high standard deviation. If we increased the number of samples, this would decrease. The resulting graph was opened in a new window, much as it would if you ran this script in another Python development environment. The toolbar at the top of the graphic is extensive, allowing you to manipulate the graphic in many ways. Summary In this article, we walked through simple data access in Jupyter through Python. Then we saw an example of using pandas. We looked at a graphics example. Finally, we looked at an example using random numbers in a Python script. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Mining Twitter with Python – Influence and Engagement [article] Unsupervised Learning [article]

0
0
34017

article-image-ensemble-methods-optimize-machine-learning-models

Guest Contributor

07 Nov 2017

8 min read

Ensemble Methods to Optimize Machine Learning Models

Guest Contributor

07 Nov 2017

8 min read

[box type="info" align="" class="" width=""]We are happy to bring you an elegant guest post on ensemble methods by Benjamin Rogojan, popularly known as The Seattle Data Guy.[/box] How do data scientists improve their algorithm’s accuracy or improve the robustness of a model? A method that is tried and tested is ensemble learning. It is a must know topic if you claim to be a data scientist and/or a machine learning engineer. Especially, if you are planning to go in for a data science/machine learning interview. Essentially, ensemble learning stays true to the meaning of the word ‘ensemble’. Rather than having several people who are singing at different octaves to create one beautiful harmony (each voice filling in the void of the other), ensemble learning uses hundreds to thousands of models of the same algorithm that work together to find the correct classification. Another way to think about ensemble learning is the fable of the blind men and the elephant. Each blind man in the story seeks to identify the elephant in front of them. However, they work separately and come up with their own conclusions about the animal. Had they worked in unison, they might have been able to eventually figure out what they were looking at. Similarly, ensemble learning utilizes the workings of different algorithms and combines them for a successful and optimal classification. Ensemble methods such as Boosting and Bagging have led to an increased robustness of statistical models with decreased variance. Before we begin with explaining the various ensemble methods, let us have a glance at the common bond between them, Bootstrapping. Bootstrap: The common glue Explaining Bootstrapping can occasionally be missed by many data scientists. However, an understanding of bootstrapping is essential as both the ensemble methods, Boosting and Bagging, are based on the concept of bootstrapping. Figure 1: Bootstrapping In machine learning terms, bootstrap method refers to random sampling with replacement. This sample, after replacement, is referred as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances, and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics which the sample might have contained. This would, in turn, affect the overall mean, standard deviation, and other descriptive metrics of a data set. Ultimately, leading to the development of more robust models. The above diagram depicts each sample population having different and non-identical pieces. Bootstrapping is also great for small size data sets that may have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping are more robust and can handle new datasets depending on the methodology chosen (boosting or bagging). The bootstrap method can also test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness. In certain cases, one sample data set may have a larger mean than another or a different standard deviation. This might break a model that was overfitted and not tested using data sets with different variations. One of the many reasons bootstrapping has become so common is because of the increase in computing power. This allows multiple permutations to be done with different resamples. Let us now move on to the most prominent ensemble methods: Bagging and Boosting. Ensemble Method 1: Bagging Bagging actually refers to Bootstrap Aggregators. Most papers or posts that explain bagging algorithms are bound to refer to Leo Breiman’s work, a paper published in 1996 called “Bagging Predictors”. In the paper, Leo describes bagging as: “Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.” Bagging helps reduce variance from models that are accurate only on the data they were trained on. This problem is also known as overfitting. Overfitting happens when a function fits the data too well. Typically this is because the actual equation is highly complicated to take into account each data point and the outlier. Figure 2: Overfitting Another example of an algorithm that can overfit easily is a decision tree. The models that are developed using decision trees require very simple heuristics. Decision trees are composed of a set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set that might have some bias or difference in the spread of underlying features compared to the previous set, the model will fail to be as accurate as before. This is because the data will not fit the model well. Bagging gets around the overfitting problem by creating its own variance amongst the data. This is done by sampling and replacing data while it tests multiple hypotheses (models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes (median, average, etc). Once each model has developed a hypothesis, the models use voting for classification or averaging for regression. This is where the “Aggregating” of the “Bootstrap Aggregating” comes into play. As in the figure shown below, each hypothesis has the same weight as all the others. (When we later discuss boosting, this is one of the places the two methodologies differ.) Figure 3: Bagging Essentially, all these models run at the same time and vote on the hypothesis which is the most accurate. This helps to decrease variance i.e. reduce the overfit. Ensemble Method 2: Boosting Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging (that has each model run independently and then aggregate the outputs at the end without preference to any model), boosting is all about “teamwork”. Each model that runs dictates what features the next model will focus on. Boosting also requires bootstrapping. However, there is another difference here. Unlike bagging, boosting weights each sample of data. This means some samples will be run more often than others. Why put weights on the samples of data? Figure 4: Boosting When boosting runs each model, it tracks which data samples are the most successful and which are not. The data sets with the most misclassified outputs are given heavier weights. This is because such data sets are considered to have more complexity. Thus, more iterations would be required to properly train the model. During the actual classification stage, boosting tracks the model's error rates to ensure that better models are given better weights. That way, when the “voting” occurs, like in bagging, the models with better outcomes have a stronger pull on the final output. Which of these ensemble methods is right for me? Ensemble methods generally out-perform a single model. This is why many Kaggle winners have utilized ensemble methodologies. Another important ensemble methodology, not discussed here, is stacking. Boosting and bagging are both great techniques to decrease variance. However, they won’t fix every problem, and they themselves have their own issues. There are different reasons why you would use one over the other. Bagging is great for decreasing variance when a model is overfitted. However, boosting is likely to be a better pick of the two methods. This is because it is also great for decreasing bias in an underfit model. On the other hand, boosting is likely to suffer performance issues. This is where experience and subject matter expertise comes in! It may seem easy to jump on the first model that works. However, it is important to analyze the algorithm and all the features it selects. For instance, a decision tree that sets specific leafs shouldn’t be implemented if it can’t be supported with other data points and visuals. It is not just about trying AdaBoost, or Random forests on various datasets. The final algorithm is driven depending on the results an algorithm is getting and the support provided. [author title="About the Author"] Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/author]

0
0
33954

article-image-questions-tensorflow-2-0-tf-prebuilt-binaries-tensorboard-keras-python-support

Sugandha Lahoti

10 Dec 2019

5 min read

#AskTensorFlow: Twitterati ask questions on TensorFlow 2.0 - TF prebuilt binaries, Tensorboard, Keras, and Python support

Sugandha Lahoti

10 Dec 2019

5 min read

TensorFlow 2.0 was released recently with tighter integration with Keras, eager execution enabled by default, three times faster training performance, a cleaned-up API, and more. TensorFlow 2.0 had a major API Cleanup. Many API symbols are removed or renamed for better consistency and clarity. It now enables eager execution by default which effectively means that your TensorFlow code runs like numpy code. Keras has been introduced as the main high-level API to enable developers to easily leverage Keras’ various model-building APIs. TensorFlow 2.0 also has the SavedModel API that allows you to save your trained Machine learning model into a language-neutral format. In May, Paige Bailey, Product Manager (TensorFlow) and Laurence Moroney, Developer Advocate at Google sat down to discuss frequently asked questions on TensorFlow 2.0. They talked about TensorFlow prebuilt binaries, the TF 2.0 upgrade script, Tensorflow Datasets, and Python support. Can I ask about any prebuilt binary for the RTX 2080 GPU on Ubuntu 16? Prebuilt binaries for TensorFlow tend to be associated with a specific driver from Nvidia. If you're taking a look at any of the prebuilt binaries, take a look at what driver or what version of the driver you have supported on that specific card. It's easy for you to go to the driver vendor and download the latest version. But that may not be the one that TensorFlow is built for or the one that it supports. So, just make sure that they actually match each other. Do my TensorFlow scripts work with TensorFlow 2.0? Generally, TensorFlow scripts do not work with TensorFlow 2.0. But TensorFlow 2.0 has created an upgrade utility that is automatically downloaded with TensorFlow 2.0. For more information, you can check out this medium blog post that Paige and her colleague Anna created. It shows how you can upgrade script on an end file - any arbitrary Python file or even Jupyter Notebooks. It'll give you an export.txt file that shows you all of the symbol renames, the added keywords, and then some manual changes. When will TensorFlow be supported in Python 3.7 and hence be accessed in Anaconda 3? TensorFlow has made the commitment that as of January 1, 2020, they no longer support Python 2. They are firmly committed to Python 3 and Python 3 support. Is it possible to run Tensorboard on colabs? You can run Tensorboard on colabs and do different operations like smoothing, changing some of the values, and using the embedding visualizer directly from your collab notebook in order to understand accuracies and to be able to model performance debugging. You also don't have to specify ports which means you need not remember to have multiple tensor board instances running. Tensorboard automatically selects one that would be a good candidate. How would you use [TensorFlow’s] feature_columns with Keras? TensorFlow's feature_columns API is quite useful for non-numerical feature processing. Feature columns are a way of getting your data efficiently into Estimators and you can use them in Keras. TensorFlow 2.0 also has a migration guide if you wanted to migrate your models from using Estimators to being more of a TensorFlow 2.0 format with Keras. What are some simple data sets for testing and comparing different training methods for artificial neural networks? Are there any in TensorFlow 2.0? Although MNIST and Fashion-MNIST are great, TensorFlow 2.0 also has TensorFlow Datasets which provide a collection of datasets ready to use with TensorFlow. It handles downloading and preparing the data and constructing a tf.data. TensorFlow Datasets is compatible with both TensorFlow Eager mode and Graph mode. Also, you can use them with all of your deep learning and machine learning models with just a few lines of code. What about all the web developers who are new to AI, how does TensorFlow 2.0 help them get started? With TensorFlow 2.0, the web models that you create using saved model can be deployed to TFLite, or TensorFlow.js. The Keras layers are also supported in TensorFlow.js, so it's not just for Python developers but also for JS developers or even R developers. You can watch Paige and Lawrence answering more questions in this three-part video series available on YouTube. Some of the other questions asked were: Is there any TensorFlow.js transfer learning example for object detection? Are you going to publish the updated version of TensorFlow from Poets tutorial from Pete Warden implementing TF2.0. TFLite 2.0 and NN-API for faster inference on Android devices equipped with NPU/DSP? Will the frozen graph generated from TF 1.x work on TF 2.0? Which is the preferred format for saving the model GOIU forward saved_model (SM) or hd5? What is the purpose of keeping Estimators and Keras as separate APIs? If you want to quickly start with building machine learning projects with TensorFlow 2.0, read our book TensorFlow 2.0 Quick Start Guide by Tony Holdroyd. In this book, you will get acquainted with some new practices introduced in TensorFlow 2.0. You will also learn to train your own models for effective prediction, using high-level Keras API. TensorFlow.js contributor Kai Sasaki on how TensorFlow.js eases web-based machine learning application development Introducing Spleeter, a Tensorflow based python library that extracts voice and sound from any music track. TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more! Brad Miro talks TensorFlow 2.0 features and how Google is using it internally

0
0
33856

article-image-15-things-every-bi-professional-should-know-about-tableau

Fatema Patrawala

17 Dec 2019

8 min read

15 things every BI professional should know about Tableau

Fatema Patrawala

17 Dec 2019

8 min read

“The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way.” ―Edd Dumbill Tableau is a powerful data visualization and discovery tool. It is an important part of a data analyst or data scientist’s - skill set, with many organizations specifying it as a key skill in job adverts. In this article, we’ll take a look at few things in Tableau you need to know to successfully make a mark in your business intelligence career. While architecture of traditional BI tools has hardware limitations, Tableau does not have such dependencies and it can function independently and requires minimum hardware support. Traditional tools are based on a complex set of technologies when Tableau is based on Associative Search technology making it intuitive, fast and dynamic. Tableau supports in-memory, multi-thread and multi-core computing and more advanced capabilities while traditional BI tools do not offer such functionalities. Various Tableau products Tableau Desktop is a self service business analytics and data visualization suite that anyone can use. With tableau desktop, you can extract massive data offline from your data warehouse for live up to date data analysis. Tableau Online / Tableau Server is an online hosting platform designed for enterprise users. It lets users working on Tableau publish and share dashboards across organization and teams. Tableau Reader is a free desktop application that enables you to open and view visualizations that are built in Tableau Desktop. Tableau Public is a free Tableau software which you can use to make visualizations but you will need to save your workbook or worksheets in the Tableau Server for anyone else to view them. Different data types in Tableau All fields in a data source have a data type. The data type reflects the kind of information stored in that field, for example integers (410), dates (1/23/2015) and strings (“Wisconsin”). The data type of a field is identified in the Data pane by one of the icons shown below. Data type icons in Tableau Icon Data type Text (string) values Date values Date & Time values Numerical values Boolean values (relational only) for example True/False Geographic values (used with maps) Cluster Group Source: Tableau website Measures and Dimensions in Tableau Measures contain numeric, quantitative values that you can measure. Measures can be aggregated. When you drag a measure into the view, Tableau applies an aggregation to that measure (by default). Dimensions, on the other hand, contain qualitative values (such as names, dates, or geographical data). You can use dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of detail in the view. Ways to connect data in Tableau We can either connect live to your data set or extract data into Tableau. Live: Connecting live to a data set leverages its computational processing and storage. New queries will go to the database and will be reflected as new or updated within the data. Extract: The Extract API allows you to programmatically extract and combine any data sources for use in Tableau. There can be multiple data source connections to different sources in the same workbook. Each connection will show up under the Data tab on the left sidebar. The benefit of Tableau extract over live connection is that extract can be used anywhere without any connection and you can build your own visualization without connecting to database. You can read a complete section on how to extract data in Tableau from this book, Learning Tableau 2019 - Third Edition, written by Joshua Milligan. This book takes you through the foundations of the Tableau 2019 paradigm to the advanced topics. Joins and Blends in Tableau Joining tables and blending data sources are two different ways to link related data together in Tableau. Joins are performed to link tables of data together on a row-by-row basis. Blends are performed to link together multiple data sources at an aggregate level. Different filters in Tableau and different use cases in which these filters are more relevant than others In Tableau, filters are used to restrict the data from database. Often, you will want to filter data in Tableau in order to perform an analysis on a subset of data, narrow your focus, or drill into detail. Tableau offers multiple ways to filter data. If you want to limit the scope of your analysis to a subset of data, you can filter the data at the source using one of the following techniques: Data Source Filters are applied before all other filters and are useful when you want to limit your analysis to a subset of data. These filters are applied before any other filters. Extract Filters limit the data that is stored in an extract (.tde or .hyper). Data source filters are often converted into extract filters if they are present when you extract the data. Custom SQL Filters can be accomplished using a live connection with custom SQL, which has a Tableau parameter in the WHERE clause. Dual axis in Tableau Dual Axis is an excellent phenomenon supported by Tableau that helps users view two scales of two measures in the same graph. Many websites like Indeed.com and other make use of dual axis to show the comparison between two measures and their growth rate in a septic set of years. Dual axis let you compare multiple measures at once, having two independent axis layered on top of one another. Key components of a Tableau Dashboard Horizontal – Horizontal layout containers allow the designer to group worksheets and dashboard components left to right across your page and edit the height of all elements at once. Vertical – Vertical containers allow the user to group worksheets and dashboard components top to bottom down your page and edit the width of all elements at once. Text – All textual fields. Image Extract – A Tableau workbook is in XML format. In order to extract images, Tableau applies some codes to extract an image which can be stored in XML. Web [URL ACTION] – A URL action is a hyperlink that points to a Web page, file, or other web-based resource outside of Tableau. You can use URL actions to link to more information about your data that may be hosted outside of your data source. To make the link relevant to your data, you can substitute field values of a selection into the URL as parameters. If you want to learn how to design dashboards in Tableau, this book Learning Tableau 2019, will give you a step by step process for designing dashboards. Why automate reports in Tableau Once you have automated reporting, you’ll have time to spend on innovative projects. What can be done manually could be performed by automation, delivering the same results in a fraction of the time. Reducing such a time-consuming and repetitive task will make you more productive, and more efficient. What is story in Tableau? Why would create a story and what are they used for? A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey information. You can create stories to show how facts are connected, provide context, demonstrate how decisions relate to outcomes, or simply make a compelling case. Each individual sheet in a story is called a story point. The primary objective of creating stories in Tableau is to communicate data to a certain audience with an intended result. How can you create stories in Tableau? There is a feature in Tableau named as Stories that allows you to tell a story using interactive snapshots of dashboards and views. The snapshots become points in a story. This allows you to construct guided narrative or even an entire presentation. Read this chapter, ‘Telling a Data Story with Dashboards’ from this book, Learning Tableau 2019, to create insightful dashboards in Tableau. How to embed views into Webpages? You can embed interactive Tableau views and dashboards into web pages, blogs, wiki pages, web applications, and intranet portals. Embedded views update as the underlying data changes, or as their workbooks are updated on Tableau Server. Embedded views follow the same licensing and permission restrictions used on Tableau Server. That is, to see a Tableau view that’s embedded in a web page, the person accessing the view must also have an account on Tableau Server. Alternatively, if your organization uses a core-based license on Tableau Server, a Guest account is available. This allows people in your organization to view and interact with Tableau views embedded in web pages without having to sign in to the server. Contact your server or site administrator to find out if the Guest user is enabled for the site you publish to. What is Tableau Prep? Can we clean messy data with Tableau? Tableau Prep extends the Tableau platform with robust options for cleaning and structuring data for analysis in Tableau. In the same way that Tableau Desktop provides a hands-on, visual experience for visualizing and analyzing data, Tableau Prep provides a hands-on, visual experience for cleaning and shaping data. If you wish to know more about Tableau Prep or how to clean messy data to create powerful data visualizations and unlock intelligent business insights, read this book Learning Tableau 2019, written by Joshua N. Milligan. ‘Tableau Day’ highlights: Augmented Analytics, Tableau Prep Builder and Conductor, and more! Alteryx vs. Tableau: Choosing the right data analytics tool for your business How to do data storytelling well with Tableau [Video]

0
0
33762

article-image-how-to-perform-audio-video-image-scraping-with-python

Amarabha Banerjee

08 Mar 2018

9 min read

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee

08 Mar 2018

9 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.

0
0
33710

article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks

Savia Lobo

16 Feb 2018

6 min read

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo

16 Feb 2018

6 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.

0
0
33693

article-image-kriging-interpolation-geostatistics

Guest Contributor

15 Nov 2017

7 min read

Using R to implement Kriging - A Spatial Interpolation technique for Geostatistics data

Guest Contributor

15 Nov 2017

7 min read

The Kriging interpolation technique is being increasingly used in geostatistics these days. But how does Kriging work to create a prediction, after all? To start with, Kriging is a method where the distance and direction between the sample data points indicate a spatial correlation. This correlation is then used to explain the different variations in the surface. In cases where the distance and direction give appropriate spatial correlation, Kriging will be able to predict surface variations in the most effective way. As such, we often see Kriging being used in Geology and Soil Sciences. Kriging generates an optimal output surface for prediction which it estimates based on a scattered set with z-values. The procedure involves investigating the z-values’ spatial behavior in an ‘interactive’ manner where advanced statistical relationships are measured (autocorrelation). Mathematically speaking, Kriging is somewhat similar to regression analysis and its whole idea is to predict the unknown value of a function at a given point by calculating the weighted average of all known functional values in the neighborhood. To get the output value for a location, we take the weighted sum of already measured values in the surrounding (all the points that we intend to consider around a specific radius), using a formula such as the following: In a regression equation, λi would represent the weights of how far the points are from the prediction location. However, in Kriging, λi represent not just the weights of how far the measured points are from prediction location, but also how the measured points are arranged spatially around the prediction location. First, the variograms and covariance functions are generated to create the spatial autocorrelation of data. Then, that data is used to make predictions. Thus, unlike the deterministic interpolation techniques like Inverse Distance Weighted (IDW) and Spline interpolation tools, Kriging goes beyond just estimating a prediction surface. Here, it brings an element of certainty in that prediction surface. That is why experts rate kriging so highly for a strong prediction. Instead of a weather report forecasting a 2 mm rain on a certain Saturday, Kriging also tells you what is the "probability" of a 2 mm rain on that Saturday. We hope you enjoy this simple R tutorial on Kriging by Berry Boessenkool. Geostatistics: Kriging - spatial interpolation between points, using semivariance We will be covering following sections in our tutorial with supported illustrations: Packages read shapefile Variogram Kriging Plotting Kriging: packages install.packages("rgeos") install.packages("sf") install.packages("geoR") library(sf) # for st_read (read shapefiles), # st_centroid, st_area, st_union library(geoR) # as.geodata, variog, variofit, # krige.control, krige.conv, legend.krige ## Warning: package ’sf’ was built under R version 3.4.1 Kriging: read shapefile / few points for demonstration x <- c(1,1,2,2,3,3,3,4,4,5,6,6,6) y <- c(4,7,3,6,2,4,6,2,6,5,1,5,7) z <- c(5,9,2,6,3,5,9,4,8,8,3,6,7) plot(x,y, pch="+", cex=z/4) Kriging: read shapefile II GEODATA <- as.geodata(cbind(x,y,z)) plot(GEODATA) Kriging: Variogram I EMP_VARIOGRAM <- variog(GEODATA) ## variog: computing omnidirectional variogram FIT_VARIOGRAM <- variofit(EMP_VARIOGRAM) ## variofit: covariance model used is matern ## variofit: weights used: npairs ## variofit: minimisation function used: optim ## Warning in variofit(EMP_VARIOGRAM): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "9.19" "3.65" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 401.578968904954 Kriging: Variogram II plot(EMP_VARIOGRAM) lines(FIT_VARIOGRAM) Kriging: Kriging res <- 0.1 grid <- expand.grid(seq(min(x),max(x),res), seq(min(y),max(y),res)) krico <- krige.control(type.krige="OK", obj.model=FIT_VARIOGRAM) krobj <- krige.conv(GEODATA, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # KRigingObjekt Kriging: Plotting I image(krobj, col=rainbow2(100)) legend.krige(col=rainbow2(100), x.leg=c(6.2,6.7), y.leg=c(2,6), vert=T, off=-0.5, values=krobj$predict) contour(krobj, add=T) colPoints(x,y,z, col=rainbow2(100), legend=F) points(x,y) Kriging: Plotting II library("berryFunctions") # scatterpoints by color colPoints(x,y,z, add=F, cex=2, legargs=list(y1=0.8,y2=1)) Kriging: Plotting III colPoints(grid[ ,1], grid[ ,2], krobj$predict, add=F, cex=2, col2=NA, legargs=list(y1=0.8,y2=1)) Time for a real dataset Precipitation from ca 250 gauges in Brandenburg as Thiessen Polygons with steep gradients at edges: Exercise 41: Kriging Load and plot the shapefile in PrecBrandenburg.zip with sf::st_read. With colPoints in the package berryFunctions, add the precipitation values at the centroids of the polygons. Calculate the variogram and fit a semivariance curve. Perform kriging on a grid with a useful resolution (keep in mind that computing time rises exponentially with grid size). Plot the interpolated values with image or an equivalent (Rclick 4.15) and add contour lines. What went wrong? (if you used the defaults, the result will be dissatisfying.) How can you fix it? Solution for exercise 41.1-2: Kriging Data # Shapefile: p <- sf::st_read("data/PrecBrandenburg/niederschlag.shp", quiet=TRUE) # Plot prep pcol <- colorRampPalette(c("red","yellow","blue"))(50) clss <- berryFunctions::classify(p$P1, breaks=50)$index # Plot par(mar = c(0,0,1.2,0)) plot(p, col=pcol[clss], max.plot=1) # P1: Precipitation # kriging coordinates cent <- sf::st_centroid(p) berryFunctions::colPoints(cent$x, cent$y, p$P1, add=T, cex=0.7, legargs=list(y1=0.8,y2=1), col=pcol) points(cent$x, cent$y, cex=0.7) Solution for exercise 41.3: Variogram library(geoR) # Semivariance: geoprec <- as.geodata(cbind(cent$x,cent$y,p$P1)) vario <- variog(geoprec, max.dist=130000) ## variog: computing omnidirectional variogram fit <-variofit(vario) ## Warning in variofit(vario): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "1326.72" "19999.93" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 107266266.76371 plot(vario) ; lines(fit) # distance to closest other point: d <- sapply(1:nrow(cent), function(i) min(berryFunctions::distance( cent$x[i], cent$y[i], cent$x[-i], cent$y[-i]))) hist(d/1000, breaks=20, main="distance to closest gauge [km]") mean(d) # 8 km ## [1] 8165.633 Solution for exercise 41.4-5: Kriging # Kriging: res <- 1000 # 1 km, since stations are 8 km apart on average grid <- expand.grid(seq(min(cent$x),max(cent$x),res), seq(min(cent$y),max(cent$y),res)) krico <- krige.control(type.krige="OK", obj.model=fit) krobj <- krige.conv(geoprec, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # Set values outside of Brandenburg to NA: grid_sf <- sf::st_as_sf(grid, coords=1:2, crs=sf::st_crs(p)) isinp <- sapply(sf::st_within(grid_sf, p), length) > 0 krobj2 <- krobj krobj2$predict[!isinp] <- NA Solution for exercise 41.5: Kriging Visualization geoR:::image.kriging(krobj2, col=pcol) colPoints(cent$x, cent$y, p$P1, col=pcol, zlab="Prec", cex=0.7, legargs=list(y1=0.1,y2=0.8, x1=0.78, x2=0.87, horiz=F)) plot(p, add=T, col=NA, border=8)#; points(cent$x,cent$y, cex=0.7) [author title="About the author"]Berry started working with R in 2010 during his studies of Geoecology at Potsdam University, Germany. He has since then given a number of R programming workshops and tutorials, including full-week workshops in Kyrgyzstan and Kazachstan. He has left the department for environmental science in summer 2017 to focus more on software development and teaching in the data science industry. Please follow the Github link for detailed explanations on Berry’s R courses. [/author]

0
0
33391

article-image-iclr-2019-highlights-algorithmic-fairness-ai-for-social-good-climate-change-protein-structures-gan-magic-adversarial-ml-and-much-more

Amrata Joshi

09 May 2019

7 min read

ICLR 2019 Highlights: Algorithmic fairness, AI for social good, climate change, protein structures, GAN magic, adversarial ML and much more

Amrata Joshi

09 May 2019

7 min read

The ongoing ICLR 2019 (International Conference on Learning Representations) has brought a pack full of surprises and key specimens of innovation. The conference started on Monday, this week and it’s already the last day today! This article covers the highlights of ICLR 2019 and introduces you to the ongoing research carried out by experts in the field of deep learning, data science, computational biology, machine vision, speech recognition, text understanding, robotics and much more. The team behind ICLR 2019, invited papers based on Unsupervised objectives for agents, Curiosity and intrinsic motivation, Few shot reinforcement learning, Model-based planning and exploration, Representation learning for planning, Learning unsupervised goal spaces, Unsupervised skill discovery and Evaluation of unsupervised agents. https://twitter.com/alfcnz/status/1125399067490684928 ICLR 2019, sponsored by Google marks the presence of 200 researchers contributing to and learning from the academic research community by presenting papers and posters. ICLR 2019 Day 1 highlights: Neural network, Algorithmic fairness, AI for social good and much more Algorithmic fairness https://twitter.com/HanieSedghi/status/1125401294880083968 The first day of the conference started with a talk on Highlights of Recent Developments in Algorithmic Fairness by Cynthia Dwork, an American computer scientist at Harvard University. She focused on "group fairness" notions that address the relative treatment of different demographic groups. And she talked on research in the ML community that explores fairness via representations. The investigation of scoring, classifying, ranking, and auditing fairness was also discussed in this talk by Dwork. Generating high fidelity images with Subscale Pixel Networks and Multidimensional Upscaling https://twitter.com/NalKalchbrenner/status/1125455415553208321 Jacob Menick, a senior research engineer at Google, Deep Mind and Nal Kalchbrenner, staff research scientist and co-creator of the Google Brain Amsterdam research lab talked on Generating high fidelity images with Subscale Pixel Networks and Multidimensional Upscaling. They talked about the challenges involved in generating the images and how they address this issue with the help of Subscale Pixel Network (SPN). It is a conditional decoder architecture that helps in generating an image as a sequence of image slices of equal size. They also explained how Multidimensional Upscaling is used to grow an image in both size and depth via intermediate stages corresponding to distinct SPNs. There were in all 10 workshops conducted on the same day based on AI and deep learning covering topics such as, The 2nd Learning from Limited Labeled Data (LLD) Workshop: Representation Learning for Weak Supervision and Beyond Deep Reinforcement Learning Meets Structured Prediction AI for Social Good Debugging Machine Learning Models The first day also witnessed a few interesting talks on neural networks covering topics such as The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, How Powerful are Graph Neural Networks? etc. Overall the first day was quite enriching and informative. ICLR 2019 Day 2 highlights: AI in climate change, Protein structure, adversarial machine learning, CNN models and much more AI’s role in climate change https://twitter.com/natanielruizg/status/1125763990158807040 Tuesday, also the second day of the conference, started with an interesting talk on Can Machine Learning Help to Conduct a Planetary Healthcheck? by Emily Shuckburgh, a Climate scientist and deputy head of the Polar Oceans team at the British Antarctic Survey. She talked about the sophisticated numerical models of the Earth’s systems which have been developed so far based on physics, chemistry and biology. She then highlighted a set of "grand challenge" problems and discussed various ways in which Machine Learning is helping to advance our capacity to address these. Protein structure with a differentiable simulator On the second day of ICLR 2019, Chris Sander, computational biologist, John Ingraham, Adam J Riesselman, and Debora Marks from Harvard University, talked on Learning protein structure with a differentiable simulator. They about the protein folding problem and their aim to bridge the gap between the expressive capacity of energy functions and the practical capabilities of their simulators by using an unrolled Monte Carlo simulation as a model for data. They also composed a neural energy function with a novel and efficient simulator which is based on Langevin dynamics for building an end-to-end-differentiable model of atomic protein structure given amino acid sequence information. They also discussed certain techniques for stabilizing backpropagation and demonstrated the model's capacity to make multimodal predictions. Adversarial Machine Learning https://twitter.com/natanielruizg/status/1125859734744117249 Day 2 was long and had Ian Goodfellow, a machine learning researcher and inventor of GANs, to talk on Adversarial Machine Learning. He talked about supervised learning works and making machine learning private, getting machine learning to work for new tasks and also reducing the dependency on large amounts of labeled data. He then discussed how the adversarial techniques in machine learning are involved in the latest research frontiers. Day 2 covered poster presentation and a few talks on Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset, Learning to Remember More with Less Memorization, Learning to Remember More with Less Memorization, etc. ICLR 2019 Day 3 highlights: GAN, Autonomous learning and much more Developmental autonomous learning: AI, Cognitive Sciences and Educational Technology https://twitter.com/drew_jaegle/status/1125522499150721025 Day 3 of ICLR 2019 started with Pierre-Yves Oudeyer’s, research director at Inria talk on Developmental Autonomous Learning: AI, Cognitive Sciences and Educational Technology. He presented a research program that focuses on computational modeling of child development and learning mechanisms. He then discussed the several developmental forces that guide exploration in large real-world spaces. He also talked about the models of curiosity-driven autonomous learning that enables machines to sample and explore their own goals and learning strategies. He then explained how these models and techniques can be successfully applied in the domain of educational technologies. Generating knockoffs for feature selection using Generative Adversarial Networks (GAN) Another interesting topic on the third day of ICLR 2019 was Generating knockoffs for feature selection using Generative Adversarial Networks (GAN) by James Jordon from Oxford University, Jinsung Yoon from California University, and Mihaela Schaar Professor at UCLA. The experts talked about the Generative Adversarial Networks framework that helps in generating knockoffs with no assumptions on the feature distribution. They also talked about the model they created which consists of 4 networks, a generator, a discriminator, a stability network and a power network. They further demonstrated the capability of their model to perform feature selection. Followed by few more interesting topics like Deterministic Variational Inference for Robust Bayesian Neural Networks, there were series of poster presentations. ICLR 2019 Day 4 highlights: Neural networks, RNN, neuro-symbolic concepts and much more Learning natural language interfaces with neural models Today’s focus was more on neural models and neuro symbolic concepts. The day started with a talk on Learning natural language interfaces with neural models by Mirella Lapata, a computer scientist. She gave an overview of recent progress on learning natural language interfaces which allow users to interact with various devices and services using everyday language. She also addressed the structured prediction problem of mapping natural language utterances onto machine-interpretable representations. She further outlined the various challenges it poses and described a general modeling framework based on neural networks which tackle these challenges. Ordered neurons: Integrating tree structures into Recurrent Neural Networks https://twitter.com/mmjb86/status/1126272417444311041 The next interesting talk was on Ordered neurons: Integrating tree structures into Recurrent Neural Networks by Professors Yikang Shen, Aaron Courville and Shawn Tan from Montreal University, and, Alessandro Sordoni, a researcher at Microsoft. In this talk, the experts focused on how they proposed a new RNN unit: ON-LSTM, which achieves good performance on four different tasks including language modeling, unsupervised parsing, targeted syntactic evaluation, and logical inference. The last day of ICLR 2019 was exciting and helped the researchers present their innovations and attendees got a chance to interact with the experts. To have a complete overview of each of these sessions, you can head over to ICLR’s Facebook page. Paper in Two minutes: A novel method for resource efficient image classification Google I/O 2019 D1 highlights: smarter display, search feature with AR capabilities, Android Q, linguistically advanced Google lens and more Google I/O 2019: Flutter UI framework now extended for Web, Embedded, and Desktop

0
0
33272

article-image-why-choose-opencv-over-matlab-for-your-next-computer-vision-project

Vincy Davis

20 Dec 2019

6 min read

Why choose OpenCV over MATLAB for your next Computer Vision project

Vincy Davis

20 Dec 2019

6 min read

Scientific Computing relies on executing computer algorithms coded in different programming languages. One such interdisciplinary scientific field is the study of Computer Vision, often abbreviated as CV. Computer Vision is used to develop techniques that can automate tasks like acquiring, processing, analyzing and understanding digital images. It is also utilized for extracting high-dimensional data from the real world to produce symbolic information. In simple words, Computer Vision gives computers the ability to see, understand and process images and videos like humans. The vast advances in hardware, machine learning tools, and frameworks have resulted in the implementation of Computer Vision in various fields like IoT, manufacturing, healthcare, security, etc. Major tech firms like Amazon, Google, Microsoft, and Facebook are investing immensely in the research and development of this field. Out of the many tools and libraries available for Computer Vision nowadays, there are two major tools OpenCV and Matlab that stand out in terms of their speed and efficiency. In this article, we will have a detailed look at both of them. Further Reading [box type="shadow" align="" class="" width=""]To learn how to build interesting image recognition models like setting up license plate recognition using OpenCV, read the book “Computer Vision Projects with OpenCV and Python 3” by author Matthew Rever. The book will also guide you to design and develop production-grade Computer Vision projects by tackling real-world problems.[/box] OpenCV: An open-source multiplatform solution tailored for Computer Vision OpenCV, developed by Intel and now supported by Willow Garage, is released under the BSD 3-Clause license and is free for commercial use. It is one of the most popular computer vision tools aimed at providing a well-optimized, well tested, and open-source (C++)-based implementation for computer vision algorithms. The open-source library has interfaces for multiple languages like C++, Python, and Java and supports Linux, macOS, Windows, iOS, and Android. Many of its functions are implemented on GPU. The first stable release of OpenCV version 1.0 was in the year 2006. The OpenCV community has grown rapidly ever since and with its latest release, OpenCV version 4.1.1, it also brings improvements in the dnn (Deep Neural Networks) module, which is a popular module in the library that implements forward pass (inferencing) with deep networks, which are pre-trained using popular deep learning frameworks. Some of the features offered by OpenCV include: imread function to read the images in the BGR (Blue-Green-Red) format by default. Easy up and downscaling for resizing an image. Supports various interpolation and downsampling methods like INTER_NEAREST to represent the nearest neighbor interpolation. Supports multiple variations of thresholding like adaptive thresholding, bitwise operations, edge detection, image filtering, image contours, and more. Enables image segmentation (Watershed Algorithm) to classify each pixel in an image to a particular class of background and foreground. Enables multiple feature-matching algorithms, like brute force matching, knn feature matching, among others. With its active community and regular updates for Machine Learning, OpenCV is only going to grow by leaps and bounds in the field of Computer Vision projects. MATLAB: A licensed quick prototyping tool with OpenCV integration One disadvantage of OpenCV, which makes novice computer vision users tilt towards Matlab is the former's complex nature. OpenCV is comparatively harder to learn due to lack of documentation and error handling codes. Matlab, developed by MathWorks is a proprietary programming language with a multi-paradigm numerical computing environment. It has over 3 million users worldwide and is considered one of the easiest and most productive software for engineers and scientists. It has a very powerful and swift matrix library. Matlab also works in integration with OpenCV. This enables MATLAB users to explore, analyze, and debug designs that incorporate OpenCV algorithms. The support package of MATLAB includes the data type conversions necessary for MATLAB and OpenCV. MathWorks provided Computer Vision Toolbox renders algorithms, functions, and apps for designing and testing computer vision, 3D vision, and video processing systems. It also allows detection, tracking, feature extraction, and matching of objects. Matlab can also train custom object detectors using deep learning and machine learning algorithms such as YOLO v2, Faster R-CNN, and ACF. Most of the toolbox algorithms in Matlab support C/C++ code generation for integrating with existing code, desktop prototyping, and embedded vision system deployment. However, Matlab does not contain as many functions for computer vision as OpenCV, which has more of its functions implemented on GPU. Another issue with Matlab is that it's not open-source, it’s license is costly and the programs are not portable. Another important factor which matters a lot in computer vision is the performance of a code, especially when working on real-time video processing. Which has a faster execution time? OpenCV or Matlab? Along with Computer Vision, other fields also require faster execution while choosing a programming language or library for implementing any function. This factor is analyzed in detail in a paper titled “Matlab vs. OpenCV: A Comparative Study of Different Machine Learning Algorithms”. The paper provides a very practical comparative study between Matlab and OpenCV using 20 different real datasets. The differentiation is based on the execution time for various machine learning algorithms like Classification and Regression Trees (CART), Naive Bayes, Boosting, Random Forest and K-Nearest Neighbor (KNN). The experiments were run on an Intel core 2 duo P7450 machine, with 3GB RAM, and Ubuntu 11.04 32-bit operating system on Matlab version 7.12.0.635 (R2011a), and OpenCV C++ version 2.1. The paper states, “To compare the speed of Matlab and OpenCV for a particular machine learning algorithm, we run the algorithm 1000 times and take the average of the execution times. Averaging over 1000 experiments is more than necessary since convergence is reached after a few hundred.” The outcome of all the experiments revealed that though Matlab is a successful scientific computing environment, it is outrun by OpenCV for almost all the experiments when their execution time is considered. The paper points out that this could be due to a combination of a number of dimensionalities, sample size, and the use of training sets. One of the listed machine learning algorithms KNN produced a log time ratio of 0.8 and 0.9 on datasets D16 and D17 respectively. Clearly, Matlab is great for exploring and fiddling with computer vision concepts as researchers and students at universities that can afford the software. However, when it comes to building production-ready real-world computer vision projects, OpenCV beats Matlab hand down. You can learn about building more Computer Vision projects like human pose estimation using TensorFlow from our book ‘Computer Vision Projects with OpenCV and Python 3’. Master the art of face swapping with OpenCV and Python by Sylwek Brzęczkowski, developer at TrustStamp NVIDIA releases Kaolin, a PyTorch library to accelerate research in 3D computer vision and AI Generating automated image captions using NLP and computer vision [Tutorial] Computer vision is growing quickly. Here’s why. Introducing Intel’s OpenVINO computer vision toolkit for edge computing

0
0
33178

article-image-jim-balsillie-on-data-governance-challenges-and-6-recommendations-to-tackle-them

Savia Lobo

05 Jun 2019

5 min read

Jim Balsillie on Data Governance Challenges and 6 Recommendations to tackle them

Savia Lobo

05 Jun 2019

5 min read

The Canadian Parliament's Standing Committee on Access to Information, Privacy and Ethics hosted the hearing of the International Grand Committee on Big Data, Privacy and Democracy from Monday, May 27 to Wednesday, May 29. Witnesses from at least 11 countries appeared before representatives to testify on how governments can protect democracy and citizen rights in the age of big data. This section of the hearing, which took place on May 28, includes Jim Balsillie’s take on Data Governance. Jim Balsillie, Chair, Centre for International Governance Innovation; Retired Chairman and co-CEO of BlackBerry, starts off by talking about how Data governance is the most important public policy issue of our time. It is cross-cutting with economic, social and security dimensions. It requires both national policy frameworks and international coordination. He applauded the seriousness and integrity of Mr. Zimmer Angus and Erskine Smith who have spearheaded a Canadian bipartisan effort to deal with data governance over the past three years. “My perspective is that of a capitalist and global tech entrepreneur for 30 years and counting. I'm the retired Chairman and co-CEO of Research in Motion, a Canadian technology company [that] we scaled from an idea to 20 billion in sales. While most are familiar with the iconic BlackBerry smartphones, ours was actually a platform business that connected tens of millions of users to thousands of consumer and enterprise applications via some 600 cellular carriers in over 150 countries. We understood how to leverage Metcalfe's law of network effects to create a category-defining company, so I'm deeply familiar with multi-sided platform business model strategies as well as navigating the interface between business and public policy.”, he adds. He further talks about his different observations about the nature, scale, and breadth of some collective challenges for the committee’s consideration: Disinformation in fake news is just two of the negative outcomes of unregulated attention based business models. They cannot be addressed in isolation; they have to be tackled horizontally as part of an integrated whole. To agonize over social media’s role in the proliferation of online hate, conspiracy theories, politically motivated misinformation, and harassment, is to miss the root and scale of the problem. Social media’s toxicity is not a bug, it's a feature. Technology works exactly as designed. Technology products services and networks are not built in a vacuum. Usage patterns drive product development decisions. Behavioral scientists involved with today's platforms helped design user experiences that capitalize on negative reactions because they produce far more engagement than positive reactions. Among the many valuable insights provided by whistleblowers inside the tech industry is this quote, “the dynamics of the attention economy are structurally set up to undermine the human will.” Democracy and markets work when people can make choices align with their interests. The online advertisement driven business model subverts choice and represents a fundamental threat to markets election integrity and democracy itself. Technology gets its power through the control of data. Data at the micro-personal level gives technology unprecedented power to influence. “Data is not the new oil, it's the new plutonium amazingly powerful dangerous when it spreads difficult to clean up and with serious consequences when improperly used.” Data deployed through next-generation 5G networks are transforming passive in infrastructure into veritable digital nervous systems. Our current domestic and global institutions rules and regulatory frameworks are not designed to deal with any of these emerging challenges. Because cyberspace knows no natural borders, digital transformation effects cannot be hermetically sealed within national boundaries; international coordination is critical. With these observations, Balsillie has further provided six recommendations: Eliminate tax deductibility of specific categories of online ads. Ban personalized online advertising for elections. Implement strict data governance regulations for political parties. Provide effective whistleblower protections. Add explicit personal liability alongside corporate responsibility to effect the CEO and board of directors’ decision-making. Create a new institution for like-minded nations to address digital cooperation and stability. Technology is becoming the new 4th Estate Technology is disrupting governance and if left unchecked could render liberal democracy obsolete. By displacing the print and broadcast media and influencing public opinion, technology is becoming the new Fourth Estate. In our system of checks and balances, this makes technology co-equal with the executive that led the legislative and the judiciary. When this new Fourth Estate declines to appear before this committee, as Silicon Valley executives are currently doing, it is symbolically asserting this aspirational co-equal status. But is asserting the status and claiming its privileges without the traditions, disciplines, legitimacy, or transparency that checked the power of the traditional Fourth Estate. The work of this international grand committee is a vital first step towards reset redress of this untenable current situation. Referring to what Professor Zuboff said last night, we Canadians are currently in a historic battle for the future of our democracy with a charade called sidewalk Toronto. He concludes by saying, “I'm here to tell you that we will win that battle.” To know more you can listen to the full hearing video titled, “Meeting No. 152 ETHI - Standing Committee on Access to Information, Privacy, and Ethics” on ParlVU. Speech2Face: A neural network that “imagines” faces from hearing voices. Is it too soon to worry about ethnic profiling? UK lawmakers to social media: “You’re accessories to radicalization, accessories to crimes”, hearing on spread of extremist content Key Takeaways from Sundar Pichai’s Congress hearing over user data, political bias, and Project Dragonfly

0
0
33154

article-image-4-popular-algorithms-distance-based-outlier-detection

Sugandha Lahoti

01 Dec 2017

7 min read

4 popular algorithms for Distance-based outlier detection

Sugandha Lahoti

01 Dec 2017

7 min read

[box type="note" align="" class="" width=""]The article is an excerpt from our book titled Mastering Java Machine Learning by Dr. Uday Kamath and Krishna Choppella.[/box] This book introduces you to an array of expert machine learning techniques, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modelling and a lot more. The article given below is extracted from Chapter 5 of the book - Real-time Stream Machine Learning, explaining 4 popular algorithms for Distance-based outlier detection. Distance-based outlier detection is the most studied, researched, and implemented method in the area of stream learning. There are many variants of the distance-based methods, based on sliding windows, the number of nearest neighbors, radius and thresholds, and other measures for considering outliers in the data. We will try to give a sampling of the most important algorithms in this article. Inputs and outputs Most algorithms take the following parameters as inputs: Window size w, corresponding to the fixed size on which the algorithm looks for outlier patterns. Sliding size s, corresponds to the number of new instances that will be added to the window, and old ones removed. The count threshold k of instances when using nearest neighbor computation. The distance threshold R used to define the outlier threshold in distances. Outliers as labels or scores (based on neighbors and distance) are outputs. How does it work? We present different variants of distance-based stream outlier algorithms, giving insights into what they do differently or uniquely. The unique elements in each algorithm define what happens when the slide expires, how a new slide is processed, and how outliers are reported. Exact Storm Exact Storm stores the data in the current window w in a well-known index structure, so that the range query search or query to find neighbors within the distance R for a given point is done efficiently. It also stores k preceding and succeeding neighbors of all data points: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries but are preserved in the preceding list of neighbors. New Slide: For each data point in the new slide, range query R is executed, results are used to update the preceding and succeeding list for the instance, and the instance is stored in the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance with at least k elements from the succeeding list and non-expired preceding list is reported as an outlier. Abstract-C Abstract-C keeps the index structure similar to Exact Storm but instead of preceding and succeeding lists for every object it just maintains a list of counts of neighbors for the windows the instance is participating in: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the first element from the list of counts is removed corresponding to the last window. New Slide: For each data point in the new slide, range query R is executed and results are used to update the list count. For existing instances, the count gets updated with new neighbors and instances are added to the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances with a neighbors count less than k in the current window are considered outliers. Direct Update of Events (DUE) DUE keeps the index structure for efficient range queries exactly like the other algorithms but has a different assumption, that when an expired slide occurs, not every instance is affected in the same way. It maintains two priority queues: the unsafe inlier queue and the outlier list. The unsafe inlier queue has sorted instances based on the increasing order of smallest expiration time of their preceding neighbors. The outlier list has all the outliers in the current window: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the unsafe inlier queue is updated for expired neighbors. Those unsafe inliers which become outliers are removed from the priority queue and moved to the outlier list. New Slide: For each data point in the new slide, range query R is executed, results are used to update the succeeding neighbors of the point, and only the most recent preceding points are updated for the instance. Based on the updates, the point is added to the unsafe inlier priority queue or removed from the queue and added to the outlier list. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances in the outlier list are reported as outliers. Micro Clustering based Algorithm (MCOD) Micro-clustering based outlier detection overcomes the computational issues of performing range queries for every data point. The micro-cluster data structure is used instead of range queries in these algorithms. A micro-cluster is centered around an instance and has a radius of R. All the points belonging to the micro-clusters become inliers. The points that are outside can be outliers or inliers and stored in a separate list. It also has a data structure similar to DUE to keep a priority queue of unsafe inliers: Expired Slide: Instances in expired slides are removed from both microclusters and the data structure with outliers and inliers. The unsafe inlier queue is updated for expired neighbors as in the DUE algorithm. Microclusters are also updated for non-expired data points. New Slide: For each data point in the new slide, the instance either becomes a center of a micro-cluster, or part of a micro-cluster or added to the event queue and the data structure of the outliers. If the point is within the distance R, it gets assigned to an existing micro-cluster; otherwise, if there are k points within R, it becomes the center of the new micro cluster; if not, it goes into the two structures of the event queue and possible outliers. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance in the outlier structure with less than k neighboring instances is reported as an outlier. Advantages and limitations The advantages and limitations are as follows: Exact Storm is demanding in storage and CPU for storing lists and retrieving neighbors. Also, it introduces delays; even though they are implemented in efficient data structures, range queries can be slow. Abstract-C has a small advantage over Exact Storm, as no time is spent on finding active neighbors for each instance in the window. The storage and time spent is still very much dependent on the window and slide chosen. DUE has some advantage over Exact Storm and Abstract-C as it can efficiently re-evaluate the "inlierness" of points (that is, whether unsafe inliers remain inliers or become outliers) but sorting the structure impacts both CPU and memory. MCOD has distinct advantages in memory and CPU owing to the use of the micro-cluster structure and removing the pairwise distance computation. Storing the neighborhood information in micro-clusters helps memory too. Validation and evaluation of stream-based outliers is still an open research area. By varying parameters such as window-size, neighbors within radius, and so on, we determine the sensitivity to the performance metrics (time to evaluate in terms of CPU times per object, Number of outliers detected in the streams,TP/Precision/Recall/ Area under PRC curve) and determine the robustness. If you liked the above article, checkout our book Mastering Java Machine Learning to explore more on advanced machine learning techniques using the best Java-based tools available.

0
0
33076

article-image-salesforce-is-buying-tableau-in-a-15-7-billion-all-stock-deal

Richard Gall

10 Jun 2019

4 min read

Salesforce is buying Tableau in a $15.7 billion all-stock deal

Richard Gall

10 Jun 2019

4 min read

Salesforce, one of the world's leading CRM platforms, is buying data visualization software Tableau in an all-stock deal worth $15.7 billion. The news comes just days after it emerged that Google is buying one of Tableau's competitors in the data visualization market, Looker. Taken together, the stories highlight the importance of analytics to some of the planet's biggest companies. They suggest that despite years of the big data revolution, it's only now that market-leading platforms are starting to realise that their customers want the level of capabilities offered by the best in the data visualization space. Salesforce shareholders will use their stock to purchase Tableau. As the press release published on the Salesforce site explains "each share of Tableau Class A and Class B common stock will be exchanged for 1.103 shares of Salesforce common stock, representing an enterprise value of $15.7 billion (net of cash), based on the trailing 3-day volume weighted average price of Salesforce's shares as of June 7, 2019." The acquisition is expected to be completed by the end of October 2019. https://twitter.com/tableau/status/1138040596604575750 Why is Salesforce buying Tableau? The deal is an incredible result for Tableau shareholders. At the end of last week, its market cap was $10.7 billion. This has led to some scepticism about just how good a deal this is for Salesforce. One commenter on Hacker News said "this seems really high for a company without earnings and a weird growth curve. Their ticker is cool and maybe sales force [sic] wants to be DATA on nasdaq. Otherwise, it will be hard to justify this high markup for a tool company." With Salesforce shares dropping 4.5% as markets opened this week, it seems investors are inclined to agree - Salesforce is certainly paying a premium for Tableau. However, whatever the long term impact of the acquisition, the price paid underlines the fact that Salesforce views Tableau as exceptionally important to its long term strategy. It opens up an opportunity for Salesforce to reposition and redefine itself as much more than just a CRM platform. It means it can start compete with the likes of Microsoft, which has a full suite of professional and business intelligence tools. Moreover, it also provides the platform with another way of potentially onboarding customers - given Tableau is well-known as a powerful yet accessible data visualization tool, it create an avenue through which new users can find their way to the Salesforce product. Marc Benioff, Chair and co-CEO of Salesforce, said "we are bringing together the world’s #1 CRM with the #1 analytics platform. Tableau helps people see and understand data, and Salesforce helps people engage and understand customers. It’s truly the best of both worlds for our customers--bringing together two critical platforms that every customer needs to understand their world.” Tableau has been a target for Salesforce for some time. Leaked documents from 2016 found that the data visualization was one of 14 companies that Salesforce had an interest in (another was LinkedIn, which would eventually be purchased by Microsoft). Read next: Alteryx vs. Tableau: Choosing the right data analytics tool for your business What's in it for Tableau (aside from the money...)? For Tableau, there are many other benefits of being purchased by Salesforce alongside the money. Primarily this is about expanding the platform's reach - Salesforce users are people who are interested in data with a huge range of use cases. By joining up with Salesforce, Tableau will become their go-to data visualization tool. "As our two companies began joint discussions," Tableau CEO Adam Selipsky said, "the possibilities of what we might do together became more and more intriguing. They have leading capabilities across many CRM areas including sales, marketing, service, application integration, AI for analytics and more. They have a vast number of field personnel selling to and servicing customers. They have incredible reach into the fabric of so many customers, all of whom need rich analytics capabilities and visual interfaces... On behalf of our customers, we began to dream about we might accomplish if we could combine our ability to help people see and understand data with their ability to help people engage and understand customers." What will happen to Tableau? Tableau won't be going anywhere. It will continue to exist under its own brand with the current leadership all remaining, including Selipsky. What does this all mean for the technology market? At the moment, it's too early to say - but the last year or so has seen some major high-profile acquisitions by tech companies. Perhaps we're seeing the emergence of a tooling arms race as the biggest organizations attempt to arm themselves with ecosystems of established market-leading tools. Whether this is good or bad for users remains to be seen, however.

0
0
33073

article-image-working-with-kibana-in-elasticsearch-5-x

Savia Lobo

26 Jan 2018

9 min read

Working with Kibana in Elasticsearch 5.x

Savia Lobo

26 Jan 2018

9 min read

0
0
33036

article-image-how-sql-server-handles-data-under-the-hood

Sunith Shetty

27 Feb 2018

11 min read

How SQL Server handles data under the hood

Sunith Shetty

27 Feb 2018

11 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Marek Chmel and Vladimír Mužný titled SQL Server 2017 Administrator's Guide. In this book, you will learn the required skills needed to successfully create, design, and deploy database using SQL Server 2017.[/box] Today, we will explore how SQL Server handles data as it is of utmost importance to get an understanding of what, when, and why data should be backed. Data structures and transaction logging We can think about a database as of physical database structure consisting of tables and indexes. However, this is just a human point of view. From the SQL Server's perspective, a database is a set of precisely structured files described in a form of metadata also saved in database structures. A conceptual imagination of how every database works is very helpful when the database has to be backed up correctly. How data is stored Every database on SQL Server must have at least two files: The primary data file with the usual suffix, mdf The transaction log file with the usual suffix, ldf For lots of databases, this minimal set of files is not enough. When the database contains big amounts of data such as historical tables, or the database has big data contention such as production tracking systems, it's good practise to design more data files. Another situation when a basic set of files is not sufficient can arise when documents or pictures would be saved along with relational data. However, SQL Server still is able to store all of our data in the basic file set, but it can lead to a performance bottlenecks and management issues. That's why we need to know all possible storage types useful for different scenarios of deployment. A complete structure of files is depicted in the following image: Database A relational database is defined as a complex data type consisting of tables with a given amount of columns, and each column has its domain that is actually a data type (such as an integer or a date) optionally complemented by some constraints. From SQL Server's perspective, the database is a record written in metadata and containing the name of the database, properties of the database, and names and locations of all files or folders representing storage for the database. This is the same for user databases as well as for system databases. System databases are created automatically during SQL Server installation and are crucial for correct running of SQL Server. We know five system databases. Database master Database master is crucial for the correct running of SQL Server service. In this database is stored data about logins, all databases and their files, instance configurations, linked servers, and so on. SQL Server finds this database at startup via two startup parameters, -d and -l, followed by paths to mdf and ldf files. These parameters are very important in situations when the administrator wants to move the master's files to a different location. Changing their values is possible in the SQL Server Configuration Manager in the SQL Server service Properties dialog on the tab called startup parameters. Database msdb The database msdb serves as the SQL Server Agent service, Database Mail, and Service Broker. In this database are stored job definitions, operators, and other objects needed for administration automation. This database also stores some logs such as backup and restore events of each database. If this database is corrupted or missing, SQL Server Agent cannot start. Database model Database model can be understood as a template for every new database while it is created. During a database creation (see the CREATE DATABASE statement on MSDN), files are created on defined paths and all objects, data and properties of database model are created, copied, and set into the new database during its creation. This database must always exist on the instance, because when it's corrupted, database tempdb can be created at instance start up! Database tempdb Even if database tempdb seems to be a regular database like many others, it plays a very special role in every SQL Server instance. This database is used by SQL Server itself as well as by developers to save temporary data such as table variables or static cursors. As this database is intended for a short lifespan (temporary data only, which can be stored during execution of stored procedure or until session is disconnected), SQL Server clears this database by truncating all data from it or by dropping and recreating this database every time when it's started. As the tempdb database will never contain durable data, it has some special internal behavior and it's the reason why accessing data in this database is several times faster than accessing durable data in other databases. If this database is corrupted, restart SQL Server. Database resourcedb The resourcedb is fifth in our enumeration and consists of definitions for all system objects of SQL Server, for example, sys.objects. This database is hidden and we don't need to care about it that much. It is not configurable and we don't use regular backup strategies for it. It is always placed in the installation path of SQL Server (to the binn directory) and it's backed up within the filesystem backup. In case of an accident, it is recovered as a part of the filesystem as well. Filegroup Filegroup is an organizational metadata object containing one or more data files. Filegroup does not have its own representation in the filesystem--it's just a group of files. When any database is created, a filegroup called primary is always created. This primary filegroup always contains the primary data file. Filegroups can be divided into the following: Row storage filegroups: These filegroup can contain data files (mdf or ndf). Filestream filegroups: This kind of filegroups can contain not files but folders to store binary data. In-memory filegroup: Only one instance of this kind of filegroup can be created in a database. Internally, it is a special case of filestream filegroup and it's used by SQL Server to persist data from in-memory tables. Every filegroup has three simple properties: Name: This is a descriptive name of the filegroup. The name must fulfill the naming convention criteria. Default: In a set of filegroups of the same type, one of these filegroups has this option set to on. This means that when a new table or index is created without explicitly specified to which filegroup it has to store data in, the default filegroup is used. By default, the primary filegroup is the default one. Read-only: Every filegroup, except the primary filegroup, could be set to read- only. Let's say that a filegroup is created for last year's history. When data is moved from the current period to tables created in this historical filegroup, the filegroup could be set as read-only, and later the filegroup cannot be backed up again and again. It is a very good approach to divide the database into smaller parts-- filegroups with more files. It helps in distributing data across more physical storage and also makes the database more manageable; backups can be done part by part in shorter times, which better fit into a service window. Data files Every database must have at least one data file called primary data file. This file is always bound to the primary filegroup. In this file is all the metadata of the database, such as structure descriptions (could be seen through views such as sys.objects, sys.columns, and others), users, and so on. If the database does not have other data files (in the same or other filegroups), all user data is also stored in this file, but this approach is good enough just for smaller databases. Considering how the volume of data in the database grows over time, it is a good practice to add more data files. These files are called secondary data files. Secondary data files are optional and contain user data only. Both types of data files have the same internal structure. Every file is divided into 8 KB small parts called data pages. SQL Server maintains several types of data pages such as data, data pages, index pages, index allocation maps (IAM) pages to locate data pages of tables or indexes, global allocation map (GAM) and shared global allocation maps (SGAM) pages to address objects in the database, and so on. Regardless of the type of a certain data page, SQL Server uses a data page as the smallest unit of I/O operations between hard disk and memory. Let's describe some common properties: A data page never contains data of several objects Data pages don't know each other (and that's why SQL Server uses IAMs to allocate all pages of an object) Data pages don't have any special physical ordering A data row must always fit in size to a data page These properties could seem to be useless but we have to keep in mind that when we know these properties, we can better optimize and manage our databases. Did you know that a data page is the smallest storage unit that can be restored from backup? As a data page is quite a small storage unit, SQL Server groups data pages into bigger logical units called extents. An extent is a logical allocation unit containing eight coherent data pages. When SQL Server requests data from disk, extents are read into memory. This is the reason why 64 KB NTFS clusters are recommended to format disk volumes for data files. Extents could be uniform or mixed. Uniform extent is a kind of extent containing data pages belonging to one object only; on the other hand, a mixed extent contains data pages of several objects. Transaction log When SQL Server processes any transaction, it works in a way called two-phase commit. When a client starts a transaction by sending a single DML request or by calling the BEGIN TRAN command, SQL Server requests data pages from disk to memory called buffer cache and makes the requested changes in these data pages in memory. When the DML request is fulfilled or the COMMIT command comes from the client, the first phase of the commit is finished, but data pages in memory differ from their original versions in a data file on disk. The data page in memory is in a state called dirty. When a transaction runs, a transaction log file is used by SQL Server for a very detailed chronological description of every single action done during the transaction. This description is called write-ahead-logging, shortly WAL, and is one of the oldest processes known on SQL Server. The second phase of the commit usually does not depend on the client's request and is an internal process called checkpoint. Checkpoint is a periodical action that: searches for dirty pages in buffer cache, saves dirty pages to their original data file location, marks these data pages as clean (or drops them out of memory to free memory space), marks the transaction as checkpoint or inactive in the transaction log. Write-ahead-logging is needed for SQL Server during recovery process. Recovery process is started on every database every time SQL Server service starts. When SQL Server service stops, some pages could remain in a dirty state and they are lost from memory. This can lead to two possible situations: The transaction is completely described in the transaction log, the new content of the data page is lost from memory, and data pages are not changed in the data file The transaction was not completed at the moment SQL Server stopped, so the transaction cannot be completely described in the transaction log as well, data pages in memory were not in a stable state (because the transaction was not finished and SQL Server cannot know if COMMIT or ROLLBACK will occur), and the original version of data pages in data files is intact SQL Server decides these two situations when it's starting. If a transaction is complete in the transaction log but was not marked as checkpoint, SQL Server executes this transaction again with both phases of COMMIT. If the transaction was not complete in the transaction log when SQL Server stopped, SQL Server will never know what was the user's intention with the transaction and the incomplete transaction is erased from the transaction log as if it had never started. The aforementioned described recovery process ensures that every database is in the last known consistent state after SQL Server's startup. It's crucial for DBAs to understand write-ahead-logging when planning a backup strategy because when restoring the database, the administrator has to recognize if it's time to run the recovery process or not. To summarize, we introduced internal data handling as it is important not only during performance backups and restores but also for optimizing a database. If you are interested to know more about how to backup, recover and secure SQL Server, do checkout this book SQL Server 2017 Administrator's Guide.

0
0
33020

How-To Tutorials - Data

Selecting Statistical-based Features in Machine Learning application

Jupyter and Python Scripting

Ensemble Methods to Optimize Machine Learning Models

#AskTensorFlow: Twitterati ask questions on TensorFlow 2.0 - TF prebuilt binaries, Tensorboard, Keras, and Python support

15 things every BI professional should know about Tableau

How to perform Audio-Video-Image Scraping with Python

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Using R to implement Kriging - A Spatial Interpolation technique for Geostatistics data

ICLR 2019 Highlights: Algorithmic fairness, AI for social good, climate change, protein structures, GAN magic, adversarial ML and much more

Why choose OpenCV over MATLAB for your next Computer Vision project

Trending Topics

Jim Balsillie on Data Governance Challenges and 6 Recommendations to tackle them

4 popular algorithms for Distance-based outlier detection

Salesforce is buying Tableau in a $15.7 billion all-stock deal

Working with Kibana in Elasticsearch 5.x

How SQL Server handles data under the hood

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading