Data | 0 articles | Tech News, Tutorials & Expert Insights

03 Nov 2015

10 min read

Big Data Analytics

03 Nov 2015

In this article, Dmitry Anoshin, the author of Learning Hunk will talk about Hadoop—how to extract Hunk to VM to set up a connection with Hadoop to create dashboards. We are living in a century of information technology. There are a lot of electronic devices around us that generate a lot of data. For example, you can surf on the Internet, visit a couple news portals, order new Airmax on the web store, write a couple of messages to your friend, and chat on Facebook. Every action produces data; we can multiply these actions with the number of people who have access to the Internet or just use a mobile phone and we will get really big data. Of course, you have a question, how big is it? I suppose, now it starts from terabytes or even petabytes. The volume is not the only issue; we struggle with a variety of data. As a result, it is not enough to analyze only structure data. We should dive deep into the unstructured data, such as machine data, that are generated by various machines. World famous enterprises try to collect this extremely big data in order to monetize it and find business insights. Big data offers us new opportunities, for example, we can enrich customer data via social networks using the APIs of Facebook or Twitter. We can build customer profiles and try to predict customer wishes in order to sell our product or improve customer experience. It is easy to say, but difficult to do. However, organizations try to overcome these challenges and use big data stores, such as Hadoop. (For more resources related to this topic, see here.) The big problem Hadoop is a distributed file system and framework to compute. It is relatively easy to get data into Hadoop. There are plenty of tools to get data into different formats. However, it is extremely difficult to get value out of these data that you put into Hadoop. Let's look at the path from data to value. First, we have to start at the collection of data. Then, we also spend a lot of time preparing and making sure this data is available for analysis while being able to ask questions to this data. It looks as follows: Unfortunately, the questions that you asked are not good or the answers that you got are not clear, and you have to repeat this cycle over again. Maybe, you have transformed and formatted your data. In other words, it is a long and challenging process. What you actually want is something to collect data; spend some time preparing the data, then you would able to ask question and get answers from data repetitively. Now, you can spend a lot of time asking multiple questions. In addition, you are able to iterate with data on those questions to refine the answers that you are looking for. The elegant solution What if we could take Splunk and put it on top of all these data stored in Hadoop? And it was, what the Splunk company actually did. The following figure shows how we got Hunk as name of the new product: Let's discuss some solution goals Hunk inventors were thinking about when they were planning Hunk: Splunk can take data from Hadoop via the Splunk Hadoop Connection app. However, it is a bad idea to copy massive data from Hadoop to Splunk. It is much better to process data in place because Hadoop provides both storage and computation and why not take advantage of both. Splunk has extremely powerful Splunk Processing Language (SPL) and it is a kind of advantage of Splunk, because it has a wide range of analytic functions. This is why it is a good idea to keep SPL in the new product. Splunk has true schema on the fly. The data that we store in Hadoop changes constantly. So, Hunk should be able to build schema on the fly, independent from the format of the data. It's a very good idea to have the ability to make previews. As you know, when a search is going on, you would able to get incremental results. It can dramatically reduce the outage. For example, we don't need to wait till the MapReduce job is finished. We can look at the incremental result and, in the case of a wrong result, restart a search query. The deployment of Hadoop is not easy; Splunk tries to make the installation and configuration of Hunk easy for us. Getting up Hunk In order to start exploring the Hadoop data, we have to install Hunk on the top of our Hadoop cluster. Hunk is easy to install and configure. Let's learn how to deploy Hunk Version 6.2.1 on top of the existing CDH cluster. It's assumed that your VM is up and running. Extracting Hunk to VM To extract Hunk to VM, perform the following steps: Open the console application. Run ls -la to see the list of files in your Home directory: [cloudera@quickstart ~]$ cd ~ [cloudera@quickstart ~]$ ls -la | grep hunk -rw-r--r-- 1 root root 113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz Unpack the archive: cd /opt sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt Setting up Hunk variables and configuration files Perform the following steps to set up the Hunk variables and configuration files It's time to set the SPLUNK_HOME environment variable. This variable is already added to the profile; it is just to bring to your attention that this variable must be set: export SPLUNK_HOME=/opt/hunk Use default splunk-launch.conf. This is the basic properties file used by the Hunk service. We don't have to change there something special, so let's use the default settings: sudocp /opt/hunk/etc/splunk-launch.conf.default /opt/hunk//etc/splunk-launch.conf Running Hunk for the first time Perform the following steps to run Hunk: Run Hunk: sudo /opt/hunk/bin/splunk start --accept-license Here is the sample output from the first run: sudo /opt/hunk/bin/splunk start --accept-license This appears to be your first time running this version of Splunk. Copying '/opt/hunk/etc/openldap/ldap.conf.default' to '/opt/hunk/etc/openldap/ldap.conf'. Generating RSA private key, 1024 bit long modulus Some output lines were deleted to reduce amount of log text Waiting for web server at http://127.0.0.1:8000 to be available.... Done If you get stuck, we're here to help. Look for answers here: http://docs.splunk.com The Splunk web interface is at http://vm-cluster-node1.localdomain:8000 Setting up a data provider and virtual index for the CDR data We need to accomplish two tasks: provide a technical connector to the underlying data storage and create a virtual index for the data on this storage. Log in to http://quickstart.cloudera:8000. The system would ask you to change the default admin user password. I did set it to admin: Setting up a connection to Hadoop Right now, we are ready to set up the integration between Hadoop and Hunk. At first, we need to specify the way Hunk connects to the current Hadoop installation. We are using the most recent way: YARN with MR2. Then, we have to point virtual indexes to the data stored on Hadoop. To do this, perform the following steps: Click on Explore Data. Click on Create a provider: Let's fill the form to create the data provider: Property name Value Name hadoop-hunk-provider Java home /usr/java/jdk1.7.0_67-cloudera Hadoop home /usr/lib/hadoop Hadoop version Hadoop 2.x, (Yarn) filesystem hdfs://quickstart.cloudera:8020 Resource Manager Address quickstart.cloudera:8032 Resource Scheduler Address quickstart.cloudera:8030 HDFS Working Directory /user/hunk Job Queue default You don't have to modify any other properties. The HDFS working directory has been created for you in advance. You can create it using the following command: sudo -u hdfshadoop fs -mkdir -p /user/hunk If you did everything correctly, you should see a screen similar to the following screenshot: Let's discuss briefly what we have done: We told Hunk where Hadoop home and Java are. Hunk uses Hadoop streaming internally, so it needs to know how to call Java and Hadoop streaming. You can inspect the submitted jobs from Hunk (discussed later) and see the following lines: /opt/hunk/bin/jars/sudobash /usr/bin/hadoop jar "/opt/hunk/bin/jars/SplunkMR-s6.0-hy2.0.jar" "com.splunk.mr.SplunkMR" MapReduce JAR is submitted by Hunk. Also, we need to tell Hunk where the YARN Resource Manager and Scheduler are located. These services allow us to ask for cluster resources and run jobs. Job queue could be useful in the production environment. You could have several queues for cluster resource distribution in real life. We would set queue name as default, since we are not discussing cluster utilization and load balancing. Setting up a virtual index for the data stored in Hadoop Now it's time to create virtual index. We are going to add the dataset with the avro files to the virtual index as an example data. Click on Explore Data and then click on Create a virtual index: You'll get a message telling that there are no indexes: Just click on New Virtual Index. A virtual index is a metadata. It tells Hunk where the data is located and what provider should be used to read the data. Property name Value Name milano_cdr_aggregated_10_min_activity Path to data in HDFS /masterdata/stream/milano_cdr Here is an example screen you should see after you create your first virtual index: Accessing data through the virtual index To access data through the virtual index, perform the following steps: Click on Explore Data and select a provider and virtual index: Select part-m-00000.avro by clicking on it. The Next button will be activated after you pick up a file: Preview data in the Preview Data tab. You should see how Hunk automatically for timestamp from our CDR data: Pay attention to the Time column and the field named Time_interval from the Event column. The time_interval column keeps the time of record. Hunk should automatically use that field as a time field: Save the source type by clicking on Save As and then Next: In the Entering Context Settings page, select search in the App context drop-down box. Then, navigate to Sharing context | All apps and then click on Next. The last step allows you to review what we've done: Click on Finish to create the finalized wizard. Creating a dashbord Now it's time to see how the dashboards work. Let's find the regions where the visitors face problems (status = 500) while using our online store: index="digital_analytics" status=500 | iplocation clientip | geostats latfield=lat longfield=lon count by Country You should see the map and the portions of error for the countries: Now let's save it as dashboard. Click on Save as and select Dashboard panel from drop-down menu. Name it as Web Operations. You should get a new dashboard with a single panel and our report on it. We have several previously created reports. Let's add them to the newly created dashboard using separate panels: Click on Edit and then Edit panels. Select Add new panel and then New from report, and add one of our previous reports. Summary In this article, you learned how to extract Hunk to VM. We also saw how to set up Hunk variables and configuration files. You learned how to run Hunk and how to set up the data provided and a virtual index for the CDR data. Setting up a connection to Hadoop and a virtual index for the data stored in Hadoop were also covered in detail. Apart from these, you also learned how to create a dashboard. Resources for Article: Further resources on this subject: Identifying Big Data Evidence in Hadoop [Article] Big Data [Article] Understanding Hadoop Backup and Recovery Needs [Article]

0
0
2963

Packt

29 Oct 2015

32 min read

Protecting Your Bitcoins

Packt

29 Oct 2015

32 min read

0
0
6188

article-image-rotation-forest-classifier-ensemble-based-feature-extraction

Packt

28 Oct 2015

16 min read

Rotation Forest - A Classifier Ensemble Based on Feature Extraction

Packt

28 Oct 2015

16 min read

In this article by Gopi Subramanian author of the book Python Data Science Cookbook you will learn bagging methods based on decision tree-based algorithms are very popular among the data science community. Rotation Forest The claim to fame of most of these methods is that they need zero data preparation as compared to the other methods, can obtain very good results, and can be provided as a black box of tools in the hands of software engineers. By design, bagging lends itself nicely to parallelization. Hence, these methods can be easily applied on a very large dataset in a cluster environment. The decision tree algorithms split the input data into various regions at each level of the tree. Thus, they perform implicit feature selection. Feature selection is one of the most important tasks in building a good model. By providing implicit feature selection, trees are at an advantageous position as compared to other techniques. Hence, bagging with decision trees comes with this advantage. Almost no data preparation is needed for decision trees. For example, consider the scaling of attributes. Attribute scaling has no impact on the structure of the trees. The missing values also do not affect decision trees. The effect of outliers is very minimal on a decision tree. We don’t have to do explicit feature transformations to accommodate feature interactions. One of the major complaints against tree-based methods is the difficulty with pruning the trees to avoid overfitting. Big trees tend to also fit the noise present in the underlying data and hence lead to a low bias and high variance. However, when we grow a lot of trees and the final prediction is an average of the output of all the trees in the ensemble, we avoid these problems. In this article, we will see a powerful tree-based ensemble method called rotation forest. A typical random forest requires a large number of trees to be a part of its ensemble in order to achieve good performance. Rotation forest can achieve similar or better performance with less number of trees. Additionally, the authors of this algorithm claim that the underlying estimator can be anything other than a tree. This way, it is projected as a new framework to build an ensemble similar to gradient boosting. (For more resources related to this topic, see here.) The algorithm Random forest and bagging gives impressive results with very large ensembles; having a large number of estimators results in the improvement of the accuracy of these methods. On the contrary, rotation forest is designed to work with a smaller number of ensembles. Let's write down the steps involved in building a rotation forest. The number of trees required in the forest is typically specified by the user. Let T be the number of trees required to be built. We start with iterating from one through T, that is, we build T trees. For each tree T, perform the following steps: Split the attributes in the training set into K nonoverlapping subsets of equal size. We have K datasets, each with K attributes. For each of the K datasets, we proceed to do the following: Bootstrap 75% of the data from each K dataset and use the bootstrapped sample for further steps. Run a principal component analysis on the ith subset in K. Retain all the principal components. For every feature j in the Kth subset, we have a principle component, a. Let's denote it as aij, where it’s the principal component for the jth attribute in the ith subset. Store the principal components for the subset. Create a rotation matrix of size, n X n, where n is the total number of attributes. Arrange the principal component in the matrix such that the components match the position of the feature in the original training dataset. Project the training dataset on the rotation matrix using the matrix multiplication. Build a decision tree with the projected dataset. Store the tree and rotation matrix. A quick note about PCA: PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss or, in other words, retain maximum variation in the data. In PCA, we find the directions in the data with the most variation, that is, the eigenvectors corresponding to the largest eigenvalues of the covariance matrix and project the data onto these directions. With a dataset (n x m) with n instances and m dimensions, PCA projects it onto a smaller subspace (n x d), where d << m. A point to note is that PCA is computationally very expensive. Programming rotation forest in Python Now let's write a Python code to implement rotation forest. We will proceed to test it on a classification problem. We will generate some classification dataset to demonstrate rotation forest. To our knowledge, there is no Python implementation available for rotation forest. We will leverage scikit-learns’s implementation of the decision tree classifier and use the train_test_split method for the bootstrapping. Refer to the following link to learn more about scikit-learn: http://scikit-learn.org/stable/ First write the necessary code to implement rotation forest and apply it on a classification problem. We will start with loading all the necessary libraries. Let's leverage the make_classification method from the sklearn.dataset module to generate the training data. We will follow it with a method to select a random subset of attributes called gen_random_subset: from sklearn.datasets import make_classification from sklearn.metrics import classification_report from sklearn.cross_validation import train_test_split from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier import numpy as np def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 50 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y def get_random_subset(iterable,k): subsets = [] iteration = 0 np.random.shuffle(iterable) subset = 0 limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) subsets.append(iterable[-subset:]) del iterable[-subset:] iteration+=1 return subsets We will write the build_rotationtree_model function, where we will build fully-grown trees and proceed to evaluate the forest’s performance using the model_worth function: def build_rotationtree_model(x_train,y_train,d,k): models = [] r_matrices = [] feature_subsets = [] for i in range(d): x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7) # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset) # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float) for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset) for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] x_transformed = x_train.dot(R_matrix) model = DecisionTreeClassifier() model.fit(x_transformed,y_train) models.append(model) r_matrices.append(R_matrix) return models,r_matrices,feature_subsets def model_worth(models,r_matrices,x,y): predicted_ys = [] for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) predicted_matrix = np.asmatrix(predicted_ys) final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) print classification_report(y, final_prediction) Finally, we will write a main function used to invoke the functions that we have defined: if __name__ == "__main__": x,y = get_data() # plot_data(x,y) # Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) # Build a bag of models models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Walking through our code Let's start with our main function. We will invoke get_data to get our predictor attributes in the response attributes. In get_data, we will leverage the make_classification dataset to generate our training data for our recipe: def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 30 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y Let's look at the parameters passed to the make_classification method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is about how many attributes per instance are required. We say that we need 30. The third parameter, flip_y, randomly interchanges 3% of the instances. This is done to introduce some noise in our data. The next parameter is how many of these 30 features should be informative enough to be used in our classification. We have specified that 60% of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce correlation among the features. Finally, the repeated features are duplicate features, which are drawn randomly from both the informative and redundant features. Let's split the data into training and testing sets using train_test_split. We will reserve 30% of our data to test: Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) Once again, we will leverage train_test_split to split our test data into dev and test sets: x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) With the data divided to build, evaluate, and test the model, we will proceed to build our models: models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) We will invoke the build_rotationtree_model function to build our rotation forest. We will pass our training data, predictor, x_train and response variable, y_train, the total number of trees to be built—25 in this case—and finally, the subset of features to be used—5 in this case. Let's jump into this function: models = [] r_matrices = [] feature_subsets = [] We will begin with declaring three lists to store each of the decision tree, rotation matrix for this tree, and subset of features used in this iteration. We will proceed to build each tree in our ensemble. As a first order of business, we will bootstrap to retain only 75% of the data: x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7) We will leverage the train_test_split function from scikit-learn for the bootstrapping. We will then decide the feature subsets: # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset) The get_random_subset function takes the feature index and number of subsets required as parameters and returns K subsets. In this function, we will shuffle the feature index. The feature index is an array of numbers starting from zero and ending with the number of features in our training set: np.random.shuffle(iterable) Let's say that we have ten features and our k value is five, indicating that we need subsets with five nonoverlapping feature indices and we need to do two iterations. We will store the number of iterations needed in the limit variable: limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) iteration+=1 If our required subset is less than the total number of attributes, we can proceed to use the first k entries in our iterable. As we shuffled our iterables, we will be returning different volumes at different times: subsets.append(iterable[-subset:]) On selecting a subset, we will remove it from the iterable as we need nonoverlapping sets: del iterable[-subset:] With all the subsets ready, we will declare our rotation matrix: del iterable[-subset:] With all the subsets ready, we will declare our rotation matrix: # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float) As you can see, our rotation matrix is of size, n x n, where is the number of attributes in our dataset. You can see that we have used the shape attribute to declare this matrix filled with zeros: for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset) For each of the K subsets of data having only K features, we will proceed to do a principal component analysis. We will fill our rotation matrix with the component values: for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] For example, let's say that we have three attributes in our subset in a total of six attributes. For illustration, let's say that our subsets are as follows: 2,4,6 and 1,3,5 Our rotation matrix, R, is of size, 6 x 6. Assume that we want to fill the rotation matrix for the first subset of features. We will have three principal components, one each for 2,4, and 6 of size, 1 x 3. The output of PCA from scikit-learn is a matrix of size components X features. We will go through each component value in the for loop. At the first run, our feature of interest is two, and the cell (0,0) in the component matrix output from PCA gives the value of contribution of feature two to component one. We have to find the right place in the rotation matrix for this value. We will use the index from the component matrices, ii and jj, with the subset list to get the right place in the rotation matrix: R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] The each_subset[0] and each_subset[0] will put us in cell (2,2) in the rotation matrix. As we go through the loop, the next component value in cell (0,1) in the component matrix will be placed in cell (2,4) in the rotation matrix and the last one in cell (2,6) in the rotation matrix. This is done for all the attributes in the first subset. Let's go to the second subset; here the first attribute is one. The cell (0,0) of the component matrix corresponds to the cell (1,1) in the rotation matrix. Proceeding this way, you will notice that the attribute component values are arranged in the same order as the attributes themselves. With our rotation matrix ready, let's project our input onto the rotation matrix: x_transformed = x_train.dot(R_matrix) It’s time now to fit our decision tree: model = DecisionTreeClassifier() model.fit(x_transformed,y_train) Finally, we will store our models and the corresponding rotation matrices: models.append(model) r_matrices.append(R_matrix) With our model built, let's proceed to see how good our model is in both the train and dev datasets using the model_worth function: model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Let's see our model_worth function: for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) In this function, perform prediction using each of the trees that we built. However, before doing the prediction, we will project our input using the rotation matrix. We will store all our prediction output in a list called predicted_ys. Let's say that we have 100 instances to predict and ten models in our tree; for each instance, we have ten predictions. We will store these as a matrix for convenience: predicted_matrix = np.asmatrix(predicted_ys) Now, we will proceed to give a final classification for each of our input records: final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) We will store our final prediction in a list called final_prediction. We will go through each of the predictions for our instance. Let's say that we are in the first instance (i=0 in our for loop); pred_from_all_models stores the output from all the trees in our model. It’s an array of zeros and ones indicating which class has the model classified in this instance. We will make another array out of it, non_zero_pred, which has only those entries from the parent arrays that are non-zero. Finally, if the length of this non-zero array is greater than half the number of models that we have, we say that our final prediction is one for the instance of interest. What we have accomplished here is the classic voting scheme. Let's look at how good our models are now by calling a classification report: print classification_report(y, final_prediction) Here is how good our model has performed on the training set: Let's see how good our model performance is on the dev dataset: References More information about rotation forest can be obtained from the following paper:. Rotation Forest: A New Classifier Ensemble Method Juan J. Rodrı´guez, Member, IEEE Computer Society, Ludmila I. Kuncheva, Member, IEEE, and Carlos J. Alonso The paper also claims that when rotation forest was compared to bagging, AdBoost, and random forest on 33 datasets, rotation forest outperformed all the other three algorithms. Similar to gradient boosting, authors of the paper claim that rotation forest is an overall framework and the underlying ensemble is not necessary to be a decision tree. Work is in progress in testing other algorithms such as Naïve Bayes, Neural Networks, and similar others. Summary In this article will learnt the bagging methods based on decision tree-based algorithms are very popular among the data science community. Resources for Article: Further resources on this subject: Mobile Phone Forensics – A First Step into Android Forensics [article] Develop a Digital Clock [article] Monitoring Physical Network Bandwidth Using OpenStack Ceilometer [article]

0
0
10885

Packt

28 Oct 2015

28 min read

An Introduction to Kibana

Packt

28 Oct 2015

28 min read

0
0
5214

article-image-putting-your-database-heart-azure-solutions

Packt

28 Oct 2015

19 min read

Putting Your Database at the Heart of Azure Solutions

Packt

28 Oct 2015

19 min read

In this article by Riccardo Becker, author of the book Learning Azure DocumentDB, we will see how to build a real scenario around an Internet of Things scenario. This scenario will build a basic Internet of Things platform that can help to accelerate building your own. In this article, we will cover the following: Have a look at a fictitious scenario Learn how to combine Azure components with DocumentDB Demonstrate how to migrate data to DocumentDB (For more resources related to this topic, see here.) Introducing an Internet of Things scenario Before we start exploring different capabilities to support a real-life scenario, we will briefly explain the scenario we will use throughout this article. IoT, Inc. IoT, Inc. is a fictitious start-up company that is planning to build solutions in the Internet of Things domain. The first solution they will build is a registration hub, where IoT devices can be registered. These devices can be diverse, ranging from home automation devices up to devices that control traffic lights and street lights. The main use case for this solution is offering the capability for devices to register themselves against a hub. The hub will be built with DocumentDB as its core component and some Web API to expose this functionality. Before devices can register themselves, they need to be whitelisted in order to prevent malicious devices to start registering. In the following screenshot, we see the high-level design of the registration requirement: The first version of the solution contains the following components: A Web API containing methods to whitelist, register, unregister, and suspend devices DocumentDB, containing all the device information including information regarding other Microsoft Azure resources Event Hub, a Microsoft Azure asset that enables scalable publish-subscribe mechanism to ingress and egress millions of events per second Power BI, Microsoft’s online offering to expose reporting capabilities and the ability to share reports Obviously, we will focus on the core of the solution which is DocumentDB but it is nice to touch some of the Azure components, as well to see how well they co-operate and how easy it is to set up a demonstration for IoT scenarios. The devices on the left-hand side are chosen randomly and will be mimicked by an emulator written in C#. The Web API will expose the functionality required to let devices register themselves at the solution and start sending data afterwards (which will be ingested to the Event Hub and reported using Power BI). Technical requirements To be able to service potentially millions of devices, it is necessary that registration request from a device is being stored in a separate collection based on the country where the device is located or manufactured. Every device is being modeled in the same way, whereas additional metadata can be provided upon registration or afterwards when updating. To achieve country-based partitioning, we will create a custom PartitionResolver to achieve this goal. To extend the basic security model, we reduce the amount of sensitive information in our configuration files. Enhance searching capabilities because we want to service multiple types of devices each with their own metadata and device-specific information. Querying on all the information is desired to support full-text search and enable users to quickly search and find their devices. Designing the model Every device is being modeled similar to be able to service multiple types of devices. The device model contains at least the deviceid and a location. Furthermore, the device model contains a dictionary where additional device properties can be stored. The next code snippet shows the device model: [JsonProperty("id")] public string DeviceId { get; set; } [JsonProperty("location")] public Point Location { get; set; } //practically store any metadata information for this device [JsonProperty("metadata")] public IDictionary<string, object> MetaData { get; set; } The Location property is of type Microsoft.Azure.Documents.Spatial.Point because we want to run spatial queries later on in this section, for example, getting all the devices within 10 kilometers of a building. Building a custom partition resolver To meet the first technical requirement (partition data based on the country), we need to build a custom partition resolver. To be able to build one, we need to implement the IPartitionResolver interface and add some logic. The resolver will take the Location property of the device model and retrieves the country that corresponds with the latitude and longitude provided upon registration. In the following code snippet, you see the full implementation of the GeographyPartitionResolver class: public class GeographyPartitionResolver : IPartitionResolver { private readonly DocumentClient _client; private readonly BingMapsHelper _helper; private readonly Database _database; public GeographyPartitionResolver(DocumentClient client, Database database) { _client = client; _database = database; _helper = new BingMapsHelper(); } public object GetPartitionKey(object document) { //get the country for this document //document should be of type DeviceModel if (document.GetType() == typeof(DeviceModel)) { //get the Location and translate to country var country = _helper.GetCountryByLatitudeLongitude( (document as DeviceModel).Location.Position.Latitude, (document as DeviceModel).Location.Position.Longitude); return country; } return String.Empty; } public string ResolveForCreate(object partitionKey) { //get the country for this partitionkey //check if there is a collection for the country found var countryCollection = _client.CreateDocumentCollectionQuery(database.SelfLink). ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault(); if (null == countryCollection) { countryCollection = new DocumentCollection { Id = partitionKey.ToString() }; countryCollection = _client.CreateDocumentCollectionAsync(_database.SelfLink, countryCollection).Result; } return countryCollection.SelfLink; } /// <summary> /// Returns a list of collectionlinks for the designated partitionkey (one per country) /// </summary> /// <param name="partitionKey"></param> /// <returns></returns> public IEnumerable<string> ResolveForRead(object partitionKey) { var countryCollection = _client.CreateDocumentCollectionQuery(_database.SelfLink). ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault(); return new List<string> { countryCollection.SelfLink }; } } In order to have the DocumentDB client use this custom PartitionResolver, we need to assign it. The code is as follows: GeographyPartitionResolver resolver = new GeographyPartitionResolver(docDbClient, _database); docDbClient.PartitionResolvers[_database.SelfLink] = resolver; //Adding a typical device and have the resolver sort out what //country is involved and whether or not the collection already //exists (and create a collection for the country if needed), use //the next code snippet. var deviceInAmsterdam = new DeviceModel { DeviceId = Guid.NewGuid().ToString(), Location = new Point(4.8951679, 52.3702157) }; Document modelAmsDocument = docDbClient.CreateDocumentAsync(_database.SelfLink, deviceInAmsterdam).Result; //get all the devices in Amsterdam var doc = docDbClient.CreateDocumentQuery<DeviceModel>( _database.SelfLink, null, resolver.GetPartitionKey(deviceInAmsterdam)); Now that we have created a country-based PartitionResolver, we can start working on the Web API that exposes the registration method. Building the Web API A Web API is an online service that can be used by any clients running any framework that supports the HTTP programming stack. Currently, REST is a way of interacting with APIs so that we will build a REST API. Building a good API should aim for platform independence. A well-designed API should also be able to extend and evolve without affecting existing clients. First, we need to whitelist the devices that should be able to register themselves against our device registry. The whitelist should at least contain a device ID, a unique identifier for a device that is used to match during the whitelisting process. A good candidate for a device ID is the mac address of the device or some random GUID. Registering a device The registration Web API contains a POST method that does the actual registration. First, it creates access to an Event Hub (not explained here) and stores the credentials needed inside the DocumentDB document. The document is then created inside the designated collection (based on the location). To learn more about Event Hubs, please visit https://azure.microsoft.com/en-us/services/event-hubs/. [Route("api/registration")] [HttpPost] public async Task<IHttpActionResult> Post([FromBody]DeviceModel value) { //add the device to the designated documentDB collection (based on country) try { var serviceUri = ServiceBusEnvironment.CreateServiceUri("sb", serviceBusNamespace, String.Format("{0}/publishers/{1}", "telemetry", value.DeviceId)) .ToString() .Trim('/'); var sasToken = SharedAccessSignatureTokenProvider.GetSharedAccessSignature(EventHubKeyName, EventHubKey, serviceUri, TimeSpan.FromDays(365 * 100)); // hundred years will do //this token can be used by the device to send telemetry //this token and the eventhub name will be saved with the metadata of the document to be saved to DocumentDB value.MetaData.Add("Namespace", serviceBusNamespace); value.MetaData.Add("EventHubName", "telemetry"); value.MetaData.Add("EventHubToken", sasToken); var document = await docDbClient.CreateDocumentAsync(_database.SelfLink, value); return Created(document.ContentLocation, value); } catch (Exception ex) { return InternalServerError(ex); } } After this registration call, the right credentials on the Event Hub have been created for this specific device. The device is now able to ingress data to the Event Hub and have consumers like Power BI consume the data and present it. Event Hubs is a highly scalable publish-subscribe event ingestor. It can collect millions of events per second so that you can process and analyze the massive amounts of data produced by your connected devices and applications. Once collected into Event Hubs, you can transform and store the data by using any real-time analytics provider or with batching/storage adapters. At the time of writing, Microsoft announced the release of Azure IoT Suite and IoT Hubs. These solutions offer internet of things capabilities as a service and are well-suited to build our scenario as well. Increasing searching We have seen how to query our documents and retrieve the information we need. For this approach, we need to understand the DocumentDB SQL language. Microsoft has an online offering that enables full-text search called Azure Search service. This feature enables us to perform full-text searches and it also includes search behaviours similar to search engines. We could also benefit from so called type-ahead query suggestions based on the input of a user. Imagine a search box on our IoT Inc. portal that offers free text searching while the user types and search for devices that include any of the search terms on the fly. Azure Search runs on Azure; therefore, it is scalable and can easily be upgraded to offer more search and storage capacity. Azure Search stores all your data inside an index, offering full-text search capabilities on your data. Setting up Azure Search Setting up Azure Search is pretty straightforward and can be done by using the REST API it offers or on the Azure portal. We will set up the Azure Search service through the portal and later on, we will utilize the REST API to start configuring our search service. We set up the Azure Search service through the Azure portal (http://portal.azure.com). Find the Search service and fill out some information. In the following screenshot, we can see how we have created the free tier for Azure Search: You can see that we use the Free tier for this scenario and that there are no datasources configured yet. We will do that know by using the REST API. We will use the REST API, since it offers more insight on how the whole concept works. We use Fiddler to create a new datasource inside our search environment. The following screenshot shows how to use Fiddler to create a datasource and add a DocumentDB collection: In the Composer window of Fiddler, you can see we need to POST a payload to the Search service we created earlier. The Api-Key is mandatory and also set the content type to be JSON. Inside the body of the request, the connection information to our DocumentDB environment is need and the collection we want to add (in this case, Netherlands). Now that we have added the collection, it is time to create an Azure Search index. Again, we use Fiddler for this purpose. Since we use the free tier of Azure Search, we can only add five indexes at most. For this scenario, we add an index on ID (device ID), location, and metadata. At the time of writing, Azure Search does not support complex types. Note that the metadata node is represented as a collection of strings. We could check in the portal to see if the creation of the index was successful. Go to the Search blade and select the Search service we have just created. You can check the indexes part to see whether the index was actually created. The next step is creating an indexer. An indexer connects the index with the provided data source. Creating this indexer takes some time. You can check in the portal if the indexing process was successful. We actually find that documents are part of the index now. If your indexer needs to process thousands of documents, it might take some time for the indexing process to finish. You can check the progress of the indexer using the REST API again. https://iotinc.search.windows.net/indexers/deviceindexer/status?api-version=2015-02-28 Using this REST call returns the result of the indexing process and indicates if it is still running and also shows if there are any errors. Errors could be caused by documents that do not have the id property available. The final step involves testing to check whether the indexing works. We will search for a device ID, as shown in the next screenshot: In the Inspector tab, we can check for the results. It actually returns the correct document also containing the location field. The metadata is missing because complex JSON is not supported (yet) at the time of writing. Indexing complex JSON types is not supported yet. It is possible to add SQL queries to the data source. We could explicitly add a SELECT statement to surface the properties of the complex JSON we have like metadata or the Point property. Try adding additional queries to your data source to enable querying complex JSON types. Now that we have created an Azure Search service that indexes our DocumentDB collection(s), we can build a nice query-as-you-type field on our portal. Try this yourself. Enhancing security Microsoft Azure offers a capability to move your secrets away from your application towards Azure Key Vault. Azure Key Vault helps to protect cryptographic keys, secrets, and other information you want to store in a safe place outside your application boundaries (connectionstring are also good candidates). Key Vault can help us to protect the DocumentDB URI and its key. DocumentDB has no (in-place) encryption feature at the time of writing, although a lot of people already asked for it to be on the roadmap. Creating and configuring Key Vault Before we can use Key Vault, we need to create and configure it first. The easiest way to achieve this is by using PowerShell cmdlets. Please visit https://msdn.microsoft.com/en-us/mt173057.aspx to read more about PowerShell. The following PowerShell cmdlets demonstrate how to set up and configure a Key Vault: Command Description Get-AzureSubscription This command will prompt you to log in using your Microsoft Account. It returns a list of all Azure subscriptions that are available to you. Select-AzureSubscription -SubscriptionName "Windows Azure MSDN Premium" This tells PowerShell to use this subscription as being subject to our next steps. Switch-AzureMode AzureResourceManager New-AzureResourceGroup –Name 'IoTIncResourceGroup' –Location 'West Europe' This creates a new Azure Resource Group with a name and a location. New-AzureKeyVault -VaultName 'IoTIncKeyVault' -ResourceGroupName 'IoTIncResourceGroup' -Location 'West Europe' This creates a new Key Vault inside the resource group and provide a name and location. $secretvalue = ConvertTo-SecureString '<DOCUMENTDB KEY>' -AsPlainText –Force This creates a security string for my DocumentDB key. $secret = Set-AzureKeyVaultSecret -VaultName 'IoTIncKeyVault' -Name 'DocumentDBKey' -SecretValue $secretvalue This creates a key named DocumentDBKey into the vault and assigns it the secret value we have just received. Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToKeys decrypt,sign This configures the application with the Service Principal Name <SPN> to get the appropriate rights to decrypt and sign Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToSecrets Get This configures the application with SPN to also be able to get a key. Key Vault must be used together with Azure Active Directory to work. The SPN we need in the steps for powershell is actually is a client ID of an application I have set up in my Azure Active Directory. Please visit https://azure.microsoft.com/nl-nl/documentation/articles/active-directory-integrating-applications/ to see how you can create an application. Make sure to copy the client ID (which is retrievable afterwards) and the key (which is not retrievable afterwards). We use these two pieces of information to take the next step. Using Key Vault from ASP.NET In order to use the Key Vault we have created in the previous section, we need to install some NuGet packages into our solution and/or projects: Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.16.204221202 Install-Package Microsoft.Azure.KeyVault These two packages enable us to use AD and Key Vault from our ASP.NET application. The next step is to add some configuration information to our web.config file: <add key="ClientId" value="<CLIENTID OF THE APP CREATED IN AD" /> <add key="ClientSecret" value="<THE SECRET FROM AZURE AD PORTAL>" />  <add key="SecretUri" value="https://iotinckeyvault.vault.azure.net:443/secrets/DocumentDBKey" /> If you deploy the ASP.NET application to Azure, you could even configure these settings from the Azure portal itself, completely removing this from the web.config file. This technique adds an additional ring of security around your application. The following code snippet shows how to use AD and Key Vault inside the registration functionality of our scenario: //no more keys in code or .config files. Just a appid, secret and the unique URL to our key (SecretUri). When deploying to Azure we could //even skip this by setting appid and clientsecret in the Azure Portal. var kv = new KeyVaultClient(new KeyVaultClient.AuthenticationCallback(Utils.GetToken)); var sec = kv.GetSecretAsync(WebConfigurationManager.AppSettings["SecretUri"]).Result.Value; The Utils.GetToken method is shown next. This method retrieves an access token from AD by supplying the ClientId and the secret. Since we configured Key Vault to allow this application to get the keys, the call to GetSecretAsync() will succeed. The code is as follows: public async static Task<string> GetToken(string authority, string resource, string scope) { var authContext = new AuthenticationContext(authority); ClientCredential clientCred = new ClientCredential(WebConfigurationManager.AppSettings["ClientId"], WebConfigurationManager.AppSettings["ClientSecret"]); AuthenticationResult result = await authContext.AcquireTokenAsync(resource, clientCred); if (result == null) throw new InvalidOperationException("Failed to obtain the JWT token"); return result.AccessToken; } Instead of storing the key to DocumentDB somewhere in code or in the web.config file, it is now moved away to Key Vault. We could do the same with the URI to our DocumentDB and with other sensitive information as well (for example, storage account keys or connection strings). Encrypting sensitive data The documents we created in the previous section contains sensitive data like namespaces, Event Hub names, and tokens. We could also use Key Vault to encrypt those specific values to enhance our security. In case someone gets hold of a document containing the device information, he is still unable to mimic this device since the keys are encrypted. Try to use Key Vault to encrypt the sensitive information that is stored in DocumentDB before it is saved in there. Migrating data This section discusses how to use a tool to migrate data from an existing data source to DocumentDB. For this scenario, we assume that we already have a large datastore containing existing devices and their registration information (Event Hub connection information). In this section, we will see how to migrate an existing data store to our new DocumentDB environment. We use the DocumentDB Data Migration Tool for this. You can download this tool from the Microsoft Download Center (http://www.microsoft.com/en-us/download/details.aspx?id=46436) or from GitHub if you want to check the code. The tool is intuitive and enables us to migrate from several datasources: JSON files MongoDB SQL Server CSV files Azure Table storage Amazon DynamoDB HBase DocumentDB collections To demonstrate the use, we migrate our existing Netherlands collection to our United Kingdom collection. Start the tool and enter the right connection string to our DocumentDB database. We do this for both our source and target information in the tool. The connection strings you need to provide should look like this: AccountEndpoint=https://<YOURDOCDBURL>;AccountKey=<ACCOUNTKEY>;Database=<NAMEOFDATABASE>. You can click on the Verify button to make sure these are correct. In the Source Information field, we provide the Netherlands as being the source to pull data from. In the Target Information field, we specify the United Kingdom as the target. In the following screenshot, you can see how these settings are provided in the migration tool for the source information: The following screenshot shows the settings for the target information: It is also possible to migrate data to a collection that is not created yet. The migration tool can do this if you enter a collection name that is not available inside your database. You also need to select the pricing tier. Optionally, setting the partition key could help to distribute your documents based on this key across all collections you add in this screen. This information is sufficient to run our example. Go to the Summary tab and verify the information you entered. Press Import to start the migration process. We can verify a successful import on the Import results pane. This example is a simple migration scenario but the tool is also capable of using complex queries to only migrate those documents that need to moved or migrated. Try migrating data from an Azure Table storage table to DocumentDB by using this tool. Summary In this article, we saw how to integrate DocumentDB with other Microsoft Azure features. We discussed how to setup the Azure Search service and how create an index to our collection. We also covered how to use the Azure Search feature to enable full-text search on our documents which could enable users to query while typing. Next, we saw how to add additional security to our scenario by using Key Vault. We also discussed how to create and configure Key Vault by using PowerShell cmdlets, and we saw how to enable our ASP.NET scenario application to make use of the Key Vault .NET SDK. Then, we discussed how to retrieve the sensitive information from Key Vault instead of configuration files. Finally, we saw how to migrate an existing data source to our collection by using the DocumentDB Data Migration Tool. Resources for Article: Further resources on this subject: Microsoft Azure – Developing Web API For Mobile Apps [article] Introduction To Microsoft Azure Cloud Services [article] Security In Microsoft Azure [article]

0
0
27672

Packt

26 Oct 2015

5 min read

Making 3D Visualizations

Packt

26 Oct 2015

5 min read

Python has become the preferred language of data scientists for data analysis, visualization, and machine learning. It features numerical and mathematical toolkits such as: Numpy, Scipy, Sci-kit learn, Matplotlib and Pandas, as well as a R-like environment with IPython, all used for data analysis, visualization and machine learning. In this article by Dimitry Foures and Giuseppe Vettigli, authors of the book Python Data Visualization Cookbook, Second Edition, we will see how visualization in 3D is sometimes effective and sometimes inevitable. In this article, you will learn the how 3D bars are created. (For more resources related to this topic, see here.) Creating 3D bars Although matplotlib is mainly focused on plotting and 2D, there are different extensions that enable us to plot over geographical maps, to integrate more with Excel, and plot in 3D. These extensions are called toolkits in matplotlib world. A toolkit is a collection of specific functions that focuses on one topic, such as plotting in 3D. Popular toolkits are Basemap, GTK Tools, Excel Tools, Natgrid, AxesGrid, and mplot3d. We will explore more of mplot3d in this recipe. The mpl_toolkits.mplot3d toolkit provides some basic 3D plotting. Plots supported are scatter, surf, line, and mesh. Although this is not the best 3D plotting library, it comes with matplotlib, and we are already familiar with this interface. Getting ready Basically, we still need to create a figure and add desired axes to it. Difference is that we specify 3D projection for the figure, and the axes we add is Axes3D. Now, we can almost use the same functions for plotting. Of course, the difference is the arguments passed. For we now have three axes, which we need to provide data for. For example, the mpl_toolkits.mplot3d.Axes3D.plot function specifies the xs, ys, zs, and zdir arguments. All others are transferred directly to matplotlib.axes.Axes.plot. We will explain these specific arguments: xs,ys: These are coordinates for X and Y axis zs: These are value(s) for Z axis. Can be one for all points, or one for each point zdir: These values choose what will be the z-axis dimension (usually this is zs, but can be xs, or ys) There is a rotate_axes method in module mpl_toolkits.mplot3d.art3d that contains 3D artist code and functions to convert 2D artists into 3D versions, which can be added to an Axes3D to reorder coordinates so that the axes are rotated with zdir along. The default value is z. Prepending the axis with a '-' does the inverse transform, so zdir can be x, -x, y, -y, z, or -z. How to do it... This is the code to demonstrate the plotting concept explained in the preceding section: import random import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.dates as mdates from mpl_toolkits.mplot3d import Axes3D mpl.rcParams['font.size'] = 10 fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for z in [2011, 2012, 2013, 2014]: xs = xrange(1,13) ys = 1000 * np.random.rand(12) color = plt.cm.Set2(random.choice(xrange(plt.cm.Set2.N))) ax.bar(xs, ys, zs=z, zdir='y', color=color, alpha=0.8) ax.xaxis.set_major_locator(mpl.ticker.FixedLocator(xs)) ax.yaxis.set_major_locator(mpl.ticker.FixedLocator(ys)) ax.set_xlabel('Month') ax.set_ylabel('Year') ax.set_zlabel('Sales Net [usd]') plt.show() This code produces the following figure: How it works... We had to do the same prep work as in 2D world. Difference here is that we needed to specify what "kind of backend." Then, we generate random data for supposed 4 years of sale (2011–2014). We needed to specify Z values to be the same for the 3D axis. The color we picked randomly from the color map set, and then we associated each Z order collection of xs, ys pairs we would render the bar series. There's more... Other plotting from 2D matplotlib are available here. For example, scatter() has a similar interface to plot(), but with added size of the point marker. We are also familiar with contour, contourf, and bar. New types that are available only in 3D are wireframe, surface, and tri-surface plots. For example, this code example, plots tri-surface plot of popular Pringle functions or, more mathematically, hyperbolic paraboloid: from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm import matplotlib.pyplot as plt import numpy as np n_angles = 36 n_radii = 8 # An array of radii # Does not include radius r=0, this is to eliminate duplicate points radii = np.linspace(0.125, 1.0, n_radii) # An array of angles angles = np.linspace(0, 2*np.pi, n_angles, endpoint=False) # Repeat all angles for each radius angles = np.repeat(angles[...,np.newaxis], n_radii, axis=1) # Convert polar (radii, angles) coords to cartesian (x, y) coords # (0, 0) is added here. There are no duplicate points in the (x, y) plane x = np.append(0, (radii*np.cos(angles)).flatten()) y = np.append(0, (radii*np.sin(angles)).flatten()) # Pringle surface z = np.sin(-x*y) fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_trisurf(x, y, z, cmap=cm.jet, linewidth=0.2) plt.show() The code will give the following output: Summary Python Data Visualization Cookbook, Second Edition, is for developers that already know about Python programming in general. If you have heard about data visualization but you don't know where to start, then the book will guide you from the start and help you understand data, data formats, data visualization, and how to use Python to visualize data. Many more visualization techniques have been illustrated in a step-by-step recipe-based approach to data visualization in the book. The topics are explained sequentially as cookbook recipes consisting of a code snippet and the resulting visualization. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Asynchronous Programming with Python [article] Introduction to Data Analysis and Libraries [article]

0
0
3644

Packt

21 Oct 2015

18 min read

Configuring Brokers

Packt

21 Oct 2015

18 min read

0
0
2694

Packt

20 Oct 2015

6 min read

QlikView Tips and Tricks

Packt

20 Oct 2015

6 min read

In this article by Andrew Dove and Roger Stone, author of the book QlikView Unlocked, we will cover the following key topics: A few coding tips The surprising data sources Include files Change logs (For more resources related to this topic, see here.) A few coding tips There are many ways to improve things in QlikView. Some are techniques and others are simply useful things to know or do. Here are a few of our favourite ones. Keep the coding style constant There's actually more to this than just being a tidy developer. So, always code your function names in the same way—it doesn't matter which style you use (unless you have installation standards that require a particular style). For example, you could use MonthStart(), monthstart(), or MONTHSTART(). They're all equally valid, but for consistency, choose one and stick to it. Use MUST_INCLUDE rather than INCLUDE This feature wasn't documented at all until quite a late service release of v11.2; however, it's very useful. If you use INCLUDE and the file you're trying to include can't be found, QlikView will silently ignore it. The consequences of this are unpredictable, ranging from strange behaviour to an outright script failure. If you use MUST_INCLUDE, QlikView will complain that the included file is missing, and you can fix the problem before it causes other issues. Actually, it seems strange that INCLUDE doesn't do this, but Qlik must have its reasons. Nevertheless, always use MUST_INCLUDE to save yourself some time and effort. Put version numbers in your code QlikView doesn't have a versioning system as such, and we have yet to see one that works effectively with QlikView. So, this requires some effort on the part of the developer. Devise a versioning system and always place the version number in a variable that is displayed somewhere in the application. Updating this number every time you make a change doesn't matter, but ensure that it's updated for every release to the user and ties in with your own release logs. Do stringing in the script and not in screen objects We would have put this in anyway, but its place in the article was assured by a recent experience on a user site. They wanted four lines of address and a postcode strung together in a single field, with each part separated by a comma and a space. However, any field could contain nulls; so, to avoid addresses such as ',,,,' or ', Somewhere ,,,', there had be a check for null in every field as the fields were strung together. The table only contained about 350 rows, but it took 56 seconds to refresh on screen when the work was done in an expression in a straight table. Moving the expression to the script and presenting just the resulting single field on screen took only 0.14 seconds. (That's right; it's about a seventh of a second). Plus, it didn't adversely affect script performance. We can't think of a better example of improving screen performance. The surprising data sources QlikView will read database tables, spreadsheets, XML files, and text files, but did you know that it can also take data from a web page? If you need some standard data from the Internet, there's no need to create your own version. Just grab it from a web page! How about ISO Country Codes? Here's an example. Open the script and click on Web files… below Data from Filesto the right of the bottom section of the screen. This will open the File Wizard: Source dialogue, as in the following screenshot. Enter the URL where the table of data resides: Then, click on Next and in this case, select @2 under Tables, as shown in the following screenshot: Click on Finish and your script will look something similar to this: LOAD F1, Country, A2, A3, Number FROM [http://www.airlineupdate.com/content_public/codes/misc_codes/icao _nat.htm] (html, codepage is 1252, embedded labels, table is @2); Now, you've got a great lookup table in about 30 seconds; it will take another few seconds to clean it up for your own purposes. One small caveat though—web pages can change address, content, and structure, so it's worth putting in some validation around this if you think there could be any volatility. Include files We have already said that you should use MUST_INCLUDE rather than INCLUDE, but we're always surprised that many developers never use include files at all. If the same code needs to be used in more than one place, it really should be in an include file. Suppose that you have several documents that use C:QlikFilesFinanceBudgets.xlsx and that the folder name is hard coded in all of them. As soon as the file is moved to another location, you will have several modifications to make, and it's easy to miss changing a document because you may not even realise it uses the file. The solution is simple, very effective, and guaranteed to save you many reload failures. Instead of coding the full folder name, create something similar to this: LET vBudgetFolder='C:QlikFilesFinance'; Put the line into an include file, for instance, FolderNames.inc. Then, code this into each script as follows: $(MUST_INCLUDE=FolderNames.inc) Finally, when you want to refer to your Budgets.xlsx spreadsheet, code this: $(vBudgetFolder)Budgets.xlsx Now, if the folder path has to change, you only need to change one line of code in the include file, and everything will work fine as long as you implement include files in all your documents. Note that this works just as well for folders containing QVD files and so on. You can also use this technique to include LOAD from QVDs or spreadsheets because you should always aim to have just one version of the truth. Change logs Unfortunately, one of the things QlikView is not great at is version control. It can be really hard to see what has been done between versions of a document, and using the -prj folder feature can be extremely tedious and not necessarily helpful. So, this means that you, as the developer, need to maintain some discipline over version control. To do this, ensure that you have an area of comments that looks something similar to this right at the top of your script: // Demo.qvw // // Roger Stone - One QV Ltd - 04-Jul-2015 // // PURPOSE // Sample code for QlikView Unlocked - Chapter 6 // // CHANGE LOG // Initial version 0.1 // - Pull in ISO table from Internet and local Excel data // // Version 0.2 // Remove unused fields and rename incoming ISO table fields to // match local spreadsheet // Ensure that you update this every time you make a change. You could make this even more helpful by explaining why the change was made and not just what change was made. You should also comment the expressions in charts when they are changed. Summary In this article, we covered few coding tips, the surprising data sources, include files, and change logs. Resources for Article: Further resources on this subject: Qlik Sense's Vision [Article] Securing QlikView Documents [Article] Common QlikView script errors [Article]

0
0
12344

article-image-understanding-text-search-and-hierarchies-sap-hana

Packt

20 Oct 2015

9 min read

Understanding Text Search and Hierarchies in SAP HANA

Packt

20 Oct 2015

9 min read

0
0
16391

Packt

19 Oct 2015

5 min read

An Overview of Oozie

Packt

19 Oct 2015

5 min read

In this article by Jagat Singh, the author of the book Apache Oozie Essentials, we will see a basic overview of Oozie and its concepts in brief. (For more resources related to this topic, see here.) Concepts Oozie is a workflow scheduler system to run Apache Hadoop jobs. Oozie workflow jobs are Directed Acyclic Graphs (DAGs) (https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation of actions. Actions tell what to do in the job. Oozie supports running jobs of various types such as Java, Map-reduce, Pig, Hive, Sqoop, Spark, and Distcp. The output of one action can be consumed by the next action to create chain sequence. Oozie has client server architecture, in which we install the server for storing the jobs and using client we submit our jobs to the server. Let's get an idea of few basic concepts of Oozie. Workflow Workflow tells Oozie 'what' to do. It is a collection of actions arranged in required dependency graph. So as part of workflows definition we write some actions and call them in certain order. These are of various types for tasks, which we can do as part of workflow for example, Hadoop filesystem action, Pig action, Hive action, Mapreduce action , Spark action, and so on. Coordinator Coordinator tells Oozie 'when' to do. Coordinators let us to run inter-dependent workflows as data pipelines based on some starting criteria. Most of the Oozie jobs are triggered at given scheduled time interval or when input dataset is present for triggering the job. Following are important definitions related to coordinators: Nominal time: The scheduled time at which job should execute. Example, we process pressrelease every day at 8:00PM. Actual time: The real time when the job ran. In some cases if the input data does not arrive the job might start late. This type of data dependent job triggering is indicated by done-flag (more on this later). The done-flag gives signal to start the job execution. The general skeleton template of coordinator is shown in the following figure: Bundles Bundles tell Oozie which all things to do together as a group. For example a set of coordinators, which can be run together to satisfy a given business requirement can be combined as Bundle. Book case study One of the main used cases of Hadoop is ETL data processing. Suppose that we work for a large consulting company and have won project to setup Big data cluster inside customer data center. On high level the requirements are to setup environment that will satisfy the following flow: We get data from various sources in Hadoop (File based loads, Sqoop based loads) We preprocess them with various scripts (Pig, Hive, Mapreduce) Insert that data into Hive tables for use by analyst and data scientists Data scientists write machine learning models (Spark) We will be using Oozie as our processing scheduling system to do all the above. In our architecture we have one landing server, which sits outside as front door of the cluster. All source systems send files to us via scp and we regularly (for example, nightly to keep simple) push them to HDFS using the hadoop fs -copyFromLocal command. This script is cron driven. It has very simple business logic run every night at 8:00 PM and moves all the files, which it sees, on landing server into HDFS. The Oozie works as follows: Oozie picks the file and cleans it using Pig Script to replace all the delimiters from comma (,) to pipes (|). We will write the same code using Pig and Map Reduce. We then push those processed files into a Hive table. For different source system which is database based MySQL table we do nightly Sqoop when the load of Database in light. So we extract all the records that have been generated on previous business day. The output of that also we insert into Hive tables. Analyst and Data scientists write there magical Hive scripts and Spark machine learning models on those Hive tables. We will use Oozie to schedule all of these regular tasks. Node types Workflow is composed on nodes; the logical DAG of nodes represents 'what' part of the work done by Oozie. Each of the node does specified work and on success moves to one node or on failure moves to other node. For example on success go to OK node and on fail goes to Kill node. Nodes in the Oozie workflow are of the following types. Control flow nodes These nodes are responsible for defining start, end, and control flow of what to do inside the workflow. These can be from following: Start node End node Kill node Decision node Fork and Join node Action nodes Actions nodes represent the actual processing tasks, which are executed when called. These are of various types for example Pig action, Hive action, and Mapreduce action. Summary So in this article we looked at the concepts of Oozie in brief. We also learnt the types on nodes in Oozie. Resources for Article: Further resources on this subject: Introduction to Hadoop[article] Hadoop and HDInsight in a Heartbeat[article] Cloudera Hadoop and HP Vertica [article]

0
0
2490

Packt

19 Oct 2015

8 min read

SQL Server with PowerShell

Packt

19 Oct 2015

8 min read

In this article by Donabel Santos, author of the book, SQL Server 2014 with Powershell v5 Cookbook explains scripts and snippets of code that accomplish basic SQL Server tasks using PowerShell. She discusses simple tasks such as Listing SQL Server Instances and Discovering SQL Server Services to make you comfortable working with SQL Server programmatically. However, even if ever you explore how to create some common database objects using PowerShell, keep in mind that PowerShell will not always be the best tool for the task. There will be tasks that are best completed using T-SQL. It is still good to know what is possible in PowerShell and how to do them, so you know that you have alternatives depending on your requirements or situation. For the recipes, we are going to use PowerShell ISE quite a lot. If you prefer running the script from the PowerShell console rather run running the commands from the ISE, you can save the scripts in a .ps1 file and run it from the PowerShell console. (For more resources related to this topic, see here.) Listing SQL Server Instances In this recipe, we will list all SQL Server Instances in the local network. Getting ready Log in to the server that has your SQL Server development instance as an administrator. How to do it... Let's look at the steps to list your SQL Server instances: Open PowerShell ISE as administrator. Let's use the Start-Service cmdlet to start the SQL Browser service: Import-Module SQLPS -DisableNameChecking #out of the box, the SQLBrowser is disabled. To enable: Set-Service SQLBrowser -StartupType Automatic #sql browser must be installed and running for us #to discover SQL Server instances Start-Service "SQLBrowser" Next, you need to create a ManagedComputer object to get access to instances. Type the following script and run: $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list server instances $managedComputer.ServerInstances Your result should look similar to the one shown in the following screenshot: Notice that $managedComputer.ServerInstances gives you not only instance names, but also additional properties such as ServerProtocols, Urn, State, and so on. Confirm that these are the same instances you see from SQL Server Management Studio. Open SQL Server Management Studio. Go to Connect | Database Engine. In the Server Name dropdown, click on Browse for More. Select the Network Servers tab and check the instances listed. Your screen should look similar to this: How it works... All services in a Windows operating system are exposed and accessible using Windows Management Instrumentation (WMI). WMI is Microsoft's framework for listing, setting, and configuring any Microsoft-related resource. This framework follows Web-based Enterprise Management (WBEM). The DISTRIBUTED MANAGEMENT TASK FORCE, INC. (http://www.dmtf.org/standards/wbem) defines WBEM as follows: A set of management and Internet standard technologies developed to unify the management of distributed computing environments. WBEM provides the ability for the industry to deliver a well-integrated set of standard-based management tools, facilitating the exchange of data across otherwise disparate technologies and platforms. In order to access SQL Server WMI-related objects, you can create a WMI ManagedComputer instance: $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName The ManagedComputer object has access to a ServerInstance property, which in turn lists all available instances in the local network. These instances however are only identifiable if the SQL Server Browser service is running. The SQL Server Browser is a Windows Service that can provide information on installed instances in a box. You need to start this service if you want to list the SQL Server-related services. There's more... The Services instance of the ManagedComputer object can also provide similar information, but you will have to filter for the server type SqlServer: #list server instances $managedComputer.Services | Where-Object Type –eq "SqlServer" | Select-Object Name, State, Type, StartMode, ProcessId Your result should look like this: Instead of creating a WMI instance by using the New-Object method, you can also use the Get-WmiObject cmdlet when creating your variable. Get-WmiObject, however, will not expose exactly the same properties exposed by the Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer object. To list instances using Get-WmiObject, you will need to discover what namespace is available in your environment: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -Namespace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" #see matching namespace objects $namespace #see namespace names $namespace | Select-Object -ExpandProperty "__NAMESPACE" $namespace | Select-Object -ExpandProperty "Name" If you are using PowerShell v2, you will have to change the Where-Object cmdlet usage to use the curly braces {} and the $_ variable: Where-Object {$_.Name -like "ComputerManagement*" } For SQL Server 2014, the namespace value is: ROOTMicrosoftSQLServerComputerManagement12 This value can be derived from $namespace.__NAMESPACE and $namespace.Name. Once you have the namespace, you can use this with Get-WmiObject to retrieve the instances. We can use the SqlServiceType property to filter. According to MSDN (http://msdn.microsoft.com/en-us/library/ms179591.aspx), these are the values of SqlServiceType: SqlServiceType Description 1 SQL Server Service 2 SQL Server Agent Service 3 Full-Text Search Engine Service 4 Integration Services Service 5 Analysis Services Service 6 Reporting Services Service 7 SQL Browser Service Thus, to retrieve the SQL Server instances, we need to provide the full namespace ROOTMicrosoftSQLServerComputerManagement12. We also need to filter for SQL Server Service type, or SQLServiceType = 1. The code is as follows: Get-WmiObject -ComputerName $hostName -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize Your result should look similar to the following screenshot: Yet another way to list all the SQL Server instances in the local network is by using the System.Data.Sql.SQLSourceEnumerator class, instead of ManagedComputer. This class has a static method called Instance.GetDataSources that will list all SQL Server instances: [System.Data.Sql.SqlDataSourceEnumerator]: :Instance.GetDataSources() | Format-Table -AutoSize When you execute, your result should look similar to the following: If you have multiple SQL Server versions, you can use the following code to display your instances: #list services using WMI foreach ($path in $namespace) { Write-Verbose "SQL Services in:$($path.__NAMESPACE)$($path.Name)" Get-WmiObject -ComputerName $hostName ` -Namespace "$($path.__NAMESPACE)$($path.Name)" ` -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize } Discovering SQL Server Services In this recipe, we will enumerate all SQL Server Services and list their statuses. Getting ready Check which SQL Server services are installed in your instance. Go to Start | Run and type services.msc. You should see a screen similar to this: How to do it... Let's assume you are running this script on the server box: Open PowerShell ISE as administrator. Add the following code and execute: Import-Module SQLPS -DisableNameChecking #you can replace localhost with your instance name $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list services $managedComputer.Services | Select-Object Name, Type, ServiceState, DisplayName | Format-Table -AutoSize Your result will look similar to the one shown in the following screenshot: Items listed in your screen will vary depending on the features installed and running in your instance Confirm that these are the services that exist in your server. Check your services window. How it works... Services that are installed on a system can be queried using WMI. Specific services for SQL Server are exposed through SMO's WMI ManagedComputer object. Some of the exposed properties are as follows: ClientProtocols ConnectionSettings ServerAliases ServerInstances Services There's more... An alternative way to get SQL Server-related services is by using Get-WMIObject. We will need to pass in the host name as well as the SQL Server WMI Provider for the ComputerManagement namespace. For SQL Server 2014, this value is ROOTMicrosoftSQLServerComputerManagement12. The script to retrieve the services is provided here. Note that we are dynamically composing the WMI namespace. The code is as follows: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" Get-WmiObject -ComputerName $hostname -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Select-Object ServiceName If you have multiple SQL Server versions installed and want to see just the most recent version's services, you can limit to the latest namespace by adding Select-Object –Last 1: $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" | Select-Object –Last 1 Yet another alternative but less accurate way of listing possible SQL Server related services is the following snippet of code: #alterative - but less accurate Get-Service *SQL* This uses the Get-Service cmdlet and filters base on the service name. This is less accurate because this grabs all processes that have SQL in the name, but may not necessarily be related to SQL Server. For example, if you have MySQL installed, it will get picked up as a process. Conversely, this will not pick up SQL Server-related services that do not have SQL in the name, such as ReportServer. Summary You will find that many of the scripts can be accomplished using PowerShell and SQL Management Objects (SMO). SMO is a library that exposes SQL Server classes that allow programmatic manipulation and automation of many database tasks. For some , we will also explore alternative ways of accomplishing the same tasks using different native PowerShell cmdlets. Now that we have a gist of SQL Server 2014 with PowerShell, lets build a full-fledged e-commerce project with SQL Server 2014 with Powershell v5 Cookbook. Resources for Article: Further resources on this subject: Exploring Windows PowerShell 5.0 [article] Working with PowerShell [article] Installing/upgrading PowerShell [article]

0
0
9069

article-image-introducing-test-driven-machine-learning

Packt

14 Oct 2015

19 min read

Introducing Test-driven Machine Learning

Packt

14 Oct 2015

19 min read

In this article by Justin Bozonier, the author of the book Test Driven Machine Learning, we will see how to develop complex software (sometimes rooted in randomness) in small, controlled steps also it will guide you on how to begin developing solutions to machine learning problems using test-driven development (from here, this will be written as TDD). Mastering TDD is not something the book will achieve. Instead, the book will help you begin your journey and expose you to guiding principles, which you can use to creatively solve challenges as you encounter them. We will answer the following three questions in this article: What are TDD and behavior-driven development (BDD)? How do we apply these concepts to machine learning, and making inferences and predictions? How does this work in practice? (For more resources related to this topic, see here.) After having answers to these questions, we will be ready to move onto tackling real problems. The book is about applying these concepts to solve machine learning problems. This article is the largest theoretical explanation that we will have with the remainder of the theory being described by example. Due to the focus on application, you will learn much more than what you can learn about the theory of TDD and BDD. To read more about the theory and ideals, search the internet for articles written by the following: Kent Beck—The father of TDD Dan North—The father of BDD Martin Fowler—The father of refactoring, he has also created a large knowledge base, on these topics James Shore—one of the author of The Art of Agile Development, has a deep theoretical understanding of TDD, and explains the practical value of it quite well These concepts are incredibly simple and yet can take a lifetime to master. When applied to machine learning, we must find new ways to control and/or measure the random processes inherent in the algorithm. This will come up in this article as well as others. In the next section, we will develop a foundation for TDD and begin to explore its application. Test-driven development Kent Beck wrote in his seminal book on the topic that TDD consists of only two specific rules, which are as follows: Don't write a line of new code unless you first have a failing automated test Eliminate duplication This as he noted fairly quickly leads us to a mantra, really the mantra of TDD: Red, Green, Refactor. If this is a bit abstract, let me restate it that TDD is a software development process that enables a programmer to write code that specifies the intended behavior before writing any software to actually implement the behavior. The key value of TDD is that at each step of the way, you have working software as well as an itemized set of specifications. TDD is a software development process that requires the following: Writing code to detect the intended behavioral change. Rapid iteration cycle that produces working software after each iteration. It clearly defines what a bug is. If a test is not failing but a bug is found, it is not a bug. It is a new feature. Another point that Kent makes is that ultimately, this technique is meant to reduce fear in the development process. Each test is a checkpoint along the way to your goal. If you stray too far from the path and wind up in trouble, you can simply delete any tests that shouldn't apply, and then work your code back to a state where the rest of your tests pass. There's a lot of trial and error inherent in TDD, but the same matter applies to machine learning. The software that you design using TDD will also be modular enough to be able to have different components swapped in and out of your pipeline. You might be thinking that just thinking through test cases is equivalent to TDD. If you are like the most people, what you write is different from what you might verbally say, and very different from what you think. By writing the intent of our code before we write our code, it applies a pressure to our software design that prevents you from writing "just in case" code. By this I mean the code that we write just because we aren't sure if there will be a problem. Using TDD, we think of a test case, prove that it isn't supported currently, and then fix it. If we can't think of a test case, we don't add code. TDD can and does operate at many different levels of the software under development. Tests can be written against functions and methods, entire classes, programs, web services, neural networks, random forests, and whole machine learning pipelines. At each level, the tests are written from the perspective of the prospective client. How does this relate to machine learning? Lets take a step back and reframe what I just said. In the context of machine learning, tests can be written against functions, methods, classes, mathematical implementations, and the entire machine learning algorithms. TDD can even be used to explore technique and methods in a very directed and focused manner, much like you might use a REPL (an interactive shell where you can try out snippets of code) or the interactive (I)Python session. The TDD cycle The TDD cycle consists of writing a small function in the code that attempts to do something that we haven't programmed yet. These small test methods will have three main sections; the first section is where we set up our objects or test data; another section is where we invoke the code that we're testing; and the last section is where we validate that what happened is what we thought would happen. You will write all sorts of lazy code to get your tests to pass. If you are doing it right, then someone who is watching you should be appalled at your laziness and tiny steps. After the test goes green, you have an opportunity to refactor your code to your heart's content. In this context, refactor refers to changing how your code is written, but not changing how it behaves. Lets examine more deeply the three steps of TDD: Red, Green, and Refactor. Red First, create a failing test. Of course, this implies that you know what failure looks like in order to write the test. At the highest level in machine learning, this might be a baseline test where baseline is a better than random test. It might even be predicts random things, or even simpler always predicts the same thing. Is this terrible? Perhaps, it is to some who are enamored with the elegance and artistic beauty of his/her code. Is it a good place to start, though? Absolutely. A common issue that I have seen in machine learning is spending so much time up front, implementing the one true algorithm that hardly anything ever gets done. Getting to outperform pure randomness, though, is a useful change that can start making your business money as soon as it's deployed. Green After you have established a failing test, you can start working to get it green. If you start with a very high-level test, you may find that it helps to conceptually break that test up into multiple failing tests that are the lower-level concerns. I'll dive deeper into this later on in this article but for now, just know that you want to get your test passing as soon as possible; lie, cheat, and steal to get there. I promise that cheating actually makes your software's test suite that much stronger. Resist the urge to write the software in an ideal fashion. Just slap something together. You will be able to fix the issues in the next step. Refactor You got your test to pass through all the manners of hackery. Now, you get to refactor your code. Note that it is not to be interpreted loosely. Refactor specifically means to change your software without affecting its behavior. If you add the if clauses, or any other special handling, you are no longer refactoring. Then you write the software without tests. One way where you will know for sure that you are no longer refactoring is that you've broken previously passing tests. If this happens, we back up our changes until our tests pass again. It may not be obvious but this isn't all that it takes for you to know that you haven't changed behavior. Read Refactoring: Improving the Design of Existing Code, Martin Fowler for you to understand how much you should really care for refactoring. By the way of his illustration in this book, refactoring code becomes a set of forms and movements not unlike karate katas. This is a lot of general theory, but what does a test actually look like? How does this process flow in a real problem? Behavior-driven development BDD is the addition of business concerns to the technical concerns more typical of TDD. This came about as people became more experienced with TDD. They started noticing some patterns in the challenges that they were facing. One especially influential person, Dan North, proposed some specific language and structure to ease some of these issues. Some issues he noticed were the following: People had a hard time understanding what they should test next. Deciding what to name a test could be difficult. How much to test in a single test always seemed arbitrary. Now that we have some context, we can define what exactly BDD is. Simply put, it's about writing our tests in such a way that they will tell us the kind of behavior change they affect. A good litmus test might be asking oneself if the test you are writing would be worth explaining to a business stakeholder. How this solves the previous may not be completely obvious, but it may help to illustrate what this looks like in practice. It follows a structure of given, when, then. Committing to this style completely can require specific frameworks or a lot of testing ceremony. As a result, I loosely follow this in my tests as you will see soon. Here's a concrete example of a test description written in this style Given an empty dataset when the classifier is trained, it should throw an invalid operation exception. This sentence probably seems like a small enough unit of work to tackle, but notice that it's also a piece of work that any business user, who is familiar with the domain that you're working in, would understand and have an opinion on. You can read more about Dan North's point of view in this article on his website at dannorth.net/introducing-bdd/. The BDD adherents tend to use specialized tools to make the language and test result reports be as accessible to business stakeholders as possible. In my experience and from my discussions with others, this extra elegance is typically used so little that it doesn't seem worthwhile. The approach you will learn in the book will take a simplicity first approach to make it as easy as possible for someone with zero background to get up to speed. With this in mind, lets work through an example. Our first test Let's start with an example of what a test looks like in Python. The main reason for using this is that while it is a bit of a pain to install a library, this library, in particular, will make everything that we do much simpler. The default unit test solution in Python requires a heavier set up. On top of this, by using nose, we can always mix in tests that use the built-in solution where we find that we need the extra features. First, install it like this: pip install nose If you have never used pip before, then it is time for you to know that it is a very simple way to install new Python libraries. Now, as a hello world style example, lets pretend that we're building a class that will guess a number using the previous guesses to inform it. This is the first simplest example to get us writing some code. We will use the TDD cycle that we discussed previously, and write our first test in painstaking detail. After we get through our first test and have something concrete to discuss, we will talk about the anatomy of the test that we wrote. First, we must write a failing test. The simplest failing test that I can think of is the following: def given_no_information_when_asked_to_guess_test(): number_guesser = NumberGuesser() result = number_guesser.guess() assert result is None, "Then it should provide no result." The context for assert is in the test name. Reading the test name and then the assert name should do a pretty good job of describing what is being tested. Notice that in my test, I instantiate a NumberGuesser object. You're not missing any steps; this class doesn't exist yet. This seems roughly like how I'd want to use it. So, it's a great place to start with. Since it doesn't exist, wouldn't you expect this test to fail? Lets test this hypothesis. To run the test, first make sure your test file is saved so that it ends in _tests.py. From the directory with the previous code, just run the following: nosetests When I do this, I get the following result: Here's a lot going on here, but the most informative part is near the end. The message is saying that NumberGuesser does not exist yet, which is exactly what I expected since we haven't actually written the code yet. Throughout the book, we'll reduce the detail of the stack traces that we show. For now, we'll keep things detailed to make sure that we're on the same page. At this point, we're in a red state in the TDD cycle. Use the following steps to create our first successful test: Now, create the following class in a file named NumberGuesser.py: class NumberGuesser: """Guesses numbers based on the history of your input"" Import the new class at the top of my test file with a simple import NumberGuesser statement. I rerun nosetests, and get the following: TypeError: 'module' object is not callable Oh whoops! I guess that's not the right way to import the class. This is another very tiny step, but what is important is that we are making forward progress through constant communication with our tests. We are going through extreme detail because I can't stress this point enough; bear with me for the time being. Change the import statement to the following: from NumberGuesser import NumberGuesser Rerun nosetests and you will see the following: AttributeError: NumberGuesser instance has no attribute 'guess' The error message has changed, and is leading to the next thing that needs to be changed. From here, just implement what we think we need for the test to pass: class NumberGuesser: """Guesses numbers based on the history of your input""" def guess(self): return None On rerunning the nosetests, we'll get the following result: That's it! Our first successful test! Some of these steps seem so tiny so as to not being worthwhile. Indeed, overtime, you may decide that you prefer to work on a different level of detail. For the sake of argument, we'll be keeping our steps pretty small if only to illustrate just how much TDD keeps us on track and guides us on what to do next. We all know how to write the code in very large, uncontrolled steps. Learning to code surgically requires intentional practice, and is worth doing explicitly. Lets take a step back and look at what this first round of testing took. Anatomy of a test Starting from a higher level, notice how I had a dialog with Python. I just wrote the test and Python complained that the class that I was testing didn't exist. Next, I created the class, but then Python complained that I didn't import it correctly. So then, I imported it correctly, and Python complained that my guess method didn't exist. In response, I implemented the way that my test expected, and Python stopped complaining. This is the spirit of TDD. You have a conversation between you and your system. You can work in steps as little or as large as you're comfortable with. What I did previously could've been entirely skipped over, and the Python class could have been written and imported correctly the first time. The longer you go without talking to the system, the more likely you are to stray from the path to getting things working as simply as possible. Lets zoom in a little deeper and dissect this simple test to see what makes it tick. Here is the same test, but I've commented it, and broken it into sections that you will see recurring in every test that you write: def given_no_information_when_asked_to_guess_test(): # given number_guesser = NumberGuesser() # when guessed_number = number_guesser.guess() # then assert guessed_number is None, 'there should be no guess.' Given This section sets up the context for the test. In the previous test, you acquired that I didn't provide any prior information to the object. In many of our machine learning tests, this will be the most complex portion of our test. We will be importing certain sets of data, sometimes making a few specific issues in the data and testing our software to handle the details that we would expect. When you think about this section of your tests, try to frame it as Given this scenario… In our test, we might say Given no prior information for NumberGuesser… When This should be one of the simplest aspects of our test. Once you've set up the context, there should be a simple action that triggers the behavior that you want to test. When you think about this section of your tests, try to frame it as When this happens… In our test we might say When NumberGuesser guesses a number… Then This section of our test will check on the state of our variables and any return result if applicable. Again, this section should also be fairly straight-forward, as there should be only a single action that causes a change into your object under the test. The reason for this is that if it takes two actions to form a test, then it is very likely that we will just want to combine the two into a single action that we can describe in terms that are meaningful in our domain. A key example maybe loading the training data from a file and training a classifier. If we find ourselves doing this a lot, then why not just create a method that loads data from a file for us? In the book, you will find examples where we'll have the helper functions help us determine whether our results have changed in certain ways. Typically, we should view these helper functions as code smells. Remember that our tests are the first applications of our software. Anything that we have to build in addition to our code, to understand the results, is something that we should probably (there are exceptions to every rule) just include in the code we are testing. Given, When, Then is not a strong requirement of TDD, because our previous definition of TDD only consisted of two things (all that the code requires is a failing test first and an eliminate duplication). It's a small thing to be passionate about and if it doesn't speak to you, just translate this back into Arrange, act, assert in your head. At the very least, consider it as well as why these specific, very deliberate words are used. Applied to machine learning At this point, you maybe wondering how TDD will be used in machine learning, and whether we use it on regression or classification problems. In every machine learning algorithm, there exists a way to quantify the quality of what you're doing. In the linear regression; it's your adjusted R2 value; in classification problems, it's an ROC curve (and the area beneath it) or a confusion matrix, and more. All of these are testable quantities. Of course, none of these quantities have a built-in way of saying that the algorithm is good enough. We can get around this by starting our work on every problem by first building up a completely naïve and ignorant algorithm. The scores that we get for this will basically represent a plain, old, and random chance. Once we have built an algorithm that can beat our random chance scores, we just start iterating, attempting to beat the next highest score that we achieve. Benchmarking algorithms are an entire field onto their own right that can be delved in more deeply. In the book, we will implement a naïve algorithm to get a random chance score, and we will build up a small test suite that we can then use to pit this model against another. This will allow us to have a conversation with our machine learning models in the same manner as we had with Python earlier. For a professional machine learning developer, it's quite likely that an ideal metric to test is a profitability model that compares risk (monetary exposure) to expected value (profit). This can help us keep a balanced view of how much error and what kind of error we can tolerate. In machine learning, we will never have a perfect model, and we can search for the rest of our lives for the best model. By finding a way to work your financial assumptions into the model, we will have an improved ability to decide between the competing models. Summary In this article, you were introduced to TDD as well as BDD. With these concepts introduced, you have a basic foundation with which to approach machine learning. We saw that the specifying behavior in the form of sentences makes for an easier to ready a set of specifications for your software. Building off of that foundation, we started to delve into testing at a higher level. We did this by establishing concepts that we can use to quantify classifiers: the ROC curve and AUC metric. Now, we've seen that different models can be quantified; it follows that they can be compared. Putting all of this together, we have everything we need to explore machine learning with a test-driven methodology. Resources for Article: Further resources on this subject: Optimization in Python[article] How to do Machine Learning with Python[article] Modeling complex functions with artificial neural networks [article]

0
0
3494

article-image-transactions-and-operators

Packt

13 Oct 2015

14 min read

Transactions and Operators

Packt

13 Oct 2015

14 min read

In this article by Emilien Kenler and Federico Razzoli, author of the book MariaDB Essentials, he has explained in brief about transactions and operators. (For more resources related to this topic, see here.) Understanding transactions A transaction is a sequence of SQL statements that are grouped into a single logical operation. Its purpose is to guarantee the integrity of data. If a transaction fails, no change will be applied to the databases. If a transaction succeeds, all the statements will succeed. Take a look at the following example: START TRANSACTION; SELECT quantity FROM product WHERE id = 42; UPDATE product SET quantity = quantity - 10 WHERE id = 42; UPDATE customer SET money = money - 0(SELECT price FROM product WHERE id = 42) WHERE id = 512; INSERT INTO product_order (product_id, quantity, customer_id) VALUES (42, 10, 512); COMMIT; We haven't yet discussed some of the statements used in this example. However, they are not important to understand transactions. This sequence of statements occur when a customer (whose id is 512) ordered a product (whose id is 42). As a consequence, we need to execute the following suboperations in our database: Check whether the desired quantity of products is available. If not, we should not proceed Decrease the available quantity of items for the product that is being bought Decrease the amount of money in the online account of our customer Register the order so that the product is delivered to our customer These suboperations form a more complex operation. When a session is executing this operation, we do not want other connections to interfere. Consider the following scenarios: Connection checks how many products with the ID 42 are available. Only one is available, but it is enough. Immediately after, the connection B checks the availability of the same product. It finds that one is available. Connection A decreases the quantity of the product. Now, it is 0. Connection B decreases the same number. Now, it is -1. Both connections create an order. Two persons will pay for the same product; however, only one is available. This is something we definitely want to avoid. However, there is another situation that we want to avoid. Imagine that the server crashes immediately after the customer's money is deducted. The order will not be written to the database, so the customer will end up paying for something he will not receive. Fortunately, transactions prevent both these situations. They protect our database writes in two ways: During a transaction, relevant data is locked or copied. In both these cases, two connections will not be able to modify the same rows at the same time. The writes will not be made effective until the COMMIT command is issued. This means that if the server crashes during the transaction, all the suboperations will be rolled back. We will not have inconsistent data (such as a payment for a product that will not be delivered). In this example, the transaction starts when we issue the START TRANSACTION command. Then, any number of operations can be performed. The COMMIT command makes the changes effective. This does not mean that if a statement fails with an error, the transaction is always aborted. In many cases, the application will receive an error and will be free to decide whether the transaction should be aborted or not. To abort the current transaction, an application can execute the ROLLBACK command. A transaction can consist of only one statement. This perfectly makes sense because the server could crash in the middle of the statement's execution. The autocommit mode In many cases, we don't want to group multiple statements in a transaction. When a transaction consists of only one statement, sending the START TRANSACTION and COMMIT statements can be annoying. For this reason, MariaDB has the autocommit mode. By default, the autocommit mode is ON. Unless a START TRANSACTION command is explicitly used, the autocommit mode causes an implicit commit after each statement. Thus, every statement is executed in a separated transaction by default. When the autocommit mode is OFF, a new transaction implicitly starts after each commit, and the COMMIT command needs be issued explicitly. To turn the autocommit ON or OFF, we can use the @@autocommit server variable as follows: follows: MariaDB [mwa]> SET @@autocommit = OFF; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@autocommit; +--------------+ | @@autocommit | +--------------+ | 0 | +--------------+ 1 row in set (0.00 sec) Transaction's limitations in MariaDB Transaction handling is not implemented in the core of MariaDB; instead, it is left to the storage engines. Many storage engines, such as MyISAM or MEMORY, do not implement it at all. Some of the transactional storage engines are: InnoDB; XtraDB; TokuDB. In a sense, Aria tables are partially transactional. Although Aria ignores commands such as START TRANSACTION, COMMIT, and ROLLBACK, each statement is somewhat a transaction. In fact, if it writes, modifies, or deletes multiple rows, the operation completely succeeds or fails, which is similar to a transaction. Only statements that modify data can be used in a transaction. Statements that modify a table structure (such as ALTER TABLE) implicitly commit the current transaction. Sometimes, we may not be sure if a transaction is active or not. Usually, this happens because we are not sure if autocommit is set to ON or not or because we are not sure if the latest statement implicitly committed a transaction. In these cases, the @in_transaction variable can help us. Its value is 1 if a transaction is active and 0 if it is not. Here is an example: MariaDB [mwa]> START TRANSACTION; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 1 | +------------------+ 1 row in set (0.00 sec) MariaDB [mwa]> DROP TABLE IF EXISTS t; Query OK, 0 rows affected, 1 warning (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 0 | +------------------+ 1 row in set (0.00 sec) InnoDB is optimized to execute a huge number of short transactions. If our databases are busy and performance is important to us, we should try to avoid big transactions in terms of the number of statements and execution time. This is particularly true if we have several concurrent connections that read the same tables. Working with operators In our examples, we have used several operators, such as equals (=), less-than and greater-than (<, >), and so on. Now, it is time to discuss operators in general and list the most important ones. In general, an operator is a sign that takes one or more operands and returns a result. Several groups of operators exist in MariaDB. In this article, we will discuss the main types: Comparison operators; String operators; Logical operators; Arithmetic operators. Comparison operators A comparison operator checks whether there is a certain relation between its operands. If the relationship exists, the operator returns 1; otherwise, it returns 0. For example, let's take the equality operator that is probably used the most: 1 = 1 -- returns 1: the equality relationship exists 1 = 0 -- returns 0: no equality relationship here In MariaDB, 1 and 0 are used in many contexts to indicate whether something is true or false. In fact, MariaDB does not have a Boolean data type, so TRUE and FALSE are merely used as aliases for 1 and 0: TRUE = 1 -- returns 1 FALSE = 0 -- returns 1 TRUE = FALSE -- returns 0 In a WHERE clause, a result of 0 or NULL prevents a row to be shown. All the numeric results other than 0, including negative numbers, are regarded as true in this context. Non-numeric values other than NULL need to be converted to numbers in order to be evaluated by the WHERE clause. Non-numeric strings are converted to 0, whereas numeric strings are treated as numbers. Dates are converted to nonzero numbers.Consider the following example: WHERE 1 -- is redundant; it shows all the rows WHERE 0 -- prevents all the rows from being shown Now, let's take a look at the following MariaDB comparison operators: Operator Description Example = This specifies equality A = B != This indicates inequality A != B <> This is the synonym for != A <> B < This denotes less than A < B > This indicates greater than A > B <= This refers to less than or equals to A <= B >= This specifies greater than or equals to A >= B IS NULL This indicates that the operand is NULL A IS NULL IS NOT NULL The operand is not NULL A IS NOT NULL <=> This denotes that the operands are equal, or both are NULL A <=> B BETWEEN ... AND This specifies that the left operand is within a range of values A BETWEEN B AND C NOT BETWEEN ... AND This indicates that the left operand is outside the specified range A NOT BETWEEN B AND C IN This denotes that the left operand is one of the items in a given list A IN (B, C, D) NOT IN This indicates that the left operand is not in the given list A NOT IN (B, C, D) Here are a couple of examples: SELECT id FROM product WHERE price BETWEEN 100 AND 200; DELETE FROM product WHERE id IN (100, 101, 102); Special attention should be paid to NULL values. Almost all the preceding operators return NULL if any of their operands is NULL. The reason is quite clear, that is, as NULL represents an unknown value, any operation involving a NULL operand returns an unknown result. However, there are some operators specifically designed to work with NULL values. IS NULL and IS NOT NULL checks whether the operand is NULL. The <=> operator is a shortcut for the following code: a = b OR (a IS NULL AND b IS NULL) String operators MariaDB supports certain comparison operators that are specifically designed to work with string values. This does not mean that other operators does not work well with strings. For example, A = B perfectly works if A and B are strings. However, some particular comparisons only make sense with text values. Let's take a look at them. The LIKE operator and its variants This operator is often used to check whether a string starts with a given sequence of characters, if it ends with that sequence, or if it contains the sequence. More generally, LIKE checks whether a string follows a given pattern. Its syntax is: <string_value> LIKE <pattern> The pattern is a string that can contain the following wildcard characters: _ (underscore) means: This specifies any character %: This denotes any sequence of 0 or more characters There is also a way to include these characters without their special meaning: the _ and % sequences represent the a_ and a% characters respectively. For example, take a look at the following expressions: my_text LIKE 'h_' my_text LIKE 'h%' The first expression returns 1 for 'hi', 'ha', or 'ho', but not for 'hey'. The second expression returns 1 for all these strings, including 'hey'. By default, LIKE is case insensitive, meaning that 'abc' LIKE 'ABC' returns 1. Thus, it can be used to perform a case insensitive equality check. To make LIKE case sensitive, the following BINARY keyword can be used: my_text LIKE BINARY your_text The complement of LIKE is NOT LIKE, as shown in the following code: <string_value> NOT LIKE <pattern> Here are the most common uses for LIKE: my_text LIKE 'my%' -- does my_text start with 'my'? my_text LIKE '%my' -- does my_text end with 'my'? my_text LIKE '%my%' -- does my_text contain 'my'? More complex uses are possible for LIKE. For example, the following expression can be used to check whether mail is a valid e-mail address: mail LIKE '_%@_%.__%' The preceding code snippet checks whether mail contains at least one character, a '@' character, at least one character, a dot, at least two characters in this order. In most cases, an invalid e-mail address will not pass this test. Using regular expressions with the REGEXP operator and its variants Regular expressions are string patterns that contain a meta character with special meanings in order to perform match operations and determine whether a given string matches the given pattern or not. The REGEXP operator is somewhat similar to LIKE. It checks whether a string matches a given pattern. However, REGEXP uses regular expressions with the syntax defined by the POSIX standard. Basically, this means that: Many developers, but not all, already know their syntax REGEXP uses a very expressive syntax, so the patterns can be much more complex and detailed REGEXP is much slower than LIKE; this should be preferred when possible The regular expressions syntax is a complex topic, and it cannot be covered in this article. Developers can learn about regular expressions at www.regular-expressions.info. The complement of REGEXP is NOT REGEXP. Logical operators Logical operators can be used to combine truth expressions that form a compound expression that can be true, false, or NULL. Depending on the truth values of its operands, a logical operator can return 1 or 0. MariaDB supports the following logical operators: NOT; AND; OR; XOR The NOT operator NOT is the only logical operator that takes one operand. It inverts its truth value. If the operand is true, NOT returns 0, and if the operand is false, NOT returns 1. If the operand is NULL, NOT returns NULL. Here is an example: NOT 1 -- returns 0 NOT 0 -- returns 1 NOT 1 = 1 -- returns 0 NOT 1 = NULL -- returns NULL NOT 1 <=> NULL -- returns 0 The AND operator AND returns 1 if both its operands are true and 0 in all other cases. Here is an example: 1 AND 1 -- returns 1 0 AND 1 -- returns 0 0 AND 0 -- returns 0 The OR operator OR returns 1 if at least one of its operators is true or 0 if both the operators are false. Here is an example: 1 OR 1 -- returns 1 0 OR 1 -- returns 1 0 OR 0 -- returns 0 The XOR operator XOR stands for eXclusive OR. It is the least used logical operator. It returns 1 if only one of its operators is true or 0 if both the operands are true or false. Take a look at the following example: 1 XOR 1 -- returns 0 1 XOR 0 -- returns 1 0 XOR 1 --returns 1 0 XOR 0 -- returns 0 A XOR B is the equivalent of the following expression: (A OR B) AND NOT (A AND B) Or: (NOT A AND B) OR (A AND NOT B) Arithmetic operators MariaDB supports the operators that are necessary to execute all the basic arithmetic operations. The supported arithmetic operators are: + for additions - for subtractions * for multiplication / for division Depending on the MariaDB configuration, remember that a division by 0 raises an error or returns NULL. In addition, two more operators are useful for divisions: DIV: This returns the integer part of a division without any decimal part or reminder MOD or %: This returns the reminder of a division Here is an example: MariaDB [(none)]> SELECT 20 DIV 3 AS int_part, 20 MOD 3 AS modulus; +----------+---------+ | int_part | modulus | +----------+---------+ | 6 | 2 | +----------+---------+ 1 row in set (0.00 sec) Operators precedence MariaDB does not blindly evaluate the expression from left to right. Every operator has a given precedence. The And operators that is evaluated before another one is said to have a higher precedence. In general, arithmetic and string operators have a higher priority than logical operators. The precedence of arithmetic operators reflect their precedence in common mathematical expressions. It is very important to remember the precedence of logical operators (from the highest to the lowest): NOT AND XOR OR MariaDB supports many operators, and we did not discuss all of them. Also, the exact precedence can slightly vary depending on the MariaDB configuration. The complete precedence can be found in the MariaDB KnowledgeBase, at https://mariadb.com/kb/en/mariadb/documentation/functions-and-operators/operator-precedence/. Parenthesis can be used to force MariaDB to follow a certain order. They are also useful when we do not remember the exact precedence of the operators that we will use, as shown in the following code: (NOT (a AND b)) OR c OR d Summary In this article you learned about the basic transactions and operators. Resources for Article: Further resources on this subject: Set Up MariaDB [Article] Installing MariaDB on Windows and Mac OS X [Article] Building a Web Application with PHP and MariaDB – Introduction to caching [Article]

0
0
3435

Packt

12 Oct 2015

6 min read

Securing Your Data

Packt

12 Oct 2015

6 min read

In this article by Tyson Cadenhead, author of Socket.IO Cookbook, we will explore several topics related to security in Socket.IO applications. These topics will cover the gambit, from authentication and validation to how to use the wss:// protocol for secure WebSockets. As the WebSocket protocol opens innumerable opportunities to communicate more directly between the client and the server, people often wonder if Socket.IO is actually as secure as something such as the HTTP protocol. The answer to this question is that it depends entirely on how you implement it. WebSockets can easily be locked down to prevent malicious or accidental security holes, but as with any API interface, your security is only as tight as your weakest link. In this article, we will cover the following topics: Locking down the HTTP referrer Using secure WebSockets (For more resources related to this topic, see here.) Locking down the HTTP referrer Socket.IO is really good at getting around cross-domain issues. You can easily include the Socket.IO script from a different domain on your page, and it will just work as you may expect it to. There are some instances where you may not want your Socket.IO events to be available on every other domain. Not to worry! We can easily whitelist only the http referrers that we want so that some domains will be allowed to connect and other domains won't. How To Do It… To lock down the HTTP referrer and only allow events to whitelisted domains, follow these steps: Create two different servers that can connect to our Socket.IO instance. We will let one server listen on port 5000 and the second server listen on port 5001: var express = require('express'), app = express(), http = require('http'), socketIO = require('socket.io'), server, server2, io; app.get('/', function (req, res) { res.sendFile(__dirname + '/index.html'); }); server = http.Server(app); server.listen(5000); server2 = http.Server(app); server2.listen(5001); io = socketIO(server); When the connection is established, check the referrer in the headers. If it is a referrer that we want to give access to, we can let our connection perform its tasks and build up events as normal. If a blacklisted referrer, such as the one on port 5001 that we created, attempts a connection, we can politely decline and perhaps throw an error message back to the client, as shown in the following code: io.on('connection', function (socket) { switch (socket.request.headers.referer) { case 'http://localhost:5000/': socket.emit('permission.message', 'Okay, you're cool.'); break; default: returnsocket.emit('permission.message', 'Who invited you to this party?'); break; } }); On the client side, we can listen to the response from the server and react as appropriate using the following code: socket.on('permission.message', function (data) { document.querySelector('h1').innerHTML = data; }); How It Works… The referrer is always available in the socket.request.headers object of every socket, so we will be able to inspect it there to check whether it was a trusted source. In our case, we will use a switch statement to whitelist our domain on port 5000, but we could really use any mechanism at our disposal to perform the task. For example, if we need to dynamically whitelist domains, we can store a list of them in our database and search for it when the connection is established. Using secure WebSockets WebSocket communications can either take place over the ws:// protocol or the wss:// protocol. In similar terms, they can be thought of as the HTTP and HTTPS protocols in the sense that one is secure and one isn't. Secure WebSockets are encrypted by the transport layer, so they are safer to use when you handle sensitive data. In this recipe, you will learn how to force our Socket.IO communications to happen over the wss:// protocol for an extra layer of encryption. Getting Ready… In this recipe, we will need to create a self-signing certificate so that we can serve our app locally over the HTTPS protocol. For this, we will need an npm package called Pem. This allows you to create a self-signed certificate that you can provide to your server. Of course, in a real production environment, we would want a true SSL certificate instead of a self-signed one. To install Pem, simply call npm install pem –save. As our certificate is self-signed, you will probably see something similar to the following screenshot when you navigate to your secure server: Just take a chance by clicking on the Proceed to localhost link. You'll see your application load using the HTTPS protocol. How To Do It… To use the secure wss:// protocol, follow these steps: First, create a secure server using the built-in node HTTPS package. We can create a self-signed certificate with the pem package so that we can serve our application over HTTPS instead of HTTP, as shown in the following code: var https = require('https'), pem = require('pem'), express = require('express'), app = express(), socketIO = require('socket.io'); // Create a self-signed certificate with pem pem.createCertificate({ days: 1, selfSigned: true }, function (err, keys) { app.get('/', function(req, res){ res.sendFile(__dirname + '/index.html'); }); // Create an https server with the certificate and key from pem var server = https.createServer({ key: keys.serviceKey, cert: keys.certificate }, app).listen(5000); vario = socketIO(server); io.on('connection', function (socket) { var protocol = 'ws://'; // Check the handshake to determine if it was secure or not if (socket.handshake.secure) { protocol = 'wss://'; } socket.emit('hello.client', { message: 'This is a message from the server. It was sent using the ' + protocol + ' protocol' }); }); }); In your client-side JavaScript, specify secure: true when you initialize your WebSocket as follows: var socket = io('//localhost:5000', { secure: true }); socket.on('hello.client', function (data) { console.log(data); }); Now, start your server and navigate to https://localhost:5000. Proceed to this page. You should see a message in your browser developer tools that shows, This is a message from the server. It was sent using the wss:// protocol. How It Works… The protocol of our WebSocket is actually set automatically based on the protocol of the page that it sits on. This means that a page that is served over the HTTP protocol will send the WebSocket communications over ws:// by default, and a page that is served by HTTPS will default to using the wss:// protocol. However, by setting the secure option to true, we told the WebSocket to always serve through wss:// no matter what. Summary In this article, we gave you an overview of the topics related to security in Socket.IO applications. Resources for Article: Further resources on this subject: Using Socket.IO and Express together[article] Adding Real-time Functionality Using Socket.io[article] Welcome to JavaScript in the full stack [article]

0
0
1538

article-image-basics-jupyter-notebook-python

Packt Editorial Staff

11 Oct 2015

28 min read

Basics of Jupyter Notebook and Python

Packt Editorial Staff

11 Oct 2015

28 min read

In this article by Cyrille Rossant, coming from his book, Learning IPython for Interactive Computing and Data Visualization - Second Edition, we will see how to use IPython console, Jupyter Notebook, and we will go through the basics of Python. Originally, IPython provided an enhanced command-line console to run Python code interactively. The Jupyter Notebook is a more recent and more sophisticated alternative to the console. Today, both tools are available, and we recommend that you learn to use both. [box type="note" align="alignleft" class="" width=""]The first chapter of the book, Chapter 1, Getting Started with IPython, contains all installation instructions. The main step is to download and install the free Anaconda distribution at https://www.continuum.io/downloads (the version of Python 3 64-bit for your operating system).[/box] Launching the IPython console To run the IPython console, type ipython in an OS terminal. There, you can write Python commands and see the results instantly. Here is a screenshot: IPython console The IPython console is most convenient when you have a command-line-based workflow and you want to execute some quick Python commands. You can exit the IPython console by typing exit. [box type="note" align="alignleft" class="" width=""]Let's mention the Qt console, which is similar to the IPython console but offers additional features such as multiline editing, enhanced tab completion, image support, and so on. The Qt console can also be integrated within a graphical application written with Python and Qt. See http://jupyter.org/qtconsole/stable/ for more information.[/box] Launching the Jupyter Notebook To run the Jupyter Notebook, open an OS terminal, go to ~/minibook/ (or into the directory where you've downloaded the book's notebooks), and type jupyter notebook. This will start the Jupyter server and open a new window in your browser (if that's not the case, go to the following URL: http://localhost:8888). Here is a screenshot of Jupyter's entry point, the Notebook dashboard: The Notebook dashboard [box type="note" align="alignleft" class="" width=""]At the time of writing, the following browsers are officially supported: Chrome 13 and greater; Safari 5 and greater; and Firefox 6 or greater. Other browsers may work also. Your mileage may vary.[/box] The Notebook is most convenient when you start a complex analysis project that will involve a substantial amount of interactive experimentation with your code. Other common use-cases include keeping track of your interactive session (like a lab notebook), or writing technical documents that involve code, equations, and figures. In the rest of this section, we will focus on the Notebook interface. [box type="note" align="alignleft" class="" width=""]Closing the Notebook server To close the Notebook server, go to the OS terminal where you launched the server from, and press Ctrl + C. You may need to confirm with y.[/box] The Notebook dashboard The dashboard contains several tabs which are as follows: Files: shows all files and notebooks in the current directory Running: shows all kernels currently running on your computer Clusters: lets you launch kernels for parallel computing A notebook is an interactive document containing code, text, and other elements. A notebook is saved in a file with the .ipynb extension. This file is a plain text file storing a JSON data structure. A kernel is a process running an interactive session. When using IPython, this kernel is a Python process. There are kernels in many languages other than Python. [box type="note" align="alignleft" class="" width=""]We follow the convention to use the term notebook for a file, and Notebook for the application and the web interface.[/box] In Jupyter, notebooks and kernels are strongly separated. A notebook is a file, whereas a kernel is a process. The kernel receives snippets of code from the Notebook interface, executes them, and sends the outputs and possible errors back to the Notebook interface. Thus, in general, the kernel has no notion of the Notebook. A notebook is persistent (it's a file), whereas a kernel may be closed at the end of an interactive session and it is therefore not persistent. When a notebook is re-opened, it needs to be re-executed. In general, no more than one Notebook interface can be connected to a given kernel. However, several IPython consoles can be connected to a given kernel. The Notebook user interface To create a new notebook, click on the New button, and select Notebook (Python 3). A new browser tab opens and shows the Notebook interface as follows: A new notebook Here are the main components of the interface, from top to bottom: The notebook name, which you can change by clicking on it. This is also the name of the .ipynb file. The Menu bar gives you access to several actions pertaining to either the notebook or the kernel. To the right of the menu bar is the Kernel name. You can change the kernel language of your notebook from the Kernel menu. The Toolbar contains icons for common actions. In particular, the dropdown menu showing Code lets you change the type of a cell. Following is the main component of the UI: the actual Notebook. It consists of a linear list of cells. We will detail the structure of a cell in the following sections. Structure of a notebook cell There are two main types of cells: Markdown cells and code cells, and they are described as follows: A Markdown cell contains rich text. In addition to classic formatting options like bold or italics, we can add links, images, HTML elements, LaTeX mathematical equations, and more. A code cell contains code to be executed by the kernel. The programming language corresponds to the kernel's language. We will only use Python in this book, but you can use many other languages. You can change the type of a cell by first clicking on a cell to select it, and then choosing the cell's type in the toolbar's dropdown menu showing Markdown or Code. Markdown cells Here is a screenshot of a Markdown cell: A Markdown cell The top panel shows the cell in edit mode, while the bottom one shows it in render mode. The edit mode lets you edit the text, while the render mode lets you display the rendered cell. We will explain the differences between these modes in greater detail in the following section. Code cells Here is a screenshot of a complex code cell: Structure of a code cell This code cell contains several parts, as follows: The Prompt number shows the cell's number. This number increases every time you run the cell. Since you can run cells of a notebook out of order, nothing guarantees that code numbers are linearly increasing in a given notebook. The Input area contains a multiline text editor that lets you write one or several lines of code with syntax highlighting. The Widget area may contain graphical controls; here, it displays a slider. The Output area can contain multiple outputs, here: Standard output (text in black) Error output (text with a red background) Rich output (an HTML table and an image here) The Notebook modal interface The Notebook implements a modal interface similar to some text editors such as vim. Mastering this interface may represent a small learning curve for some users. Use the edit mode to write code (the selected cell has a green border, and a pen icon appears at the top right of the interface). Click inside a cell to enable the edit mode for this cell (you need to double-click with Markdown cells). Use the command mode to operate on cells (the selected cell has a gray border, and there is no pen icon). Click outside the text area of a cell to enable the command mode (you can also press the Esc key). Keyboard shortcuts are available in the Notebook interface. Type h to show them. We review here the most common ones (for Windows and Linux; shortcuts for Mac OS X may be slightly different). Keyboard shortcuts available in both modes Here are a few keyboard shortcuts that are always available when a cell is selected: Ctrl + Enter: run the cell Shift + Enter: run the cell and select the cell below Alt + Enter: run the cell and insert a new cell below Ctrl + S: save the notebook Keyboard shortcuts available in the edit mode In the edit mode, you can type code as usual, and you have access to the following keyboard shortcuts: Esc: switch to command mode Ctrl + Shift + -: split the cell Keyboard shortcuts available in the command mode In the command mode, keystrokes are bound to cell operations. Don't write code in command mode or unexpected things will happen! For example, typing dd in command mode will delete the selected cell! Here are some keyboard shortcuts available in command mode: Enter: switch to edit mode Up or k: select the previous cell Down or j: select the next cell y / m: change the cell type to code cell/Markdown cell a / b: insert a new cell above/below the current cell x / c / v: cut/copy/paste the current cell dd: delete the current cell z: undo the last delete operation Shift + =: merge the cell below h: display the help menu with the list of keyboard shortcuts Spending some time learning these shortcuts is highly recommended. References Here are a few references: Main documentation of Jupyter at http://jupyter.readthedocs.org/en/latest/ Jupyter Notebook interface explained at http://jupyter-notebook.readthedocs.org/en/latest/notebook.html A crash course on Python If you don't know Python, read this section to learn the fundamentals. Python is a very accessible language and is even taught to school children. If you have ever programmed, it will only take you a few minutes to learn the basics. Hello world Open a new notebook and type the following in the first cell: In [1]: print("Hello world!") Out[1]: Hello world! Here is a screenshot: "Hello world" in the Notebook [box type="note" align="alignleft" class="" width=""]Prompt string Note that the convention chosen in this article is to show Python code (also called the input) prefixed with In [x]: (which shouldn't be typed). This is the standard IPython prompt. Here, you should just type print("Hello world!") and then press Shift + Enter.[/box] Congratulations! You are now a Python programmer. Variables Let's use Python as a calculator. In [2]: 2 * 2 Out[2]: 4 Here, 2 * 2 is an expression statement. This operation is performed, the result is returned, and IPython displays it in the notebook cell's output. [box type="note" align="alignleft" class="" width=""]Division In Python 3, 3 / 2 returns 1.5 (floating-point division), whereas it returns 1 in Python 2 (integer division). This can be source of errors when porting Python 2 code to Python 3. It is recommended to always use the explicit 3.0 / 2.0 for floating-point division (by using floating-point numbers) and 3 // 2 for integer division. Both syntaxes work in Python 2 and Python 3. See http://python3porting.com/differences.html#integer-division for more details.[/box] Other built-in mathematical operators include +, -, ** for the exponentiation, and others. You will find more details at https://docs.python.org/3/reference/expressions.html#the-power-operator. Variables form a fundamental concept of any programming language. A variable has a name and a value. Here is how to create a new variable in Python: In [3]: a = 2 And here is how to use an existing variable: In [4]: a * 3 Out[4]: 6 Several variables can be defined at once (this is called unpacking): In [5]: a, b = 2, 6 There are different types of variables. Here, we have used a number (more precisely, an integer). Other important types include floating-point numbers to represent real numbers, strings to represent text, and booleans to represent True/False values. Here are a few examples: In [6]: somefloat = 3.1415 sometext = 'pi is about' # You can also use double quotes. print(sometext, somefloat) # Display several variables. Out[6]: pi is about 3.1415 Note how we used the # character to write comments. Whereas Python discards the comments completely, adding comments in the code is important when the code is to be read by other humans (including yourself in the future). String escaping String escaping refers to the ability to insert special characters in a string. For example, how can you insert ' and ", given that these characters are used to delimit a string in Python code? The backslash is the go-to escape character in Python (and in many other languages too). Here are a few examples: In [7]: print("Hello "world"") print("A list:n* item 1n* item 2") print("C:pathonwindows") print(r"C:pathonwindows") Out[7]: Hello "world" A list: * item 1 * item 2 C:pathonwindows C:pathonwindows The special character n is the new line (or line feed) character. To insert a backslash, you need to escape it, which explains why it needs to be doubled as . You can also disable escaping by using raw literals with a r prefix before the string, like in the last example above. In this case, backslashes are considered as normal characters. This is convenient when writing Windows paths, since Windows uses backslash separators instead of forward slashes like on Unix systems. A very common error on Windows is forgetting to escape backslashes in paths: writing "C:path" may lead to subtle errors. You will find the list of special characters in Python at https://docs.python.org/3.4/reference/lexical_analysis.html#string-and-bytes-literals. Lists A list contains a sequence of items. You can concisely instruct Python to perform repeated actions on the elements of a list. Let's first create a list of numbers as follows: In [8]: items = [1, 3, 0, 4, 1] Note the syntax we used to create the list: square brackets [], and commas , to separate the items. The built-in function len() returns the number of elements in a list: In [9]: len(items) Out[9]: 5 [box type="note" align="alignleft" class="" width=""]Python comes with a set of built-in functions, including print(), len(), max(), functional routines like filter() and map(), and container-related routines like all(), any(), range(), and sorted(). You will find the full list of built-in functions at https://docs.python.org/3.4/library/functions.html.[/box] Now, let's compute the sum of all elements in the list. Python provides a built-in function for this: In [10]: sum(items) Out[10]: 9 We can also access individual elements in the list, using the following syntax: In [11]: items[0] Out[11]: 1 In [12]: items[-1] Out[12]: 1 Note that indexing starts at 0 in Python: the first element of the list is indexed by 0, the second by 1, and so on. Also, -1 refers to the last element, -2, to the penultimate element, and so on. The same syntax can be used to alter elements in the list: In [13]: items[1] = 9 items Out[13]: [1, 9, 0, 4, 1] We can access sublists with the following syntax: In [14]: items[1:3] Out[14]: [9, 0] Here, 1:3 represents a slice going from element 1 included (this is the second element of the list) to element 3 excluded. Thus, we get a sublist with the second and third element of the original list. The first-included/last-excluded asymmetry leads to an intuitive treatment of overlaps between consecutive slices. Also, note that a sublist refers to a dynamic view of the original list, not a copy; changing elements in the sublist automatically changes them in the original list. Python provides several other types of containers: Tuples are immutable and contain a fixed number of elements: In [15]: my_tuple = (1, 2, 3) my_tuple[1] Out[15]: 2 Dictionaries contain key-value pairs. They are extremely useful and common: In [16]: my_dict = {'a': 1, 'b': 2, 'c': 3} print('a:', my_dict['a']) Out[16]: a: 1 In [17]: print(my_dict.keys()) Out[17]: dict_keys(['c', 'a', 'b']) There is no notion of order in a dictionary. However, the native collections module provides an OrderedDict structure that keeps the insertion order (see https://docs.python.org/3.4/library/collections.html). Sets, like mathematical sets, contain distinct elements: In [18]: my_set = set([1, 2, 3, 2, 1]) my_set Out[18]: {1, 2, 3} A Python object is mutable if its value can change after it has been created. Otherwise, it is immutable. For example, a string is immutable; to change it, a new string needs to be created. A list, a dictionary, or a set is mutable; elements can be added or removed. By contrast, a tuple is immutable, and it is not possible to change the elements it contains without recreating the tuple. See https://docs.python.org/3.4/reference/datamodel.html for more details. Loops We can run through all elements of a list using a for loop: In [19]: for item in items: print(item) Out[19]: 1 9 0 4 1 There are several things to note here: The for item in items syntax means that a temporary variable named item is created at every iteration. This variable contains the value of every item in the list, one at a time. Note the colon : at the end of the for statement. Forgetting it will lead to a syntax error! The statement print(item) will be executed for all items in the list. Note the four spaces before print: this is called the indentation. You will find more details about indentation in the next subsection. Python supports a concise syntax to perform a given operation on all elements of a list, as follows: In [20]: squares = [item * item for item in items] squares Out[20]: [1, 81, 0, 16, 1] This is called a list comprehension. A new list is created here; it contains the squares of all numbers in the list. This concise syntax leads to highly readable and Pythonic code. Indentation Indentation refers to the spaces that may appear at the beginning of some lines of code. This is a particular aspect of Python's syntax. In most programming languages, indentation is optional and is generally used to make the code visually clearer. But in Python, indentation also has a syntactic meaning. Particular indentation rules need to be followed for Python code to be correct. In general, there are two ways to indent some text: by inserting a tab character (also referred to as t), or by inserting a number of spaces (typically, four). It is recommended to use spaces instead of tab characters. Your text editor should be configured such that the Tab key on the keyboard inserts four spaces instead of a tab character. In the Notebook, indentation is automatically configured properly; so you shouldn't worry about this issue. The question only arises if you use another text editor for your Python code. Finally, what is the meaning of indentation? In Python, indentation delimits coherent blocks of code, for example, the contents of a loop, a conditional branch, a function, and other objects. Where other languages such as C or JavaScript use curly braces to delimit such blocks, Python uses indentation. Conditional branches Sometimes, you need to perform different operations on your data depending on some condition. For example, let's display all even numbers in our list: In [21]: for item in items: if item % 2 == 0: print(item) Out[21]: 0 4 Again, here are several things to note: An if statement is followed by a boolean expression. If a and b are two integers, the modulo operand a % b returns the remainder from the division of a by b. Here, item % 2 is 0 for even numbers, and 1 for odd numbers. The equality is represented by a double equal sign == to avoid confusion with the assignment operator = that we use when we create variables. Like with the for loop, the if statement ends with a colon :. The part of the code that is executed when the condition is satisfied follows the if statement. It is indented. Indentation is cumulative: since this if is inside a for loop, there are eight spaces before the print(item) statement. Python supports a concise syntax to select all elements in a list that satisfy certain properties. Here is how to create a sublist with only even numbers: In [22]: even = [item for item in items if item % 2 == 0] even Out[22]: [0, 4] This is also a form of list comprehension. Functions Code is typically organized into functions. A function encapsulates part of your code. Functions allow you to reuse bits of functionality without copy-pasting the code. Here is a function that tells whether an integer number is even or not: In [23]: def is_even(number): """Return whether an integer is even or not.""" return number % 2 == 0 There are several things to note here: A function is defined with the def keyword. After def comes the function name. A general convention in Python is to only use lowercase characters, and separate words with an underscore _. A function name generally starts with a verb. The function name is followed by parentheses, with one or several variable names called the arguments. These are the inputs of the function. There is a single argument here, named number. No type is specified for the argument. This is because Python is dynamically typed; you could pass a variable of any type. This function would work fine with floating point numbers, for example (the modulo operation works with floating point numbers in addition to integers). The body of the function is indented (and note the colon : at the end of the def statement). There is a docstring wrapped by triple quotes """. This is a particular form of comment that explains what the function does. It is not mandatory, but it is strongly recommended to write docstrings for the functions exposed to the user. The return keyword in the body of the function specifies the output of the function. Here, the output is a Boolean, obtained from the expression number % 2 == 0. It is possible to return several values; just use a comma to separate them (in this case, a tuple of Booleans would be returned). Once a function is defined, it can be called like this: In [24]: is_even(3) Out[24]: False In [25]: is_even(4) Out[25]: True Here, 3 and 4 are successively passed as arguments to the function. Positional and keyword arguments A Python function can accept an arbitrary number of arguments, called positional arguments. It can also accept optional named arguments, called keyword arguments. Here is an example: In [26]: def remainder(number, divisor=2): return number % divisor The second argument of this function, divisor, is optional. If it is not provided by the caller, it will default to the number 2, as shown here: In [27]: remainder(5) Out[27]: 1 There are two equivalent ways of specifying a keyword argument when calling a function. They are as follows: In [28]: remainder(5, 3) Out[28]: 2 In [29]: remainder(5, divisor=3) Out[29]: 2 In the first case, 3 is understood as the second argument, divisor. In the second case, the name of the argument is given explicitly by the caller. This second syntax is clearer and less error-prone than the first one. Functions can also accept arbitrary sets of positional and keyword arguments, using the following syntax: In [30]: def f(*args, **kwargs): print("Positional arguments:", args) print("Keyword arguments:", kwargs) In [31]: f(1, 2, c=3, d=4) Out[31]: Positional arguments: (1, 2) Keyword arguments: {'c': 3, 'd': 4} Inside the function, args is a tuple containing positional arguments, and kwargs is a dictionary containing keyword arguments. Passage by assignment When passing a parameter to a Python function, a reference to the object is actually passed (passage by assignment): If the passed object is mutable, it can be modified by the function If the passed object is immutable, it cannot be modified by the function Here is an example: In [32]: my_list = [1, 2] def add(some_list, value): some_list.append(value) add(my_list, 3) my_list Out[32]: [1, 2, 3] The add() function modifies an object defined outside it (in this case, the object my_list); we say this function has side-effects. A function with no side-effects is called a pure function: it doesn't modify anything in the outer context, and it deterministically returns the same result for any given set of inputs. Pure functions are to be preferred over functions with side-effects. Knowing this can help you spot out subtle bugs. There are further related concepts that are useful to know, including function scopes, naming, binding, and more. Here are a couple of links: Passage by reference at https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference Naming, binding, and scope at https://docs.python.org/3.4/reference/executionmodel.html Errors Let's discuss errors in Python. As you learn, you will inevitably come across errors and exceptions. The Python interpreter will most of the time tell you what the problem is, and where it occurred. It is important to understand the vocabulary used by Python so that you can more quickly find and correct your errors. Let's see the following example: In [33]: def divide(a, b): return a / b In [34]: divide(1, 0) Out[34]: --------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) <ipython-input-2-b77ebb6ac6f6> in <module>() ----> 1 divide(1, 0) <ipython-input-1-5c74f9fd7706> in divide(a, b) 1 def divide(a, b): ----> 2 return a / b ZeroDivisionError: division by zero Here, we defined a divide() function, and called it to divide 1 by 0. Dividing a number by 0 is an error in Python. Here, a ZeroDivisionError exception was raised. An exception is a particular type of error that can be raised at any point in a program. It is propagated from the innards of the code up to the command that launched the code. It can be caught and processed at any point. You will find more details about exceptions at https://docs.python.org/3/tutorial/errors.html, and common exception types at https://docs.python.org/3/library/exceptions.html#bltin-exceptions. The error message you see contains the stack trace, the exception type, and the exception message. The stack trace shows all function calls between the raised exception and the script calling point. The top frame, indicated by the first arrow ---->, shows the entry point of the code execution. Here, it is divide(1, 0), which was called directly in the Notebook. The error occurred while this function was called. The next and last frame is indicated by the second arrow. It corresponds to line 2 in our function divide(a, b). It is the last frame in the stack trace: this means that the error occurred there. Object-oriented programming Object-oriented programming (OOP) is a relatively advanced topic. Although we won't use it much in this book, it is useful to know the basics. Also, mastering OOP is often essential when you start to have a large code base. In Python, everything is an object. A number, a string, or a function is an object. An object is an instance of a type (also known as class). An object has attributes and methods, as specified by its type. An attribute is a variable bound to an object, giving some information about it. A method is a function that applies to the object. For example, the object 'hello' is an instance of the built-in str type (string). The type() function returns the type of an object, as shown here: In [35]: type('hello') Out[35]: str There are native types, like str or int (integer), and custom types, also called classes, that can be created by the user. In IPython, you can discover the attributes and methods of any object with the dot syntax and tab completion. For example, typing 'hello'.u and pressing Tab automatically shows us the existence of the upper() method: In [36]: 'hello'.upper() Out[36]: 'HELLO' Here, upper() is a method available to all str objects; it returns an uppercase copy of a string. A useful string method is format(). This simple and convenient templating system lets you generate strings dynamically, as shown in the following example: In [37]: 'Hello {0:s}!'.format('Python') Out[37]: Hello Python The {0:s} syntax means "replace this with the first argument of format(), which should be a string". The variable type after the colon is especially useful for numbers, where you can specify how to display the number (for example, .3f to display three decimals). The 0 makes it possible to replace a given value several times in a given string. You can also use a name instead of a position—for example 'Hello {name}!'.format(name='Python'). Some methods are prefixed with an underscore _; they are private and are generally not meant to be used directly. IPython's tab completion won't show you these private attributes and methods unless you explicitly type _ before pressing Tab. In practice, the most important thing to remember is that appending a dot . to any Python object and pressing Tab in IPython will show you a lot of functionality pertaining to that object. Functional programming Python is a multi-paradigm language; it notably supports imperative, object-oriented, and functional programming models. Python functions are objects and can be handled like other objects. In particular, they can be passed as arguments to other functions (also called higher-order functions). This is the essence of functional programming. Decorators provide a convenient syntax construct to define higher-order functions. Here is an example using the is_even() function from the previous Functions section: In [38]: def show_output(func): def wrapped(*args, **kwargs): output = func(*args, **kwargs) print("The result is:", output) return wrapped The show_output() function transforms an arbitrary function func() to a new function, named wrapped(), that displays the result of the function, as follows: In [39]: f = show_output(is_even) f(3) Out[39]: The result is: False Equivalently, this higher-order function can also be used with a decorator, as follows: In [40]: @show_output def square(x): return x * x In [41]: square(3) Out[41]: The result is: 9 You can find more information about Python decorators at https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators and at http://www.thecodeship.com/patterns/guide-to-python-function-decorators/. Python 2 and 3 Let's finish this section with a few notes about Python 2 and Python 3 compatibility issues. There are still some Python 2 code and libraries that are not compatible with Python 3. Therefore, it is sometimes useful to be aware of the differences between the two versions. One of the most obvious differences is that print is a statement in Python 2, whereas it is a function in Python 3. Therefore, print "Hello" (without parentheses) works in Python 2 but not in Python 3, while print("Hello") works in both Python 2 and Python 3. There are several non-mutually exclusive options to write portable code that works with both versions: futures: A built-in module supporting backward-incompatible Python syntax 2to3: A built-in Python module to port Python 2 code to Python 3 six: An external lightweight library for writing compatible code Here are a few references: Official Python 2/3 wiki page at https://wiki.python.org/moin/Python2orPython3 The Porting to Python 3 book, by CreateSpace Independent Publishing Platform at http://www.python3porting.com/bookindex.html 2to3 at https://docs.python.org/3.4/library/2to3.html six at https://pythonhosted.org/six/ futures at https://docs.python.org/3.4/library/__future__.html The IPython Cookbook contains an in-depth recipe about choosing between Python 2 and 3, and how to support both. Going beyond the basics You now know the fundamentals of Python, the bare minimum that you will need in this book. As you can imagine, there is much more to say about Python. Following are a few further basic concepts that are often useful and that we cannot cover here, unfortunately. You are highly encouraged to have a look at them in the references given at the end of this section: range and enumerate pass, break, and, continue, to be used in loops Working with files Creating and importing modules The Python standard library provides a wide range of functionality (OS, network, file systems, compression, mathematics, and more) Here are some slightly more advanced concepts that you might find useful if you want to strengthen your Python skills: Regular expressions for advanced string processing Lambda functions for defining small anonymous functions Generators for controlling custom loops Exceptions for handling errors with statements for safely handling contexts Advanced object-oriented programming Metaprogramming for modifying Python code dynamically The pickle module for persisting Python objects on disk and exchanging them across a network Finally, here are a few references: Getting started with Python: https://www.python.org/about/gettingstarted/ A Python tutorial: https://docs.python.org/3/tutorial/index.html The Python Standard Library: https://docs.python.org/3/library/index.html Interactive tutorial: http://www.learnpython.org/ Codecademy Python course: http://www.codecademy.com/tracks/python Language reference (expert level): https://docs.python.org/3/reference/index.html Python Cookbook, by David Beazley and Brian K. Jones, O'Reilly Media (advanced level, highly recommended if you want to become a Python expert) Summary In this article, we have seen how to launch the IPython console and Jupyter Notebook, the different aspects of the Notebook and its user interface, the structure of the notebook cell, keyboard shortcuts that are available in the Notebook interface, and the basics of Python. Introduction to Data Analysis and Libraries Hand Gesture Recognition Using a Kinect Depth Sensor The strange relationship between objects, functions, generators and coroutines

0
0
126551

How-To Tutorials - Data

Big Data Analytics

Protecting Your Bitcoins

Rotation Forest - A Classifier Ensemble Based on Feature Extraction

An Introduction to Kibana

Putting Your Database at the Heart of Azure Solutions

Making 3D Visualizations

Configuring Brokers

QlikView Tips and Tricks

Understanding Text Search and Hierarchies in SAP HANA

An Overview of Oozie

Trending Topics

SQL Server with PowerShell

Introducing Test-driven Machine Learning

Transactions and Operators

Securing Your Data

Basics of Jupyter Notebook and Python

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access