Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-indexing-data
Packt
18 Apr 2014
10 min read
Save for later

Indexing the Data

Packt
18 Apr 2014
10 min read
(For more resources related to this topic, see here.) Elasticsearch indexing We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let's see what the indexing process looks like. Shards and replicas The Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster. Are you wondering how we did the calculation and got 10 Lucene indices from five shards and one replica? The term "replica" is somewhat misleading. It means that every shard has its copy, so it means there are five shards and five copies. Having a shard and its replica, in general, means that when we index a document, we will modify them both. That's because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows: More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance. More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards. More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive. Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well. The more the replicas, the higher the query throughput will be. That's because the query can use either a shard or any of its copies to execute the query. Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch. So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data. Creating indices When we created our first document in Elasticsearch, we didn't care about index creation at all. We just used the following command: curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New   version of Elasticsearch released!", "content": "...", "tags":["announce", "elasticsearch", "release"] }' This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command: curl -XPUT http://localhost:9200/blog/ We just told Elasticsearch that we want to create the index with the blog name. If everything goes right, you will see the following response from Elasticsearch: {"acknowledged":true} When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards. Altering automatic index creation Sometimes, you can come to the  conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml configuration file: action.auto_create_index: false Note that action.auto_create_index is more complex than it looks. The value can be set to not only false or true. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn't exist. For example, the following definition allows automatic creation of indices with the names beginning with a, but disallows the creation of indices starting with an. The other indices aren't allowed and must be created manually (because of -*). action.auto_create_index: -an*,+a*,-* Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an* to the end, it won't be used because of +a* , which will be checked first. Settings for a newly created index The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let's look at the following example: curl -XPUT http://localhost:9200/blog/ -d '{     "settings" : {         "number_of_shards" : 1,         "number_of_replicas" : 2     } }' The preceding command will result in the creation of the blog index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way. So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we'll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT HTTP method, we use DELETE. So the actual command is as follows: curl –XDELETE http://localhost:9200/posts And the response will be the same as the one we saw earlier, as follows: {"acknowledged":true} Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched. Mappings configuration If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fl y, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them) and how to create mappings that suit your needs and match your data structure. Type determining mechanism Before we  start describing how to create mappings  manually, we wanted to write about one thing. Elasticsearch can guess the document structure by looking at JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document: {   "field1": 10, "field2": "10" } The preceding document has two fields. The field1 field will be determined as a number (to be precise, as long type), but field2 will be determined as a string, because it is surrounded by quotation marks. Of course, this can be the desired behavior, but sometimes the data source may omit the information about the data type and everything may be present as strings. The solution to this is to enable more aggressive text checking in the mapping definition by setting the numeric_detection property to true. For example, we can execute the following command during the creation of the index: curl -XPUT http://localhost:9200/blog/?pretty -d '{   "mappings" : {     "article": {       "numeric_detection" : true     }   } }' Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition. Another type that causes trouble is a date-based one. Elasticsearch tries to guess dates given as timestamps or strings that match the date format. We can define the list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let's look at the following command for creating the index and type: curl -XPUT 'http://localhost:9200/blog/' -d '{   "mappings" : {     "article" : {       "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]     }   } }' The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type for fields matching the defined format. Elasticsearch uses the joda-time library to define date formats, so please visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in finding out more about them. Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.
Read more
  • 0
  • 0
  • 4183

article-image-learning-random-forest-using-mahout
Packt
05 Mar 2015
11 min read
Save for later

Learning Random Forest Using Mahout

Packt
05 Mar 2015
11 min read
In this article by Ashish Gupta, author of the book Learning Apache Mahout Classification, we will learn about Random forest, which is one of the most popular techniques in classification. It starts with a machine learning technique called decision tree. In this article, we will explore the following topics: Decision tree Random forest Using Mahout for Random forest (For more resources related to this topic, see here.) Decision tree A decision tree is used for classification and regression problems. In simple terms, it is a predictive model that uses binary rules to calculate the target variable. In a decision tree, we use an iterative process of splitting the data into partitions, then we split it further on branches. As in other classification model creation processes, we start with the training dataset in which target variables or class labels are defined. The algorithm tries to break all the records in training datasets into two parts based on one of the explanatory variables. The partitioning is then applied to each new partition, and this process is continued until no more partitioning can be done. The core of the algorithm is to find out the rule that determines the initial split. There are algorithms to create decision trees, such as Iterative Dichotomiser 3 (ID3), Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detector (CHAID), and so on. A good explanation for ID3 can be found at http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html. Forming the explanatory variables to choose the best splitter in a node, the algorithm considers each variable in turn. Every possible split is considered and tried, and the best split is the one that produces the largest decrease in diversity of the classification label within each partition. This is repeated for all variables, and the winner is chosen as the best splitter for that node. The process is continued in the next node until we reach a node where we can make the decision. We create a decision tree from a training dataset so it can suffer from the overfitting problem. This behavior creates a problem with real datasets. To improve this situation, a process called pruning is used. In this process, we remove the branches and leaves of the tree to improve the performance. Algorithms used to build the tree work best at the starting or root node since all the information is available there. Later on, with each split, data is less and towards the end of the tree, a particular node can show patterns that are related to the set of data which is used to split. These patterns create problems when we use them to predict the real dataset. Pruning methods let the tree grow and remove the smaller branches that fail to generalize. Now take an example to understand the decision tree. Consider we have a iris flower dataset. This dataset is hugely popular in the machine learning field. It was introduced by Sir Ronald Fisher. It contains 50 samples from each of three species of iris flower (Iris setosa, Iris virginica, and Iris versicolor). The four explanatory variables are the length and width of the sepals and petals in centimeters, and the target variable is the class to which the flower belongs. As you can see in the preceding diagram, all the groups were earlier considered as Sentosa species and then the explanatory variable and petal length were further used to divide the groups. At each step, the calculation for misclassified items was also done, which shows how many items were wrongly classified. Moreover, the petal width variable was taken into account. Usually, items at leaf nodes are correctly classified. Random forest The Random forest algorithm was developed by Leo Breiman and Adele Cutler. Random forests grow many classification trees. They are an ensemble learning method for classification and regression that constructs a number of decision trees at training time and also outputs the class that is the mode of the classes outputted by individual trees. Single decision trees show the bias–variance tradeoff. So they usually have high variance or high bias. The following are the parameters in the algorithm: Bias: This is an error caused by an erroneous assumption in the learning algorithm Variance: This is an error that ranges from sensitivity to small fluctuations in the training set Random forests attempt to mitigate this problem by averaging to find a natural balance between two extremes. A Random forest works on the idea of bagging, which is to average noisy and unbiased models to create a model with low variance. A Random forest algorithm works as a large collection of decorrelated decision trees. To understand the idea of a Random forest algorithm, let's work with an example. Consider we have a training dataset that has lots of features (explanatory variables) and target variables or classes: We create a sample set from the given dataset: A different set of random features were taken into account to create the random sub-dataset. Now, from these sub-datasets, different decision trees will be created. So actually we have created a forest of the different decision trees. Using these different trees, we will create a ranking system for all the classifiers. To predict the class of a new unknown item, we will use all the decision trees and separately find out which class these trees are predicting. See the following diagram for a better understanding of this concept: Different decision trees to predict the class of an unknown item In this particular case, we have four different decision trees. We predict the class of an unknown dataset with each of the trees. As per the preceding figure, the first decision tree provides class 2 as the predicted class, the second decision tree predicts class 5, the third decision tree predicts class 5, and the fourth decision tree predicts class 3. Now, a Random forest will vote for each class. So we have one vote each for class 2 and class 3 and two votes for class 5. Therefore, it has decided that for the new unknown dataset, the predicted class is class 5. So the class that gets a higher vote is decided for the new dataset. A Random forest has a lot of benefits in classification and a few of them are mentioned in the following list: Combination of learning models increases the accuracy of the classification Runs effectively on large datasets as well The generated forest can be saved and used for other datasets as well Can handle a large amount of explanatory variables Now that we have understood the Random forest theoretically, let's move on to Mahout and use the Random forest algorithm, which is available in Apache Mahout. Using Mahout for Random forest Mahout has implementation for the Random forest algorithm. It is very easy to understand and use. So let's get started. Dataset We will use the NSL-KDD dataset. Since 1999, KDD'99 has been the most widely used dataset for the evaluation of anomaly detection methods. This dataset is prepared by S. J. Stolfo and is built based on the data captured in the DARPA'98 IDS evaluation program (R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham, and M. A. Zissman, "Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation," discex, vol. 02, p. 1012, 2000). DARPA'98 is about 4 GB of compressed raw (binary) tcp dump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. The two weeks of test data have around 2 million connection records. The KDD training dataset consists of approximately 4,900,000 single connection vectors, each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type. NSL-KDD is a dataset suggested to solve some of the inherent problems of the KDD'99 dataset. You can download this dataset from http://nsl.cs.unb.ca/NSL-KDD/. We will download the KDDTrain+_20Percent.ARFF and KDDTest+.ARFF datasets. In KDDTrain+_20Percent.ARFF and KDDTest+.ARFF, remove the first 44 lines (that is, all lines starting with @attribute). If this is not done, we will not be able to generate a descriptor file. Steps to use the Random forest algorithm in Mahout The steps to implement the Random forest algorithm in Apache Mahout are as follows: Transfer the test and training datasets to hdfs using the following commands: hadoop fs -mkdir /user/hue/KDDTrainhadoop fs -mkdir /user/hue/KDDTesthadoop fs –put /tmp/KDDTrain+_20Percent.arff /user/hue/KDDTrainhadoop fs –put /tmp/KDDTest+.arff /user/hue/KDDTest Generate the descriptor file. Before you build a Random forest model based on the training data in KDDTrain+.arff, a descriptor file is required. This is because all information in the training dataset needs to be labeled. From the labeled dataset, the algorithm can understand which one is numerical and categorical. Use the following command to generate descriptor file: hadoop jar $MAHOUT_HOME/core/target/mahout-core-xyz.job.jarorg.apache.mahout.classifier.df.tools.Describe-p /user/hue/KDDTrain/KDDTrain+_20Percent.arff-f /user/hue/KDDTrain/KDDTrain+.info-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Jar: Mahout core jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class Describe is used here and it takes three parameters: The p path for the data to be described. The f location for the generated descriptor file. d is the information for the attribute on the data. N 3 C 2 N C 4 N C 8 N 2 C 19 N L defines that the dataset is starting with a numeric (N), followed by three categorical attributes, and so on. In the last, L defines the label. The output of the previous command is shown in the following screenshot: Build the Random forest using the following command: hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest-Dmapred.max.split.size=1874231 -d /user/hue/KDDTrain/KDDTrain+_20Percent.arff-ds /user/hue/KDDTrain/KDDTrain+.info-sl 5 -p -t 100 –o /user/hue/ nsl-forest Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class build forest is used to build the forest with other arguments, which are shown in the following list: Dmapred.max.split.size indicates to Hadoop the maximum size of each partition. d stands for the data path. ds stands for the location of the descriptor file. sl is a variable to select randomly at each tree node. Here, each tree is built using five randomly selected attributes per node. p uses partial data implementation. t stands for the number of trees to grow. Here, the commands build 100 trees using partial implementation. o stands for the output path that will contain the decision forest. In the end, the process will show the following result: Use this model to classify the new dataset: hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest-i /user/hue/KDDTest/KDDTest+.arff-ds /user/hue/KDDTrain/KDDTrain+.info -m /user/hue/nsl-forest -a –mr-o /user/hue/predictions Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The class to test the forest has the following parameters: I indicates the path for the test data ds stands for the location of the descriptor file m stands for the location of the generated forest from the previous command a informs to run the analyzer to compute the confusion matrix mr informs Hadoop to distribute the classification o stands for the location to store the predictions in The job provides the following confusion matrix: So, from the confusion matrix, it is clear that 9,396 instances were correctly classified and 315 normal instances were incorrectly classified as anomalies. And the accuracy percentage is 77.7635 (correctly classified instances by the model / classified instances). The output file in the prediction folder contains the list where 0 and 1. 0 defines the normal dataset and 1 defines the anomaly. Summary In this article, we discussed the Random forest algorithm. We started our discussion by understanding the decision tree and continued with an understanding of the Random forest. We took up the NSL-KDD dataset, which is used to build predictive systems for cyber security. We used Mahout to build the Random forest tree, and used it with the test dataset and generated the confusion matrix and other statistics for the output. Resources for Article: Further resources on this subject: Implementing the Naïve Bayes classifier in Mahout [article] About Cassandra [article] Tuning Solr JVM and Container [article]
Read more
  • 0
  • 1
  • 4176

article-image-practical-applications-deep-learning
Packt
14 Jan 2016
20 min read
Save for later

Practical Applications of Deep Learning

Packt
14 Jan 2016
20 min read
In this article, Yusuke Sugomori, the author of the book Deep Learning with Java, we’ll first see how deep learning is actually applied. Here, you will see that the actual cases where deep learning is utilized are still very few. But why aren't there many cases even though it is such an innovative method? What is the problem? Later on, we’ll think about the reasons. Furthermore, going forward we will also consider which fields we can apply deep learning to and will have the chance to apply deep learning and all the related areas of artificial intelligence. The topics covered in this article include: The difficulties of turning deep learning models into practical applications The possible fields where deep learning can be applied, and ideas on how to approach these fields We'll explore the potential of this big AI boom, which will lead to ideas and hints that you can utilize in deep learning for your research, business, and many sorts of activities. (For more resources related to this topic, see here.) The difficulties of deep learning Deep learning has already got higher precision than humans in the image recognition field and has been applied to quite a lot of practical applications. Similarly, in the NLP field, many models have been researched. Then, how much deep learning is utilized in other fields? Surprisingly, there are still few fields where deep learning is successfully utilized. This is because deep learning is indeed innovative compared to past algorithms and definitely lets us take a big step towards materializing AI; however, it has some problems to be used for practical applications. The first problem is that there are too many model parameters in deep learning algorithms. We didn't look into detail when you learned about the theory and implementation of algorithms, but actually deep neural networks have many hyper parameters that need to be decided compared to the past neural networks or other machine learning algorithms. This means we have to go through more trial and error to get high precision. Combinations of parameters that define a structure of neural networks, such as how many hidden layers are to be set or how many units each hidden layer should have, need lots of experiments. Also, the parameters for training and test configurations such as the learning rateneed to be determined. Furthermore, peculiar parameters for each algorithm such as the corruption level in SDA and the size of kernels in CNN need additional trial and error. Thus, the great performance that deep learning provides is supported by steady parameter-tuning. However, people only look at one side of deep learning—that it can get great precision— and they tend to forget the hard process required to reach to the point. Deep learning is not magic. In addition, deep learning often fails to train and classify data from simple problems. The shape of deep neural networks is so deep and complicated that the weights can't be well optimized. In terms of optimization, data quantities are also important. This means that deep neural networks require a significant amount of time for each training. To sum up, deep learning shows its worth when: It solves complicated and hard problems when people have no idea what feature they can be classified as There is sufficient training data to properly optimize deep neural networks Compared to applications that constantly update a model using continuously updated data, once a model is built using a large-scale data set that doesn't change drastically, applications that use the model universally are rather well suited for deep learning. Therefore, when you look at business fields, you can say that there are more cases where the existing machine learning can get better results than using deep learning. For example, let's assume we would like to recommend appropriate products to users in an EC. In this EC, many users buy a lot of products daily, so purchase data is largely updated daily. In this case, do you use deep learning to get high-precision classification and recommendations to increase the conversion rates of users' purchases using this data? Probably not, because using the existing machine learning algorithms such as Naive Bayes, collaborative filtering, SVM, and so on, we can get sufficient precision from a practical perspective and can update the model and calculate quicker, which is usually more appreciated. This is why deep learning is not applied much in business fields. Of course, getting higher precision is better in any field, but in reality, higher precision and the necessary calculation time are in a trade-off relationship. Although deep learning is significant in the research field, it has many hurdles yet to clear considering practical applications. Besides, deep learning algorithms are not perfect, and they still need many improvements to their model itself. For example, RNN, as mentioned earlier, can only satisfy either how past information can be reflected to a network or how precision can be obtained, although it's contrived with techniques such as LSTM. Also, deep learning is still far from the true AI, although it's definitely a great technique compared to the past algorithms. Research on algorithms is progressing actively, but in the meantime, we need one more breakthrough to spread out and infiltrate deep learning into broader society. Maybe this is not just the problem of a model. Deep learning is suddenly booming because it is reinforced by huge developments in hardware and software. Deep learning is closely related to development of the surrounding technology. As mentioned earlier, there are still many hurdles to clear before deep learning can be applied more practically in the real world, but this is not impossible to achieve. It isn't possible to suddenly invent AI to achieve technological singularity, but there are some fields and methods where deep learning can be applied right away. In the next section, we’ll think about what kinds of industries deep learning can be utilized in. Hopefully, it will sew the seeds for new ideas in your business or research fields. The approaches to maximize deep learning possibilities and abilities There are several approaches on how we can apply deep learning to various industries. While it is true that an approach could be different depending on the task or purpose, we can briefly categorized the approaches as the following three: Field-oriented approach: This utilizes deep learning algorithms or models that are already thoroughly researched and can lead to a great performance Breakdown-oriented approach: This replaces problems to be solved that deep learning can apparently be applied with a different problem that deep learning can be well adopted Output-oriented approach: This explores new ways on how we express the output with deep learning These approaches are all explained in detail in the following subsections. Each approach is divided into its suitable industries or not for its use, but any of them could be a big hint for your activities going forward. There are still very few use cases of deep learning and bias against fields of use, but this means there should be many chances to create innovative and new things. Start-ups who utilize deep learning have been emerging recently and some of them have already achieved success to some extent. You can have a significant impact on the world depending on your ideas. Field-oriented approach This approach doesn't require new techniques or algorithms. There are obviously fields that are well suited to the current deep learning techniques, and the concept here is to dive into these fields. As explained previously, since deep learning algorithms that have been practically studied and developed are mostly in image recognition and NLP, we'll explore some fields that can work in great harmony with them. Medicine Medical fields should be developed by deep learning. Tumors or cancers are detected on scanned images. This means nothing else but being able to utilize one of the strongest features of deep learning—the technique of image recognition. It is possible to dramatically increase precision using deep learning to help with the early detection of an illness and identifying the kind of illness. Since CNN can be applied to 3D images, 3D scanned images should be able to be analyzed relatively easily. By adopting deep learning more in the current medical field, deep learning should greatly contribute. We can also say that deep learning can be significantly useful for the future medical field. The medical field has been under strict regulations, however, there is a movement progressing to ease the regulations in some countries, probably because of the recent development of IT and its potential. Therefore, there will be opportunities in business for the medical field and IT to have a synergy effect. For example, if telemedicine is more infiltrated, there is the possibility that diagnosing or identifying a disease can be done not only by a scanned image, but also by an image shown in real time on a display. Also, if electronic charts become widespread, it would be easier to analyze medical data using deep learning. This is because medical records are compatible with deep learning as they are a dataset of texts and images. Then, the symptom of unknown disease can be found. Automobiles We can say that surroundings off running cars are image sequences and texts. Other cars and views are images and a road sign has texts. This means we can also utilize deep learning techniques here, and it is possible to reduce the risk of accidents by improving driving assistance functions. It can be said that the ultimate type of driving assistance is self-driving cars, which is being tackled mainly by Google and Tesla. An example that is both famous and fascinating was when George Hotz, the first person to hack the iPhone, built a self-driving car in his garage. The appearance of the car was introduced in an article by Bloomberg Business (http://www.bloomberg.com/features/2015-george-hotz-self-driving-car/), and the following image was included in the article: Self-driving cars have been already tested in the U.S., but since other countries have different traffic rules and road conditions, this idea requires further studying and development before self-driving cars are commonly used worldwide. The key to success in this field is in learning and recognizing surrounding cars, people, views, and traffic signs, and properly judging how to process them. In the meantime, we don't have to just focus on utilizing deep learning techniques for the actual body of a car. Let's assume we could develop a smartphone app that has the same function as we just described, that is, recognizing and classifying surrounding images and text. Then, if you just set up the smartphone in your car, you could utilize it as a car-navigation app. In addition, for example, it could be used as a navigation app for blind people, providing them with good reliable directions. Advert technologies Advert (ad) technologies could expand their coverage with deep learning. When we say ad technologies, this currently means recommendation or ad networks that optimize ad banners or products to be shown. On the other hand, when we say advertising, this doesn't only mean banners or ad networks. There are various kinds of ads in the world depending on the type of media such as TV ads, radio ads, newspaper ads, posters, flyers, and so on. We have also digital ad campaigns with YouTube, Vine, Facebook, Twitter, Snapchat, and so on. Advertising itself has changed its definition and content, but all ads have one thing in common, they consist of images and/or language. This means they are fields that deep learning is good at. Until now, we could only use user-behavior-based indicators, such as page view (PV), click through rate (CTR), and conversion rate (CVR), to estimate the effect of an ad, but if we apply deep learning technologies, we might be able to analyze the actual content of an ad and autogenerate ads going forward. Especially since movies and videos can only be analyzed as a result of image recognition and NLP, video recognition, not image recognition, will gather momentum besides ad technologies. Profession or practice Professions such as doctor, lawyer, patent attorney, and accountant are considered to be roles that deep learning can replace. For example, if NLP's precision and accuracy gets higher, any perusal that requires expertise can be left to a machine. As a machine can cover these time-consuming reading tasks, people can focus more on high-value tasks. In addition, if a machine classifies past judicial cases or medical cases on what disease caused what symptoms and so on, we would be able to build an app like Apple’s Siri that answers simple questions that usually require professional knowledge. Then, the machine could handle these professional cases to some extent if a doctor or a lawyer is too busy to help in a timely manner. It's often said that AI takes away human’s jobs, but personally, this seems incorrect. Rather, a machine takes away menial work, which should support humans. A software engineer who works on AI programming can described as having a professional job, but this work will also be changed in the future. For example, think about a car-related job, where the current work is building standard automobiles, but in the future engineers will be in a position just like pit crews for Formula 1 cars. Sports Deep learning can certainly contribute to sports as well. In the study field known as sports science, it has become increasingly important to analyze and examine data from sports. As an example, you may know the book or movie Moneyball. In this film, they hugely increased the win percentage of the team by adopting a regression model in baseball. Watching sports itself is very exciting, but on the other hand, sport can be seen as a chunk of image sequences and number data. Since deep learning is good at identifying features that humans can't find, it will become easier to find out why certain players get good scores while others don't. These fields we have mentioned are only a small part of the many fields where deep learning is capable of significantly contributing to development. We have looked into these fields from the perspective of whether a field has images or text, but of course deep learning should also show great performance for simple analysis with general number data. It should be possible to apply deep learning to various other fields such as bioinformatics, finance, agriculture, chemistry, astronomy, economy, and more. Breakdown-oriented approach This approach might be similar to the approach considered in traditional machine learning algorithms. We already talked about how feature engineering is the key to improving precision in machine learning. Now we can say that this feature engineering can be divided into the following two parts: Engineering under the constraints of a machine learning model. The typical case is to make inputs discrete or continuous. Feature engineering to increase precision by machine learning. This tends to rely on the sense of a researcher. In a narrower meaning, feature engineering is considered as the second one, and this is the part that deep learning doesn't have to focus on, whereas the first one is definitely the important part even for deep learning. For example, it's difficult to predict stock prices using deep learning. Stock prices are volatile and it’s difficult to define inputs. Besides, how to apply an output value is also a difficult problem. Enabling deep learning to handle these inputs and outputs is also said to be feature engineering in the wider sense. If there is no limitation to the value of original data and/or data you would like to predict, it’s difficult to insert these datasets into machine learning and deep learning algorithms, including neural networks. However, we can take a certain approach and apply a model to these previous problems by breaking down the inputs and/or outputs. In terms of NLP as explained earlier, you might have thought, for example, that it would be impossible to put numberless words into features in the first place, but as you already know, we can train feed-forward neural networks with words by representing them with sparse vectors and combining N-grams into them. Of course, we can not only use neural networks, but also other machine learning algorithms such as SVM here. Thus, we can cultivate a new field where deep learning hasn't been applied by engineering to fit features well into deep learning models. In the meantime, when we focus on NLP, we can see that RNN and LSTM were developed to properly resolve the difficulties or tasks encountered in NLP. This can be considered as the opposite approach to feature engineering because in this case, the problem is solved by breaking down a model to fit into features. Then, how do we do engineering for stock prediction as we just mentioned? It's actually not difficult to think of inputs, that is, features. For example, if you predict stock prices daily, it’s hard to calculate if you use daily stock prices as features, but if you use a rate of price change between a day and the day before, then it should be much easier to process as the price stays within a certain range and the gradients won't explode easily. Meanwhile, what is difficult is how to deal with outputs. Stock prices are of course continuous values, hence outputs can be various values. This means that in neural network model where the number of units in the output layer is fixed, they can't handle this problem. What should we do here—should we give up?! No, wait a minute. Unfortunately, we can't predict a stock price itself, but there is an alternative prediction method. Here, the problem is that we can classify stock prices to be predicted into infinite patterns. Then, can we make them into limited patterns? Yes, we can. Let's forcibly make them. Think about the most extreme but easy to understand case: predicting whether a tomorrow's stock price, strictly speaking a close price, is up or down using the data from the stock price up to today. For this case, we can show it with a deep learning model as follows: In the preceding image, denotes the open price of a day, ; denotes the close price, is the high price, and is the actual price. The features used here are mere examples, and need to be fine-tuned when applied to real applications. The point here is that replacing the original task with this type of problem enables deep neural networks to theoretically classify data. Furthermore, if you classify the data by how much it will go up or down, you could make more detailed predictions. For example, you could classify data as shown in the following table: Class Description Class 1 Up more than 3 percent from the closing price Class 2 Up more than 1~3 percent from the closing price Class 3 Up more than 0~1 percent from the closing price Class 4 Down more than 0~-1 percent from the closing price Class 5 Down more than -1~-3 percent from the closing price Class 6 Down more than -3 percent from the closing price Whether the prediction actually works, in other words whether the classification works, is unknown until we examine it, but the fluctuation of stock prices can be predicted in a quite narrow range by dividing the outputs into multiple classes. Once we can adopt the task into neural networks, then what we should do is just examine which model gets better results. In this example, we may apply RNN because the stock price is time sequential data. If we look at charts showing the price as image data, we can also use CNN to predict the future price. So now we've thought about the approach by referring to examples, but to sum up in general, we can say that: Feature engineering for models: This is designing inputs or adjusting values to fit deep learning models, or enabling classification by setting a limitation for the outputs Model engineering for features: This is devising new neural network models or algorithms to solve problems in a focused field The first one needs ideas for the part of designing inputs and outputs to fit to a model, whereas the second one needs to take a mathematical approach. Feature engineering might be easier to start if you are conscious of making an item prediction-limited. Output-oriented approach The previously mentioned two approaches are to increase the percentage of correct answers for a certain field's task or problem using deep learning. Of course, it is essential and the part where deep learning proves its worth, however, increasing precision to the ultimate level may not be the only way of utilizing deep learning. Another approach is to devise the outputs using deep learning by slightly changing the point of view. Let's see what this means. Deep learning is applauded as an innovative approach among researchers and technical experts of AI, but the world in general doesn't know much about its greatness yet. Rather, they pay attention to what a machine can't do. For example, people don't really focus on the image recognition capabilities of MNIST using CNN, which generates a lower error rate than humans, but they criticize that a machine can't recognize images perfectly. This is probably because people expect a lot when they hear and imagine AI. We might need to change this mindset. Let's consider DORAEMON, a Japanese national cartoon character who is also famous worldwide—a robot who has high intelligence and AI, but often makes silly mistakes. Do we criticize him? No, we just laugh it off or take it as a joke and don’t get serious. Also, think about DUMMY / DUM-E, the robot arm in the movie Iron Man. It has AI as well, but makes silly mistakes. See, they make mistakes but we still like them. In this way, it might be better to emphasize the point that machines make mistakes. Changing the expression part of a user interface could be the trigger for people to adopt AI rather than just studying an algorithm the most. Who knows? It’s highly likely that you can gain the world’s interest by thinking of ideas in creative fields, not from the perspective of precision. Deep Dream by Google is one good example. We can do more exciting things when art or design and deep learning collaborate. Summary And …congratulations! You’ve just accomplished the learning part of deep learning with Java. Although there are still some models that have not been mentioned, you can be sure there will be no problem in acquiring and utilizing them. Resources for Article: Further resources on this subject: Setup Routine for an Enterprise Spring Application[article] Support for Developers of Spring Web Flow 2[article] Using Spring JMX within Java Applications[article]
Read more
  • 0
  • 0
  • 4173

article-image-introduction-scikit-learn
Packt
16 Feb 2016
5 min read
Save for later

Introduction to Scikit-Learn

Packt
16 Feb 2016
5 min read
Introduction Since its release in 2007, scikit-learn has become one of the most popular open source machine learning libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification, regression, dimensionality reduction, and clustering. It also provides modules for extracting features, processing data, and evaluating models. (For more resources related to this topic, see here.) Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing. scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning data. Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-learn is noted for its reliability; much of the library is covered by automated tests. Installing scikit-learn This book is written for version 0.15.1 of scikit-learn; use this version to ensure that the examples run correctly. If you have previously installed scikit-learn, you can retrieve the version number with the following code: >>> import sklearn >>> sklearn.__version__ '0.15.1' If you have not previously installed scikit-learn, you can install it from a package manager or build it from the source. We will review the installation processes for Linux, OS X, and Windows in the following sections, but refer to http://scikit-learn.org/stable/install.html for the latest instructions. The following instructions only assume that you have installed Python 2.6, Python 2.7, or Python 3.2 or newer. Go to http://www.python.org/download/ for instructions on how to install Python. Installing scikit-learn on Windows scikit-learn requires Setuptools, a third-party package that supports packaging and installing software for Python. Setuptools can be installed on Windows by running the bootstrap script at https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py. Windows binaries for the 32- and 64-bit versions of scikit-learn are also available. If you cannot determine which version you need, install the 32-bit version. Both versions depend on NumPy 1.3 or newer. The 32-bit version of NumPy can be downloaded from http://sourceforge.net/projects/numpy/files/NumPy/. The 64-bit version can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. A Windows installer for the 32-bit version of scikit-learn can be downloaded from http://sourceforge.net/projects/scikit-learn/files/. An installer for the 64-bit version of scikit-learn can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. scikit-learn can also be built from the source code on Windows. Building requires a C/C++ compiler such as MinGW (http://www.mingw.org/), NumPy, SciPy, and Setuptools. To build, clone the Git repository from https://github.com/scikit-learn/scikit-learn and execute the following command: python setup.py install Installing scikit-learn on Linux There are several options to install scikit-learn on Linux, depending on your distribution. The preferred option to install scikit-learn on Linux is to use pip. You may also install it using a package manager, or build scikit-learn from its source. To install scikit-learn using pip, execute the following command: sudo pip install scikit-learn To build scikit-learn, clone the Git repository from https://github.com/scikit-learn/scikit-learn. Then install the following dependencies: sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ Navigate to the repository's directory and execute the following command: python setup.py install Installing scikit-learn on OS X scikit-learn can be installed on OS X using Macports: sudo port install py26-sklearn If Python 2.7 is installed, run the following command: sudo port install py27-sklearn scikit-learn can also be installed using pip with the following command: pip install scikit-learn Verifying the installation To verify that scikit-learn has been installed correctly, open a Python console and execute the following: >>> import sklearn >>> sklearn.__version__ '0.15.1' To run scikit-learn's unit tests, first install the nose library. Then execute the following: nosetest sklearn –exe Congratulations! You've successfully installed scikit-learn. Summary In this article, we had a brief introduction of Scikit. We also covered the installation of Scikit on various operating system Windows, Linux, OS X. You can also refer the following books on the similar topics: scikit-learn Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/scikit-learn-cookbook) Learning scikit-learn: Machine Learning in Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-scikit-learn-machine-learning-python) Resources for Article: Further resources on this subject: Our First Machine Learning Method – Linear Classification[article] Specialized Machine Learning Topics[article] Machine Learning[article]
Read more
  • 0
  • 0
  • 4154

article-image-getting-started-oracle-data-guard
Packt
02 Jul 2013
13 min read
Save for later

Getting Started with Oracle Data Guard

Packt
02 Jul 2013
13 min read
(For more resources related to this topic, see here.) What is Data Guard? Data Guard, which was introduced as the standby database in Oracle database Version 7.3 under the name of Data Guard with Version 9 i , is a data protection and availability solution for Oracle databases. The basic function of Oracle Data Guard is to keep a synchronized copy of a database as standby, in order to make provision, incase the primary database is inaccessible to end users. These cases are hardware errors, natural disasters, and so on. Each new Oracle release added new functionalities to Data Guard and the product became more and more popular with offerings such as data protection, high availability, and disaster recovery for Oracle databases. Using Oracle Data Guard, it's possible to direct user connections to a Data Guard standby database automatically with no data loss, in case of an outage in the primary database. Data Guard also offers taking advantage of the standby database for reporting, test, and backup offloading. Corruptions on the primary database may be fixed automatically by using the non-corrupted data blocks on the standby database. There will be minimal outages (seconds to minutes) on the primary database in planned maintenances such as patching and hardware changes by using the switchover feature of Data Guard, which changes the roles of the primary and standby databases. All of these features are available with Data Guard, which doesn't require an installation but a cloning and configuration of the Oracle database. A Data Guard configuration consists of two main components: primary database and standby database. The primary database is the database for which we want to take precaution for its inaccessibility. Fundamentally, changes on the data of the primary database are passed through the standby database and these changes are applied to the standby database in order to keep it synchronized. The following figure shows the general structure of Data Guard: Let's look at the standby database and its properties more closely. Standby database It is possible to configure a standby database simply by copying, cloning, or restoring a primary database to a different server. Then the Data Guard configurations are made on the databases in order to start the transfer of redo information from primary to standby and also to start the apply process on the standby database. Primary and standby databases may exist on the same server; however, this kind of configuration should only be used for testing. In a production environment, the primary and standby database servers are generally preferred to be on separate data centers. Data Guard keeps the primary and standby databases synchronized by using redo information. As you may know, transactions on an Oracle database produce redo records. This redo information keeps all of the changes made to the database. The Oracle database first creates redo information in memory (redo log buffers). Then they're written into online redo logfiles, and when an online redo logfile is full, its content is written into an archived redo log. An Oracle database can run in the ARCHIVELOG mode or the NOARCHIVELOG mode. In the ARCHIVELOG mode, online redo logfiles are written into archived redo logs and in the NOARCHIVELOG mode, redo logfiles are overwritten without being archived as they become full. In a Data Guard environment, the primary database must be in the ARCHIVELOG mode. In Data Guard, transfer of the changed data from the primary to standby database is achieved by redo with no alternative. However, the apply process of the redo content to the standby database may vary. The different methods on the apply process reveal different type of standby databases. There were two kinds of standby databases before Oracle database Version 11 g , which were: physical standby database and logical standby database. Within Version 11 g we should mention a third type of standby database which is snapshot standby. Let's look at the properties of these standby database types. Physical standby database The Physical standby database is a block-based copy of the primary database. In a physical standby environment, in addition to containing the same database objects and same data, the primary and standby databases are identical on a block-for-block basis. Physical standby databases use Redo Apply method to apply changes. Redo Apply uses Managed recovery process ( MRP ) in order to manage application of the change in information on redo. In Version 11 g , a physical standby database can be accessible in read-only mode while Redo Apply is working, which is called Active Data Guard. Using the Active Data Guard feature, we can offload report jobs from the primary to physical standby database. Physical standby database is the only option that has no limitation on storage vendor or data types to keep a synchronized copy of the primary database. Logical standby database Logical standby database is a feature introduced in Version 9 i R2. In this configuration, redo data is first converted into SQL statements and then applied to the standby database. This process is called SQL Apply. This method makes it possible to access the standby database permanently and allows read/write while the replication of data is active. Thus, you're also able to create database objects on the standby database that don't exist on the primary database. So a logical standby database can be used for many other purposes along with high availability and disaster recovery. Due to the basics of SQL Apply, a logical standby database will contain the same data as the primary database but in a different structure on the disks. One discouraging aspect of the logical standby database is the unsupported data types, objects, and DDLs. The following data types are not supported to be replicated in a logical standby environment: BFILE Collections (including VARRAYS and nested tables) Multimedia data types (including Spatial, Image, and Oracle Text) ROWID and UROWID User-defined types The logical standby database doesn't guarantee to contain all primary data because of the unsupported data types, objects, and DDLs. Also, SQL Apply consumes more hardware resources. Therefore, it certainly brings more performance issues and administrative complexities than Redo Apply. Snapshot standby database Principally, a snapshot standby database is a special condition of a physical standby database. Snapshot standby is a feature that is available with Oracle Database Version 11 g . When you convert a Physical standby database into a snapshot standby database, it becomes accessible for read/write. You can run tests on this database and change the data. When you're finished with the snapshot standby database, it's possible to reverse all the changes made to the database and turn it back to a physical standby again. An important point here is that a snapshot standby database can't run Redo Apply. Redo transfer continues but standby is not able to apply redo. Oracle Data Guard evolution It has been a long time that the Oracle Data Guard technology has been in the database administrator's life and it apparently evolved from the beginning until 11 g R2. Let's look at this evolution closely through the different database versions. Version 7.3 – stone age The functionality of keeping a duplicate database in a separate server, which can be synchronized with the primary database, came with Oracle database Version 7.3 under the name of standby database. This standby database was constantly in recovery mode waiting for the archived redo logs to be synchronized. However, this feature was not able to automate the transfer of archived redo logs. Database administrators had to find a way to transfer archived redo logs and apply them to the standby server continuously. This was generally accomplished by a script running in the background. The only aim of Version 7.3 of the standby database was disaster recovery. It was not possible to query the standby database or to open it for any purpose other than activating it in the event of failure of the primary database. Once the standby database was activated, it couldn't be returned to the standby recovery mode again. Version 8 i – first age Oracle database Version 8 i brought the much-awaited features to the standby database and made the archived log shipping and apply process automatic, which is now called managed standby environment and managed recovery, respectively. However, some users were choosing to apply the archived logs manually because it was not possible to set a delay in the managed recovery mode. This mode was bringing the risk of the accidental operations to reflect standby database quickly. Along with the "managed" modes, 8 i made it possible to open a standby database with the read-only option and allowed it to be used as a reporting database. Even though there were new features that made the tool more manageable and practical, there were still serious deficiencies. For example, when we added a datafile or created a tablespace on the primary database, these changes were not being replicated to the standby database. Database administrators had to take care of this maintenance on the standby database. Also when we opened the primary database with resetlogs or restored a backup control file, we had to re-create the standby database. Version 9 i – middle age First of all, with this version Oracle8 i standby database was renamed to Oracle9 i Data Guard. 9 i Data Guard includes very important new features, which makes the product much more reliable and functional. The following features were included: Oracle Data Guard Broker management framework, which is used to centralize and automate the configuration, monitoring, and management of Oracle Data Guard installations, was introduced with this version. Zero data loss on failover was guaranteed as a configuration option. Switchover was introduced, which made it possible to change the roles of primary and standby. This made it possible to accomplish a planned maintenance on the primary database with very less service outage. Standby database administration became simpler because new datafiles on the primary database are created automatically on standby and if there are missing archived logs on standby, which is called gap; Data Guard detects and transmits the missing logs to standby automatically. Delay option was added, which made it possible to configure a standby database that is always behind the primary in a specified time delay. Parallel recovery increased recovery performance on the standby database. In Version 9 i Release 2, which was introduced in May 2002, one year after Release 1, there were again very important features announced. They are as follows: Logical standby database was introduced, which we've mentioned earlier in this article Three data protection modes were ready to use: Maximum Protection, Maximum Availability, and Maximum Performance, which offered more flexibility on configuration The Cascade standby database feature made it possible to configure a second standby database, which receives its redo data from the first standby database Version 10 g – new age The 10 g version again introduced important features of Data Guard but we can say that it perhaps fell behind expectations because of the revolutionary changes in release 9 i . The following new features were introduces in Version 10 g : One of the most important features of 10 g was the Real-Time Apply. When running in Real-Time Apply mode, the standby database applies changes on the redo immediately after receiving it. Standby does not wait for the standby redo logfile to be archived. This provides faster switchover and failover. Flashback database support was introduced, which made it unnecessary to configure a delay in the Data Guard configuration. Using flashback technology, it was possible to flash back a standby database to a point in time. With 10 g Data Guard, if we open a primary database with resetlogs it was not required to re-create the standby database. Standby was able to recover through resetlogs. Version 10 g made it possible to use logical standby databases in the database software rolling upgrades of the primary database. This method made it possible to lessen the service outage time by performing switchover to the logical standby database. 10 g Release 2 also introduced new features to Data Guard, but these features again were not satisfactory enough to make a jump to the Data Guard technology. The two most important features were Fast-Start Failover and the use of Guaranteed restore point: Fast-start failover automated and accelerated the failover operation when the primary database was lost. This option strengthened the disaster recovery role of Oracle Data Guard. Guaranteed restore point was not actually a Data Guard feature. It was a database feature, which made it possible to revert a database to the moment that Guaranteed restore point was created, as long as there is sufficient disk space for the flashback logs. Using this feature following scenario became possible: Activate a physical standby database after stopping Redo Apply, use it for testing with read/write operations, then revert the changes, make it standby again and synchronize it with the primary. Using a standby database read/write was offering a great flexibility to users but the archived log shipping was not able to continue while the standby is read/write and this was causing data loss on the possible primary database failure. Version 11 g – modern age Oracle database version 11 g offered the expected jump in the Data Guard technology, especially with two new features, which are called Active Data Guard and snapshot standby. The following features were introduced: Active Data Guard has been a milestone in Data Guard history, which enables a query from a physical standby database while the media recovery is active. Snapshot standby is a feature to use a physical standby database read/write for test purposes. As we mentioned, this was possible with 10 g R2 Guaranteed restore point feature but 11 g provided the continuous archived log shipping in the time period that standby is read/write with snapshot standby. It has been possible to compress redo traffic in a Data Guard configuration, which is useful in excessive redo generation rates and resolving gaps. Compression of redo when resolving gaps was introduced in 11 g R1 and compression of all redo data was introduced in 11 g R2. Use of the physical standby databases for the rolling upgrades of database software was enabled, aka Transient Logical Standby. It became possible to include different operating systems in a Data Guard configuration such as Windows and Linux. Lost-write, which is a serious data corruption type arising from the misinformation of storage subsystem on completing the write of a block, can be detected in an 11 g Data Guard configuration. Recovery is automatically stopped in such a case. RMAN fast incremental backup feature "Block Change Tracking" can be run on an Active Data Guard enabled standby database. Another very important enhancement in 11 g was Automatic Block Corruption Repair feature that was introduced with 11 g R2. With this feature, a corrupted data block in the primary database can be automatically replaced with an uncorrupted copy from a physical standby database in Active Data Guard mode and vice versa. We've gone through the evolution of Oracle Data Guard from its beginning until today. As you may notice, Data Guard started its life as a very simple database property revealed to keep a synchronized database copy with a lot of manual work and now it's a complicated tool with advanced automation, precaution, and monitoring features. Now let's move on with the architecture and components of Oracle Data Guard 11 g R2.
Read more
  • 0
  • 0
  • 4153

article-image-installing-coherence-35-and-accessing-data-grid-part-1
Packt
31 Mar 2010
10 min read
Save for later

Installing Coherence 3.5 and Accessing the Data Grid: Part 1

Packt
31 Mar 2010
10 min read
When I first started evaluating Coherence, one of my biggest concerns was how easy it would be to set up and use, especially in a development environment. The whole idea of having to set up a cluster scared me quite a bit, as any other solution I had encountered up to that point that had the word "cluster" in it was extremely difficult and time consuming to configure. My fear was completely unfounded—getting the Coherence cluster up and running is as easy as starting Tomcat. You can start multiple Coherence nodes on a single physical machine, and they will seamlessly form a cluster. Actually, it is easier than starting Tomcat. Installing Coherence In order to install Coherence you need to download the latest release from the Oracle Technology Network (OTN) website. The easiest way to do so is by following the link from the main Coherence page on OTN. At the time of this writing, this page was located at http://www.oracle.com/technology/products/coherence/index.html, but that might change. If it does, you can find its new location by searching for 'Oracle Coherence' using your favorite search engine. In order to download Coherence for evaluation, you will need to have an Oracle Technology Network (OTN) account. If you don't have one, registration is easy and completely free. Once you are logged in, you will be able to access the Coherence download page, where you will find the download links for all available Coherence releases: one for Java, one for .NET, and one for each of the supported C++ platforms. You can download any of the Coherence releases you are interested in while you are there, but for the remainder of this article you will only need the first one. The latter two (.NET and C++) are client libraries that allow .NET and C++ applications to access the Coherence data grid. Coherence ships as a single ZIP archive. Once you unpack it you should see the README.txt file containing the full product name and version number, and a single directory named coherence. Copy the contents of the coherence directory to a location of your choice on your hard drive. The common location on Windows is c:coherence and on Unix/Linux /opt/coherence, but you are free to put it wherever you want. The last thing you need to do is to configure the environment variable COHERENCE_HOME to point to the top-level Coherence directory created in the previous step, and you are done. Coherence is a Java application, so you also need to ensure that you have the Java SDK 1.4.2 or later installed and that JAVA_HOME environment variable is properly set to point to the Java SDK installation directory. If you are using a JVM other than Sun's, you might need to edit the scripts used in the following section. For example, not all JVMs support the -server option that is used while starting the Coherence nodes, so you might need to remove it. What's in the box? The first thing you should do after installing Coherence is become familiar with the structure of the Coherence installation directory. There are four subdirectories within the Coherence home directory: bin: This contains a number of useful batch files for Windows and shell scripts for Unix/Linux that can be used to start Coherence nodes or to perform various network tests doc: This contains the Coherence API documentation, as well as links to online copies of Release Notes, User Guide, and Frequently Asked Questions documents examples: This contains several basic examples of Coherence functionality lib: This contains JAR files that implement Coherence functionality Shell scripts on UnixIf you are on a Unix-based system, you will need to add execute permission to the shell scripts in the bin directory by executing the following command: $ chmod u+x *.sh Starting up the Coherence cluster In order to get the Coherence cluster up and running, you need to start one or more Coherence nodes. The Coherence nodes can run on a single physical machine, or on many physical machines that are on the same network. The latter will definitely be the case for a production deployment, but for development purposes you will likely want to limit the cluster to a single desktop or laptop. The easiest way to start a Coherence node is to run cache-server.cmd batch file on Windows or cache-server.sh shell script on Unix. The end result in either case should be similar to the following screenshot: There is quite a bit of information on this screen, and over time you will become familiar with each section. For now, notice two things: At the very top of the screen, you can see the information about the Coherence version that you are using, as well as the specific edition and the mode that the node is running in. Notice that by default you are using the most powerful, Grid Edition, in development mode. The MasterMemberSet section towards the bottom lists all members of the cluster and provides some useful information about the current and the oldest member of the cluster. Now that we have a single Coherence node running, let's start another one by running the cache-server script in a different terminal window. For the most part, the output should be very similar to the previous screen, but if everything has gone according to the plan, the MasterMemberSet section should reflect the fact that the second node has joined the cluster: MasterMemberSet ( ThisMember=Member(Id=2, ...) OldestMember=Member(Id=1, ...) ActualMemberSet=MemberSet(Size=2, BitSetCount=2 Member(Id=1, ...) Member(Id=2, ...) )RecycleMillis=120000RecycleSet=MemberSet(Size=0, BitSetCount=0)) You should also see several log messages on the first node's console, letting you know that another node has joined the cluster and that some of the distributed cache partitions were transferred to it. If you can see these log messages on the first node, as well as two members within the ActualMemberSet on the second node, congratulations—you have a working Coherence cluster. Troubleshooting cluster start-up In some cases, a Coherence node will not be able to start or to join the cluster. In general, the reason for this could be all kinds of networking-related issues, but in practice a few issues are responsible for the vast majority of problems. Multicast issues By far the most common issue is that multicast is disabled on the machine. By default, Coherence uses multicast for its cluster join protocol, and it will not be able to form the cluster unless it is enabled. You can easily check if multicast is enabled and working properly by running the multicast-test shell script within the bin directory. If you are unable to start the cluster on a single machine, you can execute the following command from your Coherence home directory: $ . bin/multicast-test.sh –ttl 0 This will limit time-to-live of multicast packets to the local machine and allow you to test multicast in isolation. If everything is working properly, you should see a result similar to the following: Starting test on ip=Aleks-Mac-Pro.home/192.168.1.7,group=/237.0.0.1:9000, ttl=0Configuring multicast socket...Starting listener...Fri Aug 07 13:44:44 EDT 2009: Sent packet 1.Fri Aug 07 13:44:44 EDT 2009: Received test packet 1 from selfFri Aug 07 13:44:46 EDT 2009: Sent packet 2.Fri Aug 07 13:44:46 EDT 2009: Received test packet 2 from selfFri Aug 07 13:44:48 EDT 2009: Sent packet 3.Fri Aug 07 13:44:48 EDT 2009: Received test packet 3 from self If the output is different from the above, it is likely that multicast is not working properly or is disabled on your machine. This is frequently the result of a firewall or VPN software running, so the first troubleshooting step would be to disable such software and retry. If you determine that was indeed the cause of the problem you have two options. The first, and obvious one, is to turn the offending software off while using Coherence. However, for various reasons that might not be an acceptable solution, in which case you will need to change the default Coherence behavior, and tell it to use the Well-Known Addresses (WKA) feature instead of multicast for the cluster join protocol. Doing so on a development machine is very simple—all you need to do is add the following argument to the JAVA_OPTS variable within the cache-server shell script: -Dtangosol.coherence.wka=localhost With that in place, you should be able to start Coherence nodes even if multicastis disabled. Localhost and loopback addressOn some systems, localhost maps to a loopback address, 127.0.0.1. If that's the case, you will have to specify the actual IP address or host name for the tangosol.coherence.wka configuration parameter. The host name should be preferred, as the IP address can change as you move from network to network, or if your machine leases an IP address from a DHCP server. As a side note, you can tell whether the WKA or multicast is being used for the cluster join protocol by looking at the section above the MasterMemberSet section when the Coherence node starts. If multicast is used, you will see something similar to the following: Group{Address=224.3.5.1, Port=35461, TTL=4} The actual multicast group address and port depend on the Coherence version being used. As a matter of fact, you can even tell the exact version and the build number from the preceding information. In this particular case, I am using Coherence 3.5.1 release, build 461. This is done in order to prevent accidental joins of cluster members into an existing cluster. For example, you wouldn't want a node in the development environment using newer version of Coherence that you are evaluating to join the existing production cluster, which could easily happen if the multicast group address remained the same. On the other hand, if you are using WKA, you should see output similar to the following instead: WellKnownAddressList(Size=1, WKA{Address=192.168.1.7, Port=8088} ) Using the WKA feature completely disables multicast in a Coherence cluster, and is recommended for most production deployments, primarily due to the fact that many production environments prohibit multicast traffic altogether, and that some network switches do not route multicast traffic properly. That said, configuring WKA for production clusters is out of the scope of this article, and you should refer to Coherence product manuals for details. Binding issues Another issue that sometimes comes up is that one of the ports that Coherence attempts to bind to is already in use and you see a bind exception when attempting to start the node. By default, Coherence starts the first node on port 8088, and increments port number by one for each subsequent node on the same machine. If for some reason that doesn't work for you, you need to identify a range of available ports for as many nodes as you are planning to start (both UDP and TCP ports with the same numbers must be available), and tell Coherence which port to use for the first node by specifying the tangosol.coherence.localport system property. For example, if you want Coherence to use port 9100 for the first node, you will need to add the following argument to the JAVA_OPTS variable in the cache-server shell script: -Dtangosol.coherence.localport=9100
Read more
  • 0
  • 0
  • 4114
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-oracle-when-use-log-miner
Packt
07 May 2010
6 min read
Save for later

Oracle: When to use Log Miner

Packt
07 May 2010
6 min read
Log Miner has both a GUI interface in OEM as well as the database package, DBMS_LOGMNR. When this utility is used by the DBA, its primary focus is to mine data from the online and archived redo logs. Internally Oracle uses the Log Miner technology for several other features, such as Flashback Transaction Backout, Streams, and Logical Standby Databases. This section is not on how to run Log Miner, but looks at the task of identifying the information to restore. The Log Miner utility comes into play when you need to retrieve an older version of selected pieces of data without completely recovering the entire database. A complete recovery is usually a drastic measure that means downtime for all users and the possibility of lost transactions. Most often Log Miner is used for recovery purposes when the data consists of just a few tables or a single code change. Make sure supplemental logging is turned on (see the Add Supplemental Logging section). In this case, you discover that one or more of the following conditions apply when trying to recover a small amount of data that was recently changed: Flashback is not enabled Flashback logs that are needed are no longer available Data that is needed is not available in the online redo logs Data that is needed has been overwritten in the undo segments Go to the last place available: archived redo logs. This requires the database to be in archivelog mode and for all archive logs that are needed to still be available or recoverable. Identifying the data needed to restore One of the hardest parts of restoring data is determining what to restore, the basic question being when did the bad data become part of the collective? Think the Borg from Star Trek! When you need to execute Log Miner to retrieve data from a production database, you will need to act fast. The older the transactions the longer it will take to recover and traverse with Log Miner. The newest (committed) transactions are processed first, proceeding backwards. The first question to ask is when do you think the bad event happened? Searching for data can be done in several different ways: SCN, timestamp, or log sequence number> Pseudo column ORA_ROWSCN SCN, timestamp, or log sequence number If you are lucky, the application also writes a timestamp of when the data was last changed. If that is the case, then you determine the archive log to mine by using the following queries. It is important to set the session NLS_DATE_FORMAT so that the time element is displayed along with the date, otherwise you will just get the default date format of DD-MMM-RR. The data format comes from the database startup parameters— the NLS_TERRITORY setting. Find the time when a log was archived and match that to the archive log needed. Pseudo column ORA_ROWSCN While this method seems very elegant, it does not work perfectly, meaning it won't always return the correct answer. As it may not work every time or accurately, it is generally not recommended for Flashback Transaction Queries. It is definitely worth trying to narrow the window that you will have to search. It uses the SCN information that was stored for the associated transaction in the Interested Transaction List. You know that delayed block cleanout is involved. The pseudo column ORA_ROWSCN contains information for the approximate time this table was updated for each row. In the following example the table has three rows, with the last row being the one that was most recently updated. It gives me the time window to search the archive logs with Log Miner. Log Miner is the basic technology behind several of the database Maximum Availability Architecture capabilities—Logical Standby, Streams, and the following Flashback Transaction Backout exercise. Flashback Transaction Query and Backout Flashback technology was first introduced in Oracle9i Database. This feature allows you to view data at different points in time and with more recent timestamps (versions), and thus provides the capability to recover previous versions of data. In this article, we are dealing with Flashback Transaction Query (FTQ) and Flashback Transaction Backout (FTB), because they both deal with transaction IDs and integrate with the Log Miner utility. See the MOS document: "What Do All 10g Flashback Features Rely on and what are their Limitations?" (Doc ID 435998.1). Flashback Transaction Query uses the transaction ID (Xid) that is stored with each row version in a Flashback Versions Query to display every transaction that changed the row. Currently, the only Flashback technology that can be used when the object(s) in question have been changed by DDL is Flashback Data Archive. There are other restrictions to using FTB with certain data types (VARRAYs, BFILES), which match the data type restrictions for Log Miner. This basically means if data types aren't supported, then you can't use Log Miner to find the undo and redo log entries. When would you use FTQ or FTB instead of the previously described methods? The answer is when the data involves several tables with multiple constraints or extensive amounts of information. Similar to Log Miner, the database can be up and running while people are working online in other schemas of the database to accomplish this restore task. An example of using FTB or FTQ would be to reverse a payroll batch job that was run with the wrong parameters. Most often a batch job is a compiled code (like C or Cobol) run against the database, with parameters built in by the application vendor. A wrong parameter could be the wrong payroll period, wrong set of employees, wrong tax calculations, or payroll deductions. Enabling flashback logs First off all flashback needs to be enabled in the database. Oracle Flashback is the database technology intended for a point-in-time recovery (PITR) by saving transactions in flashback logs. A flashback log is a temporary Oracle file and is required to be stored in the FRA, as it cannot be backed up to any other media. Extensive information on all of the ramifications of enabling flashback is found in the documentation labeled: Oracle Database Backup and Recovery User's Guide. See the following section for an example of how to enable flashback: SYS@NEWDB>ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='/backup/flash_recovery_area/NEWDB' SCOPE=BOTH;SYS@NEWDB>ALTER SYSTEM SET DB_RECOVERY_FILE_DEST_SIZE=100M SCOPE=BOTH;--this is sized for a small test databaseSYS@NEWDB> SHUTDOWN IMMEDIATE;SYS@NEWDB> STARTUP MOUNT EXCLUSIVE;SYS@NEWDB> ALTER DATABASE FLASHBACK ON;SYS@NEWDB> ALTER DATABASE OPEN;SYS@NEWDB> SHOW PARAMETER RECOVERY; The following query would then verify that FLASHBACK had been turned on: SYS@NEWDB>SELECT FLASHBACK_ON FROM V$DATABASE;
Read more
  • 0
  • 0
  • 4101

article-image-azure-feature-pack
Packt
05 Jul 2017
9 min read
Save for later

Azure Feature Pack

Packt
05 Jul 2017
9 min read
In this article by Christian Cote, Matija Lah, and Dejan Sarka, the author of the book SQL Server 2016 Integration Services Cookbook, we will see how to install Azure Feature Pack that in turn, will install Azure control flow tasks and data flow components And we will also see how to use the Fuzzy Lookup transformation for identity mapping. (For more resources related to this topic, see here.) In the early years of SQL Server, Microsoft introduced a tool to help developers and database administrator (DBA) to interact with the data: data transformation services (DTS). The tool was very primitive compared to SSIS and it was mostly relying on ActiveX and TSQL to transform the data. SSIS 1.0 appears in 2005. The tool was a game changer in the ETL world at the time. It was a professional and (pretty much) reliable tool for 2005. 2008/2008R2 versions were much the same as 2005 in a sense that they didn't add much functionality but they made the tool more scalable. In 2012, Microsoft enhanced SSIS in many ways. They rewrote the package XML to ease source control integration and make package code easier to read. They also greatly enhanced the way packages are deployed by using a SSIS catalog in SQL Server. Having the catalog in SQL Server gives us execution reports and many views that give us access to metadata or metaprocess information's in our projects. Version 2014 didn't have anything for SSIS. Version 2016 brings other set of features as you will see. We now also have the possibility to integrate with big data. Business intelligence projects many times reveal previously unseen issues with the quality of the source data. Dealing with data quality includes data quality assessment, or data profiling, data cleansing, and maintaining high quality over time. In SSIS, the data profiling task helps you with finding unclean data. The Data Profiling task is not like the other tasks in SSIS because it is not intended to be run over and over again through a scheduled operation. Think about SSIS being the wrapper for this tool. You use the SSIS framework to configure and run the Data Profiling task, and then you observe the results through the separate Data Profile Viewer. The output of the Data Profiling task will be used to help you in your development and design of the ETL and dimensional structures in your solution. Periodically, you may want to rerun the Data Profile task to see how the data has changed, but the package you develop will not include the task in the overall recurring ETL process Azure tasks and transforms Azure ecosystem is becoming predominant in Microsoft ecosystem and SSIS has not been left over in the past few years. The Azure Feature Pack is not a SSIS 2016 specific feature. It's also available for SSIS version 2012 and 2014. It's worth mentioning that it appeared in July 2015, a few months before SSIS 2016 release. Getting ready This section assumes that you have installed SQL Server Data Tools 2015. How to do it... We'll start SQL Server Data Tools, and open the CustomLogging project if not already done: In the SSIS toolbox, scroll to the Azure group. Since the Azure tools are not installed with SSDT, the Azure group is disabled in the toolbox. The tools must be downloaded using a separate installer. Click on Azure group to expand it and click on Download Azure Feature Pack as shown in the following screenshot: Your default browser opens and the Microsoft SQL Server 2016 Integration Services Feature Pack for Azure opens. Click on Download as shown in the following screenshot: From the popup that appears, select both 32-bit and 64-bit version. The 32-bit version is necessary for SSIS package development since SSDT is a 32-bit program. Click Next as shown in the following screenshot: As shown in the following screenshot, the files are downloaded: Once the download completes, run on one the installers downloaded. The following screen appears. In that case, the 32-bit (x86) version is being installed. Click Next to start the installation process: As shown in the following screenshot, check the box near I accept the terms in the License Agreement and click Next. Then the installation starts. The following screen appears once the installation is completed. Click Finish to close the screen: Install the other feature pack you downloaded. If SSDT is opened, close it. Start SSDT again and open the CustomLogging project. In the Azure group in the SSIS toolbox, you should now see the Azure tasks as in the following screenshot: Using SSIS fuzzy components SSIS includes two really sophisticated matching transformations in the data flow. The Fuzzy Lookup transformation is used for mapping the identities. The Fuzzy Grouping Transformation is used for de-duplicating. Both of them use the same algorithm for comparing the strings and other data. Identity mapping and de-duplication are actually the same problem. For example, instead for mapping the identities of entities in two tables, you can union all of the data in a single table and then do the de-duplication. Or vice versa, you can join a table to itself and then do identity mapping instead of de-duplication. Getting ready This recipe assumes that you have successfully finished the previous recipe. How to do it… In SSMS, create a new table in the DQS_STAGING_DATA database in the dbo schema and name it dbo.FuzzyMatchingResults. Use the following code: CREATE TABLE dbo.FuzzyMatchingResults ( CustomerKey INT NOT NULL PRIMARY KEY, FullName NVARCHAR(200) NULL, StreetAddress NVARCHAR(200) NULL, Updated INT NULL, CleanCustomerKey INT NULL ); Switch to SSDT. Continue editing the DataMatching package. Add a Fuzzy Lookup transformation below the NoMatch Multicast transformation. Rename it FuzzyMatches and connect it to the NoMatch Multicast transformation with the regular data flow path. Double-click the transformation to open its editor. On the Reference Table tab, select the connection manager you want to use to connect to your DQS_STAGING_DATA database and select the dbo.CustomersClean table. Do not store a new index or use an existing index. When the package executes the transformation for the first time, it copies the reference table, adds a key with an integer datatype to the new table, and builds an index on the key column. Next, the transformation builds an index, called a match index, on the copy of the reference table. The match index stores the results of tokenizing the values in the transformation input columns. The transformation then uses these tokens in the lookup operation. The match index is a table in a SQL Server database. When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleansing. Click the Columns tab. Delete the mapping between the two CustomerKey columns. Clear the check box next to the CleanCustomerKey input column. Select the check box next to the CustomerKey lookup column. Rename the output alias for this column to CleanCustomerKey. You are replacing the original column with the one retrieved during the lookup. Your mappings should resemble those shown in the following screenshot: Click the Advanced tab. Raise the Similarity threshold to 0.50 to reduce the matching search space. With similarity threshold of 0.00, you would get a full cross join. Click OK. Drag the Union All transformation below the Fuzzy Lookup transformation. Connect it to an output of the Match Multicast transformation and an output of the FuzzyMatches Fuzzy Lookup transformation. You will combine the exact and approximate matches in a single row set. Drag an OLE DB Destination below the Union All transformation. Rename it FuzzyMatchingResults and connect it with the Union All transformation. Double-click it to open the editor. Connect to your DQS_STAGING_DATA database and select the dbo.FuzzyMatchingResults table. Click the Mappings tab. Click OK. The completed data flow is shown in the following screenshot: You need to add restartability to your package. You will truncate all destination tables. Click the Control Flow tab. Drag the Execute T-SQL Statement task above the data flow task. Connect the tasks with the green precedence constraint from the Execute T-SQL Statement task to the data flow task. The Execute T-SQL Statement task must finish successfully before the data flow task starts. Double-click the Execute T-SQL Statement task. Use the connection manager to your DQS_STAGING_DATA database. Enter the following code in the T-SQL statement textbox, and then click OK: TRUNCATE TABLE dbo.CustomersDirtyMatch; TRUNCATE TABLE dbo.CustomersDirtyNoMatch; TRUNCATE TABLE dbo.FuzzyMatchingResults; Save the solution. Execute your package in debug mode to test it. Review the results of the Fuzzy Lookup transformation in SSMS. Look for rows for which the transformation did not find a match, and for any incorrect matches. Use the following code: -- Not matched SELECT * FROM FuzzyMatchingResults WHERE CleanCustomerKey IS NULL; -- Incorrect matches SELECT * FROM FuzzyMatchingResults WHERE CleanCustomerKey <> CustomerKey * (-1); You can use the following code to clean up the AdventureWorksDW2014 and DQS_STAGING_DATA databases: USE AdventureWorksDW2014; DROP TABLE IF EXISTS dbo.Chapter05Profiling; DROP TABLE IF EXISTS dbo.AWCitiesStatesCountries; USE DQS_STAGING_DATA; DROP TABLE IF EXISTS dbo.CustomersCh05; DROP TABLE IF EXISTS dbo.CustomersCh05DQS; DROP TABLE IF EXISTS dbo.CustomersClean; DROP TABLE IF EXISTS dbo.CustomersDirty; DROP TABLE IF EXISTS dbo.CustomersDirtyMatch; DROP TABLE IF EXISTS dbo.CustomersDirtyNoMatch; DROP TABLE IF EXISTS dbo.CustomersDQSMatch; DROP TABLE IF EXISTS dbo.DQSMatchingResults; DROP TABLE IF EXISTS dbo.DQSSurvivorshipResults; DROP TABLE IF EXISTS dbo.FuzzyMatchingResults; When you are done, close SSMS and SSDT. SQL Server data quality services (DQS) is a knowledge-driven data quality solution. This means that it requires you to maintain one or more knowledge bases (KBs). In a KB, you maintain all knowledge related to a specific portion of data—for example, customer data. The idea of data quality services is to mitigate the cleansing process. While the amount of time you need to spend on cleansing decreases, you will achieve higher and higher levels of data quality. While cleansing, you learn what types of errors to expect, discover error patterns, find domains of correct values, and so on. You don't throw away this knowledge. You store it and use it to find and correct the same issues automatically during your next cleansing process. Summary We have seen how to install Azure Feature Pack, Azure control flow tasks and data flow components, and Fuzzy Lookup transformation. Resources for Article: Further resources on this subject: Building A Recommendation System with Azure [article] Introduction to Microsoft Azure Cloud Services [article] Windows Azure Service Bus: Key Features [article]
Read more
  • 0
  • 0
  • 4100

article-image-signal-processing-techniques
Packt
12 Jun 2014
6 min read
Save for later

Signal Processing Techniques

Packt
12 Jun 2014
6 min read
(For more resources related to this topic, see here.) Introducing the Sunspot data Sunspots are dark spots visible on the Sun's surface. This phenomenon has been studied for many centuries by astronomers. Evidence has been found for periodic sunspot cycles. We can download up-to-date annual sunspot data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. This is provided by the Belgian Solar Influences Data Analysis Center. The data goes back to 1700 and contains more than 300 annual averages. In order to determine sunspot cycles, scientists successfully used the Hilbert-Huang transform (refer to http://en.wikipedia.org/wiki/Hilbert%E2%80%93Huang_transform). A major part of this transform is the so-called Empirical Mode Decomposition (EMD) method. The entire algorithm contains many iterative steps, and we will cover only some of them here. EMD reduces data to a group of Intrinsic Mode Functions (IMF). You can compare this to the way Fast Fourier Transform decomposes a signal in a superposition of sine and cosine terms. Extracting IMFs is done via a sifting process. The sifting of a signal is related to separating out components of a signal one at a time. The first step of this process is identifying local extrema. We will perform the first step and plot the data with the extrema we found. Let's download the data in CSV format. We also need to reverse the array to have it in the correct chronological order. The following code snippet finds the indices of the local minima and maxima respectively: mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] Now we need to concatenate these arrays and use the indices to select the corresponding values. The following code accomplishes that and also plots the data: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) plt.plot(1700 + extrema, data[extrema], 'go') plt.plot(year_range, data) plt.show() We will see the following chart: In this plot, you can see the extrema is indicated with dots. Sifting continued The next steps in the sifting process require us to interpolate with a cubic spline of the minima and maxima. This creates an upper envelope and a lower envelope, which should surround the data. The mean of the envelopes is needed for the next iteration of the EMD process. We can interpolate minima with the following code snippet: spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) Similar code can be used to interpolate the maxima. We need to be aware that the interpolation results are only valid within the range over which we are interpolating. This range is defined by the first occurrence of a minima/maxima and ends at the last occurrence of a minima/maxima. Unfortunately, the interpolation ranges we can define in this way for the maxima and minima do not match perfectly. So, for the purpose of plotting, we need to extract a shorter range that lies within both the maxima and minima interpolation ranges. Have a look at the following code: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal from scipy import interpolate data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) spl_max = interpolate.interp1d(maxs, data[maxs], kind='cubic') max_rng = np.arange(maxs.min(), maxs.max()) u_env = spl_max(max_rng) inclusive_rng = np.arange(max(min_rng[0], max_rng[0]), min(min_rng[-1],max_rng[-1])) mid = (spl_max(inclusive_rng) + spl_min(inclusive_rng))/2 plt.plot(year_range, data) plt.plot(1700 + min_rng, l_env, '-x') plt.plot(1700 + max_rng, u_env, '-x') plt.plot(1700 + inclusive_rng, mid, '--') plt.show() The code produces the following chart: What you see is the observed data, with computed envelopes and mid line. Obviously, negative values don't make any sense in this context. However, for the algorithm we only need to care about the mid line of the upper and lower envelopes. In these first two sections, we basically performed the first iteration of the EMD process. The algorithm is a bit more involved, so we will leave it up to you whether or not you want to continue with this analysis on your own. Moving averages Moving averages are tools commonly used to analyze time-series data. A moving average defines a window of previously seen data that is averaged each time the window slides forward one period. The different types of moving average differ essentially in the weights used for averaging. The exponential moving average, for instance, has exponentially decreasing weights with time. This means that older values have less influence than newer values, which is sometimes desirable. We can express an equal-weight strategy for the simple moving average as follows in the NumPy code: weights = np.exp(np.linspace(-1., 0., N)) weights /= weights.sum() A simple moving average uses equal weights which, in code, looks as follows: def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] The following code plots the simple moving average for the 11- and 22-year sunspot cycle: import numpy as np import sys import matplotlib.pyplot as plt data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True, skiprows=1) #reverse order data = data[::-1] year_range = np.arange(1700, 1700 + len(data)) def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] sma11 = sma(data, 11) sma22 = sma(data, 22) plt.plot(year_range, data, label='Data') plt.plot(year_range[10:], sma11, '-x', label='SMA 11') plt.plot(year_range[21:], sma22, '--', label='SMA 22') plt.legend() plt.show() In the following plot, we see the original data and the simple moving averages for 11- and 22-year periods. As you can see, moving averages are not a good fit for this data; this is generally the case for sinusoidal data. Summary This article gave us examples of signal processing and time series analysis. We looked at shifting continued that performs the first iteration of the EMD process. We also learned about Moving averages, which are tools commonly used to analyze time-series data. Resources for Article: Further resources on this subject: Advanced Indexing and Array Concepts [Article] Fast Array Operations with NumPy [Article] Move Further with NumPy Modules [Article]
Read more
  • 0
  • 0
  • 4085

article-image-using-r-statistics-research-and-graphics
Packt
16 Sep 2014
12 min read
Save for later

Using R for Statistics, Research, and Graphics

Packt
16 Sep 2014
12 min read
In this article by David Alexander Lillis, author of the R Graph Essentials, we will talk about R. Developed by Professor Ross Ihaka and Dr. Robert Gentleman at Auckland University (New Zealand) during the early 1990s, the R statistics environment is a real success story. R is open source software, which you can download in a couple of minutes from the Comprehensive R Network (CRAN) website (http://cran.r-project.org/), and combines a powerful programming language, outstanding graphics, and a comprehensive range of useful statistical functions. If you need a statistics environment that includes a programming language, R is ideal. It's true that the learning curve is longer than for spreadsheet-based packages, but once you master the R programming syntax, you can develop your own very powerful analytic tools. Many contributed packages are available on the web for use with R, and very often the analytic tools you need can be downloaded at no cost. (For more resources related to this topic, see here.) The main problem for those new to R is the time required to master the programming language, but several nice graphical user interfaces, such as John Fox's R Commander package, are available, which make it much easier for the newcomer to develop proficiency in R than it used to be. For many statisticians and researchers, R is the package of choice because of its powerful programming language, the easy availability of code, and because it can import Excel spreadsheets, comma separated variable (.csv) spreadsheets, and text files, as well as SPSS files, STATA files, and files produced within other statistical packages. You may be looking for a tool for your own data analysis. If so, let's take a brief look at what R can do for you. Some basic R syntax Data can be created in R or else read in from .csv or other files as objects. For example, you can read in the data contained within a .csv file called mydata.csv as follows: A <- read.csv(mydata.csv, h=T) A The object A now contains all the data of the original file. The syntax A[3,7] picks out the element in row 3 and column 7. The syntax A[14, ] selects the fourteenth row and A[,6] selects the sixth column. The functions mean(A) and sd(A) find the mean and standard deviation of each column. The syntax 3*A + 7 would triple each element of A and add 7 to each element and store the new array as the object B Now you could save this array as a .csv file called Outputfile.csv as follows: write.csv(B, file="Outputfile.csv") Statistical modeling R provides a comprehensive range of basic statistical functions relating to the commonly-used distributions (normal distribution, t-distribution, Poisson, gamma, and so on), and many less-well known distributions. It also provides a range of non-parametric tests that are appropriate when your data are not distributed normally. Linear and non-linear regressions are easy to perform, and finding the optimum model (that is, by eliminating non-significant independent variables and non-significant factor interactions) is particularly easy. Implementing Generalized Linear Models and other commonly-used models such as Analysis of Variance, Multivariate Analysis of Variance, and Analysis of Covariance is also straightforward and, once you know the syntax, you may find that such tasks can be done more quickly in R than in other packages. The usual post-hoc tests for identifying factor levels that are significantly different from the other levels (for example, Tukey and Sheffe tests) are available, and testing for interactions between factors is easy. Factor Analysis, and the related Principal Components Analysis, are well known data reduction techniques that enable you to explain your data in terms of smaller sets of independent variables (or factors). Both methods are available in R, and code for complex designs, including One and Two Way Repeated Measures, and Four Way ANOVA (for example, two repeated measures and two between-subjects), can be written relatively easily or downloaded from various websites (for example, http://www.personality-project.org/r/). Other analytic tools include Cluster Analysis, Discriminant Analysis, Multidimensional Scaling, and Correspondence Analysis. R also provides various methods for fitting analytic models to data and smoothing (for example, lowess and spline-based methods). Miscellaneous packages for specialist methods You can find some very useful packages of R code for fields as diverse as biometry, epidemiology, astrophysics, econometrics, financial and actuarial modeling, the social sciences, and psychology. For example, if you are interested in Astrophysics, Penn State Astrophysics School offers a nice website that includes both tutorials and code (http://www.iiap.res.in/astrostat/RTutorials.html). Here I'll mention just a few of the popular techniques: Monte Carlo methods A number of sources give excellent accounts of how to perform Monte Carlo simulations in R (that is, drawing samples from multidimensional distributions and estimating expected values). A valuable text is Christian Robert's book Introducing Monte Carlo Methods with R. Murali Haran gives another interesting Astrophysical example in the CAStR website (http://www.stat.psu.edu/~mharan/MCMCtut/MCMC.html). Structural Equation Modeling Structural Equation Modelling (SEM) is becoming increasingly popular in the social sciences and economics as an alternative to other modeling techniques such as multiple regression, factor analysis and analysis of covariance. Essentially, SEM is a kind of multiple regression that takes account of factor interactions, nonlinearities, measurement error, multiple latent independent variables, and latent dependent variables. Useful references for conducting SEM in R include those of Revelle, Farnsworth (2008), and Fox (2002 and 2006). Data mining A number of very useful resources are available for anyone contemplating data mining using R. For example, Luis Torgo has just published a book on data mining using R, and presents case studies, along with the datasets and code, which the interested student can work through. Torgo's book provides the usual analytic and graphical techniques used every day by data miners, including visualization techniques, dealing with missing values, developing prediction models, and methods for evaluating the performance of your models. Also of interest to the data miner is the Rattle GUI (R Analytical Tool to Learn Easily). Rattle is a data mining facility for analyzing very large data sets. It provides many useful statistical and graphical data summaries, presents mechanisms for developing a variety of models, and summarizes the performance of your models. Graphics in R Quite simply, the quality and range of graphics available through R is superb and, in my view, vastly superior to those of any other package I have encountered. Of course, you have to write the necessary code, but once you have mastered this skill, you have access to wonderful graphics. You can write your own code from scratch, but many websites provide helpful examples, complete with code, which you can download and modify to suit your own needs. R's base graphics (graphics created without the use of any additional contributed packages) are superb, but various graphics packages such as ggplot2 (and the associated qplot function) help you to create wonderful graphs. R's graphics capabilities include, but are not limited to, the following: Base graphics in R Basic graphics techniques and syntax Creating scatterplots and line plots Customizing axes, colors, and symbols Adding text – legends, titles, and axis labels Adding lines – interpolation lines, regression lines, and curves Increasing complexity – graphing three variables, multiple plots, or multiple axes Saving your plots to multiple formats – PDF, postscript, and JPG Including mathematical expressions on your plots Making graphs clear and pretty – including a grid, point labels, and shading Shading and coloring your plot Creating bar charts, histograms, boxplots, pie charts, and dotcharts Adding loess smoothers Scatterplot matrices R's color palettes Adding error bars Creating graphs using qplot Using basic qplot graphics techniques and syntax to customize in easy steps Creating scatterplots and line plots in qplot Mapping symbol size, symbol type and symbol color to categorical data Including regressions and confidence intervals on your graphs Shading and coloring your graph Creating bar charts, histograms, boxplots, pie charts, and dotcharts Labelling points on your graph Creating graphs using ggplot Ploting options – backgrounds, sizes, transparencies, and colors Superimposing points Controlling symbol shapes and using pretty color schemes Stacked, clustered, and paneled bar charts Methods for detailed customization of lines, point labels, smoothers, confidence bands, and error bars The following graph records information on the heights in centimeters and weights in kilograms of patients in a medical study. The curve in red gives a smoothed version of the data, created using locally weighted scatterplot smoothing. Both the graph and the modelling required to produce the smoothed curve, were performed in R. Here is another graph. It gives the heights and body masses of female patients receiving treatment in a hospital. Each patient is identified by name. This graph was created very easily using ggplot, and shows the default background produced by ggplot (a grey plotting background and white grid lines). Next, we see a histogram of patients' heights and body masses, partitioned by gender. The bars are given in an orange and an ivory color. The ggplot package provides a wide range of colors and hues, as well as a wide range of color palettes. Finally, we see a line graph of height against age for a group of four children. The graph includes both points and lines and we have a unique color for each child. The ggplot package makes it possible to create attractive and effective graphs for research and data analysis. Summary For many scientists and data analysts, mastery of R could be an investment for the future, particularly for those who are beginning their careers. The technology for handling scientific computation is advancing very quickly, and is a major impetus for scientific advance. Some level of mastery of R has become, for many applications, essential for taking advantage of these developments. Spatial analysis, where R provides an integrated framework access to abilities that are spread across many different computer programs, is a good example. A few years ago, I would not have recommended R as a statistics environment for generalist data analysts or postgraduate students, except those working directly in areas involving statistical modeling. However, many tutorials are downloadable from the Internet and a number of organizations provide online tutorials and/or face-to-face workshops (for example, The Analysis Factor http://www.theanalysisfactor.com/). In addition, the appearance of GUIs, such as R Commander and the new iNZight GUI33 (designed for use in schools), makes it easier for non-specialists to learn and use R effectively. I am most happy to provide advice to anyone contemplating learning to use this outstanding statistical and research tool. References Some useful material on R are as follows: L'analyse des donn´ees. Tome 1: La taxinomie, Tome 2: L'analyse des correspondances, Dunod, Paris, Benz´ecri, J. P (1973). Computation of Correspondence Analysis, Blasius J, Greenacre, M. J (1994). In M J Greenacre, J Blasius (eds.), Correspondence Analysis in the Social Sciences, pp. 53–75, Academic Press, London. Statistics: An Introduction using R, Crawley, M. J. (m.crawley@imperial.ac.uk), Imperial College, Silwood Park, Ascot, Berks, Published in 2005 by John Wiley & Sons, Ltd. http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470022973,subjectCd-ST05.html (ISBN 0-470-02297-3). http://www3.imperial.ac.uk/naturalsciences/research/statisticsusingr. Structural Equation Models Appendix to An R and S-PLUS Companion to Applied Regression, Fox, John, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf. Getting Started with the R Commander, Fox, John, 26 August 2006. The R Commander: A Basic-Statistics Graphical User Interface to R, Fox, John, Journal of Statistical Software, September 2005, Volume 14, Issue 9. http://www.jstatsoft.org/. Structural Equation Modeling With the sem Package in R, Fox, John, Structural Equation Modeling, 13(3), 465–486. Lawrence Erlbaum Associates, Inc. 2006. Biplots in Biomedical Research, Gabriel, K, R and Odoroff, C, 9, 469–485, Statistics in Medicine, 1990. Theory and Applications of Correspondence Analysis, Greenacre M. J., Academic Press, London, 1984. Using R for Data Analysis and Graphics Introduction, Code and Commentary, Maindonald, J. H, Centre for Mathematics and its Applications, Australian National University. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George, 2010, XX, 284 p., Softcover, ISBN 978-1-4419-1575-7. <p>Useful tutorials available on the web are as follows:</p> An Introduction to R: examples for Actuaries, De Silva, N, 2006, http://toolkit.pbworks.com/f/R%20Examples%20for%20Actuaries%20v0.1-1.pdf. Econometrics in R, Farnsworth, Grant, V, October 26, 2008, http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf. An Introduction to the R Language, Harte, David, Statistics Research Associates Limited, www.statsresearch.co.nz. Quick R, Kabakoff, Rob, http://www.statmethods.net/index.html. R for SAS and SPSS Users, Muenchen, Bob, http://RforSASandSPSSusers.com. Statistical Analysis with R - a quick start, Nenadi´,C and Zucchini, Walter. R for Beginners, Paradis, Emannuel (paradis@isem.univ-montp2.fr), Institut des Sciences de l' Evolution, Universite Montpellier II, F-34095 Montpellier c_edex 05, France. Data Mining with R learning by case studies, Torgo, Luis, http://www.liaad.up.pt/~ltorgo/DataMiningWithR/. SimpleR - Using R for Introductory Statistics, Verzani, John, http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf. Time Series Analysis and Its Applications: With R Examples, http://www.stat.pitt.edu/stoffer/tsa2/textRcode.htm#ch2. The irises of the Gaspé peninsula, E. Anderson, Bulletin of the American Iris Society, 59, 2-5. 1935. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George. 2010, XX, 284 p., Softcover, ISBN: 978-1-4419-1575-7. Resources for Article: Further resources on this subject: Aspects of Data Manipulation in R [Article] Learning Data Analytics with R and Hadoop [Article] First steps with R [Article]
Read more
  • 0
  • 0
  • 4073
article-image-social-media-insight-using-naive-bayes
Packt
22 Feb 2016
48 min read
Save for later

Social Media Insight Using Naive Bayes

Packt
22 Feb 2016
48 min read
Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. (For more resources related to this topic, see here.) We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model in this article is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets. We will cover the following topics in this article: Downloading data from social network APIs Transformers for text Naive Bayes classifier Using JSON for saving and loading datasets The NLTK library for extracting features from text The F-measure for evaluation Disambiguation Text is often called an unstructured format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it! We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called metadata, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. One of the problems is the term disambiguation. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do. In this article, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet. When people talk about Python, they could be talking about the following things: The programming language Python Monty Python, the classic comedy group The snake Python A make of shoe called Python There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet. Downloading data from a social network We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting. First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one. Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API. You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account. When you are logged in, go to https://apps.twitter.com/ and click on Create New App. Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application. Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library. You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system, check the documentation at https://github.com/sixohsix/twitter. Create a new IPython Notebook to download the data. We will create several notebooks in this article for various different purposes, so it might be a good idea to also create a folder to keep track of them. This first notebook, ch6_get_twitter, is specifically for downloading new Twitter data. First, we import the twitter library and set our authorization tokens. The consumer key, consumer secret will be available on the Keys and Access Tokens tab on your Twitter app's page. To get the access tokens, you'll need to click on the Create my access token button, which is on the same page. Enter the keys into the appropriate places in the following code: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) We are going to get our tweets from Twitter's search function. We will create a reader that connects to twitter using our authorization, and then use that reader to perform searches. In the Notebook, we set the filename where the tweets will be stored: import os output_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") We also need the json library for saving our tweets: import json Next, create an object that can read from Twitter. We create this object with our authorization object that we set up earlier: t = twitter.Twitter(auth=authorization) We then open our output file for writing. We open it for appending—this allows us to rerun the script to obtain more tweets. We then use our Twitter connection to perform a search for the word Python. We only want the statuses that are returned for our dataset. This code takes the tweet, uses the json library to create a string representation using the dumps function, and then writes it to the file. It then creates a blank line under the tweet so that we can easily distinguish where one tweet starts and ends in our file: with open(output_filename, 'a') as output_file: search_results = t.search.tweets(q="python", count=100)['statuses'] for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for. Running this for a few minutes will result in 100 tweets being added to the output file. You can keep rerunning this script to add more tweets to your dataset, keeping in mind that you may get some duplicates in the output file if you rerun it too fast (that is, before Twitter gets new tweets to return!). Loading and classifying the dataset After we have collected a set of tweets (our dataset), we need labels to perform classification. We are going to label the dataset by setting up a form in an IPython Notebook to allow us to enter the labels. The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy. A key difference between our dataset and real JSON is that we included newlines between tweets. The reason for this was to allow us to easily append new tweets (the actual JSON format doesn't allow this easily). Our format is a JSON representation of a tweet, followed by a newline, followed by the next tweet, and so on. To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves. Set up a new IPython Notebook (I called mine ch6_label_twitter) and enter the dataset's filename. This is the same filename in which we saved the data in the previous section. We also define the filename that we will use to save the labels to. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") As stated, we will use the json library, so import that too: import json We create a list that will store the tweets we received from the file: tweets = [] We then iterate over each line in the file. We aren't interested in lines with no information (they separate the tweets for us), so check if the length of the line (minus any whitespace characters) is zero. If it is, ignore it and move to the next line. Otherwise, load the tweet using json.loads (which loads a JSON object from a string) and add it to our list of tweets. The code is as follows: with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not. The code will present a new tweet to the user (you) and ask for a label: is it relevant or not? It will then store the input and present the next tweet to be labeled. First, we create a list for storing the labels. These labels will be stored whether or not the given tweet refers to the programming language Python, and it will allow our classifier to learn how to differentiate between meanings. We also check if we have any labels already and load them. This helps if you need to close the notebook down midway through labeling. This code will load the labels from where you left off. It is generally a good idea to consider how to save at midpoints for tasks like this. Nothing hurts quite like losing an hour of work because your computer crashed before you saved the labels! The code is as follows: labels = [] if os.path.exists(labels_filename): with open(labels_filename) as inf: labels = json.load(inf) Next, we create a simple function that will return the next tweet that needs to be labeled. We can work out which is the next tweet by finding the first one that hasn't yet been labeled. The code is as follows: def get_next_tweet(): return tweet_sample[len(labels)]['text'] The next step in our experiment is to collect information from the user (you!) on which tweets are referring to Python (the programming language) and which are not. As of yet, there is not a good, straightforward way to get interactive feedback with pure Python in IPython Notebooks. For this reason, we will use some JavaScript and HTML to get this input from the user. Next we create some JavaScript in the IPython Notebook to run our input. Notebooks allow us to use magic functions to embed HTML and JavaScript (among other things) directly into the Notebook itself. Start a new cell with the following line at the top: %%javascript The code in here will be in JavaScript, hence the curly braces that are coming up. Don't worry, we will get back to Python soon. Keep in mind here that the following code must be in the same cell as the %%javascript magic function. The first function we will define in JavaScript shows how easy it is to talk to your Python code from JavaScript in IPython Notebooks. This function, if called, will add a label to the labels array (which is in python code). To do this, we load the IPython kernel as a JavaScript object and give it a Python command to execute. The code is as follows: function set_label(label){ var kernel = IPython.notebook.kernel; kernel.execute("labels.append(" + label + ")"); load_next_tweet(); } At the end of that function, we call the load_next_tweet function. This function loads the next tweet to be labeled. It runs on the same principle; we load the IPython kernel and give it a command to execute (calling the get_next_tweet function we defined earlier). However, in this case we want to get the result. This is a little more difficult. We need to define a callback, which is a function that is called when the data is returned. The format for defining callback is outside the scope of this book. If you are interested in more advanced JavaScript/Python integration, consult the IPython documentation. The code is as follows: function load_next_tweet(){ var code_input = "get_next_tweet()"; var kernel = IPython.notebook.kernel; var callbacks = { 'iopub' : {'output' : handle_output}}; kernel.execute(code_input, callbacks, {silent:false}); } The callback function is called handle_output, which we will define now. This function gets called when the Python function that kernel.execute calls returns a value. As before, the full format of this is outside the scope of this book. However, for our purposes the result is returned as data of the type text/plain, which we extract and show in the #tweet_text div of the form we are going to create in the next cell. The code is as follows: function handle_output(out){ var res = out.content.data["text/plain"]; $("div#tweet_text").html(res); } Our form will have a div that shows the next tweet to be labeled, which we will give the ID #tweet_text. We also create a textbox to enable us to capture key presses (otherwise, the Notebook will capture them and JavaScript won't do anything). This allows us to use the keyboard to set labels of 1 or 0, which is faster than using the mouse to click buttons—given that we will need to label at least 100 tweets. Run the previous cell to embed some JavaScript into the page, although nothing will be shown to you in the results section. We are going to use a different magic function now, %%html. Unsurprisingly, this magic function allows us to directly embed HTML into our Notebook. In a new cell, start with this line: %%html For this cell, we will be coding in HTML and a little JavaScript. First, define a div element to store our current tweet to be labeled. I've also added some instructions for using this form. Then, create the #tweet_text div that will store the text of the next tweet to be labeled. As stated before, we need to create a textbox to be able to capture key presses. The code is as follows: <div name="tweetbox"> Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 0 otherwise.<br> Tweet: <div id="tweet_text" value="text"></div><br> <input type=text id="capture"></input><br> </div> Don't run the cell just yet! We create the JavaScript for capturing the key presses. This has to be defined after creating the form, as the #tweet_text div doesn't exist until the above code runs. We use the JQuery library (which IPython is already using, so we don't need to include the JavaScript file) to add a function that is called when key presses are made on the #capture textbox we defined. However, keep in mind that this is a %%html cell and not a JavaScript cell, so we need to enclose this JavaScript in the <script> tags. We are only interested in key presses if the user presses the 0 or the 1, in which case the relevant label is added. We can determine which key was pressed by the ASCII value stored in e.which. If the user presses 0 or 1, we append the label and clear out the textbox. The code is as follows: <script> $("input#capture").keypress(function(e) { if(e.which == 48) { set_label(0); $("input#capture").val(""); }else if (e.which == 49){ set_label(1); $("input#capture").val(""); } }); All other key presses are ignored. As a last bit of JavaScript for this article (I promise), we call the load_next_tweet() function. This will set the first tweet to be labeled and then close off the JavaScript. The code is as follows: load_next_tweet(); </script> After you run this cell, you will get an HTML textbox, alongside the first tweet's text. Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the tweet related to the programming language Python) and a 0 if it is not. After you do this, the next tweet will load. Enter the label and the next one will load. This continues until the tweets run out. When you finish all of this, simply save the labels to the output filename we defined earlier for the class values: with open(labels_filename, 'w') as outf: json.dump(labels, outf) You can call the preceding code even if you haven't finished. Any labeling you have done to that point will be saved. Running this Notebook again will pick up where you left off and you can keep labeling your tweets. This might take a while to do this! If you have a lot of tweets in your dataset, you'll need to classify all of them. If you are pushed for time, you can download the same dataset I used, which contains classifications. Creating a replicable dataset from Twitter In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results. Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare. On running the preceding code, you will get a different dataset to the one I created and used. The main reasons are that Twitter will return different search results for you than me based on the time you performed the search. Even after that, your labeling of tweets might be different from what I do. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area I ran into was tweets in non-English languages that I couldn't read. In this specific instance, there are options in Twitter's API for setting the language, but even these aren't going to be perfect. Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly. One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a tweet ID dataset that we can freely share. Then, we will see how to download the original tweets from this file to recreate the original dataset. First, we save the replicable dataset of tweet IDs. Creating another new IPython Notebook, first set up the filenames. This is done in the same way we did labeling but there is a new filename where we can store the replicable dataset. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") We load the tweets and labels as we did in the previous notebook: import json tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) if os.path.exists(labels_filename): with open(classes_filename) as inf: labels = json.load(inf) Now we create a dataset by looping over both the tweets and labels at the same time and saving those in a list: dataset = [(tweet['id'], label) for tweet, label in zip(tweets, labels)] Finally, we save the results in our file: with open(replicable_dataset, 'w') as outf: json.dump(dataset, outf) Now that we have the tweet IDs and labels saved, we can recreate the original dataset. If you are looking to recreate the dataset I used for this article, it can be found in the code bundle that comes with this book. Loading the preceding dataset is not difficult but it can take some time. Start a new IPython Notebook and set the dataset, label, and tweet ID filenames as before. I've adjusted the filenames here to ensure that you don't overwrite your previously collected dataset, but feel free to change these if you want. The code is as follows: import os tweet_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") Then load the tweet IDs from the file using JSON: import json with open(replicable_dataset) as inf: tweet_ids = json.load(inf) Saving the labels is very easy. We just iterate through this dataset and extract the IDs. We could do this quite easily with just two lines of code (open file and save tweets). However, we can't guarantee that we will get all the tweets we are after (for example, some may have been changed to private since collecting the dataset) and therefore the labels will be incorrectly indexed against the data. As an example, I tried to recreate the dataset just one day after collecting them and already two of the tweets were missing (they might be deleted or made private by the user). For this reason, it is important to only print out the labels that we need. To do this, we first create an empty actual labels list to store the labels for tweets that we actually recover from twitter, and then create a dictionary mapping the tweet IDs to the labels. The code is as follows: actual_labels = [] label_mapping = dict(tweet_ids) Next, we are going to create a twitter server to collect all of these tweets. This is going to take a little longer. Import the twitter library that we used before, creating an authorization token and using that to create the twitter object: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(auth=authorization) Iterate over each of the twitter IDs by extracting the IDs into a list using the following command: all_ids = [tweet_id for tweet_id, label in tweet_ids] Then, we open our output file to save the tweets: with open(tweets_filename, 'a') as output_file: The Twitter API allows us get 100 tweets at a time. Therefore, we iterate over each batch of 100 tweets: for start_index in range(0, len(tweet_ids), 100): To search by ID, we first create a string that joins all of the IDs (in this batch) together: id_string = ",".join(str(i) for i in all_ids[start_index:start_index+100]) Next, we perform a statuses/lookup API call, which is defined by Twitter. We pass our list of IDs (which we turned into a string) into the API call in order to have those tweets returned to us: search_results = t.statuses.lookup(_id=id_string) Then for each tweet in the search results, we save it to our file in the same way we did when we were collecting the dataset originally: for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") As a final step here (and still under the preceding if block), we want to store the labeling of this tweet. We can do this using the label_mapping dictionary we created before, looking up the tweet ID. The code is as follows: actual_labels.append(label_mapping[tweet['id']]) Run the previous cell and the code will collect all of the tweets for you. If you created a really big dataset, this may take a while—Twitter does rate-limit requests. As a final step here, save the actual_labels to our classes file: with open(labels_filename, 'w') as outf: json.dump(actual_labels, outf) Text transformers Now that we have our dataset, how are we going to perform data mining on it? Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with? There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate. Bag-of-words One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document. Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien: Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie.                                            - J.R.R. Tolkien's epigraph to The Lord of The Rings The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of. We can create a dataset from this, choosing a subset of words and counting the frequency: Word the one ring to Frequency 9 4 3 4 We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating the string. The code is as follows: s = """Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie. """.lower() words = s.split() from collections import Counter c = Counter(words) Printing c.most_common(5) gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place. The bag-of-words model has three major types. The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this article. Another popular (arguably more popular) method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus There are a number of libraries for working with text data in Python. We will use a major one, called Natural Language ToolKit (NLTK). The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it. However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use. N-grams A step up from single bag-of-words features is that of n-grams. An n-gram is a subsequence of n consecutive tokens. In this context, a word n-gram is a set of n words that appear in a row. They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document. The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values. As an example, for n=3, we extract the first few n-grams in the following quote: Always look on the bright side of life. The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words. Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!). Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications. Another form of n-gram for text documents is that of a character n-gram. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). This type of dataset can pick up words that are misspelled, as well as providing other benefits. We will test character n-grams in this article. Other features There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this book. If you are interested in learning more, I recommend Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt publication. Naive Bayes Naive Bayes is a probabilistic model that is unsurprisingly built upon a naive interpretation of Bayesian statistics. Despite the naive aspect, the method performs very well in a large number of contexts. It can be used for classification of many different feature types and formats, but we will focus on one in this article: binary features in the bag-of-words model. Bayes' theorem For most of us, when we were taught statistics, we started from a frequentist approach. In this approach, we assume the data comes from some distribution and we aim to determine what the parameters are for that distribution. However, those parameters are (perhaps incorrectly) assumed to be fixed. We use our model to describe the data, even testing to ensure the data fits our model. Bayesian statistics instead model how people (non-statisticians) actually reason. We have some data and we use that data to update our model about how likely something is to occur. In Bayesian statistics, we use the data to describe the model rather than using a model and confirming it with data (as per the frequentist approach). Bayes' theorem computes the value of P(A|B), that is, knowing that B has occurred, what is the probability of A. In most cases, B is an observed event such as it rained yesterday, and A is a prediction it will rain today. For data mining, B is usually we observed this sample and A is it belongs to this class. We will see how to use Bayes' theorem for data mining in the next section. The equation for Bayes' theorem is given as follows: As an example, we want to determine the probability that an e-mail containing the word drugs is spam (as we believe that such a tweet may be a pharmaceutical spam). A, in this context, is the probability that this tweet is spam. We can compute P(A), called the prior belief directly from a training dataset by computing the percentage of tweets in our dataset that are spam. If our dataset contains 30 spam messages for every 100 e-mails, P(A) is 30/100 or 0.3. B, in this context, is this tweet contains the word 'drugs'. Likewise, we can compute P(B) by computing the percentage of tweets in our dataset containing the word drugs. If 10 e-mails in every 100 of our training dataset contain the word drugs, P(B) is 10/100 or 0.1. Note that we don't care if the e-mail is spam or not when computing this value. P(B|A) is the probability that an e-mail contains the word drugs if it is spam. It is also easy to compute from our training dataset. We look through our training set for spam e-mails and compute the percentage of them that contain the word drugs. Of our 30 spam e-mails, if 6 contain the word drugs, then P(B|A) is calculated as 6/30 or 0.2. From here, we use Bayes' theorem to compute P(A|B), which is the probability that a tweet containing the word drugs is spam. Using the previous equation, we see the result is 0.6. This indicates that if an e-mail has the word drugs in it, there is a 60 percent chance that it is spam. Note the empirical nature of the preceding example—we use evidence directly from our training dataset, not from some preconceived distribution. In contrast, a frequentist view of this would rely on us creating a distribution of the probability of words in tweets to compute similar equations. Naive Bayes algorithm Looking back at our Bayes' theorem equation, we can use it to compute the probability that a given sample belongs to a given class. This allows the equation to be used as a classification algorithm. With C as a given class and D as a sample in our dataset, we create the elements necessary for Bayes' theorem, and subsequently Naive Bayes. Naive Bayes is a classification algorithm that utilizes Bayes' theorem to compute the probability that a new data sample belongs to a particular class. P(C) is the probability of a class, which is computed from the training dataset itself (as we did with the spam example). We simply compute the percentage of samples in our training dataset that belong to the given class. P(D) is the probability of a given data sample. It can be difficult to compute this, as the sample is a complex interaction between different features, but luckily it is a constant across all classes. Therefore, we don't need to compute it at all. We will see later how to get around this issue. P(D|C) is the probability of the data point belonging to the class. This could also be difficult to compute due to the different features. However, this is where we introduce the naive part of the Naive Bayes algorithm. We naively assume that each feature is independent of each other. Rather than computing the full probability of P(D|C), we compute the probability of each feature D1, D2, D3, … and so on. Then, we multiply them together: P(D|C) = P(D1|C) x P(D2|C).... x P(Dn|C) Each of these values is relatively easy to compute with binary features; we simply compute the percentage of times it is equal in our sample dataset. In contrast, if we were to perform a non-naive Bayes version of this part, we would need to compute the correlations between different features for each class. Such computation is infeasible at best, and nearly impossible without vast amounts of data or adequate language analysis models. From here, the algorithm is straightforward. We compute P(C|D) for each possible class, ignoring the P(D) term. Then we choose the class with the highest probability. As the P(D) term is consistent across each of the classes, ignoring it has no impact on the final prediction. How it works As an example, suppose we have the following (binary) feature values from a sample in our dataset: [0, 0, 0, 1]. Our training dataset contains two classes with 75 percent of samples belonging to the class 0, and 25 percent belonging to the class 1. The likelihood of the feature values for each class are as follows: For class 0: [0.3, 0.4, 0.4, 0.7] For class 1: [0.7, 0.3, 0.4, 0.9] These values are to be interpreted as: for feature 1, it is a 1 in 30 percent of cases for class 0. We can now compute the probability that this sample should belong to the class 0. P(C=0) = 0.75 which is the probability that the class is 0. P(D) isn't needed for the Naive Bayes algorithm. Let's take a look at the calculation: P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0) = 0.3 x 0.6 x 0.6 x 0.7 = 0.0756 The second and third values are 0.6, because the value of that feature in the sample was 0. The listed probabilities are for values of 1 for each feature. Therefore, the probability of a 0 is its inverse: P(0) = 1 – P(1). Now, we can compute the probability of the data point belonging to this class. An important point to note is that we haven't computed P(D), so this isn't a real probability. However, it is good enough to compare against the same value for the probability of the class 1. Let's take a look at the calculation: P(C=0|D) = P(C=0) P(D|C=0) = 0.75 * 0.0756 = 0.0567 Now, we compute the same values for the class 1: P(C=1) = 0.25 P(D) isn't needed for naive Bayes. Let's take a look at the calculation: P(D|C=1) = P(D1|C=1) x P(D2|C=1) x P(D3|C=1) x P(D4|C=1) = 0.7 x 0.7 x 0.6 x 0.9 = 0.2646 P(C=1|D) = P(C=1)P(D|C=1) = 0.25 * 0.2646 = 0.06615 Normally, P(C=0|D) + P(C=1|D) should equal to 1. After all, those are the only two possible options! However, the probabilities are not 1 due to the fact we haven't included the computation of P(D) in our equations here. The data point should be classified as belonging to the class 1. You may have guessed this while going through the equations anyway; however, you may have been a bit surprised that the final decision was so close. After all, the probabilities in computing P(D|C) were much, much higher for the class 1. This is because we introduced a prior belief that most samples generally belong to the class 0. If the classes had been equal sizes, the resulting probabilities would be much different. Try it yourself by changing both P(C=0) and P(C=1) to 0.5 for equal class sizes and computing the result again. Application We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet. To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language. We will use NLTK in future articles as well. To get NLTK on your computer, use pip to install the package: pip3 install nltk If that doesn't work, see the NLTK installation instructions at www.nltk.org/install.html. We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps: Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function. Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step. Train the Naive Bayes classifier, as we have seen in previous articles. We will need to create another Notebook (last one for the article!) called ch6_classify_twitter for performing the classification. Extracting word counts We are going to use NLTK to extract our word counts. We still want to use it in a pipeline, but NLTK doesn't conform to our transformer interface. We will therefore need to create a basic transformer to do this to obtain both fit and transform methods, enabling us to use this in a pipeline. First, set up the transformer class. We don't need to fit anything in this class, as this transformer simply extracts the words in the document. Therefore, our fit is an empty function, except that it returns self which is necessary for transformer objects. Our transform is a little more complicated. We want to extract each word from each document and record True if it was discovered. We are only using the binary features here—True if in the document, False otherwise. If we wanted to use the frequency we would set up counting dictionaries. Let's take a look at the code: from sklearn.base import TransformerMixin class NLTKBOW(TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): return [{word: True for word in word_tokenize(document)} for document in X] The result is a list of dictionaries, where the first dictionary is the list of words in the first tweet, and so on. Each dictionary has a word as key and the value true to indicate this word was discovered. Any word not in the dictionary will be assumed to have not occurred in the tweet. Explicitly stating that a word's occurrence is False will also work, but will take up needless space to store. Converting dictionaries to a matrix This step converts the dictionaries built as per the previous step into a matrix that can be used with a classifier. This step is made quite simple through the DictVectorizer transformer. The DictVectorizer class simply takes a list of dictionaries and converts them into a matrix. The features in this matrix are the keys in each of the dictionaries, and the values correspond to the occurrence of those features in each sample. Dictionaries are easy to create in code, but many data algorithm implementations prefer matrices. This makes DictVectorizer a very useful class. In our dataset, each dictionary has words as keys and only occurs if the word actually occurs in the tweet. Therefore, our matrix will have each word as a feature and a value of True in the cell if the word occurred in the tweet. To use DictVectorizer, simply import it using the following command: from sklearn.feature_extraction import DictVectorizer Training the Naive Bayes classifier Finally, we need to set up a classifier and we are using Naive Bayes for this article. As our dataset contains only binary features, we use the BernoulliNB classifier that is designed for binary features. As a classifier, it is very easy to use. As with DictVectorizer, we simply import it and add it to our pipeline: from sklearn.naive_bayes import BernoulliNB Putting it all together Now comes the moment to put all of these pieces together. In our IPython Notebook, set the filenames and load the dataset and classes as we have done before. Set the filenames for both the tweets themselves (not the IDs!) and the labels that we assigned to them. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") Load the tweets themselves. We are only interested in the content of the tweets, so we extract the text value and store only that. The code is as follows: tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)['text']) Load the labels for each of the tweets: with open(classes_filename) as inf: labels = json.load(inf) Now, create a pipeline putting together the components from before. Our pipeline has three parts: The NLTKBOW transformer we created A DictVectorizer transformer A BernoulliNB classifier The code is as follows: from sklearn.pipeline import Pipeline pipeline = Pipeline([('bag-of-words', NLTKBOW()), ('vectorizer', DictVectorizer()), ('naive-bayes', BernoulliNB()) ]) We can nearly run our pipeline now, which we will do with cross_val_score as we have done many times before. Before that though, we will introduce a better evaluation metric than the accuracy metric we used before. As we will see, the use of accuracy is not adequate for datasets when the number of samples in each class is different. Evaluation using the F1-score When choosing an evaluation metric, it is always important to consider cases where that evaluation metric is not useful. Accuracy is a good evaluation metric in many cases, as it is easy to understand and simple to compute. However, it can be easily faked. In other words, in many cases you can create algorithms that have a high accuracy by poor utility. While our dataset of tweets (typically, your results may vary) contains about 50 percent programming-related and 50 percent nonprogramming, many datasets aren't as balanced as this. As an example, an e-mail spam filter may expect to see more than 80 percent of incoming e-mails be spam. A spam filter that simply labels everything as spam is quite useless; however, it will obtain an accuracy of 80 percent! To get around this problem, we can use other evaluation metrics. One of the most commonly employed is called an f1-score (also called f-score, f-measure, or one of many other variations on this term). The f1-score is defined on a per-class basis and is based on two concepts: the precision and recall. The precision is the percentage of all the samples that were predicted as belonging to a specific class that were actually from that class. The recall is the percentage of samples in the dataset that are in a class and actually labeled as belonging to that class. In the case of our application, we could compute the value for both classes (relevant and not relevant). However, we are really interested in the spam. Therefore, our precision computation becomes the question: of all the tweets that were predicted as being relevant, what percentage were actually relevant? Likewise, the recall becomes the question: of all the relevant tweets in the dataset, how many were predicted as being relevant? After you compute both the precision and recall, the f1-score is the harmonic mean of the precision and recall: To use the f1-score in scikit-learn methods, simply set the scoring parameter to f1. By default, this will return the f1-score of the class with label 1. Running the code on our dataset, we simply use the following line of code: scores = cross_val_score(pipeline, tweets, labels, scoring='f1') We then print out the average of the scores: import numpy as np print("Score: {:.3f}".format(np.mean(scores))) The result is 0.798, which means we can accurately determine if a tweet using Python relates to the programing language nearly 80 percent of the time. This is using a dataset with only 200 tweets in it. Go back and collect more data and you will find that the results increase! More data usually means a better accuracy, but it is not guaranteed! Getting useful features from models One question you may ask is what are the best features for determining if a tweet is relevant or not? We can extract this information from of our Naive Bayes model and find out which features are the best individually, according to Naive Bayes. First we fit a new model. While the cross_val_score gives us a score across different folds of cross-validated testing data, it doesn't easily give us the trained models themselves. To do this, we simply fit our pipeline with the tweets, creating a new model. The code is as follows: model = pipeline.fit(tweets, labels) Note that we aren't really evaluating the model here, so we don't need to be as careful with the training/testing split. However, before you put these features into practice, you should evaluate on a separate test split. We skip over that here for the sake of clarity. A pipeline gives you access to the individual steps through the named_steps attribute and the name of the step (we defined these names ourselves when we created the pipeline object itself). For instance, we can get the Naive Bayes model: nb = model.named_steps['naive-bayes'] From this model, we can extract the probabilities for each word. These are stored as log probabilities, which is simply log(P(A|f)), where f is a given feature. The reason these are stored as log probabilities is because the actual values are very low. For instance, the first value is -3.486, which correlates to a probability under 0.03 percent. Logarithm probabilities are used in computation involving small probabilities like this as they stop underflow errors where very small values are just rounded to zeros. Given that all of the probabilities are multiplied together, a single value of 0 will result in the whole answer always being 0! Regardless, the relationship between values is still the same; the higher the value, the more useful that feature is. We can get the most useful features by sorting the array of logarithm probabilities. We want descending order, so we simply negate the values first. The code is as follows: top_features = np.argsort(-feature_probabilities[1])[:50] The preceding code will just give us the indices and not the actual feature values. This isn't very useful, so we will map the feature's indices to the actual values. The key is the DictVectorizer step of the pipeline, which created the matrices for us. Luckily this also records the mapping, allowing us to find the feature names that correlate to different columns. We can extract the features from that part of the pipeline: dv = model.named_steps['vectorizer'] From here, we can print out the names of the top features by looking them up in the feature_names_ attribute of DictVectorizer. Enter the following lines into a new cell and run it to print out a list of the top features: for i, feature_index in enumerate(top_features): print(i, dv.feature_names_[feature_index], np.exp(feature_probabilities[1][feature_index])) The first few features include :, http, # and @. These are likely to be noise (although the use of a colon is not very common outside programming), based on the data we collected. Collecting more data is critical to smoothing out these issues. Looking through the list though, we get a number of more obvious programming features: 7 for 0.188679245283 11 with 0.141509433962 28 installing 0.0660377358491 29 Top 0.0660377358491 34 Developer 0.0566037735849 35 library 0.0566037735849 36 ] 0.0566037735849 37 [ 0.0566037735849 41 version 0.0471698113208 43 error 0.0471698113208 There are some others too that refer to Python in a work context, and therefore might be referring to the programming language (although freelance snake handlers may also use similar terms, they are less common on Twitter): 22 jobs 0.0660377358491 30 looking 0.0566037735849 31 Job 0.0566037735849 34 Developer 0.0566037735849 38 Freelancer 0.0471698113208 40 projects 0.0471698113208 47 We're 0.0471698113208 That last one is usually in the format: We're looking for a candidate for this job. Looking through these features gives us quite a few benefits. We could train people to recognize these tweets, look for commonalities (which give insight into a topic), or even get rid of features that make no sense. For example, the word RT appears quite high in this list; however, this is a common Twitter phrase for retweet (that is, forwarding on someone else's tweet). An expert could decide to remove this word from the list, making the classifier less prone to the noise we introduced by having a small dataset. Summary In this article, we looked at text mining—how to extract features from text, how to use those features, and ways of extending those features. In doing this, we looked at putting a tweet in context—was this tweet mentioning python referring to the programming language? We downloaded data from a web-based API, getting tweets from the popular microblogging website Twitter. This gave us a dataset that we labeled using a form we built directly in the IPython Notebook. We also looked at reproducibility of experiments. While Twitter doesn't allow you to send copies of your data to others, it allows you to send the tweet's IDs. Using this, we created code that saved the IDs and recreated most of the original dataset. Not all tweets were returned; some had been deleted in the time since the ID list was created and the dataset was reproduced. We used a Naive Bayes classifier to perform our text classification. This is built upon the Bayes' theorem that uses data to update the model, unlike the frequentist method that often starts with the model first. This allows the model to incorporate and update new data, and incorporate a prior belief. In addition, the naive part allows to easily compute the frequencies without dealing with complex correlations between features. The features we extracted were word occurrences—did this word occur in this tweet? This model is called bag-of-words. While this discards information about where a word was used, it still achieves a high accuracy on many datasets. This entire pipeline of using the bag-of-words model with Naive Bayes is quite robust. You will find that it can achieve quite good scores on most text-based tasks. It is a great baseline for you, before trying more advanced models. As another advantage, the Naive Bayes classifier doesn't have any parameters that need to be set (although there are some if you wish to do some tinkering). In the next article, we will look at extracting features from another type of data, graphs, in order to make recommendations on who to follow on social media. Resources for Article: Further resources on this subject: Putting the Fun in Functional Python[article] Python Data Analysis Utilities[article] Leveraging Python in the World of Big Data[article]
Read more
  • 0
  • 0
  • 4053

article-image-crypto-cash-is-missing-from-the-wallet-of-dead-cryptocurrency-entrepreneur-gerald-cotten-find-it-and-you-could-get-100000
Richard Gall
05 Mar 2019
3 min read
Save for later

Crypto-cash is missing from the wallet of dead cryptocurrency entrepreneur Gerald Cotten - find it, and you could get $100,000

Richard Gall
05 Mar 2019
3 min read
In theory, stealing cryptocurrency should be impossible. But a mystery has emerged that seems to throw all that into question and even suggests a bigger, much stranger conspiracy. Gerald Cotten, the founder of cryptocurrency exchange QadrigaCX, died in December in India. He was believed to have left $136 million USD worth of crypto-cash in 'cold wallets' on his own laptop, to which only he had access. However, investigators from EY, who have been working on closing QuadrigaCX following Cotten's death, were surprised to find that the wallets were empty. In fact, it's believed crypto-cash had disappeared from them months before Cotten died. A cryptocurrency mystery now involving the FBI The only lead in this mystery is the fact that the EY investigators have found other user accounts that appear to be linked to Gerald Cotten. There's a chance that Cotten used these to trade on his own exchange, but the nature of these exchanges remain a little unclear. To add to the intrigue, Fortune reported yesterday that the FBI are working with Canada's Mounted Police Force to investigate the missing money. This information came from Jesse Powell, CEO of another cryptocurrency company called Kraken. Powell told Fortune that both the FBI and the Mounted Police have been in touch with him about the mystery surrounding QuadrigaCX. Powell has offered a reward of $100,000 to anyone that can locate the missing cryptocurrency funds. So what actually happened to Gerald Cotten and his crypto-cash? The story has many layers of complexity. There are rumors that Cotten faked his own death. For example, Cotten filed a will just 12 days before his death, leaving a significant amount of wealth and assets to his wife. And while sources from the hospital in India where Cotten is believed to have died say he died of cardiac arrest, as Fortune explains, "Cotten’s body was handled by hotel staff after an embalmer refused to receive it" - something which is, at the very least, strange. It should be noted that there is certainly no clear evidence that Cotten faked his own death - only missing pieces that encourage such rumors. A further subplot - that might or night not be useful in cracking this case - emerged late last week when Canada's Globe and Mail reported that QuadrigaCX's co-founder has a history of identity theft and using digital currencies to launder money. Where could the money be? There are, as you might expect, no shortage of theories about where the cash could be. A few days ago, it was suggested that it might be possible to locate Cotten's Ethereum funds - a blog post by James Edwards, who is the editor of cryptocurrency blog zerononcense claimed that Ethereum linked to QuadrigaCX can be found in Bitfinex, Poloniex, and Jesse Powell's Kraken. "It appears that a significant amount of Ethereum (600,000+ ETH) was transferred to these exchanges as a means of ‘storage’ during the years that QuadrigaCX was in operation and offering Ethereum on their exchange," Edwards writes. Edwards is keen for his findings to be the starting point for a clearer line of inquiry, free from speculation and conspiracy. He wrote that he hoped that it would be "a helpful addition to the QuadrigaCX narrative, rather than a conspiratorial piece that speculates on whether the exchange or its owners have been honest."
Read more
  • 0
  • 0
  • 4038

article-image-top-announcements-from-the-tensorflow-dev-summit-2019
Sugandha Lahoti
08 Mar 2019
5 min read
Save for later

Top announcements from the TensorFlow Dev Summit 2019

Sugandha Lahoti
08 Mar 2019
5 min read
The two-days long TensorFlow Dev Summit 2019 just got over, leaving in its wake major updates being made to the TensorFlow ecosystem.  The major announcement included the release of the first alpha version of most coveted release TensorFlow 2.0. Also announced were, TensorFlow Lite 1.0, TensorFlow Federated, TensorFlow Privacy and more. TensorFlow Federated In a medium blog post, Alex Ingerman (Product Manager) and Krzys Ostrowski (Research Scientist) introduced the TensorFlow Federated framework on the first day. This open source framework is useful for experimenting with machine learning and other computations on decentralized data. As the name suggests, this framework uses Federated Learning, a learning approach introduced by Google in 2017. This technique enables ML models to collaboratively learn a shared prediction model while keeping all the training data on the device. Thus eliminating machine learning from the need to store the data in the cloud. The authors note that TFF is based on their experiences with developing federated learning technology at Google. TFF uses the Federated Learning API to express an ML model architecture, and then train it across data provided by multiple developers, while keeping each developer’s data separate and local. It also uses the Federated Core (FC) API, a set of lower-level primitives, which enables the expression of a broad range of computations over a decentralized dataset. The authors conclude, “With TFF, we are excited to put a flexible, open framework for locally simulating decentralized computations into the hands of all TensorFlow users. You can try out TFF in your browser, with just a few clicks, by walking through the tutorials.” TensorFlow 2.0.0- alpha0 The event also the release of the first alpha version of the TensorFlow 2.0 framework which came with fewer APIs. First introduced last year in August by Martin Wicke, engineer at Google, TensorFlow 2.0, is expected to come with: Easy model building with Keras and eager execution. Robust model deployment in production on any platform. Powerful experimentation for research. API simplification by reducing duplication removing deprecated endpoints. The first teaser,  TensorFlow 2.0.0- alpha0 version comes with the following changes: API clean-up included removing tf.app, tf.flags, and tf.logging in favor of absl-py. No more global variables with helper methods like tf.global_variables_initializer and tf.get_global_step. Functions, not sessions (tf.Session and session.run -> tf.function). Added support for TensorFlow Lite in TensorFlow 2.0. tf.contrib has been deprecated, and functionality has been either migrated to the core TensorFlow API, to tensorflow/addons, or removed entirely. Checkpoint breakage for RNNs and for Optimizers. Minor bug fixes have also been made to the Keras and Python API and tf.estimator. Read the full list of bug fixes in the changelog. TensorFlow Lite 1.0 The TF-Lite framework is basically designed to aid developers in deploying machine learning and artificial intelligence models on mobile and IoT devices. Lite was first introduced at the I/O developer conference in May 2017 and in developer preview later that year. At the TensorFlow Dev Summit, the team announced a new version of this framework, the TensorFlow Lite 1.0. According to a post by VentureBeat, improvements include selective registration and quantization during and after training for faster, smaller models. The team behind TF-Lite 1.0 says that quantization has helped them achieve up to 4 times compression of some models. TensorFlow Privacy Another interesting library released at the TensorFlow dev summit was TensorFlow Privacy. This Python-based open source library aids developers to train their machine-learning models with strong privacy guarantees. To achieve this, it takes inspiration from the principles of differential privacy. This technique offers strong mathematical guarantees that models do not learn or remember the details about any specific user when training the user data. TensorFlow Privacy includes implementations of TensorFlow optimizers for training machine learning models with differential privacy. For more information, you can go through the technical whitepaper describing its privacy mechanisms in more detail. The creators also note that “no expertise in privacy or its underlying mathematics should be required for using TensorFlow Privacy. Those using standard TensorFlow mechanisms should not have to change their model architectures, training procedures, or processes.” TensorFlow Replicator TF Replicator also released at the TensorFlow Dev Summit, is a software library that helps researchers deploy their TensorFlow models on GPUs and Cloud TPUs. To do this, the creators assure that developers would require minimal effort and need not have previous experience with distributed systems. For multi-GPU computation, TF-Replicator relies on an “in-graph replication” pattern, where the computation for each device is replicated in the same TensorFlow graph. When TF-Replicator builds an in-graph replicated computation, it first builds the computation for each device independently and leaves placeholders where cross-device computation has been specified by the user. Once the sub-graphs for all devices have been built, TF-Replicator connects them by replacing the placeholders with actual cross-device computation. For a more comprehensive description, you can go through the research paper. These were the top announcements made at the TensorFlow Dev Summit 2019. You can go through the Keynote and other videos of the announcements and tutorials on this YouTube playlist. TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tffunction and more. TensorFlow 2.0 is coming. Here’s what we can expect. Google introduces and open-sources Lingvo, a scalable TensorFlow framework for Sequence-to-Sequence Modeling
Read more
  • 0
  • 0
  • 4037
article-image-introduction-practical-business-intelligence
Packt
10 Nov 2016
20 min read
Save for later

Introduction to Practical Business Intelligence

Packt
10 Nov 2016
20 min read
In this article by Ahmed Sherif, author of the book Practical Business Intelligence, is going to explain what is business intelligence? Before answering this question, I want to pose and answer another question. What isn't business intelligence? It is not spreadsheet analysis done with transactional data with thousands of rows. One of the goals of Business Intelligence or BI is to shield the users of the data from the intelligent logic lurking behind the scenes of the application that is delivering the data to them. If the integrity of the data is compromised in any way by an individual not intimately familiar with the data source, then there cannot, by definition, be intelligence in the business decisions made within that same data. The single source of truth is the key for any Business Intelligence operation whether it is a mom and pop soda shop or a Fortune 500 company. Any report, dashboard, or analytical application that is delivering information to a user through a BI tool but the numbers cannot be tied back to the original source will break the trust between the user and the data and will defeat the purpose of Business Intelligence. (For more resources related to this topic, see here.) In my opinion, the most successful tools used for business intelligence directly shield the business user from the query logic used for displaying that same data in some form of visual manner. Business Intelligence has taken many forms in terms of labels over the years. Business Intelligence is the process of delivering actionable business decisions from analytical manipulation and presentation of data within the confines of a business environment. The delivery process mentioned in the definition will focus its attention on. The beauty of BI is that it is not owned by any one particular tool that is proprietary to a specific industry or company. Business Intelligence can be delivered using many different tools, some even that were not originally intended to be used for BI. The tool itself should not be the source where the query logic is applied to generate the business logic of the data. The tool should primarily serve as the delivery mechanism of the query that is generated by the data warehouse that houses both the data, as well as the logic. In this chapter we will cover the following topics: Understanding the Kimball method Understanding business intelligence Data and SQL Working with data and SQL Working with business intelligence tools Downloading and installing MS SQL Server 2014 Downloading and installing AdventureWorks Understanding the Kimball method As we discuss the data warehouse where our data is being housed, we will be remised not to bring up Ralph Kimball, one of the original architects of the data warehouse.  Kimball's methodology incorporated dimensional modeling, which has become the standard for modeling a data warehouse for Business Intelligence purposes. Dimensional modeling incorporates joining tables that have detail data and tables that have lookup data. A detail table is known as a fact table in dimensional modeling. An example of a fact table would be a table holding thousands of rows of transactional sales from a retail store.  The table will house several ID's affiliated with the product, the sales person, the purchase date, and the purchaser just to name a few. Additionally, the fact table will store numeric data for each individual transaction, such as sales quantity for sales amount to name a few examples. These numeric values will be referred to as measures. While there is usually one fact table, there will also be several lookup or dimensional tables that will have one table for each ID that is used in a fact table. So, for example,  there would be one dimensional table for the product name affiliated with a product ID. There would be one dimensional table for the month, week, day, and year of the id affiliated with the date. These dimensional tables are also referred to as Lookup Tables, because they kind of look up what the name of a dimension ID is affiliated with. Usually, you would find as many dimensional tables as there are ID's in the fact table. The dimensional tables would all be joined to the one fact table creating something of a 'star' look. Hence, the name for this type of table join is known as a star schema which is represented diagrammatically in the following figure. It is customary that the fact table will be the largest table in a data warehouse while the lookup tables will all be quite small in rows, some as small as one row. The tables are joined by ID's, also known as surrogate keys. Surrogate keys allow for the most efficient join between a fact table and a dimensional table as they are usually a data type of integer. As more and more detail is added to a dimensional table, that new dimension is just given the next number in line, usually starting with 1. Query performance between tables joins suffers when we introduce non-numeric characters into the join or worse, symbols (although most databases will not allow that). Understanding business intelligence architecture I will continuously hammer home the point that the various tools utilized to deliver the visual and graphical BI components should not house any internal logic to filter data out of the tool nor should it be the source of any built in calculations. The tools themselves should not house this logic as they will be utilized by many different users. If each user who develops a BI app off of the tool incorporates different internal filters without the tool, the single source of truth tying back to the data warehouse will become multiple sources of truths.  Any logic applied to the data to filter out a specific dimension or to calculate a specific measure should be applied in the data warehouse and then pulled into the tool. For example, if the requirement for a BI dashboard was to show current year and prior year sales for US regions only, the filter for region code would be ideally applied in the data warehouse as opposed to inside of the tool. The following is a query written in SQL joining two tables from the AdventureWorks database that highlights the difference between dimenions and measures.  The 'region' column is a dimension column and the 'SalesYTD' and 'SalesPY' are measure columns. In this example, the 'TerritoryID' is serving as the key join between 'SalesTerritory' and 'SalesPerson'. Since the measures are coming from the 'SalesPerson' table, that table will serve as the fact table and 'SalesPerson.TerritoryID' will serve as the fact ID. Since the Region column is dimensional and coming from the 'SalesTerritory' table, that table will serve as the dimensional or lookup table and 'SalesTerritory.TerritoryID' will serve as the dimension ID. In a finely-tuned data warehouse both the fact ID and dimension ID would be indexed to allow for efficient query performance. This performance is obtained by sorting the ID's numerically so that a row from one table that is being joined to another table does not have to be searched through the entire table but only a subset of that table. When the table is only a few hundred rows, it may not seem necessary to index columns, but when the table grows to a few hundred million rows, it may become necessary. Select region.Name as Region ,round(sum(sales.SalesYTD),2) as SalesYTD ,round(sum(sales.SalesLastYear),2) as SalesPY FROM [AdventureWorks2014].[Sales].[SalesTerritory] region left outer join [AdventureWorks2014].[Sales].[SalesPerson] sales on sales.TerritoryID = region.TerritoryID where region.CountryRegionCode = 'US' Group by region.Name order by region.Name asc There are several reasons why applying the logic at the database level is considered a best practice. Most of the time, these requests for filtering data or manipulating calculations are done at the BI tool level because it is easier for the developer than to go to the source. However, if these filters are being performed due to data quality issues then applying logic at the reporting level is only masking an underlying data issue that needs to be addressed across the entire data warehouse. You would be doing yourself a disservice in the long run as you will be establishing a precedence that the data quality would be handled by the report developer as opposed to the database administrator. You are just adding additional work onto your plate. Ideal BI tools will quickly connect to the data source and then allow for slicing and dicing of your dimensions and measures in a manner that will quickly inform the business of useful and practical information. Ultimately, the choice of a BI tool by an individual or an organization will come down to the ease of use of the tool as well as the flexibility to showcase the data through various components such as graphs, charts, widgets, and infographics. Management If you are a Business Intelligence manager looking to establish a department with a variety of tools to help flesh out your requirements, could serve as a good source for interview questions to weed out unqualified candidates. A manager could use to distinguish some of the nuances between these different skillsets and prioritize hiring based on immediate needs. Data Scientist The term Data Scientist has been misused in the BI industry, in my humble opinion. It has been lumped in with Data Analyst as well as BI Developer. Unfortunately, these three positions have separate skillsets and you will do yourself a disservice by assuming one person can do multiple positions successfully. A Data Scientist will be able to apply statistical algorithms behind the data that is being extracted from the BI tools and make predictions about what will happen in the future with that same data set. Due to this skillset, a Data Scientist may find the chapters focusing on R and Python to be of particular importance because of their abilities to leverage predictive capabilities within their BI delivery mechanisms. Data Analyst A Data Analyst is probably the second most misused position behind a Data Scientist. Typically, a Data Analyst should be analyzing the data that is coming out of the BI tools that are connected to the data warehouse. Most Data Analysts are comfortable working with Microsoft Excel. Often times they are asked to take on additional roles in developing dashboards that require additional programming skills.  This is where they would find some comfort using a tool like Power BI, Tableau, or QlikView. These tools would allow for a Data Analyst to quickly develop a storyboard or visualization that would allow for quick analysis with minimal programming skills. Visualization Developer A 'dataviz' developer is someone who can create complex visualizations out of data and showcase interesting interactions between different measures inside of a dataset that cannot necessarily be seen with a traditional chart or graph. More often than not these developers possess some programming background such as JavaScript, HTML, or CSS. These developers are also used to developing applications directly for the web and therefore would find D3.js a comfortable environment to program in. Working with Data and SQL The examples and exercises that will come from the AdventureWorks database.  The AdventureWorks database has a comprehensive list of tables that mimics an actual bicycle retailor. The examples will draw on different tables from the database to highlight BI reporting from the various segments appropriate for the AdventureWorks Company. These segments include Human Resources, Manufacturing, Sales, Purchasing, and Contact Management. A different segment of the data will be highlighted in each chapter utilizing a specific set of tools. A cursory understanding of SQL (structured query language) will be helpful to get a grasp of how data is being aggregated with dimensions and measures. Additionally, an understanding of the SQL statements used will help with the validation process to ensure a single source of truth between the source data and the output inside of the BI tool of choice. For more information about learning SQL, visit the following website: www.sqlfordummies.com Working with business intelligence tools Over the course of the last 20 years, there have been a growing number of software products released that were geared towards Business Intelligence. In addition, there have been a number of software products and programming languages that were not initially built for BI but later on became a staple for the industry. The tools used were chosen based on the fact that they were either built off of open source technology or they were products from companies that provided free versions of their software for development purposes. Many companies from the big enterprise firms have their own BI tools and they are quite popular. However, unless you have a license with them, it is unlikely that you will be able to use their tool without having to shell out a small fortune. Power BI and Excel Power BI is one of the more relatively newer BI tools from Microsoft.  It is known as a self-service solution and integrates seamlessly with other data sources such as Microsoft Excel and Microsoft SQL Server.  Our primary purpose in using Power BI will be to generate interactive dashboards, reports, and datasets for users. In addition to using Power BI we will also focus on utilizing Microsoft Excel to assist with some data analysis and validation of results that are being pulled from our data warehouse.  Pivot tables are very popular within MS Excel and will be used to validate aggregation done inside of the data warehouse. D3.js D3.js, also known as data-driven documents, is a JavaScript library known for delivery beautiful visualizations by manipulating documents based on data. Since D3 is rooted in JavaScript, all visualizations make a seamless transition to the web. D3 allows for major customization to any part of visualization and because of this flexibility, it will require a steeper learning curve that probably any other software program. D3 can consume data easily as a .json or a .csv file.  Additionally, the data can also be imbedded directly within the JavaScript code that renders the visualization on the web. R R is a free and open source statistical programming language that produces beautiful graphics. The R language has been widely used among the statistical community and more recently in the data science and machine learning community as well. Due to this fact, it has picked up steam in recent years as a platform for displaying and delivering effective and practical BI. In addition to visualizing BI, R has the ability to also visualize predictive analysis with algorithms and forecasts. While R is a bit raw in its interface, there have been some IDE's (Integrated Development Environment) that have been developed to ease the user experience. RStudio will be used to deliver the visualisations developed within R. Python Python is considered the most traditional programming language of all the different languages. It is a widely used general purpose programming language with several modules that are very powerful in analysing and visualizing data. Similar to R, Python is a bit raw in its own form for delivering beautiful graphics as a BI tool; however, with the incorporation of an IDE the user interface becomes much more of a pleasurable development experience. PyCharm will be the IDE used to develop BI with Python. PyCharm is free to use and allows creation of the iPython notebook which delivers seamless integration between Python and the powerful modules that will assist with BI. As a note, all code in Python will be developed using the Python 3 syntax. QlikView QlikView is a software company specializing in delivering business intelligence solutions using their desktop tool. QlikView is one of the leaders in delivering quick visualizations based on data and queries through their desktop application. They advertise themselves to be self-service BI for business users. While they do offer solutions that target more enterprise organizations, they also offer a free version of their tool for personal use. Tableau is probably the closest competitor in terms of delivering similar BI solutions. Tableau Tableau is a software company specializing in delivering business intelligence solutions using their desktop tool. If this sounds familiar to QlikView, it's probably because it's true. Both are leaders in the field of establishing a delivery mechanism with easy installation, setup, and connectivity to the available data. Tableau has a free version of their desktop tool. Again, Tableau excels at delivering both beautiful visualizations quickly as well as self-service data discovery to more advanced business users. Microsoft SQL Server Microsoft SQL will serve as the data warehouse for the examples that we will with the BI Tools. Microsoft SQL Server is relatively simple to install and set up as well it is free to download. Additionally, there are example databases that configure seamlessly with it, such as the AdventureWorks database. Downloading and Installing MS SQL Server 2014 First things first. We will need to get started with getting our database and data warehouse up and running so that we can begin to develop our BI environment. We will visit the Microsoft website below to start the download selection process. https://www.microsoft.com/en-us/download/details.aspx?id=42299 Select the specified language that is applicable to you and also select the MS SQL Server Express version with Advanced features that is 64-bit edition as shown in the following screenshot. Ideally you'll want to be working in a 64-bit edition when dealing with servers. After selecting the file, the download process should begin. Depending on your connection speed it could take some time as the file is slightly larger than 1 GB. The next step in the process is selecting a new stand-alone instance of SQL Server 2014 unless you already have a version and wish to upgrade instead as shown in the following screenshot.. After accepting the license terms, continue through the steps in the Global Rules as well as the Product Updates to get to the setup installation files. For the feature selection tab, make sure the following features are selected for your installation as shown in the following screenshot. Our preference is to label a named instance of this database to something related to the work we are doing.  Since this will be used for Business Intelligence, I went ahead and name this instance 'SQLBI' as shown in the following screenshot: The default Server Configuration settings are sufficient for now, there is no need to change anything under that section as shown in the following screenshot. Unless you are required to do so within your company or organization, for personal use it is sufficient to just go with Windows Authentication mode for sign-on as shown in the following screenshot. We will not need to do any configuring of reporting services, so it is sufficient for our purposes to just with installing Reporting Services Native mode without any need for configuration at this time. At this point the installation will proceed and may take anywhere between 20-30 minutes depending on the cpu resources. If you continue to have issues with your installation, you can visit the following website from Microsoft for additional help. http://social.technet.microsoft.com/wiki/contents/articles/23878.installing-sql-server-2014-step-by-step-tutorial.aspx Ultimately, if everything with the installation is successful, you'll want to see all portions of the installation have a green check mark next to their name and be labeled 'Successful' as shown in the following screenshot. Downloading and Installing AdventureWorks We are almost finished with getting our business intelligence data warehouse complete. We are now at the stage where we will extract and load data into our data warehouse. The last part is to download and install the AdventureWorks database from Microsoft. The zipped file for AdventureWorks 2014 is located in the following website from Microsoft: https://msftdbprodsamples.codeplex.com/downloads/get/880661 Once the file is downloaded and unzipped, you will find a file named the following: AdventureWorks2014.bak Copy that file and paste it in the following folder where it will be incorporated with your Microsoft SQL Server 2014 Express Edition. C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackup Also note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. Once that has been copied over, we can fire up Management Studio for SQL Server 2014 and start up a blank new query by going to File New Query with Current Connection Once you have a blank query set up, copy and paste the following code in the and execute it: use [master] Restore database AdventureWorks2014 from disk = 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackupAdventureWorks2014.bak' with move 'AdventureWorks2014_data' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.mdf', Move 'AdventureWorks2014_log' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.ldf' , replace Once again, please note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. At this point in time within the database you should have received a message saying that Microsoft SQL Server has processed 24248 pages for database 'AdventureWorks2014'. Once you have refreshed your database tab on the upper left hand corner of SQL Server, the AdventureWorks database will become visible as well as all of the appropriate tables as shown in the following screenshot: One final step that we will need to verify just to make sure that your login account has all of the appropriate server settings. When you right-click on the SQL Server name on the upper left hand portion of Management Studio, select the properties.  Select Permissions inside Properties. Find your username and check all of the rights under the Grant column as shown in the following screenshot: Finally, we need to also ensure that the folder that houses Microsoft SQL Server 2014 also has the appropriate rights enabled for your current user.  That specific folder is located under C:Program FilesMicrosoft SQL Server. For purposes of our exercises, we will assign all rights for the SQL Server user to the following folder as shown in the following screenshot: We are now ready to begin connecting our BI tools to our data! Summary The emphasis will be placed on implementing Business Intelligence best practices within the various tools that will be used based on the different levels of data that is provided within the AdventureWorks database. In the next chapter we will cover extracting additional data from the web that will be joined to the AdventureWorks database. This process is known as web scraping and can be performed with great success using tools such as Python and R. In addition to collecting the data, we will focus on transforming the collected data for optimal query performance. Resources for Article: Further resources on this subject: LabVIEW Basics [article] Thinking Probabilistically [article] Clustering Methods [article]
Read more
  • 0
  • 1
  • 4035

article-image-including-charts-and-graphics-pentaho-reports-part-2
Packt
29 Oct 2009
6 min read
Save for later

Including Charts and Graphics in Pentaho Reports (Part 2)

Packt
29 Oct 2009
6 min read
Ring chart The ring chart is identical to the pie chart, except that it renders as a ring versus a complete pie. In addition to sharing all the properties similar to the pie chart, it also defines the following rendering property : Options Property Group Property name Description section-depth This property defines the percentage of the radius to render the section as. The default value is set to 0.5. Ring chart example For this example, simply open the defined pie chart example and select the Ring chart type. Also, set the section-depth to 0.1, in order to generate the following effect: Multi pie chart The multi pie chart renders a group of pie charts, based on a category dataset. This meta-chart renders individual series data as a pie chart, each broken into individual categories within the individual pie charts. The multi pie chart utilizes the common properties defined above, including the category dataset properties. In addition to the standard set of properties, it also defines the following two properties: Options Property Group Property name Description label-format This label defines how each item within a chart is rendered. The default value is set to "{0}". The format string may also contain any of the following: {0}: To render the item name {1}: To render the item value {2}: To render the item percentage in relation to the entire pie chart by-row This value defaults to True. If set to False, the series and category fields are reversed, and individual charts render series information. Note that the horizontal, series-color, stacked and stacked-percent properties do not apply to this chart type. Multi pie chart example This example demonstrates the distribution of purchased item types, based on payment type. To begin, create a new report. You'll reuse the bar chart's SQL query. Now, place a new Chart element into the Report Header. Edit the chart, selecting Multi Pie as the chart type. To configure the dataset for this chart, select ITEMCATEGORY as the category-column. Set the value-columns property to QUANTITY and the series-by-field to PAYMENTTYPE. Waterfall chart The waterfall chart displays a unique stacked bar chart that spans categories. This chart is useful when comparing categories to one another. The last category in a waterfall chart normally equals the total of all the other categories to render appropriately, but this is based on the dataset, not the chart rendering. The waterfall chart utilizes the common properties defined above, including the category dataset properties. The stacked property is not available for this chart. There are no additional properties defined for the waterfall chart. Waterfall chart example In this example, you'll compare by type, the quantity of items in your inventory. Normally, the last category would be used to display the total values. The chart will render the data provided with or without a summary series, so you'll just use the example SQL query from the bar chart example. Add a Chart element to the Report Header and select Waterfall as the chart type. Set the category-column to ITEMCATEGORY, the value-columns to QUANTITY, and the series-by-value property to Quantity. Now, apply your changes and preview the results. Bar line chart The bar line chart combines the bar and line charts, allowing visualization of trends with categories, along with comparisons. The bar line chart is unique in that it requires two category datasets to populate the chart. The first dataset populates the bar chart, and the second dataset populates the line chart. The bar line chart utilizes the common properties defined above, including the category dataset properties. This chart also inherits the properties from both the bar chart, as well as the line chart. This chart also has certain additional properties, which are listed in the following table: Required Property Group Property name Description bar-data-source The name of the first dataset required by the bar line chart, which populates the bars in the chart. This value is automatically populated with the correct name. line-data-source The name of the second dataset required by the bar line chart, which populates the lines in the chart. This value is automatically populated with the correct name. Bar Options Property Group Property name Description ctgry-tick-font Defines the Java font that renders the Categories. Line Options Property Group Property name Description line-series-color Defines the color in which to render each line series. line-tick-fmt Specifies the Java DecimalFormat string for rendering the Line Axis Labels lines-label-font Defines the Java font to use when rendering line labels. line-tick-font Defines the Java font to use when rendering the Line Axis Labels. As part of the bar line chart, a second y-axis is defined for the lines. The property group Y2-Axis is available with similar properties as the standard y-axis. Bar line chart example To demonstrate the bar line chart, you'll reuse the SQL query from the area chart example. Create a new report, and add a Chart element to the Report Header. Edit the chart, and select Bar Line as the chart type. You'll begin by configuring the first dataset. Set the category-column to ITEMCATEGORY, the value-columns to COST, and the series-by-value property to Cost. To configure the second dataset, set the category-column to ITEMCATEGORY, the value-columns to SALEPRICE, and the series-by-value property to Sale Price. Set the x-axis-label-width to 2.0, and reduce the x-font size to 7. Also, set show-legend to True. You're now ready to preview the bar line chart. Bubble chart The bubble chart allows you to view three dimensions of data. The first two dimensions are your traditional X and Y dimensions, also known as domain and range. The third dimension is expressed by the size of the individual bubbles rendered. The bubble chart utilizes the common properties defined above, including the XY series dataset properties. The bubble chart also defines the following properties: Options Property Group Property name Description max-bubble-size This value defines the diameter of the largest bubble to render. All other bubble sizes are relative to the maximum bubble size. The default value is 0, so this value must be set to a reasonable value for rendering of bubbles to take place. Note that this value is based on pixels, not the domain or range values. The bubble chart defines the following additional dataset property: Required Property Group Property name Description z-value-columns This is the data source column to use for Z value, which specifies the bubble diameter relative to the maximum bubble size. Bubble chart example In this example, you need to define a three dimensional SQL query to populate the chart. You'll use inventory information, and calculate Item Category Margin: SELECT"INVENTORY"."ITEMCATEGORY","INVENTORY"."ONHAND","INVENTORY"."ONORDER","INVENTORY"."COST","INVENTORY"."SALEPRICE","INVENTORY"."SALEPRICE" - "INVENTORY"."COST" MARGINFROM"INVENTORY"ORDER BY"INVENTORY"."ITEMCATEGORY" ASC Now that you have a SQL query to work with, add a Chart element to the Report Header and select Bubble as the chart type. First, you'll populate the correct dataset fields. Set the series-by-field property to ITEMCATEGORY. Now, set the X, Y, and Z value columns to ONHAND, SALEPRICE, and MARGIN. You're now ready to customize the chart rendering. Set the x-title to On Hand, the y-title to Sales Price, the max-bubble-size to 100, and the show-legend property to True. The final result should look like this:
Read more
  • 0
  • 0
  • 4011
Modal Close icon
Modal Close icon