Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-hadoop-monitoring-and-its-aspects
Packt
04 May 2015
8 min read
Save for later

Hadoop Monitoring and its aspects

Packt
04 May 2015
8 min read
In this article by Gurmukh Singh, the author of the book Monitoring Hadoop, tells us the importance of monitoring Hadoop and its importance. It also explains various other concepts of Hadoop, such as its architecture, Ganglia (a tool used to monitor Hadoop), and so on. (For more resources related to this topic, see here.) In any enterprise, how big or small it could be, it is very important to monitor the health of all its components like servers, network devices, databases, and many more and make sure things are working as intended. Monitoring is a critical part for any business dependent upon infrastructure, by giving signals to enable necessary actions incase of any failures. Monitoring can be very complex with many components and configurations in a real production environment. There might be different security zones; different ways in which servers are setup or a same database might be used in many different ways with servers listening on various service ports. Before diving into setting up Monitoring and logging for Hadoop, it is very important to understand the basics of monitoring, how it works and some commonly used tools in the market. In Hadoop, we can do monitoring of the resources, services and also do metrics collection of various Hadoop counters. There are many tools available in the market and one of them is Nagios, which is widely used. Nagios is a powerful monitoring system that provides you with instant awareness of your organization's mission-critical IT infrastructure. By using Nagios, you can: Plan release cycle and rollouts, before things get outdated Early detection, before it causes an outage Have automation and a better response across the organization Nagios Architecture It is based on a simple server client architecture, in which the server has the capability to execute checks remotely using NRPE agents on the Linux clients. The results of execution are captured by the server and accordingly alerted by the system. The checks could be for memory, disk, CPU utilization, network, database connection and many more. It provides the flexibility to use either active or passive checks. Ganglia Ganglia, it is a beautiful tool for aggregating the stats and plotting them nicely. Nagios, give the events and alerts, Ganglia aggregates and presents it in a meaningful way. What if you want to look for total CPU, memory per cluster of 2000 nodes or total free disk space on 1000 nodes. Some of the key feature of Ganglia. View historical and real time metrics of a single node or for the entire cluster Use the data to make decision on cluster sizing and performance Ganglia Components Ganglia Monitoring Daemon (gmond): This runs on the nodes that need to be monitored, captures state change and sends updates using XDR to a central daemon. Ganglia Meta Daemon (gmetad): This collects data from gmond and other gmetad daemons. The data is indexed and stored to disk in round robin fashion. There is also a Ganglia front-end for meaningful display of information collected. All these tools can be integrated with Hadoop, to monitor it and capture its metrics. Integration with Hadoop There are many important components in Hadoop that needs to be monitored, like NameNode uptime, disk space, memory utilization, and heap size. Similarly, on DataNode we need to monitor disk usage, memory utilization or job execution flow status across the MapReduce components. To know what to monitor, we must understand how Hadoop daemons communicate with each other. There are lots of ports used in Hadoop, some are for internal communication like scheduling jobs, and replication, while others are for user interactions. They may be exposed using TCP or HTTP. Hadoop daemons provide information over HTTP about logs, stacks, metrics that could be used for troubleshooting. NameNode can expose information about the file system, live or dead nodes or block reports by the DataNode or JobTracker for tracking the running jobs. Hadoop uses TCP, HTTP, IPC or socket for communication among the nodes or daemons. YARN Framework The YARN (Yet Another resource Negotiator) is the new MapReduce framework. It is designed to scale for large clusters and performs much better as compared to the old framework. There are new sets of daemons in the new framework and it is good to understand how to communicate with each other. The diagram that follows, explains the daemons and ports on which they talk. Logging in Hadoop In Hadoop, each daemon writes its own logs and the severity of logging is configurable. The logs in Hadoop can be related to the daemons or the jobs submitted. Useful to troubleshoot slowness, issue with map reduce tasks, connectivity issues and platforms bugs. The logs generated can be user level like task tracker logs on each node or can be related to master daemons like NameNode and JobTracker. In the newer YARN platform, there is a feature to move the logs to HDFS after the initial logging. In Hadoop 1.x the user log management is done using UserLogManager, which cleans and truncates logs according to retention and size parameters like mapred.userlog.retain.hours and mapreduce.cluster.map.userlog.retain-size respectively. The tasks standard out and error are piped to Unix tail program, so it retains the require size only. The following are some of the challenges of log management in Hadoop: Excessive logging: The truncation of logs is not done till the tasks finish, this for many jobs could cause disk space issues as the amount of data written is quite large. Truncation: We cannot always say what to log and how much is good enough. For some users 500KB of logs might be good but for some 10MB might not suffice. Retention: How long to retain logs, 1 or 6 months?. There is no rule, but there are best practices or governance issues. In many countries there is regulation in place to keep data for 1 year. Best practice for any organization is to keep it for at least 6 months. Analysis: What if we want to look at historical data, how to aggregate logs onto a central system and do analyses. In Hadoop logs are served over HTTP for a single node by default. Some of the above stated issues have been addressed in the YARN framework. Rather then truncating logs and that to on individual nodes, the logs can be moved to HDFS and processed using other tools. The logs are written at the per application level into directories per application. The user can access these logs through command line or web UI. For example, $HADOOP_YARN_HOME/bin/yarn logs. Hadoop metrics In Hadoop there are many daemons running like DataNode, NameNode, JobTracker, and so on, each of these daemons captures a lot of information about the components they work on. Similarly, in YARN framework we have ResourceManager, NodeManager, and Application Manager, each of which exposes metrics, explained in the following sections under Metrics2. For example, DataNode collects metrics like number of blocks it has for advertising to the NameNode, the number of replicated blocks, metrics about the various read or writes from clients. In addition to this there could be metrics related to events, and so on. Hence, it is very important to gather it for the working of the Hadoop cluster and also helps in debugging, if something goes wrong. For this, Hadoop has a metrics system, for collecting all this information. There are two versions of the metrics system, Metrics and Metrics2 for Hadoop 1.x and Hadoop 2.x respectively. The file hadoop-metrics.properties and hadoop-metrics2.properties for each Hadoop version can be configured respectively. Configuring Metrics2 For Hadoop version 2, which uses YARN framework, the metrics can be configured using hadoop-metrics2.properties, under the $HADOOP_HOME directory. *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink *.period=10 namenode.sink.file.filename=namenode-metrics.out datanode.sink.file.filename=datanode-metrics.out jobtracker.sink.file.filename=jobtracker-metrics.out tasktracker.sink.file.filename=tasktracker-metrics.out maptask.sink.file.filename=maptask-metrics.out reducetask.sink.file.filename=reducetask-metrics.out Hadoop metrics Configuration for Ganglia Firstly, we need to define a sink class, as per Ganglia. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 Secondly, we need to define the frequency of how often the source showed be polled for data. We are polling every 30 seconds: *.sink.ganglia.period=30 Define retention for the metrics: *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 Summary In this article, we learned about Hadoop monitoring and its importance, and also the various concepts of Hadoop. Resources for Article: Further resources on this subject: Hadoop and MapReduce [article] YARN and Hadoop [article] Hive in Hadoop [article]
Read more
  • 0
  • 0
  • 2832

article-image-machine-learning-0
Packt
30 Apr 2015
15 min read
Save for later

Machine Learning

Packt
30 Apr 2015
15 min read
In this article by Alberto Boschetti and Luca Massaron, authors of the book Python Data Science Essentials, you will learn about how to deal with big data and an overview of Stochastic Gradient Descent (SGD). (For more resources related to this topic, see here.) Dealing with big data Big data challenges data science projects with four points of view: volume (data quantity), velocity, variety, and veracity. Scikit-learn package offers a range of classes and functions that will help you effectively work with data so big that it cannot entirely fit in the memory of your computer, no matter what its specifications may be. Before providing you with an overview of big data solutions, we have to create or import some datasets in order to give you a better idea of the scalability and performances of different algorithms. This will require about 1.5 gigabytes of your hard disk, which will be freed after the experiment. Creating some big datasets as examples As a typical example of big data analysis, we will use some textual data from the Internet and we will take advantage of the available fetch_20newsgroups, which contains data of 11,314 posts, each one averaging about 206 words, that appeared in 20 different newsgroups: In: import numpy as np from sklearn.datasets import fetch_20newsgroups newsgroups_dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'), random_state=6) print 'Posts inside the data: %s' % np.shape(newsgroups_dataset.data) print 'Average number of words for post: %0.0f' % np.mean([len(text.split(' ')) for text in newsgroups_dataset.data]) Out: Posts inside the data: 11314 Average number of words for post: 206 Instead, to work out a generic classification example, we will create three synthetic datasets that contain from 1,00,000 to up to 10 million cases. You can create and use any of them according to your computer resources. We will always refer to the largest one for our experiments: In: from sklearn.datasets import make_classification X,y = make_classification(n_samples=10**5, n_features=5, n_informative=3, random_state=101) D = np.c_[y,X] np.savetxt('huge_dataset_10__5.csv', D, delimiter=",") # the saved file should be around 14,6 MB del(D, X, y) X,y = make_classification(n_samples=10**6, n_features=5, n_informative=3, random_state=101) D = np.c_[y,X] np.savetxt('huge_dataset_10__6.csv', D, delimiter=",") # the saved file should be around 146 MB del(D, X, y) X,y = make_classification(n_samples=10**7, n_features=5, n_informative=3, random_state=101) D = np.c_[y,X] np.savetxt('huge_dataset_10__7.csv', D, delimiter=",") # the saved file should be around 1,46 GB del(D, X, y) After creating and using any of the datasets, you can remove them by the following command: import os os.remove('huge_dataset_10__5.csv') os.remove('huge_dataset_10__6.csv') os.remove('huge_dataset_10__7.csv') Scalability with volume The trick to manage high volumes of data without loading too many megabytes or gigabytes into memory is to incrementally update the parameters of your algorithm until all the observations have been elaborated at least once by the machine learner. This is possible in Scikit-learn thanks to the .partial_fit() method, which has been made available to a certain number of supervised and unsupervised algorithms. Using the .partial_fit() method and providing some basic information (for example, for classification, you should know beforehand the number of classes to be predicted), you can immediately start fitting your model, even if you have a single or a few observations. This method is called incremental learning. The chunks of data that you incrementally fed into the learning algorithm are called batches. The critical points of incremental learning are as follows: Batch size Data preprocessing Number of passes with the same examples Validation and parameters fine tuning Batch size generally depends on your available memory. The principle is that the larger the data chunks, the better, since the data sample will get more representatives of the data distributions as its size gets larger. Also, data preprocessing is challenging. Incremental learning algorithms work well with data in the range of [-1,+1] or [0,+1] (for instance, Multinomial Bayes won't accept negative values). However, to scale into such a precise range, you need to know beforehand the range of each variable. Alternatively, you have to either pass all the data once, and record the minimum and maximum values, or derive them from the first batch, trimming the following observations that exceed the initial maximum and minimum values. The number of passes can become a problem. In fact, as you pass the same examples multiple times, you help the predictive coefficients converge to an optimum solution. If you pass too many of the same observations, the algorithm will tend to overfit, that is, it will adapt too much to the data repeated too many times. Some algorithms, like the SGD family, are also very sensitive to the order that you propose to the examples to be learned. Therefore, you have to either set their shuffle option (shuffle=True) or shuffle the file rows before the learning starts, keeping in mind that for efficacy, the order of the rows proposed for the learning should be casual. Validation is not easy either, especially if you have to validate against unseen chunks, validate in a progressive way, or hold out some observations from every chunk. The latter is also the best way to reserve a sample for grid search or some other optimization. In our example, we entrust the SGDClassifier with a log loss (basically a logistic regression) to learn how to predict a binary outcome given 10**7 observations: In: from sklearn.linear_model import SGDClassifier from sklearn.preprocessing import MinMaxScaler import pandas as pd import numpy as np streaming = pd.read_csv('huge_dataset_10__7.csv', header=None, chunksize=10000) learner = SGDClassifier(loss='log') minmax_scaler = MinMaxScaler(feature_range=(0, 1)) cumulative_accuracy = list() for n,chunk in enumerate(streaming):    if n == 0:            minmax_scaler.fit(chunk.ix[:,1:].values)    X = minmax_scaler.transform(chunk.ix[:,1:].values)    X[X>1] = 1    X[X<0] = 0    y = chunk.ix[:,0]    if n > 8 :        cumulative_accuracy.append(learner.score(X,y))    learner.partial_fit(X,y,classes=np.unique(y)) print 'Progressive validation mean accuracy %0.3f' % np.mean(cumulative_accuracy) Out: Progressive validation mean accuracy 0.660 First, pandas read_csv allows us to iterate over the file by reading batches of 10,000 observations (the number can be increased or decreased according to your computing resources). We use the MinMaxScaler in order to record the range of each variable on the first batch. For the following batches, we will use the rule that if it exceeds one of the limits of [0,+1], they are trimmed to the nearest limit. Eventually, starting from the 10th batch, we will record the accuracy of the learning algorithm on each newly received batch before using it to update the training. In the end, the accumulated accuracy scores are averaged, offering a global performance estimation. Keeping up with velocity There are various algorithms that work with incremental learning. For classification, we will recall the following: sklearn.naive_bayes.MultinomialNB sklearn.naive_bayes.BernoulliNB sklearn.linear_model.Perceptron sklearn.linear_model.SGDClassifier sklearn.linear_model.PassiveAggressiveClassifier For regression, we will recall the following: sklearn.linear_model.SGDRegressor sklearn.linear_model.PassiveAggressiveRegressor As for velocity, they are all comparable in speed. You can try for yourself the following script: In: from sklearn.naive_bayes import MultinomialNB from sklearn.naive_bayes import BernoulliNB from sklearn.linear_model import Perceptron from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier import pandas as pd import numpy as np from datetime import datetime classifiers = { 'SGDClassifier hinge loss' : SGDClassifier(loss='hinge', random_state=101), 'SGDClassifier log loss' : SGDClassifier(loss='log', random_state=101), 'Perceptron' : Perceptron(random_state=101), 'BernoulliNB' : BernoulliNB(), 'PassiveAggressiveClassifier' : PassiveAggressiveClassifier(random_state=101) } huge_dataset = 'huge_dataset_10__6.csv' for algorithm in classifiers:    start = datetime.now()    minmax_scaler = MinMaxScaler(feature_range=(0, 1))    streaming = pd.read_csv(huge_dataset, header=None, chunksize=100)    learner = classifiers[algorithm]    cumulative_accuracy = list()    for n,chunk in enumerate(streaming):        y = chunk.ix[:,0]        X = chunk.ix[:,1:]        if n > 50 :            cumulative_accuracy.append(learner.score(X,y))        learner.partial_fit(X,y,classes=np.unique(y))    elapsed_time = datetime.now() - start    print algorithm + ' : mean accuracy %0.3f in %s secs' % (np.mean(cumulative_accuracy),elapsed_time.total_seconds()) Out: BernoulliNB : mean accuracy 0.734 in 41.101 secs Perceptron : mean accuracy 0.616 in 37.479 secs SGDClassifier hinge loss : mean accuracy 0.712 in 38.43 secs SGDClassifier log loss : mean accuracy 0.716 in 39.618 secs PassiveAggressiveClassifier : mean accuracy 0.625 in 40.622 secs As a general note, remember that smaller batches are slower since that implies more disk access from a database or a file, which is always a bottleneck. Dealing with variety Variety is a big data characteristic. This is especially true when we are dealing with textual data or very large categorical variables (for example, variables storing website names in programmatic advertising). As you will learn from batches of examples, and as you unfold categories or words, each one is an appropriate and exclusive variable. You may find it difficult to handle the challenge of variety and the unpredictability of large streams of data. Scikit-learn package provides you with a simple and fast way to implement the hashing trick and completely forget the problem of defining in advance of a rigid variable structure. The hashing trick uses hash functions and sparse matrices in order to save your time, resources, and hassle. The hash functions are functions that map in a deterministic way any input they receive. It doesn't matter if you feed them with numbers or strings, they will always provide you with an integer number in a certain range. Sparse matrices are, instead, arrays that record only values that are not zero, since their default value is zero for any combination of their row and column. Therefore, the hashing trick bounds every possible input; it doesn't matter if it was previously unseen to a certain range or position on a corresponding input sparse matrix, which is loaded with a value that is not 0. For instance, if your input is Python, a hashing command like abs(hash('Python')) can transform that into the 539294296 integer number and then assign the value of 1 to the cell at the 539294296 column index. The hash function is a very fast and convenient way to always express the same column index given the same input. The using of only absolute values assures that each index corresponds only to a column in our array (negative indexes just start from the last column, and hence in Python, each column of an array can be expressed by both a positive and negative number). The example that follows uses the HashingVectorizer class, a convenient class that automatically takes documents, separates the words, and transforms them, thanks to the hashing trick, into an input matrix. The script aims at learning why posts are published in 20 distinct newsgroups on the basis of the words used on the existing posts in the newsgroups: In: import pandas as pd import numpy as np from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import HashingVectorizer def streaming():    for response, item in zip(newsgroups_dataset.target, newsgroups_dataset.data):        yield response, item hashing_trick = HashingVectorizer(stop_words='english', norm = 'l2', non_negative=True) learner = SGDClassifier(random_state=101) texts = list() targets = list() for n,(target, text) in enumerate(streaming()):    texts.append(text)    targets.append(target)  if n % 1000 == 0 and n >0:        learning_chunk = hashing_trick.transform(texts)        if n > 1000:            last_validation_score = learner.score(learning_chunk, targets),        learner.partial_fit(learning_chunk, targets, classes=[k for k in range(20)])        texts, targets = list(), list() print 'Last validation score: %0.3f' % last_validation_score Out: Last validation score: 0.710 At this point, no matter what text you may input, the predictive algorithm will always answer by pointing out a class. In our case, it points out a newsgroup suitable for the post to appear on. Let's try out this algorithm with a text taken from a classified ad: In: New_text = ['A 2014 red Toyota Prius v Five with fewer than 14K miles. Powered by a reliable 1.8L four cylinder hybrid engine that averages 44mpg in the city and 40mpg on the highway.'] text_vector = hashing_trick.transform(New_text) print np.shape(text_vector), type(text_vector) print 'Predicted newsgroup: %s' % newsgroups_dataset.target_names[learner.predict(text_vector)] Out: (1, 1048576) <class 'scipy.sparse.csr.csr_matrix'> Predicted newsgroup: rec.autos Naturally, you may change the New_text variable and discover where your text will most likely be displayed in a newsgroup. Note that the HashingVectorizer class has transformed the text into a csr_matrix (which is quite an efficient sparse matrix), having about 1 million columns. A quick overview of Stochastic Gradient Descent (SGD) We will close this article on big data with a quick overview of the SGD family comprising of SGDClassifier (for classification) and SGDRegressor (for regression). Like other classifiers, they can be fit by using the .fit() method (passing row by row the in-memory dataset to the learning algorithm) or the previously seen .partial_fit() method based on batches. In the latter case, if you are classifying, you have to declare the predicted classes with the class parameter. It can accept a list containing all the class code that it should expect to meet during the training phase. SGDClassifier can behave as a logistic regression when the loss parameter is set to loss. It transforms into a linear SVC if the loss is set to hinge. It can also take the form of other loss functions or even the loss functions working for regression. SGDRegressor mimics a linear regression using the squared_loss loss parameter. Instead, the huber loss transforms the squared loss into a linear loss over a certain distance epsilon (another parameter to be fixed). It can also act as a linear SVR using the epsilon_insensitive loss function or the slightly different squared_epsilon_insensitive (which penalizes outliers more). As in other situations with machine learning, performance of the different loss functions on your data science problem cannot be estimated a priori. Anyway, please take into account that if you are doing classification and you need an estimation of class probabilities, you will be limited in your choice to log or modified_huber only. Key parameters that require tuning for this algorithm to best work with your data are: n_iter: The number of iterations over the data; the more the passes, the better the optimization of the algorithm. However, there is a higher risk of overfitting. Empirically, SGD tends to converge to a stable solution after having seen 10**6 examples. Given your examples, set your number of iterations accordingly. penalty : You have to choose l1, l2, or elasticnet, which are all different regularization strategies, in order to avoid overfitting because of overparametrization (using too many unnecessary parameters leads to the memorization of observations more than the learning of patterns). Briefly, l1 tends to reduce unhelpful coefficients to zero, l2 just attenuates them, and elasticnet is a mix of l1 and l2 strategies. alpha: This is a multiplier of the regularization term; the higher the alpha, the more the regularization. We advise you to find the best alpha value by performing a grid search ranging from 10**-7 to 10**-1. l1_ratio: The l1 ratio is used for elastic net penalty. learning_rate: This sets how much the coefficients are affected by every single example. Usually, it is optimal for classifiers and invscaling for regression. If you want to use invscaling for classification, you'll have to set eta0 and power_t (invscaling = eta0 / (t**power_t)). With invscaling, you can start with a lower learning rate, which is less than the optimal rate, though it will decrease slower. epsilon: This should be used if your loss is huber, epsilon_insensitive, or squared_epsilon_insensitive. shuffle: If this is True, the algorithm will shuffle the order of the training data in order to improve the generalization of the learning. Summary In this article, we introduced the essentials of machine learning. We discussed about creating some big data examples, scalability with volume, keeping up with velocity, dealing with variety and with advanced one SGD. Resources for Article: Further resources on this subject: Supervised learning [article] Installing NumPy, SciPy, matplotlib, and IPython [article] Driving Visual Analyses with Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 2629

article-image-algorithmic-trading
Packt
29 Apr 2015
16 min read
Save for later

Algorithmic Trading

Packt
29 Apr 2015
16 min read
In this article by James Ma Weiming, author of the book Mastering Python for Finance , we will see how algorithmic trading automates the systematic trading process, where orders are executed at the best price possible based on a variety of factors, such as pricing, timing, and volume. Some brokerage firms may offer an application programming interface (API) as part of their service offering to customers who wish to deploy their own trading algorithms. For developing an algorithmic trading system, it must be highly robust and handle any point of failure during the order execution. Network configuration, hardware, memory management and speed, and user experience are some factors to be considered when designing a system in executing orders. Designing larger systems inevitably add complexity to the framework. As soon as a position in a market is opened, it is subjected to various types of risk, such as market risk. To preserve the trading capital as much as possible, it is important to incorporate risk management measures to the trading system. Perhaps the most common risk measure used in the financial industry is the value-at-risk (VaR) technique. We will discuss the beauty and flaws of VaR, and how it can be incorporated into our trading system that we will develop in this article. In this article, we will cover the following topics: An overview of algorithmic trading List of brokers and system vendors with public API Choosing a programming language for a trading system Setting up API access on Interactive Brokers (IB) trading platform Using the IbPy module to interact with IB Trader WorkStation (TWS) Introduction to algorithmic trading In the 1990s, exchanges had already begun to use electronic trading systems. By 1997, 44 exchanges worldwide used automated systems for trading futures and options with more exchanges in the process of developing automated technology. Exchanges such as the Chicago Board of Trade (CBOT) and the London International Financial Futures and Options Exchange (LIFFE) used their electronic trading systems as an after-hours complement to traditional open outcry trading in pits, giving traders 24-hour access to the exchange's risk management tools. With improvements in technology, technology-based trading became less expensive, fueling the growth of trading platforms that are faster and powerful. Higher reliability of order execution and lower rates of message transmission error deepened the reliance of technology by financial institutions. The majority of asset managers, proprietary traders, and market makers have since moved from the trading pits to electronic trading floors. As systematic or computerized trading became more commonplace, speed became the most important factor in determining the outcome of a trade. Quants utilizing sophisticated fundamental models are able to recompute fair values of trading products on the fly and execute trading decisions, enabling them to reap profits at the expense of fundamental traders using traditional tools. This gave way to the term high-frequency trading (HFT) that relies on fast computers to execute the trading decisions before anyone else can. HFT has evolved into a billion-dollar industry. Algorithmic trading refers to the automation of the systematic trading process, where the order execution is heavily optimized to give the best price possible. It is not part of the portfolio allocation process. Banks, hedge funds, brokerage firms, clearing firms, and trading firms typically have their servers placed right next to the electronic exchange to receive the latest market prices and to perform the fastest order execution where possible. They bring enormous trading volumes to the exchange. Anyone who wishes to participate in low-latency, high-volume trading activities, such as complex event processing or capturing fleeting price discrepancies, by acquiring exchange connectivity may do so in the form of co-location, where his or her server hardware can be placed on a rack right next to the exchange for a fee. The Financial Information Exchange (FIX) protocol is the industry standard for electronic communications with the exchange from the private server for direct market access (DMA) to real-time information. C++ is the common choice of programming language for trading over the FIX protocol, though other languages, such as .NET framework common language and Java can be used. Before creating an algorithmic trading platform, you would need to assess various factors, such as speed and ease of learning before deciding on a specific language for the purpose. Brokerage firms would provide a trading platform of some sort to their customers for them to execute orders on selected exchanges in return for the commission fees. Some brokerage firms may offer an API as part of their service offering to technically inclined customers who wish to run their own trading algorithms. In most circumstances, customers may also choose from a number of commercial trading platforms offered by third-party vendors. Some of these trading platforms may also offer API access to route orders electronically to the exchange. It is important to read the API documentation beforehand to understand the technical capabilities offered by your broker and to formulate an approach in developing an algorithmic trading system. List of trading platforms with public API The following table lists some brokers and trading platform vendors who have their API documentation publicly available: Broker/vendor URL Programming languages supported Interactive Brokers https://www.interactivebrokers.com/en/index.php?f=1325 C++, Posix C++, Java, and Visual Basic for ActiveX E*Trade https://developer.etrade.com Java, PHP, and C++ IG http://labs.ig.com/ REST, Java, FIX, and Microsoft .NET Framework 4.0 Tradier https://developer.tradier.com Java, Perl, Python, and Ruby TradeKing https://developers.tradeking.com Java, Node.js, PHP, R, and Ruby Cunningham trading systems http://www.ctsfutures.com/wiki/T4%20API%2040.MainPage.ashx Microsoft .NET Framework 4.0 CQG http://cqg.com/Products/CQG-API.aspx C#, C++, Excel, MATLAB, and VB.NET Trading technologies https://developer.tradingtechnologies.com Microsoft .NET Framework 4.0 OANDA http://developer.oanda.com REST, Java, FIX, and MT4 Which is the best programming language to use? With many choices of programming languages available to interface with brokers or vendors, the question that comes naturally to anyone starting out in algorithmic trading platform development is: which language should I use? Well, the short answer is that there is really no best programming language. How your product will be developed, the performance metrics to follow, the costs involved, latency threshold, risk measures, and the expected user interface are pieces of the puzzle to be taken into consideration. The risk manager, execution engine, and portfolio optimizer are some major components that will affect the design of your system. Your existing trading infrastructure, choice of operating system, programming language compiler capability, and available software tools poses further constraints on the system design, development, and deployment. System functionalities It is important to define the outcomes of your trading system. An outcome could be a research-based system that might be more concerned with obtaining high-quality data from data vendors, performing computations or running models, and evaluating a strategy through signal generation. Part of the research component might include a data-cleaning module or a backtesting interface to run a strategy with theoretical parameters over historical data. The CPU speed, memory size, and bandwidth are factors to be considered while designing our system. Another outcome could be an execution-based system that is more concerned with risk management and order handling features to ensure timely execution of multiple orders. The system must be highly robust and handle any point of failure during the order execution. As such, network configuration, hardware, memory management and speed, and user experience are some factors to be considered when designing a system in executing orders. A system may contain one or more of these functionalities. Designing larger systems inevitably add complexity to the framework. It is recommended that you choose one or more programming languages that can address and balance the development speed, ease of development, scalability, and reliability of your trading system. Algorithmic trading with Interactive Brokers and IbPy In this section, we will build a working algorithmic trading platform that will authenticate with Interactive Brokers (IB) and log in, retrieve the market data, and send orders. IB is one of the most popular brokers in the trading community and has a long history of API development. There are plenty of articles on the use of the API available on the Web. IB serves clients ranging from hedge funds to retail traders. Although the API does not support Python directly, Python wrappers such as IbPy are available to make the API calls to the IB interface. The IB API is unique to its own implementation, and every broker has its own API handling methods. Nevertheless, the documents and sample applications provided by your broker would demonstrate the core functionality of every API interface that can be easily integrated into an algorithmic trading system if designed properly. Getting Interactive Brokers' Trader WorkStation The official page for IB is https://www.interactivebrokers.com. Here, you can find a wealth of information regarding trading and investing for retail and institutional traders. In this section, we will take a look at how to get the Trader WorkStation X (TWS) installed and running on your local workstation before setting up an algorithmic trading system using Python. Note that we will perform simulated trading on a demonstration account. If your trading strategy turns out to be profitable, head to the OPEN AN ACCOUNT section of the IB website to open a live trading account. Rules, regulations, market data fees, exchange fees, commissions, and other conditions are subjected to the broker of your choice. In addition, market conditions are vastly different from the simulated environment. You are encouraged to perform extensive testing on your algorithmic trading system before running on live markets. The following key steps describe how to install TWS on your local workstation, log in to the demonstration account, and set it up for API use: From IB's official website, navigate to TRADING, and then select Standalone TWS. Choose the installation executable that is suitable for your local workstation. TWS runs on Java; therefore, ensure that Java runtime plugin is already installed on your local workstation. Refer to the following screenshot: When prompted during the installation process, choose Trader_WorkStation_X and IB Gateway options. The Trader WorkStation X (TWS) is the trading platform with full order management functionality. The IB Gateway program accepts and processes the API connections without any order management features of the TWS. We will not cover the use of the IB Gateway, but you may find it useful later. Select the destination directory on your local workstation where TWS will place all the required files, as shown in the following screenshot: When the installation is completed, a TWS shortcut icon will appear together with your list of installed applications. Double-click on the icon to start the TWS program. When TWS starts, you will be prompted to enter your login credentials. To log in to the demonstration account, type edemo in the username field and demouser in the password field, as shown in the following screenshot: Once we have managed to load our demo account on TWS, we can now set up its API functionality. On the toolbar, click on Configure: Under the Configuration tree, open the API node to reveal further options. Select Settings. Note that Socket port is 7496, and we added the IP address of our workstation housing our algorithmic trading system to the list of trusted IP addresses, which in this case is 127.0.0.1. Ensure that the Enable ActiveX and Socket Clients option is selected to allow the socket connections to TWS: Click on OK to save all the changes. TWS is now ready to accept orders and market data requests from our algorithmic trading system. Getting IbPy – the IB API wrapper IbPy is an add-on module for Python that wraps the IB API. It is open source and can be found at https://github.com/blampe/IbPy. Head to this URL and download the source files. Unzip the source folder, and use Terminal to navigate to this directory. Type python setup.py install to install IbPy as part of the Python runtime environment. The use of IbPy is similar to the API calls, as documented on the IB website. The documentation for IbPy is at https://code.google.com/p/ibpy/w/list. A simple order routing mechanism In this section, we will start interacting with TWS using Python by establishing a connection and sending out a market order to the exchange. Once IbPy is installed, import the following necessary modules into our Python script: from ib.ext.Contract import Contractfrom ib.ext.Order import Orderfrom ib.opt import Connection Next, implement the logging functions to handle calls from the server. The error_handler method is invoked whenever the API encounters an error, which is accompanied with a message. The server_handler method is dedicated to handle all the other forms of returned API messages. The msg variable is a type of an ib.opt.message object and references the method calls, as defined by the IB API EWrapper methods. The API documentation can be accessed at https://www.interactivebrokers.com/en/software/api/api.htm. The following is the Python code for the server_handler method: def error_handler(msg):print "Server Error:", msgdef server_handler(msg):print "Server Msg:", msg.typeName, "-", msg We will place a sample order of the stock AAPL. The contract specifications of the order are defined by the Contract class object found in the ib.ext.Contract module. We will create a method called create_contract that returns a new instance of this object: def create_contract(symbol, sec_type, exch, prim_exch, curr):contract = Contract()contract.m_symbol = symbolcontract.m_secType = sec_typecontract.m_exchange = exchcontract.m_primaryExch = prim_exchcontract.m_currency = currreturn contract The Order class object is used to place an order with TWS. Let's define a method called create_order that will return a new instance of the object: def create_order(order_type, quantity, action):order = Order()order.m_orderType = order_typeorder.m_totalQuantity = quantityorder.m_action = actionreturn order After the required methods are created, we can then begin to script the main functionality. Let's initialize the required variables: if __name__ == "__main__":client_id = 100order_id = 1port = 7496tws_conn = None Note that the client_id variable is our assigned integer that identifies the instance of the client communicating with TWS. The order_id variable is our assigned integer that identifies the order queue number sent to TWS. Each new order requires this value to be incremented sequentially. The port number has the same value as defined in our API settings of TWS earlier. The tws_conn variable holds the connection value to TWS. Let's initialize this variable with an empty value for now. Let's use a try block that encapsulates the Connection.create method to handle the socket connections to TWS in a graceful manner: try:# Establish connection to TWS.tws_conn = Connection.create(port=port,clientId=client_id)tws_conn.connect()# Assign error handling function.tws_conn.register(error_handler, 'Error')# Assign server messages handling function.tws_conn.registerAll(server_handler)finally:# Disconnect from TWSif tws_conn is not None:tws_conn.disconnect() The port and clientId parameter fields define this connection. After the connection instance is created, the connect method will try to connect to TWS. When the connection to TWS has successfully opened, it is time to register listeners to receive notifications from the server. The register method associates a function handler to a particular event. The registerAll method associates a handler to all the messages generated. This is where the error_handler and server_handler methods declared earlier will be used for this occasion. Before sending our very first order of 100 shares of AAPL to the exchange, we will call the create_contract method to create a new contract object for AAPL. Then, we will call the create_order method to create a new Order object, to go long 100 shares. Finally, we will call the placeOrder method of the Connection class to send out this order to TWS: # Create a contract for AAPL stock using SMART order routing.aapl_contract = create_contract('AAPL','STK','SMART','SMART','USD')# Go long 100 shares of AAPLaapl_order = create_order('MKT', 100, 'BUY')# Place order on IB TWS.tws_conn.placeOrder(order_id, aapl_contract, aapl_order) That's it! Let's run our Python script. We should get a similar output as follows: Server Error: <error id=-1, errorCode=2104, errorMsg=Market data farmconnection is OK:ibdemo>Server Response: error, <error id=-1, errorCode=2104, errorMsg=Marketdata farm connection is OK:ibdemo>Server Version: 75TWS Time at connection:20141210 23:14:17 CSTServer Msg: managedAccounts - <managedAccounts accountsList=DU15200>Server Msg: nextValidId - <nextValidId orderId=1>Server Error: <error id=-1, errorCode=2104, errorMsg=Market data farmconnection is OK:ibdemo>Server Msg: error - <error id=-1, errorCode=2104, errorMsg=Market datafarm connection is OK:ibdemo>Server Error: <error id=-1, errorCode=2107, errorMsg=HMDS data farmconnection is inactive but should be available upon demand.demohmds>Server Msg: error - <error id=-1, errorCode=2107, errorMsg=HMDS data farmconnection is inactive but should be available upon demand.demohmds> Basically, what the error messages say is that there are no errors and the connections are OK. Should the simulated order be executed successfully during market trading hours, the trade will be reflected in TWS: The full source code of our implementation is given as follows: """ A Simple Order Routing Mechanism """from ib.ext.Contract import Contractfrom ib.ext.Order import Orderfrom ib.opt import Connectiondef error_handler(msg):print "Server Error:", msgdef server_handler(msg):print "Server Msg:", msg.typeName, "-", msgdef create_contract(symbol, sec_type, exch, prim_exch, curr):contract = Contract()contract.m_symbol = symbolcontract.m_secType = sec_typecontract.m_exchange = exchcontract.m_primaryExch = prim_exchcontract.m_currency = currreturn contractdef create_order(order_type, quantity, action):order = Order()order.m_orderType = order_typeorder.m_totalQuantity = quantityorder.m_action = actionreturn orderif __name__ == "__main__":client_id = 1order_id = 119port = 7496tws_conn = Nonetry:# Establish connection to TWS.tws_conn = Connection.create(port=port,clientId=client_id)tws_conn.connect()# Assign error handling function.tws_conn.register(error_handler, 'Error')# Assign server messages handling function.tws_conn.registerAll(server_handler)# Create AAPL contract and send orderaapl_contract = create_contract('AAPL','STK','SMART','SMART','USD')# Go long 100 shares of AAPLaapl_order = create_order('MKT', 100, 'BUY')# Place order on IB TWS.tws_conn.placeOrder(order_id, aapl_contract, aapl_order)finally:# Disconnect from TWSif tws_conn is not None:tws_conn.disconnect() Summary In this article, we were introduced to the evolution of trading from the pits to the electronic trading platform, and learned how algorithmic trading came about. We looked at some brokers offering API access to their trading service offering. To help us get started on our journey in developing an algorithmic trading system, we used the TWS of IB and the IbPy Python module. In our first trading program, we successfully sent an order to our broker through the TWS API using a demonstration account. Resources for Article: Prototyping Arduino Projects using Python Python functions – Avoid repeating code Pentesting Using Python
Read more
  • 0
  • 0
  • 13995

Packt
27 Apr 2015
9 min read
Save for later

Apache Solr and Big Data – integration with MongoDB

Packt
27 Apr 2015
9 min read
In this article by Hrishikesh Vijay Karambelkar, author of the book Scaling Big Data with Hadoop and Solr - Second Edition, we will go through Apache Solr and MongoDB together. In an enterprise, data is generated from all the software that is participating in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires a storage system that is flexible enough to accommodate a data with varying data models. A NoSQL database, by its design, is best suited for this kind of storage requirements. One of the primary objectives of NoSQL is horizontal scaling, that is, the P in CAP theorem, but this works at the cost of sacrificing Consistency or Availability. Visit http://en.wikipedia.org/wiki/CAP_theorem to understand more about CAP theorem (For more resources related to this topic, see here.) What is NoSQL and how is it related to Big Data? As we have seen, data models for NoSQL differ completely from that of a relational database. With the flexible data model, it becomes very easy for developers to quickly integrate with the NoSQL database, and bring in large sized data from different data sources. This makes the NoSQL database ideal for Big Data storage, since it demands that different data types be brought together under one umbrella. NoSQL also has different data models, like KV store, document store and Big Table storage. In addition to flexible schema, NoSQL offers scalability and high performance, which is again one of the most important factors to be considered while running big data. NoSQL was developed to be a distributed type of database. When traditional relational stores rely on the high computing power of CPUs and the high memory focused on a centralized system, NoSQL can run on your low-cost, commodity hardware. These servers can be added or removed dynamically from the cluster running NoSQL, making the NoSQL database easier to scale. NoSQL enables most advanced features of a database, like data partitioning, index sharding, distributed query, caching, and so on. Although NoSQL offers optimized storage for big data, it may not be able to replace the relational database. While relational databases offer transactional (ACID), high CRUD, data integrity, and a structured database design approach, which are required in many applications, NoSQL may not support them. Hence, it is most suited for Big Data where there is less possibility of need for data to be transactional. MongoDB at glance MongoDB is one of the popular NoSQL databases, (just like Cassandra). MongoDB supports the storing of any random schemas in the document oriented storage of its own. MongoDB supports the JSON-based information pipe for any communication with the server. This database is designed to work with heavy data. Today, many organizations are focusing on utilizing MongoDB for various enterprise applications. MongoDB provides high availability and load balancing. Each data unit is replicated and the combination of a data with its copes is called a replica set. Replicas in MongoDB can either be primary or secondary. Primary is the active replica, which is used for direct read-write operations, while the secondary replica works like a backup for the primary. MongoDB supports searches by field, range queries, and regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Any field in a MongoDB document can be indexed. More information about MongoDB can be read at https://www.mongodb.org/. The data on MongoDB is eventually consistent. Apache Solr can be used to work with MongoDB, to enable database searching capabilities on a MongoDB-based data store. Unlike Cassandra, where the Solr indexes are stored directly in Cassandra through solandra, MongoDB integration with Solr brings in the indexes in the Solr-based optimized storage. There are various ways in which the data residing in MongoDB can be analyzed and searched. MongoDB's replication works by recording all operations made on a database in a log file, called the oplog (operation log). Mongo's oplog keeps a rolling record of all operations that modify the data stored in your databases. Many of the implementers suggest reading this log file using a standard file IO program to push the data directly to Apache Solr, using CURL, SolrJ. Since oplog is a collection of data with an upper limit on maximum storage, it is feasible to synch such querying with Apache Solr. Oplog also provides tailable cursors on the database. These cursors can provide a natural order to the documents loaded in MongoDB, thereby, preserving their order. However, we are going to look at a different approach. Let's look at the schematic following diagram: In this case, MongoDB is exposed as a database to Apache Solr through the custom database driver. Apache Solr reads MongoDB data through the DataImportHandler, which in turns calls the JDBC-based MongoDB driver for connecting to MongoDB and running data import utilities. Since MongoDB supports replica sets, it manages the distribution of data across nodes. It also supports Sharding just like Apache Solr. Installing MongoDB To install MongoDB in your development environment, please follow the following steps: Download the latest version of MongoDB from https://www.mongodb.org/downloads for your supported operating system. Unzip the zipped folder. MongoDB comes up with a default set of different command-line components and utilities:      bin/mongod: The database process.      bin/mongos: Sharding controller.      bin/mongo: The database shell (uses interactive JavaScript). Now, create a directory for MongoDB, which it will use for user data creation and management, and run the following command to start the single node server: $ bin/mongod –dbpath <path to your data directory> --rest In this case, --rest parameter enables support for simple rest APIs that can be used for getting the status. Once the server is started, access http://localhost:28017 from your favorite browser, you should be able to see following administration status page: Now that you have successfully installed MongoDB, try loading a sample data set from the book on MongoDB by opening a new command-line interface. Change the directory to $MONGODB_HOME and run the following command: $ bin/mongoimport --db solr-test --collection zips --file "<file-dir>/samples/zips.json" Please note that the database name is solr-test. You can see the stored data using the MongoDB-based CLI by running the following set of commands from your shell: $ bin/mongo MongoDB shell version: 2.4.9 connecting to: test Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see        http://docs.mongodb.org/ Questions? Try the support group        http://groups.google.com/group/mongodb-user > use test Switched to db test > show dbs exampledb       0.203125GB local   0.078125GB test   0.203125GB > db.zips.find({city:"ACMAR"}) { "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" :"AL", "_id" : "35004" } Congratulations! MongoDB is installed successfully Creating Solr indexes from MongoDB To run MongoDB as a database, you will need a JDBC driver built for MongoDB. However, the Mongo-JDBC driver has certain limitations, and it does not work with the Apache Solr DataImportHandler. So, I have extended Mongo-JDBC to work under the Solr-based DataImportHandler. The project repository is available at https://github.com/hrishik/solr-mongodb-dih. Let's look at the setting-up procedure for enabling MongoDB based Solr integration: You may not require a complete package from the solr-mongodb-dih repository, but just the jar file. This can be downloaded from https://github.com/hrishik/solr-mongodb-dih/tree/master/sample-jar. You will also need the following additional jar files:      jsqlparser.jar      mongo.jar These jars are available with the book Scaling Big Data with Hadoop and Solr, Second Edition for download. In your Solr setup, copy these jar files into the library path, that is, the $SOLR_WAR_LOCATION/WEB-INF/lib folder. Alternatively, point your container classpath variable to link them up. Using simple Java source code DataLoad.java (link https://github.com/hrishik/solr-mongodb-dih/blob/master/examples/DataLoad.java), populate the database with some sample schema and tables that you will use to load in Apache Solr. Now create a data source file (data-source-config.xml) as follows: <dataConfig> <dataSource name="mongod" type="JdbcDataSource" driver="com.mongodb. jdbc.MongoDriver" url="mongodb://localhost/solr-test"/> <document>    <entity name="nameage" dataSource="mongod" query="select name, price from grocery">        <field column="name" name="name"/>        <field column="name" name="id"/>        <!-- other files -->    </entity> </document> </dataConfig> Copy the solr-dataimporthandler-*.jar from your contrib directory to a container/application library path. Modify $SOLR_COLLECTION_ROOT/conf/solr-config.xml with DIH entry: <!-- DIH Starts --> <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">    <lst name="defaults">    <str name="config"><path to config>/data-source-config.xml</str>    </lst> </requestHandler>    <!-- DIH ends --> Once this configuration is done, you are ready to test it out. Access http://localhost:8983/solr/dataimport?command=full-import from your browser to run the full import on Apache Solr, where you will see that your import handler has successfully ran, and has loaded the data in Solr store, as shown in the following screenshot: You can validate the content created by your new MongoDB DIH by accessing the Solr Admin page, and running a query: Using this connector, you can perform operations for full-import on various data elements. Since MongoDB is not a relational database, it does support join queries. However, it supports selects, order by, and so on. Summary In this article, we have understood the distributed aspects of any enterprise search where went through Apache Solr and MongoDB together. Resources for Article: Further resources on this subject: Evolution of Hadoop [article] In the Cloud [article] Learning Data Analytics with R and Hadoop [article]
Read more
  • 0
  • 0
  • 9767

article-image-integrating-d3js-visualization-simple-angularjs-application
Packt
27 Apr 2015
19 min read
Save for later

Integrating a D3.js visualization into a simple AngularJS application

Packt
27 Apr 2015
19 min read
In this article by Christoph Körner, author of the book Data Visualization with D3 and AngularJS, we will apply the acquired knowledge to integrate a D3.js visualization into a simple AngularJS application. First, we will set up an AngularJS template that serves as a boilerplate for the examples and the application. We will see a typical directory structure for an AngularJS project and initialize a controller. Similar to the previous example, the controller will generate random data that we want to display in an autoupdating chart. Next, we will wrap D3.js in a factory and create a directive for the visualization. You will learn how to isolate the components from each other. We will create a simple AngularJS directive and write a custom compile function to create and update the chart. (For more resources related to this topic, see here.) Setting up an AngularJS application To get started with this article, I assume that you feel comfortable with the main concepts of AngularJS: the application structure, controllers, directives, services, dependency injection, and scopes. I will use these concepts without introducing them in great detail, so if you do not know about one of these topics, first try an intermediate AngularJS tutorial. Organizing the directory To begin with, we will create a simple AngularJS boilerplate for the examples and the visualization application. We will use this boilerplate during the development of the sample application. Let's create a project root directory that contains the following files and folders: bower_components/: This directory contains all third-party components src/: This directory contains all source files src/app.js: This file contains source of the application src/app.css: CSS layout of the application test/: This directory contains all test files (test/config/ contains all test configurations, test/spec/ contains all unit tests, and test/e2e/ contains all integration tests) index.html: This is the starting point of the application Installing AngularJS In this article, we use the AngularJS version 1.3.14, but different patch versions (~1.3.0) should also work fine with the examples. Let's first install AngularJS with the Bower package manager. Therefore, we execute the following command in the root directory of the project: bower install angular#1.3.14 Now, AngularJS is downloaded and installed to the bower_components/ directory. If you don't want to use Bower, you can also simply download the source files from the AngularJS website and put them in a libs/ directory. Note that—if you develop large AngularJS applications—you most likely want to create a separate bower.json file and keep track of all your third-party dependencies. Bootstrapping the index file We can move on to the next step and code the index.html file that serves as a starting point for the application and all examples of this section. We need to include the JavaScript application files and the corresponding CSS layouts, the same for the chart component. Then, we need to initialize AngularJS by placing an ng-app attribute to the html tag; this will create the root scope of the application. Here, we will call the AngularJS application myApp, as shown in the following code: <html ng-app="myApp"> <head>    <!-- Include 3rd party libraries -->    <script src="bower_components/d3/d3.js" charset="UTF-   8"></script>    <script src="bower_components/angular/angular.js"     charset="UTF-8"></script>      <!-- Include the application files -->    <script src="src/app.js"></script>    <link href="src/app.css" rel="stylesheet">      <!-- Include the files of the chart component -->    <script src="src/chart.js"></script>    <link href="src/chart.css" rel="stylesheet">   </head> <body>    <!-- AngularJS example go here --> </body> </html> For all the examples in this section, I will use the exact same setup as the preceding code. I will only change the body of the HTML page or the JavaScript or CSS sources of the application. I will indicate to which file the code belongs to with a comment for each code snippet. If you are not using Bower and previously downloaded D3.js and AngularJS in a libs/ directory, refer to this directory when including the JavaScript files. Adding a module and a controller Next, we initialize the AngularJS module in the app.js file and create a main controller for the application. The controller should create random data (that represent some simple logs) in a fixed interval. Let's generate some random number of visitors every second and store all data points on the scope as follows: /* src/app.js */ // Application Module angular.module('myApp', [])   // Main application controller .controller('MainCtrl', ['$scope', '$interval', function ($scope, $interval) {      var time = new Date('2014-01-01 00:00:00 +0100');      // Random data point generator    var randPoint = function() {      var rand = Math.random;      return { time: time.toString(), visitors: rand()*100 };    }      // We store a list of logs    $scope.logs = [ randPoint() ];      $interval(function() {     time.setSeconds(time.getSeconds() + 1);      $scope.logs.push(randPoint());    }, 1000); }]); In the preceding example, we define an array of logs on the scope that we initialize with a random point. Every second, we will push a new random point to the logs. The points contain a number of visitors and a timestamp—starting with the date 2014-01-01 00:00:00 (timezone GMT+01) and counting up a second on each iteration. I want to keep it simple for now; therefore, we will use just a very basic example of random access log entries. Consider to use the cleaner controller as syntax for larger AngularJS applications because it makes the scopes in HTML templates explicit! However, for compatibility reasons, I will use the standard controller and $scope notation. Integrating D3.js into AngularJS We bootstrapped a simple AngularJS application in the previous section. Now, the goal is to integrate a D3.js component seamlessly into an AngularJS application—in an Angular way. This means that we have to design the AngularJS application and the visualization component such that the modules are fully encapsulated and reusable. In order to do so, we will use a separation on different levels: Code of different components goes into different files Code of the visualization library goes into a separate module Inside a module, we divide logics into controllers, services, and directives Using this clear separation allows you to keep files and modules organized and clean. If at anytime we want to replace the D3.js backend with a canvas pixel graphic, we can just implement it without interfering with the main application. This means that we want to use a new module of the visualization component and dependency injection. These modules enable us to have full control of the separate visualization component without touching the main application and they will make the component maintainable, reusable, and testable. Organizing the directory First, we add the new files for the visualization component to the project: src/: This is the default directory to store all the file components for the project src/chart.js: This is the JS source of the chart component src/chart.css: This is the CSS layout for the chart component test/test/config/: This directory contains all test configurations test/spec/test/spec/chart.spec.js: This file contains the unit tests of the chart component test/e2e/chart.e2e.js: This file contains the integration tests of the chart component If you develop large AngularJS applications, this is probably not the folder structure that you are aiming for. Especially in bigger applications, you will most likely want to have components in separate folders and directives and services in separate files. Then, we will encapsulate the visualization from the main application and create the new myChart module for it. This will make it possible to inject the visualization component or parts of it—for example just the chart directive—to the main application. Wrapping D3.js In this module, we will wrap D3.js—which is available via the global d3 variable—in a service; actually, we will use a factory to just return the reference to the d3 variable. This enables us to pass D3.js as a dependency inside the newly created module wherever we need it. The advantage of doing so is that the injectable d3 component—or some parts of it—can be mocked for testing easily. Let's assume we are loading data from a remote resource and do not want to wait for the time to load the resource every time we test the component. Then, the fact that we can mock and override functions without having to modify anything within the component will become very handy. Another great advantage will be defining custom localization configurations directly in the factory. This will guarantee that we have the proper localization wherever we use D3.js in the component. Moreover, in every component, we use the injected d3 variable in a private scope of a function and not in the global scope. This is absolutely necessary for clean and encapsulated components; we should never use any variables from global scope within an AngularJS component. Now, let's create a second module that stores all the visualization-specific code dependent on D3.js. Thus, we want to create an injectable factory for D3.js, as shown in the following code: /* src/chart.js */ // Chart Module   angular.module('myChart', [])   // D3 Factory .factory('d3', function() {   /* We could declare locals or other D3.js      specific configurations here. */   return d3; }); In the preceding example, we returned d3 without modifying it from the global scope. We can also define custom D3.js specific configurations here (such as locals and formatters). We can go one step further and load the complete D3.js code inside this factory so that d3 will not be available in the global scope at all. However, we don't use this approach here to keep things as simple and understandable as possible. We need to make this module or parts of it available to the main application. In AngularJS, we can do this by injecting the myChart module into the myApp application as follows: /* src/app.js */   angular.module('myApp', ['myChart']); Usually, we will just inject the directives and services of the visualization module that we want to use in the application, not the whole module. However, for the start and to access all parts of the visualization, we will leave it like this. We can use the components of the chart module now on the AngularJS application by injecting them into the controllers, services, and directives. The boilerplate—with a simple chart.js and chart.css file—is now ready. We can start to design the chart directive. A chart directive Next, we want to create a reusable and testable chart directive. The first question that comes into one's mind is where to put which functionality? Should we create a svg element as parent for the directive or a div element? Should we draw a data point as a circle in svg and use ng-repeat to replicate these points in the chart? Or should we better create and modify all data points with D3.js? I will answer all these question in the following sections. A directive for SVG As a general rule, we can say that different concepts should be encapsulated so that they can be replaced anytime by a new technology. Hence, we will use AngularJS with an element directive as a parent element for the visualization. We will bind the data and the options of the chart to the private scope of the directive. In the directive itself, we will create the complete chart including the parent svg container, the axis, and all data points using D3.js. Let's first add a simple directive for the chart component: /* src/chart.js */ …   // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){      return {      restrict: 'E',      scope: {        },      compile: function( element, attrs, transclude ) {                   // Create a SVG root element        var svg = d3.select(element[0]).append('svg');          // Return the link function        return function(scope, element, attrs) { };      }    }; }]); In the preceding example, we first inject d3 to the directive by passing it as an argument to the caller function. Then, we return a directive as an element with a private scope. Next, we define a custom compile function that returns the link function of the directive. This is important because we need to create the svg container for the visualization during the compilation of the directive. Then, during the link phase of the directive, we need to draw the visualization. Let's try to define some of these directives and look at the generated output. We define three directives in the index.html file, as shown in the following code: <!-- index.html --> <div ng-controller="MainCtrl">   <!-- We can use the visualization directives here --> <!-- The first chart --> <my-scatter-chart class="chart"></my-scatter-chart>   <!-- A second chart --> <my-scatter-chart class="chart"></my-scatter-chart>   <!-- Another chart --> <my-scatter-chart class="chart"></my-scatter-chart>   </div> If we look at the output of the html page in the developer tools, we can see that for each base element of the directive, we created a svg parent element for the visualization: Output of the HTML page In the resulting DOM tree, we can see that three svg elements are appended to the directives. We can now start to draw the chart in these directives. Let's fill these elements with some awesome charts. Implementing a custom compile function First, let's add a data attribute to the isolated scope of the directive. This gives us access to the dataset, which we will later pass to the directive in the HTML template. Next, we extend the compile function of the directive to create a g group container for the data points and the axis. We will also add a watcher that checks for changes of the scope data array. Every time the data changes, we call a draw() function that redraws the chart of the directive. Let's get started: /* src/capp..js */ ... // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){        // we will soon implement this function    var draw = function(svg, width, height, data){ … };      return {      restrict: 'E',      scope: {        data: '='      },      compile: function( element, attrs, transclude ) {          // Create a SVG root element        var svg = d3.select(element[0]).append('svg');          svg.append('g').attr('class', 'data');        svg.append('g').attr('class', 'x-axis axis');        svg.append('g').attr('class', 'y-axis axis');          // Define the dimensions for the chart        var width = 600, height = 300;          // Return the link function        return function(scope, element, attrs) {            // Watch the data attribute of the scope          scope.$watch('data', function(newVal, oldVal, scope) {              // Update the chart            draw(svg, width, height, scope.data);          }, true);        };      }    }; }]); Now, we implement the draw() function in the beginning of the directive. Drawing charts So far, the chart directive should look like the following code. We will now implement the draw() function, draw axis, and time series data. We start with setting the height and width for the svg element as follows: /* src/chart.js */ ...   // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){      function draw(svg, width, height, data) {      svg        .attr('width', width)        .attr('height', height);      // code continues here }      return {      restrict: 'E',      scope: {        data: '='      },      compile: function( element, attrs, transclude ) { ... } }]); Axis, scale, range, and domain We first need to create the scales for the data and then the axis for the chart. The implementation looks very similar to the scatter chart. We want to update the axis with the minimum and maximum values of the dataset; therefore, we also add this code to the draw() function: /* src/chart.js --> myScatterChart --> draw() */   function draw(svg, width, height, data) { ... // Define a margin var margin = 30;   // Define x-scale var xScale = d3.time.scale()    .domain([      d3.min(data, function(d) { return d.time; }),      d3.max(data, function(d) { return d.time; })    ])    .range([margin, width-margin]);   // Define x-axis var xAxis = d3.svg.axis()    .scale(xScale)    .orient('top')    .tickFormat(d3.time.format('%S'));   // Define y-scale var yScale = d3.time.scale()    .domain([0, d3.max(data, function(d) { return d.visitors; })])    .range([margin, height-margin]);   // Define y-axis var yAxis = d3.svg.axis()    .scale(yScale)    .orient('left')    .tickFormat(d3.format('f'));   // Draw x-axis svg.select('.x-axis')    .attr("transform", "translate(0, " + margin + ")")    .call(xAxis);   // Draw y-axis svg.select('.y-axis')    .attr("transform", "translate(" + margin + ")")    .call(yAxis); } In the preceding code, we create a timescale for the x-axis and a linear scale for the y-axis and adapt the domain of both axes to match the maximum value of the dataset (we can also use the d3.extent() function to return min and max at the same time). Then, we define the pixel range for our chart area. Next, we create two axes objects with the previously defined scales and specify the tick format of the axis. We want to display the number of seconds that have passed on the x-axis and an integer value of the number of visitors on the y-axis. In the end, we draw the axes by calling the axis generator on the axis selection. Joining the data points Now, we will draw the data points and the axis. We finish the draw() function with this code: /* src/chart.js --> myScatterChart --> draw() */ function draw(svg, width, height, data) { ... // Add new the data points svg.select('.data')    .selectAll('circle').data(data)    .enter()    .append('circle');   // Updated all data points svg.select('.data')    .selectAll('circle').data(data)    .attr('r', 2.5)    .attr('cx', function(d) { return xScale(d.time); })    .attr('cy', function(d) { return yScale(d.visitors); }); } In the preceding code, we first create circle elements for the enter join for the data points where no corresponding circle is found in the Selection. Then, we update the attributes of the center point of all circle elements of the chart. Let's look at the generated output of the application: Output of the chart directive We notice that the axes and the whole chart scales as soon as new data points are added to the chart. In fact, this result looks very similar to the previous example with the main difference that we used a directive to draw this chart. This means that the data of the visualization that belongs to the application is stored and updated in the application itself, whereas the directive is completely decoupled from the data. To achieve a nice output like in the previous figure, we need to add some styles to the cart.css file, as shown in the following code: /* src/chart.css */ .axis path, .axis line {    fill: none;    stroke: #999;    shape-rendering: crispEdges; } .tick {    font: 10px sans-serif; } circle {    fill: steelblue; } We need to disable the filling of the axis and enable crisp edges rendering; this will give the whole visualization a much better look. Summary In this article, you learned how to properly integrate a D3.js component into an AngularJS application—the Angular way. All files, modules, and components should be maintainable, testable, and reusable. You learned how to set up an AngularJS application and how to structure the folder structure for the visualization component. We put different responsibilities in different files and modules. Every piece that we can separate from the main application can be reused in another application; the goal is to use as much modularization as possible. As a next step, we created the visualization directive by implementing a custom compile function. This gives us access to the first compilation of the element—where we can append the svg element as a parent for the visualization—and other container elements. Resources for Article: Further resources on this subject: AngularJS Performance [article] An introduction to testing AngularJS directives [article] Our App and Tool Stack [article]
Read more
  • 0
  • 0
  • 7849

article-image-solr-indexing-internals
Packt
23 Apr 2015
9 min read
Save for later

Solr Indexing Internals

Packt
23 Apr 2015
9 min read
In this article by Jayant Kumar, author of the book Apache Solr Search Patterns, we will discuss use cases for Solr in e-commerce and job sites. We will look at the problems faced while providing search in an e-commerce or job site: The e-commerce problem statement The job site problem statement Challenges of large-scale indexing (For more resources related to this topic, see here.) The e-commerce problem statement E-commerce provides an easy way to sell products to a large customer base. However, there is a lot of competition among multiple e-commerce sites. When users land on an e-commerce site, they expect to find what they are looking for quickly and easily. Also, users are not sure about the brands or the actual products they want to purchase. They have a very broad idea about what they want to buy. Many customers nowadays search for their products on Google rather than visiting specific e-commerce sites. They believe that Google will take them to the e-commerce sites that have their product. The purpose of any e-commerce website is to help customers narrow down their broad ideas and enable them to finalize the products they want to purchase. For example, suppose a customer is interested in purchasing a mobile. His or her search for a mobile should list mobile brands, operating systems on mobiles, screen size of mobiles, and all other features as facets. As the customer selects more and more features or options from the facets provided, the search narrows down to a small list of mobiles that suit his or her choice. If the list is small enough and the customer likes one of the mobiles listed, he or she will make the purchase. The challenge is also that each category will have a different set of facets to be displayed. For example, searching for books should display their format, as in paperpack or hardcover, author name, book series, language, and other facets related to books. These facets were different for mobiles that we discussed earlier. Similarly, each category will have different facets and it needs to be designed properly so that customers can narrow down to their preferred products, irrespective of the category they are looking into. The takeaway from this is that categorization and feature listing of products should be taken care of. Misrepresentation of features can lead to incorrect search results. Another takeaway is that we need to provide multiple facets in the search results. For example, while displaying the list of all mobiles, we need to provide facets for a brand. Once a brand is selected, another set of facets for operating systems, network, and mobile phone features has to be provided. As more and more facets are selected, we still need to show facets within the remaining products. Example of facet selection on Amazon.com Another problem is that we do not know what product the customer is searching for. A site that displays a huge list of products from different categories, such as electronics, mobiles, clothes, or books, needs to be able to identify what the customer is searching for. A customer can be searching for samsung, which can be in mobiles, tablets, electronics, or computers. The site should be able to identify whether the customer has input the author name or the book name. Identifying the input would help in increasing the relevance of the result set by increasing the precision of the search results. Most e-commerce sites provide search suggestions that include the category to help customers target the right category during their search. Amazon, for example, provides search suggestions that include both latest searched terms and products along with category-wise suggestions: Search suggestions on Amazon.com It is also important that products are added to the index as soon as they are available. It is even more important that they are removed from the index or marked as sold out as soon as their stock is exhausted. For this, modifications to the index should be immediately visible in the search. This is facilitated by a concept in Solr known as Near Real Time Indexing and Search (NRT). The job site problem statement A job site serves a dual purpose. On the one hand, it provides jobs to candidates, and on the other, it serves as a database of registered candidates' profiles for companies to shortlist. A job search has to be very intuitive for the candidates so that they can find jobs suiting their skills, position, industry, role, and location, or even by the company name. As it is important to keep the candidates engaged during their job search, it is important to provide facets on the abovementioned criteria so that they can narrow down to the job of their choice. The searches by candidates are not very elaborate. If the search is generic, the results need to have high precision. On the other hand, if the search does not return many results, then recall has to be high to keep the candidate engaged on the site. Providing a personalized job search to candidates on the basis of their profiles and past search history makes sense for the candidates. On the recruiter side, the search provided over the candidate database is required to have a huge set of fields to search upon every data point that the candidate has entered. The recruiters are very selective when it comes to searching for candidates for specific jobs. Educational qualification, industry, function, key skills, designation, location, and experience are some of the fields provided to the recruiter during a search. In such cases, the precision has to be high. The recruiter would like a certain candidate and may be interested in more candidates similar to the selected candidate. The more like this search in Solr can be used to provide a search for candidates similar to a selected candidate. NRT is important as the site should be able to provide a job or a candidate for a search as soon as any one of them is added to the database by either the recruiter or the candidate. The promptness of the site is an important factor in keeping users engaged on the site. Challenges of large-scale indexing Let us understand how indexing happens and what can be done to speed it up. We will also look at the challenges faced during the indexing of a large number of documents or bulky documents. An e-commerce site is a perfect example of a site containing a large number of products, while a job site is an example of a search where documents are bulky because of the content in candidate resumes. During indexing, Solr first analyzes the documents and converts them into tokens that are stored in the RAM buffer. When the RAM buffer is full, data is flushed into a segment on the disk. When the numbers of segments are more than that defined in the MergeFactor class of the Solr configuration, the segments are merged. Data is also written to disk when a commit is made in Solr. Let us discuss a few points to make Solr indexing fast and to handle a large index containing a huge number of documents. Using multiple threads for indexing on Solr We can divide our data into smaller chunks and each chunk can be indexed in a separate thread. Ideally, the number of threads should be twice the number of processor cores to avoid a lot of context switching. However, we can increase the number of threads beyond that and check for performance improvement. Using the Java binary format of data for indexing Instead of using XML files, we can use the Java bin format for indexing. This reduces a lot of overhead of parsing an XML file and converting it into a binary format that is usable. The way to use the Java bin format is to write our own program for creating fields, adding fields to documents, and finally adding documents to the index. Here is a sample code: //Create an instance of the Solr server String SOLR_URL = "http://localhost:8983/solr" SolrServer server = new HttpSolrServer(SOLR_URL);   //Create collection of documents to add to Solr server SolrInputDocument doc1 = new SolrInputDocument(); document.addField("id",1); document.addField("desc", "description text for doc 1");   SolrInputDocument doc2 = new SolrInputDocument(); document.addField("id",2); document.addField("desc", "description text for doc 2");   Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); docs.add(doc1); docs.add(doc2);   //Add the collection of documents to the Solr server and commit. server.add(docs); server.commit(); Here is the reference to the API for the HttpSolrServer program http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html. Add all files from the <solr_directory>/dist folder to the classpath for compiling and running the HttpSolrServer program. Using the ConcurrentUpdateSolrServer class for indexing Using the ConcurrentUpdateSolrServer class instead of the HttpSolrServer class can provide performance benefits as the former uses buffers to store processed documents before sending them to the Solr server. We can also specify the number of background threads to use to empty the buffers. The API docs for ConcurrentUpdateSolrServer are found in the following link: http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html The constructor for the ConcurrentUpdateSolrServer class is defined as: ConcurrentUpdateSolrServer(String solrServerUrl, int queueSize, int threadCount) Here, queueSize is the buffer and threadCount is the number of background threads used to flush the buffers to the index on disk. Note that using too many threads can increase the context switching between threads and reduce performance. In order to optimize the number of threads, we should monitor performance (docs indexed per minute) after each increase and ensure that there is no decrease in performance. Summary In this article, we saw in brief the problems faced by e-commerce and job websites during indexing and search. We discussed the challenges faced while indexing a large number of documents. We also saw some tips on improving the speed of indexing. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [article] Apache Solr PHP Integration [article] Boost Your search [article]
Read more
  • 0
  • 0
  • 2303
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-text-mining-r-part-2
Robi Sen
16 Apr 2015
4 min read
Save for later

Text Mining with R: Part 2

Robi Sen
16 Apr 2015
4 min read
In Part 1, we covered the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In this post we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. Building the document matrix A common technique in text mining is using a matrix of documents terms called a document term matrix. A document term matrix is simply a matrix where columns are terms and rows are documents that contain the occurrence of specific terms within the document. Or if you reverse the order and have terms as rows and documents as columns it’s called a term document matrix. For example let’s say we have two documents D 1 and D2. For example let’s say we have the documents: D1 = "I like cats" D2 = "I hate cats" Then the document term matrix would look like:   I like hate cats D1 1 1 0 1 D2 1 0 1 1 For our project to make a Document term matrix in R all you need to do is use the DocumentTermMatrix() like this: tdm <- DocumentTermMatrix(mycorpus) You can see information on your document term matrix by using print like: print(tdm) <<DocumentTermMatrix (documents: 4688, terms: 18363)>> Non-/sparse entries: 44400/86041344 Sparsity : 100% Maximal term length: 65 Weighting : term frequency (tf) Next because we need to sum up all the values in each term column so that we can drive the frequency of each term occurrence. We also want to sort those values from highest to lowest. You can use this code: m <- as.matrix(tdm) v <- sort(colSums(m),decreasing=TRUE) Next we will use the names() to pull the each term object’s name which in our case is a word. Then we want to build a dataframe from our words associated with their frequency of occurrences. Finally we want to create our word cloud but remove any terms that have an occurrence of less than 45 times to reduce clutter in our wordcloud. You could also use max.words to limit the total number of words in your word cloud. So your final code should look like this: words <- names(v) d <- data.frame(word=words, freq=v) wordcloud(d$word,d$freq,min.freq=45) If you run this in R studio you should see something like the figure which shows the words with highest occurrence in our corpus. The wordcloud object automatically scales the drawn words by the size of their frequency value. From here you can do a lot with your word cloud including change the scale, associate color to various values, and much more. You can read more about wordcloud here. While word clouds are often used on the web for things like blogs, news sites, and other similar use cases they have real value for data analysis beyond just visual indicators for users to find terms of interest. For example if you look at the word cloud we generated you will notice that one of the most popular terms mentioned in tweets is chocolate. Doing a short inspection of our CSV document for the term chocolate we find a lot of people mentioning the word in a variety of contexts but one of the most common is in relationship to a specific super bowl add. For example here is a tweet: Alexalabesky 41673.39 Chocolate chips and peanut butter 0 0 0 Unknown Unknown Unknown Unknown Unknown This appeared after the airing of this advertisement from Butterfinger. So even with this simple R code we can generate real meaning from social media which is the measurable impact of an advertisement during the Super Bowl. Summary In this post we looked at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.
Read more
  • 0
  • 0
  • 4816

article-image-visualization
Packt
15 Apr 2015
29 min read
Save for later

Visualization

Packt
15 Apr 2015
29 min read
Humans are visual creatures and have evolved to be able to quickly notice the meaning when information is presented in certain ways that cause the wiring in our brains to have the light bulb of insight turn on. This "aha" can often be performed very quickly, given the correct tools, instead of through tedious numerical analysis. Tools for data analysis, such as pandas, take advantage of being able to quickly and iteratively provide the user to take data, process it, and quickly visualize the meaning. Often, much of what you will do with pandas is massaging your data to be able to visualize it in one or more visual patterns, in an attempt to get to "aha" by simply glancing at the visual representation of the information. In this article by Michael Heydt, author of the book Learning pandas we will cover common patterns in visualizing data with pandas. It is not meant to be exhaustive in coverage. The goal is to give you the required knowledge to create beautiful data visualizations on pandas data quickly and with very few lines of code. (For more resources related to this topic, see here.) This article is presented in three sections. The first introduces you to the general concepts of programming visualizations with pandas, emphasizing the process of creating time-series charts. We will also dive into techniques to label axes and create legends, colors, line styles, and markets. The second part of the article will then focus on the many types of data visualizations commonly used in pandas programs and data sciences, including: Bar plots Histograms Box and whisker charts Area plots Scatter plots Density plots Scatter plot matrixes Heatmaps The final section will briefly look at creating composite plots by dividing plots into subparts and drawing multiple plots within a single graphical canvas. Setting up the IPython notebook The first step to plot with pandas data, is to first include the appropriate libraries, primarily, matplotlib. The examples in this article will all be based on the following imports, where the plotting capabilities are from matplotlib, which will be aliased with plt: In [1]:# import pandas, numpy and datetimeimport numpy as npimport pandas as pd# needed for representing dates and timesimport datetimefrom datetime import datetime# Set some pandas options for controlling outputpd.set_option('display.notebook_repr_html', False)pd.set_option('display.max_columns', 10)pd.set_option('display.max_rows', 10)# used for seeding random number sequencesseedval = 111111# matplotlibimport matplotlib as mpl# matplotlib plotting functionsimport matplotlib.pyplot as plt# we want our plots inline%matplotlib inline The %matplotlib inline line is the statement that tells matplotlib to produce inline graphics. This will make the resulting graphs appear either inside your IPython notebook or IPython session. All examples will seed the random number generator with 111111, so that the graphs remain the same every time they run. Plotting basics with pandas The pandas library itself performs data manipulation. It does not provide data visualization capabilities itself. The visualization of data in pandas data structures is handed off by pandas to other robust visualization libraries that are part of the Python ecosystem, most commonly, matplotlib, which is what we will use in this article. All of the visualizations and techniques covered in this article can be performed without pandas. These techniques are all available independently in matplotlib. pandas tightly integrates with matplotlib, and by doing this, it is very simple to go directly from pandas data to a matplotlib visualization without having to work with intermediate forms of data. pandas does not draw the graphs, but it will tell matplotlib how to draw graphs using pandas data, taking care of many details on your behalf, such as automatically selecting Series for plots, labeling axes, creating legends, and defaulting color. Therefore, you often have to write very little code to create stunning visualizations. Creating time-series charts with .plot() One of the most common data visualizations created, is of the time-series data. Visualizing a time series in pandas is as simple as calling .plot() on a DataFrame or Series object. To demonstrate, the following creates a time series representing a random walk of values over time, akin to the movements in the price of a stock: In [2]:# generate a random walk time-seriesnp.random.seed(seedval)s = pd.Series(np.random.randn(1096),index=pd.date_range('2012-01-01','2014-12-31'))walk_ts = s.cumsum()# this plots the walk - just that easy :)walk_ts.plot(); The ; character at the end suppresses the generation of an IPython out tag, as well as the trace information. It is a common practice to execute the following statement to produce plots that have a richer visual style. This sets a pandas option that makes resulting plots have a shaded background and what is considered a slightly more pleasing style: In [3]:# tells pandas plots to use a default style# which has a background fillpd.options.display.mpl_style = 'default'walk_ts.plot(); The .plot() method on pandas objects is a wrapper function around the matplotlib libraries' plot() function. It makes plots of pandas data very easy to create. It is coded to know how to use the data in the pandas objects to create the appropriate plots for the data, handling many of the details of plot generation, such as selecting series, labeling, and axes generation. In this situation, the .plot() method determines that as Series contains dates for its index that the x axis should be formatted as dates and it selects a default color for the data. This example used a single series and the result would be the same using DataFrame with a single column. As an example, the following produces the same graph with one small difference. It has added a legend to the graph, which charts by default, generated from a DataFrame object, will have a legend even if there is only one series of data: In [4]:# a DataFrame with a single column will produce# the same plot as plotting the Series it is created fromwalk_df = pd.DataFrame(walk_ts)walk_df.plot(); The .plot() function is smart enough to know whether DataFrame has multiple columns, and it should create multiple lines/series in the plot and include a key for each, and also select a distinct color for each line. This is demonstrated with the following example: In [5]:# generate two random walks, one in each of# two columns in a DataFramenp.random.seed(seedval)df = pd.DataFrame(np.random.randn(1096, 2),index=walk_ts.index, columns=list('AB'))walk_df = df.cumsum()walk_df.head()Out [5]:A B2012-01-01 -1.878324 1.3623672012-01-02 -2.804186 1.4272612012-01-03 -3.241758 3.1653682012-01-04 -2.750550 3.3326852012-01-05 -1.620667 2.930017In [6]:# plot the DataFrame, which will plot a line# for each column, with a legendwalk_df.plot(); If you want to use one column of DataFrame as the labels on the x axis of the plot instead of the index labels, you can use the x and y parameters to the .plot() method, giving the x parameter the name of the column to use as the x axis and y parameter the names of the columns to be used as data in the plot. The following recreates the random walks as columns 'A' and 'B', creates a column 'C' with sequential values starting with 0, and uses these values as the x axis labels and the 'A' and 'B' columns values as the two plotted lines: In [7]:# copy the walkdf2 = walk_df.copy()# add a column C which is 0 .. 1096df2['C'] = pd.Series(np.arange(0, len(df2)), index=df2.index)# instead of dates on the x axis, use the 'C' column,# which will label the axis with 0..1000df2.plot(x='C', y=['A', 'B']); The .plot() functions, provided by pandas for the Series and DataFrame objects, take care of most of the details of generating plots. However, if you want to modify characteristics of the generated plots beyond their capabilities, you can directly use the matplotlib functions or one of more of the many optional parameters of the .plot() method. Adorning and styling your time-series plot The built-in .plot() method has many options that you can use to change the content in the plot. We will cover several of the common options used in most plots. Adding a title and changing axes labels The title of the chart can be set using the title parameter of the .plot() method. Axes labels are not set with .plot(), but by directly using the plt.ylabel() and plt.xlabel() functions after calling .plot(): In [8]:# create a time-series chart with a title and specific# x and y axes labels# the title is set in the .plot() method as a parameterwalk_df.plot(title='Title of the Chart')# explicitly set the x and y axes labels after the .plot()plt.xlabel('Time')plt.ylabel('Money'); The labels in this plot were added after the call to .plot(). A question that may be asked, is that if the plot is generated in the call to .plot(), then how are they changed on the plot? The answer, is that plots in matplotlib are not displayed until either .show() is called on the plot or the code reaches the end of the execution and returns to the interactive prompt. At either of these points, any plot generated by plot commands will be flushed out to the display. In this example, although .plot() is called, the plot is not generated until the IPython notebook code section finishes completion, so the changes for labels and title are added to the plot. Specifying the legend content and position To change the text used in the legend (the default is the column name from DataFrame), you can use the ax object returned from the .plot() method to modify the text using its .legend() method. The ax object is an AxesSubplot object, which is a representation of the elements of the plot, that can be used to change various aspects of the plot before it is generated: In [9]:# change the legend items to be different# from the names of the columns in the DataFrameax = walk_df.plot(title='Title of the Chart')# this sets the legend labelsax.legend(['1', '2']); The location of the legend can be set using the loc parameter of the .legend() method. By default, pandas sets the location to 'best', which tells matplotlib to examine the data and determine the best place to put the legend. However, you can also specify any of the following to position the legend more specifically (you can use either the string or the numeric code): Text Code 'best' 0 'upper right' 1 'upper left' 2 'lower left' 3 'lower right' 4 'right' 5 'center left' 6 'center right' 7 'lower center' 8 'upper center' 9 'center' 10 In our last chart, the 'best' option actually had the legend overlap the line from one of the series. We can reposition the legend in the upper center of the chart, which will prevent this and create a better chart of this data: In [10]:# change the position of the legendax = walk_df.plot(title='Title of the Chart')# put the legend in the upper center of the chartax.legend(['1', '2'], loc='upper center'); Legends can also be turned off with the legend parameter: In [11]:# omit the legend by using legend=Falsewalk_df.plot(title='Title of the Chart', legend=False); There are more possibilities for locating and actually controlling the content of the legend, but we leave that for you to do some more experimentation. Specifying line colors, styles, thickness, and markers pandas automatically sets the colors of each series on any chart. If you would like to specify your own color, you can do so by supplying style code to the style parameter of the plot function. pandas has a number of built-in single character code for colors, several of which are listed here: b: Blue g: Green r: Red c: Cyan m: Magenta y: Yellow k: Black w: White It is also possible to specify the color using a hexadecimal RGB code of the #RRGGBB format. To demonstrate both options, the following example sets the color of the first series to green using a single digit code and the second series to red using the hexadecimal code: In [12]:# change the line colors on the plot# use character code for the first line,# hex RGB for the secondwalk_df.plot(style=['g', '#FF0000']); Line styles can be specified using a line style code. These can be used in combination with the color style codes, following the color code. The following are examples of several useful line style codes: '-' = solid '--' = dashed ':' = dotted '-.' = dot-dashed '.' = points The following plot demonstrates these five line styles by drawing five data series, each with one of these styles. Notice how each style item now consists of a color symbol and a line style code: In [13]:# show off different line stylest = np.arange(0., 5., 0.2)legend_labels = ['Solid', 'Dashed', 'Dotted','Dot-dashed', 'Points']line_style = pd.DataFrame({0 : t,1 : t**1.5,2 : t**2.0,3 : t**2.5,4 : t**3.0})# generate the plot, specifying color and line style for each lineax = line_style.plot(style=['r-', 'g--', 'b:', 'm-.', 'k:'])# set the legendax.legend(legend_labels, loc='upper left'); The thickness of lines can be specified using the lw parameter of .plot(). This can be passed a thickness for multiple lines, by passing a list of widths, or a single width that is applied to all lines. The following redraws the graph with a line width of 3, making the lines a little more pronounced: In [14]:# regenerate the plot, specifying color and line style# for each line and a line width of 3 for all linesax = line_style.plot(style=['r-', 'g--', 'b:', 'm-.', 'k:'], lw=3)ax.legend(legend_labels, loc='upper left'); Markers on a line can also be specified using abbreviations in the style code. There are quite a few marker types provided and you can see them all at http://matplotlib.org/api/markers_api.html. We will examine five of them in the following chart by having each series use a different marker from the following: circles, stars, triangles, diamonds, and points. The type of marker is also specified using a code at the end of the style: In [15]:# redraw, adding markers to the linesax = line_style.plot(style=['r-o', 'g--^', 'b:*','m-.D', 'k:o'], lw=3)ax.legend(legend_labels, loc='upper left'); Specifying tick mark locations and tick labels Every plot we have seen to this point, has used the default tick marks and labels on the ticks that pandas decides are appropriate for the plot. These can also be customized using various matplotlib functions. We will demonstrate how ticks are handled by first examining a simple DataFrame. We can retrieve the locations of the ticks that were generated on the x axis using the plt.xticks() method. This method returns two values, the location, and the actual labels: In [16]:# a simple plot to use to examine ticksticks_data = pd.DataFrame(np.arange(0,5))ticks_data.plot()ticks, labels = plt.xticks()ticksOut [16]:array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ]) This array contains the locations of the ticks in units of the values along the x axis. pandas has decided that a range of 0 through 4 (the min and max) and an interval of 0.5 is appropriate. If we want to use other locations, we can provide these by passing them to plt.xticks() as a list. The following demonstrates these using even integers from -1 to 5, which will both change the extents of the axis, as well as remove non integral labels: In [17]:# resize x axis to (-1, 5), and draw ticks# only at integer valuesticks_data = pd.DataFrame(np.arange(0,5))ticks_data.plot()plt.xticks(np.arange(-1, 6)); Also, we can specify new labels at these locations by passing them as the second parameter. Just as an example, we can change the y axis ticks and labels to integral values and consecutive alpha characters using the following: In [18]:# rename y axis tick labels to A, B, C, D, and Eticks_data = pd.DataFrame(np.arange(0,5))ticks_data.plot()plt.yticks(np.arange(0, 5), list("ABCDE")); Formatting axes tick date labels using formatters The formatting of axes labels whose underlying data types is datetime is performed using locators and formatters. Locators control the position of the ticks, and the formatters control the formatting of the labels. To facilitate locating ticks and formatting labels based on dates, matplotlib provides several classes in maptplotlib.dates to help facilitate the process: MinuteLocator, HourLocator, DayLocator, WeekdayLocator, MonthLocator, and YearLocator: These are specific locators coded to determine where ticks for each type of date field will be found on the axis DateFormatter: This is a class that can be used to format date objects into labels on the axis By default, the default locator and formatter are AutoDateLocator and AutoDateFormatter, respectively. You can change these by providing different objects to use the appropriate methods on the specific axis object. To demonstrate, we will use a subset of the random walk data from earlier, which represents just the data from January through February of 2014. Plotting this gives us the following output: In [19]:# plot January-February 2014 from the random walkwalk_df.loc['2014-01':'2014-02'].plot(); The labels on the x axis of this plot have two series of labels, the minor and the major. The minor labels in this plot contain the day of the month, and the major contains the year and month (the year only for the first month). We can set locators and formatters for each of the minor and major levels. This will be demonstrated by changing the minor labels to be located at the Monday of each week and to contain the date and day of the week (right now, the chart uses weekly and only Friday's date—without the day name). On the major labels, we will use the monthly location and always include both the month name and the year: In [20]:# this import styles helps us type lessfrom matplotlib.dates import WeekdayLocator, DateFormatter, MonthLocator# plot Jan-Feb 2014ax = walk_df.loc['2014-01':'2014-02'].plot()# do the minor labelsweekday_locator = WeekdayLocator(byweekday=(0), interval=1)ax.xaxis.set_minor_locator(weekday_locator)ax.xaxis.set_minor_formatter(DateFormatter("%dn%a"))# do the major labelsax.xaxis.set_major_locator(MonthLocator())ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y')); This is almost what we wanted. However, note that the year is being reported as 45. This, unfortunately, seems to be an issue between pandas and the matplotlib representation of values for the year. The best reference I have on this is this following link from Stack Overflow (http://stackoverflow.com/questions/12945971/pandas-timeseries-plot-setting-x-axis-major-and-minor-ticks-and-labels). So, it appears to create a plot with custom-date-based labels, we need to avoid the pandas .plot() and need to kick all the way down to using matplotlib. Fortunately, this is not too hard. The following changes the code slightly and renders what we wanted: In [21]:# this gets around the pandas / matplotlib year issue# need to reference the subset twice, so let's make a variablewalk_subset = walk_df['2014-01':'2014-02']# this gets the plot so we can use it, we can ignore figfig, ax = plt.subplots()# inform matplotlib that we will use the following as dates# note we need to convert the index to a pydatetime seriesax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')# do the minor labelsweekday_locator = WeekdayLocator(byweekday=(0), interval=1)ax.xaxis.set_minor_locator(weekday_locator)ax.xaxis.set_minor_formatter(DateFormatter('%dn%a'))# do the major labelsax.xaxis.set_major_locator(MonthLocator())ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y'));ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y')); To add grid lines for the minor axes ticks, you can use the .grid() method of the x axis object of the plot, the first parameter specifying the lines to use and the second parameter specifying the minor or major set of ticks. The following replots this graph without the major grid line and with the minor grid lines: In [22]:# this gets the plot so we can use it, we can ignore figfig, ax = plt.subplots()# inform matplotlib that we will use the following as dates# note we need to convert the index to a pydatetime seriesax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')# do the minor labelsweekday_locator = WeekdayLocator(byweekday=(0), interval=1)ax.xaxis.set_minor_locator(weekday_locator)ax.xaxis.set_minor_formatter(DateFormatter('%dn%a'))ax.xaxis.grid(True, "minor") # turn on minor tick grid linesax.xaxis.grid(False, "major") # turn off major tick grid lines# do the major labelsax.xaxis.set_major_locator(MonthLocator())ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y')); The last demonstration of formatting will use only the major labels but on a weekly basis and using a YYYY-MM-DD format. However, because these would overlap, we will specify that they should be rotated to prevent the overlap. This is done using the fig.autofmt_xdate() function: In [23]:# this gets the plot so we can use it, we can ignore figfig, ax = plt.subplots()# inform matplotlib that we will use the following as dates# note we need to convert the index to a pydatetime seriesax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')ax.xaxis.grid(True, "major") # turn off major tick grid lines# do the major labelsax.xaxis.set_major_locator(weekday_locator)ax.xaxis.set_major_formatter(DateFormatter('%Y-%m-%d'));# informs to rotate date labelsfig.autofmt_xdate(); Common plots used in statistical analyses Having seen how to create, lay out, and annotate time-series charts, we will now look at creating a number of charts, other than time series that are commonplace in presenting statistical information. Bar plots Bar plots are useful in order to visualize the relative differences in values of non time-series data. Bar plots can be created using the kind='bar' parameter of the .plot() method: In [24]:# make a bar plot# create a small series of 10 random values centered at 0.0np.random.seed(seedval)s = pd.Series(np.random.rand(10) - 0.5)# plot the bar charts.plot(kind='bar'); If the data being plotted consists of multiple columns, a multiple series bar plot will be created: In [25]:# draw a multiple series bar chart# generate 4 columns of 10 random valuesnp.random.seed(seedval)df2 = pd.DataFrame(np.random.rand(10, 4),columns=['a', 'b', 'c', 'd'])# draw the multi-series bar chartdf2.plot(kind='bar'); If you would prefer stacked bars, you can use the stacked parameter, setting it to True: In [26]:# horizontal stacked bar chartdf2.plot(kind='bar', stacked=True); If you want the bars to be horizontally aligned, you can use kind='barh': In [27]:# horizontal stacked bar chartdf2.plot(kind='barh', stacked=True); Histograms Histograms are useful for visualizing distributions of data. The following shows you a histogram of generating 1000 values from the normal distribution: In [28]:# create a histogramnp.random.seed(seedval)# 1000 random numbersdfh = pd.DataFrame(np.random.randn(1000))# draw the histogramdfh.hist(); The resolution of a histogram can be controlled by specifying the number of bins to allocate to the graph. The default is 10, and increasing the number of bins gives finer detail to the histogram. The following increases the number of bins to 100: In [29]:# histogram again, but with more binsdfh.hist(bins = 100); If the data has multiple series, the histogram function will automatically generate multiple histograms, one for each series: In [30]:# generate a multiple histogram plot# create DataFrame with 4 columns of 1000 random valuesnp.random.seed(seedval)dfh = pd.DataFrame(np.random.randn(1000, 4),columns=['a', 'b', 'c', 'd'])# draw the chart. There are four columns so pandas draws# four historgramsdfh.hist(); If you want to overlay multiple histograms on the same graph (to give a quick visual difference of distribution), you can call the pyplot.hist() function multiple times before .show() is called to render the chart: In [31]:# directly use pyplot to overlay multiple histograms# generate two distributions, each with a different# mean and standard deviationnp.random.seed(seedval)x = [np.random.normal(3,1) for _ in range(400)]y = [np.random.normal(4,2) for _ in range(400)]# specify the bins (-10 to 10 with 100 bins)bins = np.linspace(-10, 10, 100)# generate plot x using plt.hist, 50% transparentplt.hist(x, bins, alpha=0.5, label='x')# generate plot y using plt.hist, 50% transparentplt.hist(y, bins, alpha=0.5, label='y')plt.legend(loc='upper right'); Box and whisker charts Box plots come from descriptive statistics and are a useful way of graphically depicting the distributions of categorical data using quartiles. Each box represents the values between the first and third quartiles of the data with a line across the box at the median. Each whisker reaches out to demonstrate the extent to five interquartile ranges below and above the first and third quartiles: In [32]:# create a box plot# generate the seriesnp.random.seed(seedval)dfb = pd.DataFrame(np.random.randn(10,5))# generate the plotdfb.boxplot(return_type='axes'); There are ways to overlay dots and show outliers, but for brevity, they will not be covered in this text. Area plots Area plots are used to represent cumulative totals over time, to demonstrate the change in trends over time among related attributes. They can also be "stacked" to demonstrate representative totals across all variables. Area plots are generated by specifying kind='area'. A stacked area chart is the default: In [33]:# create a stacked area plot# generate a 4-column data frame of random datanp.random.seed(seedval)dfa = pd.DataFrame(np.random.rand(10, 4),columns=['a', 'b', 'c', 'd'])# create the area plotdfa.plot(kind='area'); To produce an unstacked plot, specify stacked=False: In [34]:# do not stack the area plotdfa.plot(kind='area', stacked=False); By default, unstacked plots have an alpha value of 0.5, so that it is possible to see how the data series overlaps. Scatter plots A scatter plot displays the correlation between a pair of variables. A scatter plot can be created from DataFrame using .plot() and specifying kind='scatter', as well as specifying the x and y columns from the DataFrame source: In [35]:# generate a scatter plot of two series of normally# distributed random values# we would expect this to cluster around 0,0np.random.seed(111111)sp_df = pd.DataFrame(np.random.randn(10000, 2),columns=['a', 'b'])sp_df.plot(kind='scatter', x='a', y='b') We can easily create more elaborate scatter plots by dropping down a little lower into matplotlib. The following code gets Google stock data for the year of 2011 and calculates delta in the closing price per day, and renders close versus volume as bubbles of different sizes, derived on the size of the values in the data: In [36]:# get Google stock data from 1/1/2011 to 12/31/2011from pandas.io.data import DataReaderstock_data = DataReader("GOOGL", "yahoo",datetime(2011, 1, 1),datetime(2011, 12, 31))# % change per daydelta = np.diff(stock_data["Adj Close"])/stock_data["Adj Close"][:-1]# this calculates size of markersvolume = (15 * stock_data.Volume[:-2] / stock_data.Volume[0])**2close = 0.003 * stock_data.Close[:-2] / 0.003 * stock_data.Open[:-2]# generate scatter plotfig, ax = plt.subplots()ax.scatter(delta[:-1], delta[1:], c=close, s=volume, alpha=0.5)# add some labels and styleax.set_xlabel(r'$Delta_i$', fontsize=20)ax.set_ylabel(r'$Delta_{i+1}$', fontsize=20)ax.set_title('Volume and percent change')ax.grid(True); Note the nomenclature for the x and y axes labels, which creates a nice mathematical style for the labels. Density plot You can create kernel density estimation plots using the .plot() method and setting the kind='kde' parameter. A kernel density estimate plot, instead of being a pure empirical representation of the data, makes an attempt and estimates the true distribution of the data, and hence smoothes it into a continuous plot. The following generates a normal distributed set of numbers, displays it as a histogram, and overlays the kde plot: In [37]:# create a kde density plot# generate a series of 1000 random numbersnp.random.seed(seedval)s = pd.Series(np.random.randn(1000))# generate the plots.hist(normed=True) # shows the barss.plot(kind='kde'); The scatter plot matrix The final composite graph we'll look at in this article is one that is provided by pandas in its plotting tools subcomponent: the scatter plot matrix. A scatter plot matrix is a popular way of determining whether there is a linear correlation between multiple variables. The following creates a scatter plot matrix with random values, which then shows a scatter plot for each combination, as well as a kde graph for each variable: In [38]:# create a scatter plot matrix# import this classfrom pandas.tools.plotting import scatter_matrix# generate DataFrame with 4 columns of 1000 random numbersnp.random.seed(111111)df_spm = pd.DataFrame(np.random.randn(1000, 4),columns=['a', 'b', 'c', 'd'])# create the scatter matrixscatter_matrix(df_spm, alpha=0.2, figsize=(6, 6), diagonal='kde'); Heatmaps A heatmap is a graphical representation of data, where values within a matrix are represented by colors. This is an effective means to show relationships of values that are measured at the intersection of two variables, at each intersection of the rows and the columns of the matrix. A common scenario, is to have the values in the matrix normalized to 0.0 through 1.0 and have the intersections between a row and column represent the correlation between the two variables. Values with less correlation (0.0) are the darkest, and those with the highest correlation (1.0) are white. Heatmaps are easily created with pandas and matplotlib using the .imshow() function: In [39]:# create a heatmap# start with data for the heatmaps = pd.Series([0.0, 0.1, 0.2, 0.3, 0.4],['V', 'W', 'X', 'Y', 'Z'])heatmap_data = pd.DataFrame({'A' : s + 0.0,'B' : s + 0.1,'C' : s + 0.2,'D' : s + 0.3,'E' : s + 0.4,'F' : s + 0.5,'G' : s + 0.6})heatmap_dataOut [39]:A B C D E F GV 0.0 0.1 0.2 0.3 0.4 0.5 0.6W 0.1 0.2 0.3 0.4 0.5 0.6 0.7X 0.2 0.3 0.4 0.5 0.6 0.7 0.8Y 0.3 0.4 0.5 0.6 0.7 0.8 0.9Z 0.4 0.5 0.6 0.7 0.8 0.9 1.0In [40]:# generate the heatmapplt.imshow(heatmap_data, cmap='hot', interpolation='none')plt.colorbar() # add the scale of colors bar# set the labelsplt.xticks(range(len(heatmap_data.columns)), heatmap_data.columns)plt.yticks(range(len(heatmap_data)), heatmap_data.index); Multiple plots in a single chart It is often useful to contrast data by displaying multiple plots next to each other. This is actually quite easy to when using matplotlib. To draw multiple subplots on a grid, we can make multiple calls to plt.subplot2grid(), each time passing the size of the grid the subplot is to be located on (shape=(height, width)) and the location on the grid of the upper-left section of the subplot (loc=(row, column)). Each call to plt.subplot2grid() returns a different AxesSubplot object that can be used to reference the specific subplot and direct the rendering into. The following demonstrates this, by creating a plot with two subplots based on a two row by one column grid (shape=(2,1)). The first subplot, referred to by ax1, is located in the first row (loc=(0,0)), and the second, referred to as ax2, is in the second row (loc=(1,0)): In [41]:# create two sub plots on the new plot using a 2x1 grid# ax1 is the upper rowax1 = plt.subplot2grid(shape=(2,1), loc=(0,0))# and ax2 is in the lower rowax2 = plt.subplot2grid(shape=(2,1), loc=(1,0)) The subplots have been created, but we have not drawn into either yet. The size of any subplot can be specified using the rowspan and colspan parameters in each call to plt.subplot2grid(). This actually feels a lot like placing content in HTML tables. The following demonstrates a more complicated layout of five plots, specifying different row and column spans for each: In [42]:# layout sub plots on a 4x4 grid# ax1 on top row, 4 columns wideax1 = plt.subplot2grid((4,4), (0,0), colspan=4)# ax2 is row 2, leftmost and 2 columns wideax2 = plt.subplot2grid((4,4), (1,0), colspan=2)# ax3 is 2 cols wide and 2 rows high, starting# on second row and the third columnax3 = plt.subplot2grid((4,4), (1,2), colspan=2, rowspan=2)# ax4 1 high 1 wide, in row 4 column 0ax4 = plt.subplot2grid((4,4), (2,0))# ax4 1 high 1 wide, in row 4 column 1ax5 = plt.subplot2grid((4,4), (2,1)); To draw into a specific subplot using the pandas .plot() method, you can pass the specific axes into the plot function via the ax parameter. The following demonstrates this by extracting each series from the random walk we created at the beginning of this article, and drawing each into different subplots: In [43]:# demonstrating drawing into specific sub-plots# generate a layout of 2 rows 1 column# create the subplots, one on each rowax5 = plt.subplot2grid((2,1), (0,0))ax6 = plt.subplot2grid((2,1), (1,0))# plot column 0 of walk_df into top row of the gridwalk_df[[0]].plot(ax = ax5)# and column 1 of walk_df into bottom rowwalk_df[[1]].plot(ax = ax6); Using this technique, we can perform combinations of different series of data, such as a stock close versus volume graph. Given the data we read during a previous example for Google, the following will plot the volume versus the closing price: In [44]:# draw the close on the top charttop = plt.subplot2grid((4,4), (0, 0), rowspan=3, colspan=4)top.plot(stock_data.index, stock_data['Close'], label='Close')plt.title('Google Opening Stock Price 2001')# draw the volume chart on the bottombottom = plt.subplot2grid((4,4), (3,0), rowspan=1, colspan=4)bottom.bar(stock_data.index, stock_data['Volume'])plt.title('Google Trading Volume')# set the size of the plotplt.gcf().set_size_inches(15,8) Summary Visualizing your data is one of the best ways to quickly understand the story that is being told with the data. Python, pandas, and matplotlib (and a few other libraries) provide a means of very quickly, and with a few lines of code, getting the gist of what you are trying to discover, as well as the underlying message (and displaying it beautifully too). In this article, we examined many of the most common means of visualizing data from pandas. There are also a lot of interesting visualizations that were not covered, and indeed, the concept of data visualization with pandas and/or Python is the subject of entire texts, but I believe this article provides a much-needed reference to get up and going with the visualizations that provide most of what is needed. Resources for Article: Further resources on this subject: Prototyping Arduino Projects using Python [Article] Classifying with Real-world Examples [Article] Python functions – Avoid repeating code [Article]
Read more
  • 0
  • 0
  • 4245

article-image-work-item-querying
Packt
07 Apr 2015
9 min read
Save for later

Work Item Querying

Packt
07 Apr 2015
9 min read
In this article by Dipti Chhatrapati, author of Reporting in TFS, shows us that work items are the primary element project managers and team leaders focus on to track and identify the pending work to be completed. A team member uses work items to track their personal work queue. In order to achieve the current status of the project via work items, it's essential to query work items based on the requirements. This article will cover the following topics: Team project scenario Work item queries Search box queries Flat queries Direct link queries Tree queries (For more resources related to this topic, see here.) Team project scenario Here, we are considering a sports item website that helps user to buy sport items from an item gallery based on their category. The user has to register for membership in order to buy sport products such as footballs, tennis rackets, cricket bats, and so on. Moreover, a registered user can also view/add sport-related articles or news, which will be visible to everyone irrespective of whether they are anonymous or registered. This project is mapped with TFS and has a repository created in TFS Server with work items such as user stories, tasks, bugs, and test cases to plan and track the project's work. We have the following TFS configuration settings for the team project: Team Foundation Server: DIPSTFS Website project: SportsWeb Team project: SportsWebTeamProject Team Foundation Server URL: http://dipstfs:8080/tfs Team project collection URL: http://dipstfs:8080/tfs/DefaultCollection Team Project URL: http://dipstfs:8080/tfs/DefaultCollection/SportsWebTeamProject Team project administrators: DIPSTFSDipsAdministrator Team project members: DIPSTFSDipti Chhatrapati, DIPSTFSBjoern H Rapp, DIPSTFSEdric Taylor, DIPSTFSJohn Smith, DIPSTFSNelson Hall, DIPSTFSScott Harley The following figure shows the project with TFS configuration and setup: Work item queries Work item queries smoothen the process of identifying the status of the team project; this helps in creating a custom report in TFS. We can query work items by a search box or a query editor via Team Web Access. For more information on Work Item Queries, have a look at following links: http://msdn.microsoft.com/en-us/library/ms181308(v=vs.110).aspx http://msdn.microsoft.com/en-us/library/dd286638.aspx There are three types of queries: Flat queries Direct link queries Tree queries Search box queries We can find a work item using the search box available in the team project web portal, which is shown in the following screenshot: You can type in keywords in the search box located on top right of the team project web portal site; for example master, will result in the following work items: The search box content menu also has the ability to find work items based on assignment, status, created by, or work item type, as shown in the following screenshot: The search box finds items using shortcut filters or by specifying keywords or phrases, specific fields/field values, assignment or date modifications, or using the equals, contains, and not operators. For more information on search box filtering, have a look at http://msdn.microsoft.com/en-us/library/cc668120.aspx. Flat queries A flat query list of work items is used when you want to perform the following tasks: Finding a work item with an unknown ID Checking the status or other columns of work items Finding work items that you want to link to other work items Exporting work items to Microsoft Office, Microsoft Excel, and Office Project for bulk updates to column fields Generating a report about a set of work items As a general practice, to easily find work items, a team member can create Shared Queries, which are predefined queries shared across the team. They can be created, modified, and saved as a new query too. The following steps demonstrate how to open a flat query list and create a new query list: In the team project web portal, expand Shared Query List located on the left-hand side and click on the My Tasks query, as shown in the following screenshot: The resulting work items generated by the My Tasks query will be shown in the Work item pane, as shown in the following screenshot: As there are now three active tasks and two new tasks, we will create the My Active Tasks flat Query. To do so, click on Editor, as shown here: Add a clause to filter work items by Active State: Now click on the Save Query as… icon to save the query as My Active Task: Enter the query name and folder as appropriate. Here, we will save the query in the Shared Queries Folder and click on OK: Click on Results to view the work items for the My Active Tasks query and it will display the items, as shown in the following screenshot: Now let's have a look at how to create a query that represents all the work item details of different sprints/iterations. For example, you have a number of sprints in the Release 1 iteration and another release to test an application that's named Test Release 1 that you can find in Team Web Access site's settings page under the Iterations tab, as indicated in the following screenshot: In order to fetch the work item data of all the sprints to know which task is allocated to which team member in which sprint, go to the Backlogs tab and click on Create query: Specify the query name and folder location to store the query. Then click on OK: Then click on the link as indicated in the following screenshot, which will redirect you to the created query: Click on Flat list of work items and remove all the conditions except the iteration path, as shown in the following screenshot: Now save the query and run it. Add columns such as Work Item Type, State, Iteration Path, Title, and Assigned To as appropriate. As a result, this query will display the work items available under the team project for different sprints or releases, as indicated in the following screenshot: To filter work items based on the sprintreleaseiteration, change the iteration path condition for Value to Sprint 1, as indicated in the following screenshot: Finally, save and run the query, which will return the work items available under Sprint 1 of the Release 1 iteration: For more information on flat queries, have a look at http://msdn.microsoft.com/en-us/library/ms181308(v=vs.110).aspx. Direct link queries There are work items that are dependent on other work items such as tasks, bugs, and issues, and they can be tracked using direct links. They help determine risks and dependencies in order to collaborate among teams effectively. Direct link queries help perform the following tasks: Creating a custom view of linked work items Tracking dependencies across team projects and manage the commitments made to other project teams Assessing changes to work items that you do not own but that your work items depend on The following steps demonstrate how to generate a linked query list: Open My Tasks List from Shared Queries. Click on Editor. Click on Work items and direct links, as shown in the following screenshot: Specify the clause for the work item type: Task in Filters for linked work items: We can filter the first level work items by choosing the following option: The meanings of the filter options are described as follows: Only return work items that have the specified links: This option returns only the top-level work items that have links to work items. Return all top level work items: This option returns all the work items whether they have linked work items or not. This option also returns the second-level work items that are linked to the first-level work items. Only return work items that do not have the specified links: This option returns only the top-level work items those are not linked to any work items. Run the query, save it as My Linked Tasks and click on OK: Click on Results to view the linked tasks as configured previously. For more information on direct link queries, have a look at http://msdn.microsoft.com/en-us/library/dd286501(v=vs.110).aspx. Tree queries To view nested work items, tree queries are used by selecting the Tree of Work Items query type. Tree queries are used to execute following tasks: Viewing the hierarchy Finding parent or child work items Changing the tree hierarchy Exporting the tree view to Microsoft Excel for either bulk updates to column fields or to change the tree hierarchy The following steps demonstrate how to generate a tree query list: Open the My Tasks list from Shared Queries. Click on Editor. Click on Tree of work items, as shown in the following screenshot: Define the filter criteria for both parent and child work items. Specify the clause for work item type: Task in Filters for linked work items. Also, select Match top-level work items first. We can filter linked work items by choosing the following option: To find linked children, select Match top-level work items first and, to find linked parents, select Match linked work items first. Run the query, save it as My Tree Tasks, and click on OK. Click on Results to view the linked tasks as configured previously: For more information on Tree queries, have a look at: http://msdn.microsoft.com/en-us/library/dd286633(v=vs.110).aspx Summary In this article, we reviewed the team project scenario; and we also walked through the types of work item queries that produce work items we need in order to know the status of work progress. Resources for Article: Further resources on this subject: Creating a basic JavaScript plugin [article] Building Financial Functions into Excel 2010 [article] Team Foundation Server 2012 [article]
Read more
  • 0
  • 0
  • 3573

article-image-working-blender
Packt
06 Apr 2015
15 min read
Save for later

Working with Blender

Packt
06 Apr 2015
15 min read
In this article by Jos Dirksen, author of Learning Three.js – the JavaScript 3D Library for WebGL - Second Edition, we will learn about Blender and also about how to load models in Three.js using different formats. (For more resources related to this topic, see here.) Before we get started with the configuration, we'll show the result that we'll be aiming for. In the following screenshot, you can see a simple Blender model that we exported with the Three.js plugin and imported in Three.js with THREE.JSONLoader: Installing the Three.js exporter in Blender To get Blender to export Three.js models, we first need to add the Three.js exporter to Blender. The following steps are for Mac OS X but are pretty much the same on Windows and Linux. You can download Blender from www.blender.org and follow the platform-specific installation instructions. After installation, you can add the Three.js plugin. First, locate the addons directory from your Blender installation using a terminal window: On my Mac, it's located here: ./blender.app/Contents/MacOS/2.70/scripts/addons. For Windows, this directory can be found at the following location: C:UsersUSERNAMEAppDataRoamingBlender FoundationBlender2.7Xscriptsaddons. And for Linux, you can find this directory here: /home/USERNAME/.config/blender/2.7X/scripts/addons. Next, you need to get the Three.js distribution and unpack it locally. In this distribution, you can find the following folder: utils/exporters/blender/2.65/scripts/addons/. In this directory, there is a single subdirectory with the name io_mesh_threejs. Copy this directory to the addons folder of your Blender installation. Now, all we need to do is start Blender and enable the exporter. In Blender, open Blender User Preferences (File | User Preferences). In the window that opens, select the Addons tab, and in the search box, type three. This will show the following screen: At this point, the Three.js plugin is found, but it is still disabled. Check the small checkbox to the right, and the Three.js exporter will be enabled. As a final check to see whether everything is working correctly, open the File | Export menu option, and you'll see Three.js listed as an export option. This is shown in the following screenshot: With the plugin installed, we can load our first model. Loading and exporting a model from Blender As an example, we've added a simple Blender model named misc_chair01.blend in the assets/models folder, which you can find in the sources for this article. In this section, we'll load this model and show the minimal steps it takes to export this model to Three.js. First, we need to load this model in Blender. Use File | Open and navigate to the folder containing the misc_chair01.blend file. Select this file and click on Open. This will show you a screen that looks somewhat like this: Exporting this model to the Three.js JSON format is pretty straightforward. From the File menu, open Export | Three.js, type in the name of the export file, and select Export Three.js. This will create a JSON file in a format Three.js understands. A part of the contents of this file is shown next: {   "metadata" : {    "formatVersion" : 3.1,    "generatedBy"   : "Blender 2.7 Exporter",    "vertices"     : 208,    "faces"         : 124,    "normals"       : 115,    "colors"       : 0,    "uvs"          : [270,151],    "materials"     : 1,    "morphTargets" : 0,    "bones"         : 0 }, ... However, we aren't completely done. In the previous screenshot, you can see that the chair contains a wooden texture. If you look through the JSON export, you can see that the export for the chair also specifies a material, as follows: "materials": [{ "DbgColor": 15658734, "DbgIndex": 0, "DbgName": "misc_chair01", "blending": "NormalBlending", "colorAmbient": [0.53132, 0.25074, 0.147919], "colorDiffuse": [0.53132, 0.25074, 0.147919], "colorSpecular": [0.0, 0.0, 0.0], "depthTest": true, "depthWrite": true, "mapDiffuse": "misc_chair01_col.jpg", "mapDiffuseWrap": ["repeat", "repeat"], "shading": "Lambert", "specularCoef": 50, "transparency": 1.0, "transparent": false, "vertexColors": false }], This material specifies a texture, misc_chair01_col.jpg, for the mapDiffuse property. So, besides exporting the model, we also need to make sure the texture file is also available to Three.js. Luckily, we can save this texture directly from Blender. In Blender, open the UV/Image Editor view. You can select this view from the drop-down menu on the left-hand side of the File menu option. This will replace the top menu with the following: Make sure the texture you want to export is selected, misc_chair_01_col.jpg in our case (you can select a different one using the small image icon). Next, click on the Image menu and use the Save as Image menu option to save the image. Save it in the same folder where you saved the model using the name specified in the JSON export file. At this point, we're ready to load the model into Three.js. The code to load this into Three.js at this point looks like this: var loader = new THREE.JSONLoader(); loader.load('../assets/models/misc_chair01.js', function (geometry, mat) { mesh = new THREE.Mesh(geometry, mat[0]);   mesh.scale.x = 15; mesh.scale.y = 15; mesh.scale.z = 15;   scene.add(mesh);   }, '../assets/models/'); We've already seen JSONLoader before, but this time, we use the load function instead of the parse function. In this function, we specify the URL we want to load (points to the exported JSON file), a callback that is called when the object is loaded, and the location, ../assets/models/, where the texture can be found (relative to the page). This callback takes two parameters: geometry and mat. The geometry parameter contains the model, and the mat parameter contains an array of material objects. We know that there is only one material, so when we create THREE.Mesh, we directly reference that material. If you open the 05-blender-from-json.html example, you can see the chair we just exported from Blender. Using the Three.js exporter isn't the only way of loading models from Blender into Three.js. Three.js understands a number of 3D file formats, and Blender can export in a couple of those formats. Using the Three.js format, however, is very easy, and if things go wrong, they are often quickly found. In the following section, we'll look at a couple of the formats Three.js supports and also show a Blender-based example for the OBJ and MTL file formats. Importing from 3D file formats At the beginning of this article, we listed a number of formats that are supported by Three.js. In this section, we'll quickly walk through a couple of examples for those formats. Note that for all these formats, an additional JavaScript file needs to be included. You can find all these files in the Three.js distribution in the examples/js/loaders directory. The OBJ and MTL formats OBJ and MTL are companion formats and often used together. The OBJ file defines the geometry, and the MTL file defines the materials that are used. Both OBJ and MTL are text-based formats. A part of an OBJ file looks like this: v -0.032442 0.010796 0.025935 v -0.028519 0.013697 0.026201 v -0.029086 0.014533 0.021409 usemtl Material s 1 f 2731 2735 2736 2732 f 2732 2736 3043 3044 The MTL file defines materials like this: newmtl Material Ns 56.862745 Ka 0.000000 0.000000 0.000000 Kd 0.360725 0.227524 0.127497 Ks 0.010000 0.010000 0.010000 Ni 1.000000 d 1.000000 illum 2 The OBJ and MTL formats by Three.js are understood well and are also supported by Blender. So, as an alternative, you could choose to export models from Blender in the OBJ/MTL format instead of the Three.js JSON format. Three.js has two different loaders you can use. If you only want to load the geometry, you can use OBJLoader. We used this loader for our example (06-load-obj.html). The following screenshot shows this example: To import this in Three.js, you have to add the OBJLoader JavaScript file: <script type="text/javascript" src="../libs/OBJLoader.js"> </script> Import the model like this: var loader = new THREE.OBJLoader(); loader.load('../assets/models/pinecone.obj', function (loadedMesh) { var material = new THREE.MeshLambertMaterial({color: 0x5C3A21});   // loadedMesh is a group of meshes. For // each mesh set the material, and compute the information // three.js needs for rendering. loadedMesh.children.forEach(function (child) {    child.material = material;    child.geometry.computeFaceNormals();    child.geometry.computeVertexNormals(); });   mesh = loadedMesh; loadedMesh.scale.set(100, 100, 100); loadedMesh.rotation.x = -0.3; scene.add(loadedMesh); }); In this code, we use OBJLoader to load the model from a URL. Once the model is loaded, the callback we provide is called, and we add the model to the scene. Usually, a good first step is to print out the response from the callback to the console to understand how the loaded object is built up. Often with these loaders, the geometry or mesh is returned as a hierarchy of groups. Understanding this makes it much easier to place and apply the correct material and take any other additional steps. Also, look at the position of a couple of vertices to determine whether you need to scale the model up or down and where to position the camera. In this example, we've also made the calls to computeFaceNormals and computeVertexNormals. This is required to ensure that the material used (THREE.MeshLambertMaterial) is rendered correctly. The next example (07-load-obj-mtl.html) uses OBJMTLLoader to load a model and directly assign a material. The following screenshot shows this example: First, we need to add the correct loaders to the page: <script type="text/javascript" src="../libs/OBJLoader.js"> </script> <script type="text/javascript" src="../libs/MTLLoader.js"> </script> <script type="text/javascript" src="../libs/OBJMTLLoader.js"> </script> We can load the model from the OBJ and MTL files like this: var loader = new THREE.OBJMTLLoader(); loader.load('../assets/models/butterfly.obj', '../assets/ models/butterfly.mtl', function(object) { // configure the wings var wing2 = object.children[5].children[0]; var wing1 = object.children[4].children[0];   wing1.material.opacity = 0.6; wing1.material.transparent = true; wing1.material.depthTest = false; wing1.material.side = THREE.DoubleSide;   wing2.material.opacity = 0.6; wing2.material.depthTest = false; wing2.material.transparent = true; wing2.material.side = THREE.DoubleSide;   object.scale.set(140, 140, 140); mesh = object; scene.add(mesh);   mesh.rotation.x = 0.2; mesh.rotation.y = -1.3; }); The first thing to mention before we look at the code is that if you receive an OBJ file, an MTL file, and the required texture files, you'll have to check how the MTL file references the textures. These should be referenced relative to the MTL file and not as an absolute path. The code itself isn't that different from the one we saw for THREE.ObjLoader. We specify the location of the OBJ file, the location of the MTL file, and the function to call when the model is loaded. The model we've used as an example in this case is a complex model. So, we set some specific properties in the callback to fix some rendering issues, as follows: The opacity in the source files was set incorrectly, which caused the wings to be invisible. So, to fix that, we set the opacity and transparent properties ourselves. By default, Three.js only renders one side of an object. Since we look at the wings from two sides, we need to set the side property to the THREE.DoubleSide value. The wings caused some unwanted artifacts when they needed to be rendered on top of each other. We've fixed that by setting the depthTest property to false. This has a slight impact on performance but can often solve some strange rendering artifacts. But, as you can see, you can easily load complex models directly into Three.js and render them in real time in your browser. You might need to fine-tune some material properties though. Loading a Collada model Collada models (extension is .dae) are another very common format for defining scenes and models (and animations as well). In a Collada model, it is not just the geometry that is defined, but also the materials. It's even possible to define light sources. To load Collada models, you have to take pretty much the same steps as for the OBJ and MTL models. You start by including the correct loader: <script type="text/javascript" src="../libs/ColladaLoader.js"> </script> For this example, we'll load the following model: Loading a truck model is once again pretty simple: var mesh; loader.load("../assets/models/dae/Truck_dae.dae", function   (result) { mesh = result.scene.children[0].children[0].clone(); mesh.scale.set(4, 4, 4); scene.add(mesh); }); The main difference here is the result of the object that is returned to the callback. The result object has the following structure: var result = {   scene: scene, morphs: morphs, skins: skins, animations: animData, dae: {    ... } }; In this article, we're interested in the objects that are in the scene parameter. I first printed out the scene to the console to look where the mesh was that I was interested in, which was result.scene.children[0].children[0]. All that was left to do was scale it to a reasonable size and add it to the scene. A final note on this specific example—when I loaded this model for the first time, the materials didn't render correctly. The reason was that the textures used the .tga format, which isn't supported in WebGL. To fix this, I had to convert the .tga files to .png and edit the XML of the .dae model to point to these .png files. As you can see, for most complex models, including materials, you often have to take some additional steps to get the desired results. By looking closely at how the materials are configured (using console.log()) or replacing them with test materials, problems are often easy to spot. Loading the STL, CTM, VTK, AWD, Assimp, VRML, and Babylon models We're going to quickly skim over these file formats as they all follow the same principles: Include [NameOfFormat]Loader.js in your web page. Use [NameOfFormat]Loader.load() to load a URL. Check what the response format for the callback looks like and render the result. We have included an example for all these formats: Name Example Screenshot STL 08-load-STL.html CTM 09-load-CTM.html VTK 10-load-vtk.html AWD 11-load-awd.html Assimp 12-load-assimp.html VRML 13-load-vrml.html Babylon The Babylon loader is slightly different from the other loaders in this table. With this loader, you don't load a single THREE.Mesh or THREE.Geometry instance, but with this loader, you load a complete scene, including lights.   14-load-babylon.html If you look at the source code for these examples, you might see that for some of them, we need to change some material properties or do some scaling before the model is rendered correctly. The reason we need to do this is because of the way the model is created in its external application, giving it different dimensions and grouping than we normally use in Three.js. Summary In this article, we've almost shown all the supported file formats. Using models from external sources isn't that hard to do in Three.js. Especially for simple models, you only have to take a few simple steps. When working with external models, or creating them using grouping and merging, it is good to keep a couple of things in mind. The first thing you need to remember is that when you group objects, they still remain available as individual objects. Transformations applied to the parent also affect the children, but you can still transform the children individually. Besides grouping, you can also merge geometries together. With this approach, you lose the individual geometries and get a single new geometry. This is especially useful when you're dealing with thousands of geometries you need to render and you're running into performance issues. Three.js supports a large number of external formats. When using these format loaders, it's a good idea to look through the source code and log out the information received in the callback. This will help you to understand the steps you need to take to get the correct mesh and set it to the correct position and scale. Often, when the model doesn't show correctly, this is caused by its material settings. It could be that incompatible texture formats are used, opacity is incorrectly defined, or the format contains incorrect links to the texture images. It is usually a good idea to use a test material to determine whether the model itself is loaded correctly and log the loaded material to the JavaScript console to check for unexpected values. It is also possible to export meshes and scenes, but remember that GeometryExporter, SceneExporter, and SceneLoader of Three.js are still work in progress. Resources for Article: Further resources on this subject: Creating the maze and animating the cube [article] Mesh animation [article] Working with the Basic Components That Make Up a Three.js Scene [article]
Read more
  • 0
  • 0
  • 4871
article-image-machine-learning-using-spark-mllib
Packt
01 Apr 2015
22 min read
Save for later

Machine Learning Using Spark MLlib

Packt
01 Apr 2015
22 min read
This Spark machine learning tutorial is by Krishna Sankar, the author of Fast Data Processing with Spark Second Edition. One of the major attractions of Spark is the ability to scale computation massively, and that is exactly what you need for machine learning algorithms. But the caveat is that all machine learning algorithms cannot be effectively parallelized. Each algorithm has its own challenges for parallelization, whether it is task parallelism or data parallelism. Having said that, Spark is becoming the de-facto platform for building machine learning algorithms and applications. For example, Apache Mahout is moving away from Hadoop MapReduce and implementing the algorithms in Spark (see the first reference at the end of this article). The developers working on the Spark MLlib are implementing more and more machine algorithms in a scalable and concise manner in the Spark framework. For the latest information on this, you can refer to the Spark site at https://spark.apache.org/docs/latest/mllib-guide.html, which is the authoritative source. This article covers the following machine learning algorithms: Basic statistics Linear regression Classification Clustering Recommendations The Spark machine learning algorithm table The Spark machine learning algorithms implemented in Spark 1.1.0 org.apache.spark.mllib for Scala and Java, and in pyspark.mllib for Python is shown in the following table: Algorithm Feature Notes Basic statistics Summary statistics Mean, variance, count, max, min, and numNonZeros   Correlations Spearman and Pearson correlation   Stratified sampling sampleBykey, sampleByKeyExact—With and without replacement   Hypothesis testing Pearson's chi-squared goodness of fit test   Random data generation RandomRDDs Normal, Poisson, and so on Regression Linear models Linear regression—least square, Lasso, and ridge regression Classification Binary classification Logistic regression, SVM, decision trees, and naïve Bayes   Multi-class classification Decision trees, naïve Bayes, and so on Recommendation Collaborative filtering Alternating least squares Clustering k-means   Dimensionality reduction SVD PCA   Feature extraction TF-IDF Word2Vec StandardScaler Normalizer   Optimization SGD L-BFGS   Spark MLlib examples Now, let's look at how to use the algorithms. Naturally, we need interesting datasets to implement the algorithms; we will use appropriate datasets for the algorithms shown in the next section. The code and data files are available in the GitHub repository at https://github.com/xsankar/fdps-vii. We'll keep it updated with corrections. Basic statistics Let's read the car mileage data into an RDD and then compute some basic statistics. We will use a simple parse class to parse a line of data. This will work if you know the type and the structure of your CSV file. We will use this technique for the examples in this article: import org.apache.spark.SparkContext import org.apache.spark.mllib.stat. {MultivariateStatisticalSummary, Statistics} import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD   object MLlib01 { // def getCurrentDirectory = new java.io.File( "." ).getCanonicalPath // def parseCarData(inpLine : String) : Array[Double] = {    val values = inpLine.split(',')    val mpg = values(0).toDouble    val displacement = values(1).toDouble    val hp = values(2).toInt    val torque = values(3).toInt    val CRatio = values(4).toDouble    val RARatio = values(5).toDouble    val CarbBarrells = values(6).toInt    val NoOfSpeed = values(7).toInt    val length = values(8).toDouble    val width = values(9).toDouble    val weight = values(10).toDouble    val automatic = values(11).toInt    return Array(mpg,displacement,hp,    torque,CRatio,RARatio,CarbBarrells,    NoOfSpeed,length,width,weight,automatic) } // def main(args: Array[String]) {    println(getCurrentDirectory)    val sc = new SparkContext("local","Chapter 9")    println(s"Running Spark Version ${sc.version}")    //    val dataFile = sc.textFile("/Users/ksankar/fdps-vii/data/car-     milage-no-hdr.csv")    val carRDD = dataFile.map(line => parseCarData(line))    //    // Let us find summary statistics    //    val vectors: RDD[Vector] = carRDD.map(v => Vectors.dense(v))    val summary = Statistics.colStats(vectors)    carRDD.foreach(ln=> {ln.foreach(no => print("%6.2f | "     .format(no))); println()})    print("Max :");summary.max.toArray.foreach(m => print("%5.1f |     ".format(m)));println    print("Min :");summary.min.toArray.foreach(m => print("%5.1f |     ".format(m)));println    print("Mean :");summary.mean.toArray.foreach(m => print("%5.1f     | ".format(m)));println    } } This program will produce the following output: Let's also run some correlations, as shown here: // // correlations // val hp = vectors.map(x => x(2)) val weight = vectors.map(x => x(10)) var corP = Statistics.corr(hp,weight,"pearson") // default println("hp to weight : Pearson Correlation = %2.4f".format(corP)) var corS = Statistics.corr(hp,weight,"spearman") // Need to   specify println("hp to weight : Spearman Correlation = %2.4f" .format(corS)) // val raRatio = vectors.map(x => x(5)) val width = vectors.map(x => x(9)) corP = Statistics.corr(raRatio,width,"pearson") // default println("raRatio to width : Pearson Correlation = %2.4f" .format(corP)) corS = Statistics.corr(raRatio,width,"spearman") // Need to   specify println("raRatio to width : Spearman Correlation = %2.4f" .format(corS)) // This will produce interesting results as shown in the next screenshot: While this might seem too much work to calculate the correlation of a tiny dataset, remember that this will scale to datasets consisting of 1,000,000 rows or even a billion rows! Linear regression Linear regression takes a little more work than statistics. We need the LabeledPoint class as well as a few more parameters such as the learning rate, that is, the step size. We will also split the dataset into training and test, as shown here:    //    // def carDataToLP(inpArray : Array[Double]) : LabeledPoint = {    return new LabeledPoint( inpArray(0),Vectors.dense (       inpArray(1), inpArray(2), inpArray(3),       inpArray(4), inpArray(5), inpArray(6), inpArray(7),       inpArray(8), inpArray(9), inpArray(10), inpArray(11) ) )    } // Linear Regression    //    val carRDDLP = carRDD.map(x => carDataToLP(x)) // create a     labeled point RDD    println(carRDDLP.count())    println(carRDDLP.first().label)    println(carRDDLP.first().features)    //    // Let us split the data set into training & test set using a     very simple filter    //    val carRDDLPTrain = carRDDLP.filter( x => x.features(9) <=     4000)    val carRDDLPTest = carRDDLP.filter( x => x.features(9) > 4000)    println("Training Set : " + "%3d".format     (carRDDLPTrain.count()))    println("Training Set : " + "%3d".format(carRDDLPTest.count()))    //    // Train a Linear Regression Model    // numIterations = 100, stepsize = 0.000000001    // without such a small step size the algorithm will diverge    //    val mdlLR = LinearRegressionWithSGD.train     (carRDDLPTrain,100,0.000000001)    println(mdlLR.intercept) // Intercept is turned off when using     LinearRegressionSGD object, so intercept will always be 0 for     this code      println(mdlLR.weights)    //    // Now let us use the model to predict our test set    //    val valuesAndPreds = carRDDLPTest.map(p => (p.label,     mdlLR.predict(p.features)))    val mse = valuesAndPreds.map( vp => math.pow( (vp._1 - vp._2),2     ) ).        reduce(_+_) / valuesAndPreds.count()    println("Mean Squared Error     = " + "%6.3f".format(mse))    println("Root Mean Squared Error = " + "%6.3f"     .format(math.sqrt(mse)))    // Let us print what the model predicted    valuesAndPreds.take(20).foreach(m => println("%5.1f | %5.1f |"     .format(m._1,m._2))) The run result will be as expected, as shown in the next screenshot: The prediction is not that impressive. There are a couple of reasons for this. There might be quadratic effects; some of the variables might be correlated (for example, length, width, and weight, and so we might not need all three to predict the mpg value). Finally, we might not need all the 10 features anyways. I leave it to you to try with different combinations of features. (In the parseCarData function, take only a subset of the variables; for example take hp, weight, and number of speed and see which combination minimizes the mse value.) Classification Classification is very similar to linear regression. The algorithms take labeled points, and the train process has various parameters to tweak the algorithm to fit the needs of an application. The returned model can be used to predict the class of a labeled point. Here is a quick example using the titanic dataset: For our example, we will keep the same structure as the linear regression example. First, we will parse the full dataset line and then later keep it simple by creating a labeled point with a set of selected features, as shown in the following code: import org.apache.spark.SparkContext import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.DecisionTree   object Chapter0802 { // def getCurrentDirectory = new java.io.File( "."     ).getCanonicalPath // // 0 pclass,1 survived,2 l.name,3.f.name, 4 sex,5 age,6 sibsp,7       parch,8 ticket,9 fare,10 cabin, // 11 embarked,12 boat,13 body,14 home.dest // def str2Double(x: String) : Double = {    try {      x.toDouble    } catch {      case e: Exception => 0.0    } } // def parsePassengerDataToLP(inpLine : String) : LabeledPoint = {    val values = inpLine.split(',')    //println(values)    //println(values.length)    //    val pclass = str2Double(values(0))    val survived = str2Double(values(1))    // skip last name, first name    var sex = 0    if (values(4) == "male") {      sex = 1    }    var age = 0.0 // a better choice would be the average of all       ages    age = str2Double(values(5))    //    var sibsp = 0.0    age = str2Double(values(6))    //    var parch = 0.0    age = str2Double(values(7))    //    var fare = 0.0    fare = str2Double(values(9))    return new LabeledPoint(survived,Vectors.dense     (pclass,sex,age,sibsp,parch,fare)) } Now that we have setup the routines to parse the data, let's dive into the main program: // def main(args: Array[String]): Unit = {    println(getCurrentDirectory)    val sc = new SparkContext("local","Chapter 8")    println(s"Running Spark Version ${sc.version}")    //    val dataFile = sc.textFile("/Users/ksankar/bdtc-2014     /titanic/titanic3_01.csv")    val titanicRDDLP = dataFile.map(_.trim).filter( _.length > 1).      map(line => parsePassengerDataToLP(line))    //    println(titanicRDDLP.count())    //titanicRDDLP.foreach(println)    //    println(titanicRDDLP.first().label)    println(titanicRDDLP.first().features)    //    val categoricalFeaturesInfo = Map[Int, Int]()    val mdlTree = DecisionTree.trainClassifier(titanicRDDLP, 2, //       numClasses        categoricalFeaturesInfo, // all features are continuous        "gini", // impurity        5, // Maxdepth        32) //maxBins    //    println(mdlTree.depth)    println(mdlTree) The tree is interesting to inspect. Check it out here:    //    // Let us predict on the dataset and see how well it works.    // In the real world, we should split the data to train & test       and then predict the test data:    //    val predictions = mdlTree.predict(titanicRDDLP.     map(x=>x.features))    val labelsAndPreds = titanicRDDLP.     map(x=>x.label).zip(predictions)    //    val mse = labelsAndPreds.map( vp => math.pow( (vp._1 -       vp._2),2 ) ).        reduce(_+_) / labelsAndPreds.count()    println("Mean Squared Error = " + "%6f".format(mse))    //    // labelsAndPreds.foreach(println)    //    val correctVals = labelsAndPreds.aggregate(0.0)((x, rec) => x       + (rec._1 == rec._2).compare(false), _ + _)    val accuracy = correctVals/labelsAndPreds.count()    println("Accuracy = " + "%3.2f%%".format(accuracy*100))    //    println("*** Done ***") } } The result obtained when you run the program is as expected. The printout of the tree is interesting, as shown here: Running Spark Version 1.1.1 14/11/28 18:41:27 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=2061647216 [..] 14/11/28 18:41:27 INFO SparkContext: Job finished: count at Chapter0802.scala:56, took 0.260993 s 1309 14/11/28 18:41:27 INFO SparkContext: Starting job: first at Chapter0802.scala:59 [..] 14/11/28 18:41:27 INFO SparkContext: Job finished: first at Chapter0802.scala:59, took 0.016479 s 1.0 14/11/28 18:41:27 INFO SparkContext: Starting job: first at Chapter0802.scala:60 [..] 14/11/28 18:41:27 INFO SparkContext: Job finished: first at Chapter0802.scala:60, took 0.014408 s [1.0,0.0,0.0,0.0,0.0,211.3375] 14/11/28 18:41:27 INFO SparkContext: Starting job: take at DecisionTreeMetadata.scala:66 [..] 14/11/28 18:41:28 INFO DecisionTree: Internal timing for DecisionTree: 14/11/28 18:41:28 INFO DecisionTree:   init: 0.36408 total: 0.95518 extractNodeInfo: 7.3E-4 findSplitsBins: 0.249814 extractInfoForLowerLevels: 7.74E-4 findBestSplits: 0.565394 chooseSplits: 0.201012 aggregation: 0.362411 5 DecisionTreeModel classifier If (feature 1 <= 0.0)    If (feature 0 <= 2.0)    If (feature 5 <= 26.0)      If (feature 2 <= 1.0)      If (feature 0 <= 1.0)        Predict: 1.0      Else (feature 0 > 1.0)        Predict: 1.0      Else (feature 2 > 1.0)      Predict: 1.0    Else (feature 5 > 26.0)      If (feature 2 <= 1.0)      If (feature 5 <= 38.0021)        Predict: 1.0      Else (feature 5 > 38.0021)        Predict: 1.0      Else (feature 2 > 1.0)      If (feature 5 <= 79.42500000000001)        Predict: 1.0      Else (feature 5 > 79.42500000000001)        Predict: 1.0    Else (feature 0 > 2.0)    If (feature 5 <= 25.4667)      If (feature 5 <= 7.2292)      If (feature 5 <= 7.05)        Predict: 1.0      Else (feature 5 > 7.05)        Predict: 1.0      Else (feature 5 > 7.2292)      If (feature 5 <= 15.5646)        Predict: 0.0      Else (feature 5 > 15.5646)        Predict: 1.0    Else (feature 5 > 25.4667)      If (feature 5 <= 38.0021)      If (feature 5 <= 30.6958)        Predict: 0.0      Else (feature 5 > 30.6958)        Predict: 0.0      Else (feature 5 > 38.0021)      Predict: 0.0 Else (feature 1 > 0.0)    If (feature 0 <= 1.0)    If (feature 5 <= 26.0)      If (feature 5 <= 7.05)      If (feature 5 <= 0.0)        Predict: 0.0      Else (feature 5 > 0.0)        Predict: 0.0      Else (feature 5 > 7.05)      Predict: 0.0    Else (feature 5 > 26.0)      If (feature 5 <= 30.6958)      If (feature 2 <= 0.0)        Predict: 0.0      Else (feature 2 > 0.0)        Predict: 0.0      Else (feature 5 > 30.6958)      If (feature 2 <= 1.0)        Predict: 0.0      Else (feature 2 > 1.0)        Predict: 1.0    Else (feature 0 > 1.0)    If (feature 2 <= 0.0)      If (feature 5 <= 38.0021)      If (feature 5 <= 14.4583)        Predict: 0.0      Else (feature 5 > 14.4583)        Predict: 0.0      Else (feature 5 > 38.0021)      If (feature 0 <= 2.0)        Predict: 0.0      Else (feature 0 > 2.0)        Predict: 1.0    Else (feature 2 > 0.0)      If (feature 5 <= 26.0)      If (feature 2 <= 1.0)        Predict: 0.0      Else (feature 2 > 1.0)        Predict: 0.0      Else (feature 5 > 26.0)      If (feature 0 <= 2.0)        Predict: 0.0      Else (feature 0 > 2.0)        Predict: 0.0   14/11/28 18:41:28 INFO SparkContext: Starting job: reduce at Chapter0802.scala:79 [..] 14/11/28 18:41:28 INFO SparkContext: Job finished: count at Chapter0802.scala:79, took 0.077973 s Mean Squared Error = 0.200153 14/11/28 18:41:28 INFO SparkContext: Starting job: aggregate at Chapter0802.scala:84 [..] 14/11/28 18:41:28 INFO SparkContext: Job finished: count at Chapter0802.scala:85, took 0.042592 s Accuracy = 79.98% *** Done *** In the real world, one would create a training and a test dataset and train the model on the training dataset and then predict on the test dataset. Then we can calculate the mse and minimize it on various feature combinations, some of which could also be engineered features. Clustering Spark MLlib has implemented the k-means clustering algorithm. The model training and prediction interfaces are similar to other machine learning algorithms. Let's see how it works by going through an example. Let's use a sample data that has two dimensions x and y. The plot of the points would look like the following screenshot: From the preceding graph, we can see that four clusters form one solution. Let's try with k=2 and k=4. Let's see how the Spark clustering algorithm handles this dataset and the groupings: import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.{Vector,Vectors} import org.apache.spark.mllib.clustering.KMeans   object Chapter0803 { def parsePoints(inpLine : String) : Vector = {    val values = inpLine.split(',')    val x = values(0).toInt    val y = values(1).toInt    return Vectors.dense(x,y) } //   def main(args: Array[String]): Unit = {    val sc = new SparkContext("local","Chapter 8")    println(s"Running Spark Version ${sc.version}")    //    val dataFile = sc.textFile("/Users/ksankar/bdtc-2014/cluster-     points/cluster-points.csv")    val points = dataFile.map(_.trim).filter( _.length > 1).     map(line => parsePoints(line))    //  println(points.count())    //    var numClusters = 2    val numIterations = 20    var mdlKMeans = KMeans.train(points, numClusters,       numIterations)    //    println(mdlKMeans.clusterCenters)    //    var clusterPred = points.map(x=>mdlKMeans.predict(x))    var clusterMap = points.zip(clusterPred)    //    clusterMap.foreach(println)    //    clusterMap.saveAsTextFile("/Users/ksankar/bdtc-2014/cluster-     points/2-cluster.csv")    //    // Now let us try 4 centers:    //    numClusters = 4    mdlKMeans = KMeans.train(points, numClusters, numIterations)    clusterPred = points.map(x=>mdlKMeans.predict(x))    clusterMap = points.zip(clusterPred)    clusterMap.saveAsTextFile("/Users/ksankar/bdtc-2014/cluster-     points/4-cluster.csv")    clusterMap.foreach(println) } } The results of the run would be as shown in the next screenshot (your run could give slightly different results): The k=2 graph shown in the next screenshot looks as expected: With k=4 the results are as shown in the following screenshot: The plot shown in the following screenshot confirms that the clusters are obtained as expected. Spark does understand clustering! Bear in mind that the results could vary a little between runs because the clustering algorithm picks the centers randomly and grows from there. With k=4, the results are stable; but with k=2, there is room for partitioning the points in different ways. Try it out a few times and see the results. Recommendation The recommendation algorithms fall under five general mechanisms, namely, knowledge-based, demographic-based, content-based, collaborative filtering (item-based or user-based), and latent factor-based. Usually, the collaborative filtering is computationally intensive—Spark implements the Alternating Least Square (ALS) algorithm authored by Yehuda Koren, available at http://dl.acm.org/citation.cfm?id=1608614. It is user-based collaborative filtering using the method of learning latent factors, which can scale to a large dataset. Let's quickly use the movielens medium dataset to implement a recommendation using Spark. There are some interesting RDD transformations. Apart from that, the code is not that complex, as shown next: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ // for implicit   conversations import org.apache.spark.mllib.recommendation.Rating import org.apache.spark.mllib.recommendation.ALS   object Chapter0804 { def parseRating1(line : String) : (Int,Int,Double,Int) = {    //println(x)    val x = line.split("::")    val userId = x(0).toInt    val movieId = x(1).toInt    val rating = x(2).toDouble    val timeStamp = x(3).toInt/10    return (userId,movieId,rating,timeStamp) } // def parseRating(x : (Int,Int,Double,Int)) : Rating = {    val userId = x._1    val movieId = x._2    val rating = x._3    val timeStamp = x._4 // ignore    return new Rating(userId,movieId,rating) } // Now that we have the parsers in place, let's focus on the main program, as shown next: def main(args: Array[String]): Unit = {    val sc = new SparkContext("local","Chapter 8")    println(s"Running Spark Version ${sc.version}")    //    val moviesFile = sc.textFile("/Users/ksankar/bdtc-     2014/movielens/medium/movies.dat")    val moviesRDD = moviesFile.map(line => line.split("::"))    println(moviesRDD.count())    //    val ratingsFile = sc.textFile("/Users/ksankar/bdtc-     2014/movielens/medium/ratings.dat")    val ratingsRDD = ratingsFile.map(line => parseRating1(line))    println(ratingsRDD.count())    //    ratingsRDD.take(5).foreach(println) // always check the RDD    //    val numRatings = ratingsRDD.count()    val numUsers = ratingsRDD.map(r => r._1).distinct().count()    val numMovies = ratingsRDD.map(r => r._2).distinct().count()    println("Got %d ratings from %d users on %d movies.".          format(numRatings, numUsers, numMovies)) Split the dataset into training, validation, and test. We can use any random dataset. But here we will use the last digit of the timestamp: val trainSet = ratingsRDD.filter(x => (x._4 % 10) < 6) .map(x=>parseRating(x))    val validationSet = ratingsRDD.filter(x => (x._4 % 10) >= 6 &       (x._4 % 10) < 8).map(x=>parseRating(x))    val testSet = ratingsRDD.filter(x => (x._4 % 10) >= 8)     .map(x=>parseRating(x))    println("Training: "+ "%d".format(trainSet.count()) +      ", validation: " + "%d".format(validationSet.count()) + ",         test: " + "%d".format(testSet.count()) + ".")    //    // Now train the model using the training set:    val rank = 10    val numIterations = 20    val mdlALS = ALS.train(trainSet,rank,numIterations)    //    // prepare validation set for prediction    //    val userMovie = validationSet.map {      case Rating(user, movie, rate) =>(user, movie)    }    //    // Predict and convert to Key-Value PairRDD    val predictions = mdlALS.predict(userMovie).map {      case Rating(user, movie, rate) => ((user, movie), rate)    }    //    println(predictions.count())    predictions.take(5).foreach(println)    //    // Now convert the validation set to PairRDD:    //    val validationPairRDD = validationSet.map(r => ((r.user,       r.product), r.rating))    println(validationPairRDD.count())    validationPairRDD.take(5).foreach(println)    println(validationPairRDD.getClass())    println(predictions.getClass())    //    // Now join the validation set with predictions.    // Then we can figure out how good our recommendations are.    // Tip:    //   Need to import org.apache.spark.SparkContext._    //   Then MappedRDD would be converted implicitly to PairRDD    //    val ratingsAndPreds = validationPairRDD.join(predictions)    println(ratingsAndPreds.count())    ratingsAndPreds.take(3).foreach(println)    //    val mse = ratingsAndPreds.map(r => {      math.pow((r._2._1 - r._2._2),2)    }).reduce(_+_) / ratingsAndPreds.count()    val rmse = math.sqrt(mse)    println("MSE = %2.5f".format(mse) + " RMSE = %2.5f"     .format(rmse))    println("** Done **") } } The run results, as shown in the next screenshot, are obtained as expected: Check the following screenshot as well: Some more information is available at: The Goodby MapReduce article from Mahout News (https://mahout.apache.org/) https://spark.apache.org/docs/latest/mllib-guide.html A Collaborative Filtering ALS paper (http://dl.acm.org/citation.cfm?id=1608614) A good presentation on decision trees (http://spark-summit.org/wp-content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf) A recommended hands-on exercise from Spark Summit 2014 (https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html) Summary In this article, we looked at the most common machine learning algorithms. Naturally, ML is a vast subject and requires lot more study, experimentation, and practical experience on interesting data science problems. Two books that are relevant to Spark Machine Learning are Packt's own books Machine Learning with Spark, Nick Pentreath, and O'Reilly's Advanced Analytics with Spark, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. Both are excellent books that you can refer to. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] The Spark programming model [article] Using the Spark Shell [article]
Read more
  • 0
  • 0
  • 7085

article-image-installing-postgresql
Packt
01 Apr 2015
16 min read
Save for later

Installing PostgreSQL

Packt
01 Apr 2015
16 min read
In this article by Hans-Jürgen Schönig, author of the book Troubleshooting PostgreSQL, we will cover what can go wrong during the installation process and what can be done to avoid those things from happening. At the end of the article, you should be able to avoid all of the pitfalls, traps, and dangers you might face during the setup process. (For more resources related to this topic, see here.) For this article, I have compiled some of the core problems that I have seen over the years, as follows: Deciding on a version during installation Memory and kernel issues Preventing problems by adding checksums to your database instance Wrong encodings and subsequent import errors Polluted template databases Killing the postmaster badly At the end of the article, you should be able to install PostgreSQL and protect yourself against the most common issues popping up immediately after installation. Deciding on a version number The first thing to work on when installing PostgreSQL is to decide on the version number. In general, a PostgreSQL version number consists of three digits. Here are some examples: 9.4.0, 9.4.1, or 9.4.2 9.3.4, 9.3.5, or 9.3.6 The last digit is the so-called minor release. When a new minor release is issued, it generally means that some bugs have been fixed (for example, some time zone changes, crashes, and so on). There will never be new features, missing functions, or changes of that sort in a minor release. The same applies to something truly important—the storage format. It won't change with a new minor release. These little facts have a wide range of consequences. As the binary format and the functionality are unchanged, you can simply upgrade your binaries, restart PostgreSQL, and enjoy your improved minor release. When the digit in the middle changes, things get a bit more complex. A changing middle digit is called a major release. It usually happens around once a year and provides you with significant new functionality. If this happens, we cannot just stop or start the database anymore to replace the binaries. If the first digit changes, something really important has happened. Examples of such important events were introductions of SQL (6.0), the Windows port (8.0), streaming replication (9.0), and so on. Technically, there is no difference between the first and the second digit—they mean the same thing to the end user. However, a migration process is needed. The question that now arises is this: if you have a choice, which version of PostgreSQL should you use? Well, in general, it is a good idea to take the latest stable release. In PostgreSQL, every version number following the design patterns I just outlined is a stable release. As of PostgreSQL 9.4, the PostgreSQL community provides fixes for versions as old as PostgreSQL 9.0. So, if you are running an older version of PostgreSQL, you can still enjoy bug fixes and so on. Methods of installing PostgreSQL Before digging into troubleshooting itself, the installation process will be outlined. The following choices are available: Installing binary packages Installing from source Installing from source is not too hard to do. However, this article will focus on installing binary packages only. Nowadays, most people (not including me) like to install PostgreSQL from binary packages because it is easier and faster. Basically, two types of binary packages are common these days: RPM (Red Hat-based) and DEB (Debian-based). Installing RPM packages Most Linux distributions include PostgreSQL. However, the shipped PostgreSQL version is somewhat ancient in many cases. Recently, I saw a Linux distribution that still featured PostgreSQL 8.4, a version already abandoned by the PostgreSQL community. Distributors tend to ship older versions to ensure that new bugs are not introduced into their systems. For high-performance production servers, outdated versions might not be the best idea, however. Clearly, for many people, it is not feasible to run long-outdated versions of PostgreSQL. Therefore, it makes sense to make use of repositories provided by the community. The Yum repository shows which distributions we can use RPMs for, at http://yum.postgresql.org/repopackages.php. Once you have found your distribution, the first thing is to install this repository information for Fedora 20 as it is shown in the next listing: yum install http://yum.postgresql.org/9.4/fedora/fedora-20-x86_64/pgdg-fedora94-9.4-1.noarch.rpm Once the repository has been added, we can install PostgreSQL: yum install postgresql94-server postgresql94-contrib /usr/pgsql-9.4/bin/postgresql94-setup initdb systemctl enable postgresql-9.4.service systemctl start postgresql-9.4.service First of all, PostgreSQL 9.4 is installed. Then a so-called database instance is created (initdb). Next, the service is enabled to make sure that it is always there after a reboot, and finally, the postgresql-9.4 service is started. The term database instance is an important concept. It basically describes an entire PostgreSQL environment (setup). A database instance is fired up when PostgreSQL is started. Databases are part of a database instance. Installing Debian packages Installing Debian packages is also not too hard. By the way, the process on Ubuntu as well as on some other similar distributions is the same as that on Debian, so you can directly use the knowledge gained from this article for other distributions. A simple file called /etc/apt/sources.list.d/pgdg.list can be created, and a line for the PostgreSQL repository (all the following steps can be done as root user or using sudo) can be added: deb http://apt.postgresql.org/pub/repos/apt/ YOUR_DEBIAN_VERSION_HERE-pgdg main So, in the case of Debian Wheezy, the following line would be useful: deb http://apt.postgresql.org/pub/repos/apt/ wheezy-pgdg main Once we have added the repository, we can import the signing key: $# wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - OK Voilà! Things are mostly done. In the next step, the repository information can be updated: apt-get update Once this has been done successfully, it is time to install PostgreSQL: apt-get install "postgresql-9.4" If no error is issued by the operating system, it means you have successfully installed PostgreSQL. The beauty here is that PostgreSQL will fire up automatically after a restart. A simple database instance has also been created for you. If everything has worked as expected, you can give it a try and log in to the database: root@chantal:~# su - postgres $ psql postgres psql (9.4.1) Type "help" for help. postgres=# Memory and kernel issues After this brief introduction to installing PostgreSQL, it is time to focus on some of the most common problems. Fixing memory issues Some of the most important issues are related to the kernel and memory. Up to version 9.2, PostgreSQL was using the classical system V shared memory to cache data, store locks, and so on. Since PostgreSQL 9.3, things have changed, solving most issues people had been facing during installation. However, in PostgreSQL 9.2 or before, you might have faced the following error message: FATAL: Could not create shared memory segment DETAIL: Failed system call was shmget (key=5432001, size=1122263040, 03600) HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 1122263040 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections. If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for. The PostgreSQL documentation contains more information about shared memory configuration. If you are facing a message like this, it means that the kernel does not provide you with enough shared memory to satisfy your needs. Where does this need for shared memory come from? Back in the old days, PostgreSQL stored a lot of stuff, such as the I/O cache (shared_buffers, locks, autovacuum-related information and a lot more), in the shared memory. Traditionally, most Linux distributions have had a tight grip on the memory, and they don't issue large shared memory segments; for example, Red Hat has long limited the maximum amount of shared memory available to applications to 32 MB. For most applications, this is not enough to run PostgreSQL in a useful way—especially not if performance does matter (and it usually does). To fix this problem, you have to adjust kernel parameters. Managing Kernel Resources of the PostgreSQL Administrator's Guide will tell you exactly why we have to adjust kernel parameters. For more information, check out the PostgreSQL documentation at http://www.postgresql.org/docs/9.4/static/kernel-resources.htm. This article describes all the kernel parameters that are relevant to PostgreSQL. Note that every operating system needs slightly different values here (for open files, semaphores, and so on). Adjusting kernel parameters for Linux In this article, parameters relevant to Linux will be covered. If shmget (previously mentioned) fails, two parameters must be changed: $ sysctl -w kernel.shmmax=17179869184 $ sysctl -w kernel.shmall=4194304 In this example, shmmax and shmall have been adjusted to 16 GB. Note that shmmax is in bytes while shmall is in 4k blocks. The kernel will now provide you with a great deal of shared memory. Also, there is more; to handle concurrency, PostgreSQL needs something called semaphores. These semaphores are also provided by the operating system. The following kernel variables are available: SEMMNI: This is the maximum number of semaphore identifiers. It should be at least ceil((max_connections + autovacuum_max_workers + 4) / 16). SEMMNS: This is the maximum number of system-wide semaphores. It should be at least ceil((max_connections + autovacuum_max_workers + 4) / 16) * 17, and it should have room for other applications in addition to this. SEMMSL: This is the maximum number of semaphores per set. It should be at least 17. SEMMAP: This is the number of entries in the semaphore map. SEMVMX: This is the maximum value of the semaphore. It should be at least 1000. Don't change these variables unless you really have to. Changes can be made with sysctl, as was shown for the shared memory. Adjusting kernel parameters for Mac OS X If you happen to run Mac OS X and plan to run a large system, there are also some kernel parameters that need changes. Again, /etc/sysctl.conf has to be changed. Here is an example: kern.sysv.shmmax=4194304 kern.sysv.shmmin=1 kern.sysv.shmmni=32 kern.sysv.shmseg=8 kern.sysv.shmall=1024 Mac OS X is somewhat nasty to configure. The reason is that you have to set all five parameters to make this work. Otherwise, your changes will be silently ignored, and this can be really painful. In addition to that, it has to be assured that SHMMAX is an exact multiple of 4096. If it is not, trouble is near. If you want to change these parameters on the fly, recent versions of OS X provide a systcl command just like Linux. Here is how it works: sysctl -w kern.sysv.shmmax sysctl -w kern.sysv.shmmin sysctl -w kern.sysv.shmmni sysctl -w kern.sysv.shmseg sysctl -w kern.sysv.shmall Fixing other kernel-related limitations If you are planning to run a large-scale system, it can also be beneficial to raise the maximum number of open files allowed. To do that, /etc/security/limits.conf can be adapted, as shown in the next example: postgres   hard   nofile   1024 postgres   soft   nofile   1024 This example says that the postgres user can have up to 1,024 open files per session. Note that this is only important for large systems; open files won't hurt an average setup. Adding checksums to a database instance When PostgreSQL is installed, a so-called database instance is created. This step is performed by a program called initdb, which is a part of every PostgreSQL setup. Most binary packages will do this for you and you don't have to do this by hand. Why should you care then? If you happen to run a highly critical system, it could be worthwhile to add checksums to the database instance. What is the purpose of checksums? In many cases, it is assumed that crashes happen instantly—something blows up and a system fails. This is not always the case. In many scenarios, the problem starts silently. RAM may start to break, or the filesystem may start to develop slight corruption. When the problem surfaces, it may be too late. Checksums have been implemented to fight this very problem. Whenever a piece of data is written or read, the checksum is checked. If this is done, a problem can be detected as it develops. How can those checksums be enabled? All you have to do is to add -k to initdb (just change your init scripts to enable this during instance creation). Don't worry! The performance penalty of this feature can hardly be measured, so it is safe and fast to enable its functionality. Keep in mind that this feature can really help prevent problems at fairly low costs (especially when your I/O system is lousy). Preventing encoding-related issues Encoding-related problems are some of the most frequent problems that occur when people start with a fresh PostgreSQL setup. In PostgreSQL, every database in your instance has a specific encoding. One database might be en_US@UTF-8, while some other database might have been created as de_AT@UTF-8 (which denotes German as it is used in Austria). To figure out which encodings your database system is using, try to run psql -l from your Unix shell. What you will get is a list of all databases in the instance that include those encodings. So where can we actually expect trouble? Once a database has been created, many people would want to load data into the system. Let's assume that you are loading data into the aUTF-8 database. However, the data you are loading contains some ASCII characters such as ä, ö, and so on. The ASCII code for ö is 148. Binary 148 is not a valid Unicode character. In Unicode, U+00F6 is needed. Boom! Your import will fail and PostgreSQL will error out. If you are planning to load data into a new database, ensure that the encoding or character set of the data is the same as that of the database. Otherwise, you may face ugly surprises. To create a database using the correct locale, check out the syntax of CREATE DATABASE: test=# h CREATE DATABASE Command:     CREATE DATABASE Description: create a new database Syntax: CREATE DATABASE name    [ [ WITH ] [ OWNER [=] user_name ]            [ TEMPLATE [=] template ]            [ ENCODING [=] encoding ]            [ LC_COLLATE [=] lc_collate ]            [ LC_CTYPE [=] lc_ctype ]           [ TABLESPACE [=] tablespace_name ]            [ CONNECTION LIMIT [=] connlimit ] ] ENCODING and the LC* settings are used here to define the proper encoding for your new database. Avoiding template pollution It is somewhat important to understand what happens during the creation of a new database in your system. The most important point is that CREATE DATABASE (unless told otherwise) clones the template1 database, which is available in all PostgreSQL setups. This cloning has some important implications. If you have loaded a very large amount of data into template1, all of that will be copied every time you create a new database. In many cases, this is not really desirable but happens by mistake. People new to PostgreSQL sometimes put data into template1 because they don't know where else to place new tables and so on. The consequences can be disastrous. However, you can also use this common pitfall to your advantage. You can place the functions you want in all your databases in template1 (maybe for monitoring or whatever benefits). Killing the postmaster After PostgreSQL has been installed and started, many people wonder how to stop it. The most simplistic way is, of course, to use your service postgresql stop or /etc/init.d/postgresql stop init scripts. However, some administrators tend to be a bit crueler and use kill -9 to terminate PostgreSQL. In general, this is not really beneficial because it will cause some nasty side effects. Why is this so? The PostgreSQL architecture works like this: when you start PostgreSQL you are starting a process called postmaster. Whenever a new connection comes in, this postmaster forks and creates a so-called backend process (BE). This process is in charge of handling exactly one connection. In a working system, you might see hundreds of processes serving hundreds of users. The important thing here is that all of those processes are synchronized through some common chunk of memory (traditionally, shared memory, and in the more recent versions, mapped memory), and all of them have access to this chunk. What might happen if a database connection or any other process in the PostgreSQL infrastructure is killed with kill -9? A process modifying this common chunk of memory might die while making a change. The process killed cannot defend itself against the onslaught, so who can guarantee that the shared memory is not corrupted due to the interruption? This is exactly when the postmaster steps in. It ensures that one of these backend processes has died unexpectedly. To prevent the potential corruption from spreading, it kills every other database connection, goes into recovery mode, and fixes the database instance. Then new database connections are allowed again. While this makes a lot of sense, it can be quite disturbing to those users who are connected to the database system. Therefore, it is highly recommended not to use kill -9. A normal kill will be fine. Keep in mind that a kill -9 cannot corrupt your database instance, which will always start up again. However, it is pretty nasty to kick everybody out of the system just because of one process! Summary In this article we have learned how to install PostgreSQL using binary packages. Some of the most common problems and pitfalls, including encoding-related issues, checksums, and versioning were discussed. Resources for Article: Further resources on this subject: Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article] PostgreSQL – New Features [article]
Read more
  • 0
  • 0
  • 5976

article-image-factor-variables-r
Packt
01 Apr 2015
7 min read
Save for later

Factor variables in R

Packt
01 Apr 2015
7 min read
This article by Jaynal Abedin and Kishor Kumar Das, authors of the book Data Manipulation with R Second Edition, will discuss factor variables in R. In any data analysis task, the majority of the time is dedicated to data cleaning and preprocessing. Sometimes, it is considered that about 80 percent of the effort is devoted to data cleaning before conducting the actual analysis. Also, in real-world data, we often work with categorical variables. A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, it is known as a factor. Working with categorical variables in R is a bit technical, and in this article, we have tried to demystify this process of dealing with categorical variables. (For more resources related to this topic, see here.) During data analysis, the factor variable sometimes plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Numeric variables are convenient during statistical analysis, but sometimes, we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. The split-apply-combine strategy Data manipulation is an integral part of data cleaning and analysis. For large data, it is always preferable to perform the operation within a subgroup of a dataset to speed up the process. In R, this type of data manipulation can be done with base functionality, but for large-scale data, it requires considerable amount of coding and eventually takes a longer time to process. In the case of big data, we can split the dataset, perform the manipulation or analysis, and then again combine the results into a single output. This type of split using base R is not efficient, and to overcome this limitation, Wickham developed an R package, plyr, where he efficiently implemented the split-apply-combine strategy. Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break down a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we can compare it with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently. The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were not previously connected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator. The plyr package works on every type of data structure, whereas the dplyr package is designed to work only on data frames. The dplyr package offers a complete set of functions to perform every kind of data manipulation we would need in the process of analysis. These functions take a data frame as the input and also produce a data frame as the output, hence the name dplyr. There are two different types of functions in the dplyr package: single-table and aggregate. The single-table function takes a data frame as the input and an action such as subsetting a data frame, generating new columns in the data frame, or rearranging a data frame. The aggregate function takes a column as the input and produces a single value as the output, which is mostly used to summarize columns. These functions do not allow us to perform any group-wise operation, but a combination of these functions with the group_by() function allows us to implement the split-apply-combine approach. Reshaping a dataset Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping, and we need to implement some reorientation to perform certain types of analyses. A dataset's layout could be long or wide. In a long layout, multiple rows represent a single subject's record, whereas in a wide layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this article, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This article mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient. A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset, and a column represents the information for each characteristic for all subjects. This means that rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier, and others are measured characteristics. For the purpose of reshaping, we can group the variables into two groups: identifier variables and measured variables: The identifier variables: These help us identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terminology, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables. The measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mixture of both. Now, beyond this typical structure of a dataset, we can think differently, where we will have only identification variables and a value. The identification variable identifies a subject along with which the measured variable the value represents. In this new paradigm, each row represents one observation of one variable. In the new paradigm, this is termed as melting and it produces molten data. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. Text processing Text data is one of the most important areas in the field of data analytics. Nowadays, we are producing a huge amount of text data through various media every day; for example, Twitter posts, blog writing, and Facebook posts are all major sources of text data. Text data can be used to retrieve information, in sentiment analysis and even entity recognition. Summary This article briefly explained the factor variables, the split-apply-combine strategy, reshaping a dataset in R, and text processing. Resources for Article: Further resources on this subject: Introduction to S4 Classes [Article] Warming Up [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 5065
Packt
30 Mar 2015
28 min read
Save for later

PostgreSQL – New Features

Packt
30 Mar 2015
28 min read
In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL. We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions. (For more resources related to this topic, see here.) Interesting data types We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include: The number data types (smallint, integer, bigint, decimal, numeric, real, and double) The character data types (varchar, char, and text) The binary data types The date/time data types (including date, timestamp without timezone, and timestamp with timezone) BOOLEAN data types However, this is all standard fare. Let's start off by looking at the RANGE data type. RANGE This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases. Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000. We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly. Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees. These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well). It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features: From and to timestamps/dates for room reservations Lower and upper limit for price/discount ranges Scheduling jobs Timesheets Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type: CREATE TABLE rooms(id serial, descr varchar(50));   INSERT INTO rooms(descr) SELECT concat('Room ', generate_series(1,3));   CREATE TABLE room_book (id serial , room_id integer, from_time timestamp, to_time timestamp , res tsrange);   INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30 10:00:00,2014-7-30 11:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30 10:00,2014-7-30 10:40:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30 11:20:00,2014-7-30 12:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30 11:00:00,2014-7-30 11:30:00)'); PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room: SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS ('2014-07-30 10:45:00','2014-07-30 11:15:00'); If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL: SELECT id FROM rooms EXCEPT We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE res && '(2014-07-30 10:45:00,2014-07-30 11:15:00)'; Do look up GIST indexes to improve the performance of queries that use range operators. Another way of achieving the same is to use the following command: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE '2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > from_time; Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values. When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here: SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5)       | [3,6)       | [4,5)       | [3,5) Using network address types The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases. When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on. The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on. While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows: inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask. cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example: CREATE TABLE nettb (id serial, intclmn inet, cidrclmn cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/32', '192.168.64.2/32'); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.2/24'); ERROR: invalid cidr value: "192.168.64.2/24" LINE 1: ...b (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.6...                                                              ^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.0/24'); INSERT 0 1 SELECT * FROM nettb; id |     intclmn     |   cidrclmn     ----+-----------------+----------------- 1 | 192.168.64.2   | 192.168.64.2/32 2 | 192.168.64.2/24 | 192.168.64.0/24 Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here: SELECT id,intclmn FROM nettb ; id |   intclmn   ----+-------------- 1 | 192.168.64.2 3 | 192.168.12.2 4 | 192.168.13.2 5 | 192.168.12.4   SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/24'; id |   intclmn   3 | 192.168.12.2 5 | 192.168.12.4   SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/32'; id |   intclmn   3 | 192.168.12.2 The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators. The macaddr data type can be used to store Mac addresses in different formats. hstore for key-value pairs A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead. Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern. There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example. The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes: CREATE TABLE customer(id serial, first_name varchar(50), dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678", ratnum=>"Somanyidentifiers" '); Now, let's try retrieving all customers with their SSN, as shown here: SELECT first_name, dynamic_attributes FROM customer        WHERE dynamic_attributes ? 'ssn'; first_name | dynamic_attributes Michael   | "ssn"=>"123-465-798" Smith     | "ssn"=>"129-465-798" James     | "ssn"=>"No data" Also, those with a specific SSN: SELECT first_name,dynamic_attributes FROM customer        WHERE dynamic_attributes -> 'ssn'= '123-465-798'; first_name | dynamic_attributes - Michael   | "ssn"=>"123-465-798" If we want to get records that do not contain a specific SSN, just use the following command: WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798' Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command: first_name |                          dynamic_attributes         ------------+----------------------------------------------------- Ram       | "npr"=>"XYZ5678", "uuid"=>"1234567891", "ratnum"=>"Somanyidentifiers" As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on. We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently. json/jsonb JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years. PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB. The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice. Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes: CREATE TABLE items (    item_id serial,    details json ); Now, we will add records. All DVDs go into one record, books go into another, and so on: INSERT INTO items (details) VALUES ('{                  "DVDs" :[                         {"Name":"The Making of Thunderstorms", "Types":"Educational",                          "Age-group":"5-10","Produced By":"National Geographic"                          },                          {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror",                          "Certificate":"A", "Director":"Dracula","Actors":                                [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}]                          },                          {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense",                          "Certificate":"A", "Director": "Jonathan Lynn","Actors":                          [{"Name":"Joe "},{"Name":"Marissa"}] }] }' ); A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions: SELECT   details->>'DVDs' dvds, pg_typeof(details->>'DVDs') datatype      FROM items; SELECT   details->'DVDs' dvds ,pg_typeof(details->'DVDs') datatype      FROM items; Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON: Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement: WITH tmp (dvds) AS (SELECT json_array_elements(details->'DVDs') det FROM items) SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a WHERE    a->>'Name'='Meena'; We get the record as shown here: We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena. XML PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process. We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster. Inserting and verifying XML data Now, let's take a look at what we can do with the xml data type in PostgreSQL: CREATE TABLE tbl_xml(id serial, docmnt xml); INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml'); INSERT INTO tbl_xml (docmnt)        SELECT query_to_xml( 'SELECT now()',true,false,'') ; SELECT xml_is_well_formed_document(docmnt::text), docmnt        FROM tbl_xml; Then, take a look at the following screenshot: First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML. Generating XML files for table definitions and data We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content. If we want to generate just the definitions, we can use table_to_xmlschema. PostgreSQL also provides the xpath function to extract data as follows: SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml        WHERE id = 2;                xpath               ------------------------------------ {2014-07-29T16:55:00.781533+05:30} Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data. pgbadger Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database. For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on. Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right. For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows: log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d.log' log_min_duration_statement = 0 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on track_activity_query_size = 2048 It might be necessary to restart the cluster for some of these changes to take effect. We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt: pgbench -i pgp This creates a few tables on the pgp database. We can log in to psql (database pgp) and check: \dt              List of relations Schema |       Name      | Type | Owner   --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres    public | pgbench_tellers | table | postgres Now, we can run pgbench to generate load on the database with the following command: pgbench -c 5 -T10 pgp The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database. At shell prompt, execute: wget https://github.com/dalibo/pgbadger/archive/master.zip Once the file is downloaded, unzip the file using the following command: unzip master.zip Use cd to the directory pgbadger-master as follows: cd pgbadger-master Execute the following command: ./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o myoutput.html Replace the log file name in the command with the actual name. It will generate a myoutput.html file. The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time: The following screenshot gives an idea about the number of performance metrics provided by the report: If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries. If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore. The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen. With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4. Features over time Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database. Interesting features in 9.4 Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy). Keeping the buffer ready As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples. In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn. Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience. In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast. You must use PostgreSQL 9.4 for trying out the code in the following sections. So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table. We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql: CREATE EXTENSION pg_prewarm; CREATE TABLE myt(id SERIAL, name VARCHAR(40)); INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name'); Now, stop the server using pg_ctl at the shell prompt: pg_ctl stop -m immediate Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this): echo 1 > /proc/sys/vm/drop_caches The command may vary depending on the OS. Restart the cluster using pg_ctl start. Then, execute the following command: SELECT COUNT(*) FROM myt; Time: 333.115 ms We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*). The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay. SELECT COUNT(*) FROM myt; count ------- 10000 (1 row) Time: 7.002 ms Better recoverability A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave. Easy-to-change parameters An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named postgresql.auto.conf. We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql: \! more postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by ALTER SYSTEM command. work_mem = '12MB' We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated. Logical decoding and consumption of changes Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes. Two parameter changes are necessary to set this up. These are as follows: The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows: SELECT * FROM pg_create_logical_replication_slot('myslot','test_decoding'); The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows: INSERT INTO myt(id) values (4); INSERT INTO myt(name) values ('abc'); Now, we will try retrieving the entries: SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL); Then, check the following screenshot: This function lets us take a look at the changes without consuming them so that the changes can be accessed again: SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL); This is shown in the following screenshot: This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed. Summary In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects. Resources for Article: Further resources on this subject: PostgreSQL as an Extensible RDBMS [article] Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article]
Read more
  • 0
  • 0
  • 3012

article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout
Packt
30 Mar 2015
33 min read
Save for later

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt
30 Mar 2015
33 min read
In this article by Chandramani Tiwary, author of the book, Learning Apache Mahout, we will discuss some core concepts of machine learning and discuss the steps of building a logistic regression classifier in Mahout. (For more resources related to this topic, see here.) The purpose of this article is to understand the core concepts of machine learning. We will focus on understanding the steps involved in, resolving different types of problems and application areas in machine learning. In particular we will cover the following topics: Supervised learning Unsupervised learning The recommender system Model efficacy A wide range of software applications today try to replace or augment human judgment. Artificial Intelligence is a branch of computer science that has long been trying to replicate human intelligence. A subset of AI, referred to as machine learning, tries to build intelligent systems by using the data. For example, a machine learning system can learn to classify different species of flowers or group-related news items together to form categories such as news, sports, politics, and so on, and for each of these tasks, the system will learn using data. For each of the tasks, the corresponding algorithm would look at the data and try to learn from it. Supervised learning Supervised learning deals with training algorithms with labeled data, inputs for which the outcome or target variables are known, and then predicting the outcome/target with the trained model for unseen future data. For example, historical e-mail data will have individual e-mails marked as ham or spam; this data is then used for training a model that can predict future e-mails as ham or spam. Supervised learning problems can be broadly divided into two major areas, classification and regression. Classification deals with predicting categorical variables or classes; for example, whether an e-mail is ham or spam or whether a customer is going to renew a subscription or not, for example a postpaid telecom subscription. This target variable is discrete, and has a predefined set of values. Regression deals with a target variable, which is continuous. For example, when we need to predict house prices, the target variable price is continuous and doesn't have a predefined set of values. In order to solve a given problem of supervised learning, one has to perform the following steps. Determine the objective The first major step is to define the objective of the problem. Identification of class labels, what is the acceptable prediction accuracy, how far in the future is prediction required, is insight more important or is accuracy of classification the driving factor, these are the typical objectives that need to be defined. For example, for a churn classification problem, we could define the objective as identifying customers who are most likely to churn within three months. In this case, the class label from the historical data would be whether a customer has churned or not, with insights into the reasons for the churn and a prediction of churn at least three months in advance. Decide the training data After the objective of the problem has been defined, the next step is to decide what training data should be used. The training data is directly guided by the objective of the problem to be solved. For example, in the case of an e-mail classification system, it would be historical e-mails, related metadata, and a label marking each e-mail as spam or ham. For the problem of churn analysis, different data points collected about a customer such as product usage, support case, and so on, and a target label for whether a customer has churned or is active, together form the training data. Churn Analytics is a major problem area for a lot of businesses domains such as BFSI, telecommunications, and SaaS. Churn is applicable in circumstances where there is a concept of term-bound subscription. For example, postpaid telecom customers subscribe for a monthly term and can choose to renew or cancel their subscription. A customer who cancels this subscription is called a churned customer. Create and clean the training set The next step in a machine learning project is to gather and clean the dataset. The sample dataset needs to be representative of the real-world data, though all available data should be used, if possible. For example, if we assume that 10 percent of e-mails are spam, then our sample should ideally start with 10 percent spam and 90 percent ham. Thus, a set of input rows and corresponding target labels are gathered from data sources such as warehouses, or logs, or operational database systems. If possible, it is advisable to use all the data available rather than sampling the data. Cleaning data for data quality purposes forms part of this process. For example, training data inclusion criteria should also be explored in this step. An example of this in the case of customer analytics is to decide the minimum age or type of customers to use in the training set, for example including customers aged at least six months. Feature extraction Determine and create the feature set from the training data. Features or predictor variables are representations of the training data that is used as input to a model. Feature extraction involves transforming and summarizing that data. The performance of the learned model depends strongly on its input feature set. This process is primarily called feature extraction and requires good understanding of data and is aided by domain expertise. For example, for churn analytics, we use demography information from the CRM, product adoption (phone usage in case of telecom), age of customer, and payment and subscription history as the features for the model. The number of features extracted should neither be too large nor too small; feature extraction is more art than science and, optimum feature representation can be achieved after some iterations. Typically, the dataset is constructed such that each row corresponds to one variable outcome. For example, in the churn problem, the training dataset would be constructed so that every row represents a customer. Train the models We need to try out different supervised learning algorithms. This step is called training the model and is an iterative process where you might try building different training samples and try out different combinations of features. For example, we may choose to use support vector machines or decision trees depending upon the objective of the study, the type of problem, and the available data. Machine learning algorithms can be bucketed into groups based on the ability of a user to interpret how the predictions were arrived at. If the model can be interpreted easily, then it is called a white box, for example decision tree and logistic regression, and if the model cannot be interpreted easily, they belong to the black box models, for example support vector machine (SVM). If the objective is to gain insight, a white box model such as decision tree or logistic regression can be used, and if robust prediction is the criteria, then algorithms such as neural networks or support vector machines can be used. While training a model, there are a few techniques that we should keep in mind, like bagging and boosting. Bagging Bootstrap aggregating, which is also known as bagging, is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement. By with replacement we mean that you can select the same example more than once. This property allows you to have values in the new dataset that are repeated, and some values from the original won't be present in the new set. Bagging helps in reducing the variance of a model and can be used to train different models using the same datasets. The final conclusion is arrived at after considering the output of each model. For example, let's assume our data is a, b, c, d, e, f, g, and h. By sampling our data five times, we can create five different samples as follows: Sample 1: a, b, c, c, e, f, g, h Sample 2: a, b, c, d, d, f, g, h Sample 3: a, b, c, c, e, f, h, h Sample 4: a, b, c, e, e, f, g, h Sample 5: a, b, b, e, e, f, g, h As we sample with replacement, we get the same examples more than once. Now we can train five different models using the five sample datasets. Now, for the prediction; as each model will provide the output, let's assume classes are yes and no, and the final outcome would be the class with maximum votes. If three models say yes and two no, then the final prediction would be class yes. Boosting Boosting is a technique similar to bagging. In boosting and bagging, you always use the same type of classifier. But in boosting, the different classifiers are trained sequentially. Each new classifier is trained based on the performance of those already trained, but gives greater weight to examples that were misclassified by the previous classifier. Boosting focuses new classifiers in the sequence on previously misclassified data. Boosting also differs from bagging in its approach of calculating the final prediction. The output is calculated from a weighted sum of all classifiers, as opposed to the method of equal weights used in bagging. The weights assigned to the classifier output in boosting are based on the performance of the classifier in the previous iteration. Validation After collecting the training set and extracting the features, you need to train the model and validate it on unseen samples. There are many approaches for creating the unseen sample called the validation set. We will be discussing a couple of them shortly. Holdout-set validation One approach to creating the validation set is to divide the feature set into train and test samples. We use the train set to train the model and test set to validate it. The actual percentage split varies from case to case but commonly it is split at 70 percent train and 30 percent test. It is also not uncommon to create three sets, train, test and validation set. Train and test set is created from data out of all considered time periods but the validation set is created from the most recent data. K-fold cross validation Another approach is to divide the data into k equal size folds or parts and then use k-1 of them for training and one for testing. The process is repeated k times so that each set is used as a validation set once and the metrics are collected over all the runs. The general standard is to use k as 10, which is called 10-fold cross-validation. Evaluation The objective of evaluation is to test the generalization of a classifier. By generalization, we mean how good the model performs on future data. Ideally, evaluation should be done on an unseen sample, separate to the validation sample or by cross-validation. There are standard metrics to evaluate a classifier against. There are a few things to consider while training a classifier that we should keep in mind. Bias-variance trade-off The first aspect to keep in mind is the trade-off between bias and variance. To understand the meaning of bias and variance, let's assume that we have several different, but equally good, training datasets for a specific supervised learning problem. We train different models using the same technique; for example, build different decision trees using the different training datasets available. Bias measures how far off in general a model's predictions are from the correct value. Bias can be measured as the average difference between a predicted output and its actual value. A learning algorithm is biased for a particular input X if, when trained on different training sets, it is incorrect when predicting the correct output for X. Variance is how greatly the predictions for a given point vary between different realizations of the model. A learning algorithm has high variance for a particular input X if it predicts different output values for X when trained on different training sets. Generally, there will be a trade-off between bias and variance. A learning algorithm with low bias must be flexible so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training dataset differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this trade-off between bias and variance. The plot on the top left is the scatter plot of the original data. The plot on the top right is a fit with high bias; the error in prediction in this case will be high. The bottom left image is a fit with high variance; the model is very flexible, and error on the training set is low but the prediction on unseen data will have a much higher degree of error as compared to the training set. The bottom right plot is an optimum fit with a good trade-off of bias and variance. The model explains the data well and will perform in a similar way for unseen data too. If the bias-variance trade-off is not optimized, it leads to problems of under-fitting and over-fitting. The plot shows a visual representation of the bias-variance trade-off. Over-fitting occurs when an estimator is too flexible and tries to fit the data too closely. High variance and low bias leads to over-fitting of data. Under-fitting occurs when a model is not flexible enough to capture the underlying trends in the observed data. Low variance and high bias leads to under-fitting of data. Function complexity and amount of training data The second aspect to consider is the amount of training data needed to properly represent the learning task. The amount of data required is proportional to the complexity of the data and learning task at hand. For example, if the features in the data have low interaction and are smaller in number, we could train a model with a small amount of data. In this case, a learning algorithm with high bias and low variance is better suited. But if the learning task at hand is complex and has a large number of features with higher degree of interaction, then a large amount of training data is required. In this case, a learning algorithm with low bias and high variance is better suited. It is difficult to actually determine the amount of data needed, but the complexity of the task provides some indications. Dimensionality of the input space A third aspect to consider is the dimensionality of the input space. By dimensionality, we mean the number of features the training set has. If the input feature set has a very high number of features, any machine learning algorithm will require a huge amount of data to build a good model. In practice, it is advisable to remove any extra dimensionality before training the model; this is likely to improve the accuracy of the learned function. Techniques like feature selection and dimensionality reduction can be used for this. Noise in data The fourth issue is noise. Noise refers to inaccuracies in data due to various issues. Noise can be present either in the predictor variables, or in the target variable. Both lead to model inaccuracies and reduce the generalization of the model. In practice, there are several approaches to alleviate noise in the data; first would be to identify and then remove the noisy training examples prior to training the supervised learning algorithm, and second would be to have an early stopping criteria to prevent over-fitting. Unsupervised learning Unsupervised learning deals with unlabeled data. The objective is to observe structure in data and find patterns. Tasks like cluster analysis, association rule mining, outlier detection, dimensionality reduction, and so on can be modeled as unsupervised learning problems. As the tasks involved in unsupervised learning vary vastly, there is no single process outline that we can follow. We will follow the process of some of the most common unsupervised learning problems. Cluster analysis Cluster analysis is a subset of unsupervised learning that aims to create groups of similar items from a set of items. Real life examples could be clustering movies according to various attributes like genre, length, ratings, and so on. Cluster analysis helps us identify interesting groups of objects that we are interested in. It could be items we encounter in day-to-day life such as movies, songs according to taste, or interests of users in terms of their demography or purchasing patterns. Let's consider a small example so you understand what we mean by interesting groups and understand the power of clustering. We will use the Iris dataset, which is a standard dataset used for academic research and it contains five variables: sepal length, sepal width, petal length, petal width, and species with 150 observations. The first plot we see shows petal length against petal width. Each color represents a different species. The second plot is the groups identified by clustering the data. Looking at the plot, we can see that the plot of petal length against petal width clearly separates the species of the Iris flower and in the process, it clusters the group's flowers of the same species together. Cluster analysis can be used to identify interesting patterns in data. The process of clustering involves these four steps. We will discuss each of them in the section ahead. Objective Feature representation Algorithm for clustering A stopping criteria Objective What do we want to cluster? This is an important question. Let's assume we have a large customer base for some kind of an e-commerce site and we want to group them together. How do we want to group them? Do we want to group our users according to their demography, such as age, location, income, and so on or are we interested in grouping them together? A clear objective is a good start, though it is not uncommon to start without an objective and see what can be done with the available data. Feature representation As with any machine learning task, feature representation is important for cluster analysis too. Creating derived features, summarizing data, and converting categorical variables to continuous variables are some of the common tasks. The feature representation needs to represent the objective of clustering. For example, if the objective is to cluster users based upon purchasing behavior, then features should be derived from purchase transaction and user demography information. If the objective is to cluster documents, then features should be extracted from the text of the document. Feature normalization To compare the feature vectors, we need to normalize them. Normalization could be across rows or across columns. In most cases, both are normalized. Row normalization The objective of normalizing rows is to make the objects to be clustered, comparable. Let's assume we are clustering organizations based upon their e-mailing behavior. Now organizations are very large and very small, but the objective is to capture the e-mailing behavior, irrespective of size of the organization. In this scenario, we need to figure out a way to normalize rows representing each organization, so that they can be compared. In this case, dividing by user count in each respective organization could give us a good feature representation. Row normalization is mostly driven by the business domain and requires domain expertise. Column normalization The range of data across columns varies across datasets. The unit could be different or the range of columns could be different, or both. There are many ways of normalizing data. Which technique to use varies from case to case and depends upon the objective. A few of them are discussed here. Rescaling The simplest method is to rescale the range of features to make the features independent of each other. The aim is scale the range in [0, 1] or [−1, 1]: Here x is the original value and x', the rescaled valued. Standardization Feature standardization allows for the values of each feature in the data to have zero-mean and unit-variance. In general, we first calculate the mean and standard deviation for each feature and then subtract the mean in each feature. Then, we divide the mean subtracted values of each feature by its standard deviation: Xs = (X – mean(X)) / standard deviation(X). A notion of similarity and dissimilarity Once we have the objective defined, it leads to the idea of similarity and dissimilarity of object or data points. Since we need to group things together based on similarity, we need a way to measure similarity. Likewise to keep dissimilar things apart, we need a notion of dissimilarity. This idea is represented in machine learning by the idea of a distance measure. Distance measure, as the name suggests, is used to measure the distance between two objects or data points. Euclidean distance measure Euclidean distance measure is the most commonly used and intuitive distance measure: Squared Euclidean distance measure The standard Euclidean distance, when squared, places progressively greater weight on objects that are farther apart as compared to the nearer objects. The equation to calculate squared Euclidean measure is shown here: Manhattan distance measure Manhattan distance measure is defined as the sum of the absolute difference of the coordinates of two points. The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|: Cosine distance measure The cosine distance measure measures the angle between two points. When this angle is small, the vectors must be pointing in the same direction, and so in some sense the points are close. The cosine of this angle is near one when the angle is small, and decreases as it gets larger. The cosine distance equation subtracts the cosine value from one in order to give a proper distance, which is 0 when close and larger otherwise. The cosine distance measure doesn't account for the length of the two vectors; all that matters is that the points are in the same direction from the origin. Also note that the cosine distance measure ranges from 0.0, if the two vectors are along the same direction, to 2.0, when the two vectors are in opposite directions: Tanimoto distance measure The Tanimoto distance measure, like the cosine distance measure, measures the angle between two points, as well as the relative distance between the points: Apart from the standard distance measure, we can also define our own distance measure. Custom distance measure can be explored when existing ones are not able to measure the similarity between items. Algorithm for clustering The type of clustering algorithm to be used is driven by the objective of the problem at hand. There are several options and the predominant ones are density-based clustering, distance-based clustering, distribution-based clustering, and hierarchical clustering. The choice of algorithm to be used depends upon the objective of the problem. A stopping criteria We need to know when to stop the clustering process. The stopping criteria could be decided in different ways: one way is when the cluster centroids don't move beyond a certain margin after multiple iterations, a second way is when the density of the clusters have stabilized, and third way could be based upon the number of iterations, for example stopping the algorithm after 100 iterations. The stopping criteria depends upon the algorithm used, the goal being to stop when we have good enough clusters. Logistic regression Logistic regression is a probabilistic classification model. It provides the probability of a particular instance belonging to a class. It is used to predict the probability of binary outcomes. Logistic regression is computationally inexpensive, is relatively easier to implement, and can be interpreted easily. Logistic regression belongs to the class of discriminative models. The other class of algorithms is generative models. Let's try to understand the differences between the two. Suppose we have some input data represented by X and a target variable Y, the learning task obviously is P(Y|X), finding the conditional probability of Y occurring given X. A generative model concerns itself with learning the joint probability of P(Y, X), whereas a discriminative model will directly learn the conditional probability of P(Y|X) from the training set. This is the actual objective of classification. A generative model first learns P(Y, X), and then gets to P(Y|X) by conditioning on X by using Bayes' theorem. In more intuitive terms, generative models first learn the distribution of the data, then they model how the data is actually generated. However, discriminative models don't try to learn the underlying data distribution; they are concerned with finding the decision boundaries for the classification. Since generative models learn the distribution, it is possible to generate synthetic samples of X, Y. This is not possible with discriminative models. Some common examples of generative and discriminative models are as follows: Generative: naïve Bayes, Latent Dirichlet allocation Discriminative: Logistic regression, SVM, Neural networks Logistic regression belongs to the family of statistical techniques called regression. For regression problems and few other optimization problems, we first define a hypothesis, then define a cost function, and optimize it using an optimization algorithm such as Gradient descent. The optimization algorithm tries to find the regression coefficient, which best fits the data. Let's assume that the target variable is Y and the predictor variable or feature is X. Any regression problem starts with defining the hypothesis function, for example, an equation of the predictor variable , defines a cost function and then tweaks the weights; in this case, and are tweaked to minimize or maximize the cost function by using an optimization algorithm. For logistic regression, the predicted target needs to fall between zero and one. We start by defining the hypothesis function for it: Here, f(z) is the sigmoid or logistic function that has a range of zero to one, x is a matrix of features, and is the vector of weights. The next step is to define the cost function, which measures the difference between predicted and actual values. The objective of the optimization algorithm here is to find . This fits the regression coefficients so that the difference between predicted and actual target values are minimized. We will discuss gradient descent as the choice for the optimization algorithm shortly. To find the local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of that function at the current point. This will give us the optimum value of vector , once we achieve the stopping criteria. The stopping criteria is when the change in the weight vectors falls below a certain threshold, although sometimes it could be set to a predefined number of iterations. Logistic regression falls into the category of white box techniques and can be interpreted. Features or variables are of two major types, categorical and continuous, defined as follows: Categorical variable: This is a variable or feature that can take on a limited, and usually fixed, number of possible values. Example, variables such as industry, zip code, and country are categorical variables. Continuous variable: This is a variable that can take on any value between its minimum value and maximum value or range. Example, variable such as age, price, and so on, are continuous variables. Mahout logistic regression command line Mahout employs a modified version of gradient descent called stochastic gradient descent. The previous optimization algorithm, gradient ascent, uses the whole dataset on each update. This was fine with 100 examples, but with billions of data points containing thousands of features, it's unnecessarily expensive in terms of computational resources. An alternative to this method is to update the weights using, only one instance at a time. This is known as stochastic gradient ascent. Stochastic gradient ascent is an example of an online learning algorithm. This is known as online learning algorithm because we can incrementally update the classifier as new data comes in, rather than all at once. The all-at-once method is known as batch processing. We will now train and test a logistic regression algorithm using Mahout. We will also discuss both command line and code examples. The first step is to get the data and explore it. Getting the data The dataset required for this article is included in the code repository that comes with this book. It is present in the learningApacheMahout/data/chapter4 directory. If you wish to download the data, the same can be downloaded from the UCI link. The UCI is a repository for many datasets for machine learning. You can check out the other datasets available for further practice via this link http://archive.ics.uci.edu/ml/datasets.html. Create a folder in your home directory with the following command: cd $HOME mkdir bank_data cd bank_data Download the data in the bank_data directory: wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip Unzip the file using whichever utility you like, we use unzip: unzip bank-additional.zip cd bank-additional We are interested in the file bank-additional-full.csv. Copy the file to the learningApacheMahout/data/chapter4 directory. The file is semicolon delimited and the values are enclosed by ", it also has a header line with column name. We will use sed to preprocess the data. The sed editor is a very powerful editor in Linux and the command to use it is as follows: sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_fileName For inplace editing, the command is as follows: sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' Command to replace ; with , and remove " are as follows: sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv sed -i 's/"//g' input_bank_data.csv The dataset contains demographic and previous campaign-related data about a client and the outcome of whether or not the client did subscribed to the term deposit. We are interested in training a model, which can predict whether a client will subscribe to a term deposit, given the input data. The following table shows various input variables along with their types: Column name Description Variable type Age This represents the age of the Client Numeric Job This represents their type of the job, for example, entrepreneur, housemaid, management Categorical Marital This represents their marital status Categorical Education This represents their education level Categorical Default States whether the client has defaulted on credit Categorical Housing States whether the client has a housing loan Categorical Loan States whether the client has a personal loan Categorical contact States the contact communication type Categorical Month States the last contact month of the year Categorical day_of_week States the last contact day of the week Categorical duration States the last contact duration, in seconds Numeric campaign This represents the number of contacts Numeric Pdays This represents the number of days that passed since the last contact Numeric previous This represents the number of contacts performed before this campaign Numeric poutcome This represents the outcome of the previous marketing campaign Categorical emp.var.rate States the employment variation rate - quarterly indicator Numeric cons.price.idx States the consumer price index - monthly indicator Numeric cons.conf.idx States the consumer confidence index - monthly indicator Numeric euribor3m States the euribor three month rate - daily indicator Numeric nr.employed This represents the number of employees - quarterly indicator Numeric Model building via command line Mahout uses command line implementation of logistic regression. We will first build a model using the command line implementation. Logistic regression does not have a map to reduce implementation, but as it uses stochastic gradient descent, it is pretty fast, even for large datasets. The Mahout Java class is OnlineLogisticRegression in the org.mahout.classifier.sgd package. Splitting the dataset To split a dataset, we can use the Mahout split command. Let's look at the split command arguments as follows: mahout split ––help We need to remove the first line before running the split command, as the file contains the header file and the split command doesn't make any special allowances for header lines. It will land in any line in the split file. We first remove the header line from the input_bank_data.csv file. sed -i '1d' input_bank_data.csv mkdir input_bank cp input_bank_data.csv input_bank Logistic regression in Mahout is implemented for single-machine execution. We set the variable MAHOUT_LOCAL to instruct Mahout to execute in the local mode. export MAHOUT_LOCAL=TRUE   mahout split --input input_bank --trainingOutput train_data --testOutput test_data -xm sequential --randomSelectionPct 30 This will create different datasets, with the split based on number passed to the argument --randomSelectionPct. The split command can run in both Hadoop and the local file system. For current execution, it runs in the local mode on the local file system and splits the data into two sets, 70 percent as train in the train_data directory and 30 percent as test in test_data directory. Next, we restore the header line to the train and test files as follows: sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' train_data/input_bank_data.csv sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' test_data/input_bank_data.csv Train the model command line option Let's have a look at some important and commonly used parameters and their descriptions: mahout trainlogistic ––help   --help print this list --quiet be extra quiet --input "input directory from where to get the training data" --output "output directory to store the model" --target "the name of the target variable" --categories "the number of target categories to be considered" --predictors "a list of predictor variables" --types "a list of predictor variables types (numeric, word or text)" --passes "the number of times to pass over the input data" --lambda "the amount of coeffiecient decay to use" --rate     "learningRate the learning rate" --noBias "do not include a bias term" --features "the number of internal hashed features to use"   mahout trainlogistic --input train_data/input_bank_data.csv --output model --target y --predictors age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed --types n w w w w w w w w w n n n n w n n n n n --features 20 --passes 100 --rate 50 --categories 2 We pass the input filename and the output folder name, identify the target variable name using --target option, the predictors using the --predictors option, and the variable or predictor type using --types option. Numeric predictors are represented using 'n', and categorical variables are predicted using 'w'. Learning rate passed using --rate is used by gradient descent to determine the step size for each descent. We pass the maximum number of passes over data as 100 and categories as 2. The output is given below, which represents 'y', the target variable, as a sum of predictor variables multiplied by coefficient or weights. As we have not included the --noBias option, we see the intercept term in the equation: y ~ -990.322*Intercept Term + -131.624*age + -11.436*campaign + -990.322*cons.conf.idx + -14.006*cons.price.idx + -15.447*contact=cellular + -9.738*contact=telephone + 5.943*day_of_week=fri + -988.624*day_of_week=mon + 10.551*day_of_week=thu + 11.177*day_of_week=tue + -131.624*day_of_week=wed + -8.061*default=no + 12.301*default=unknown + -131.541*default=yes + 6210.316*duration + -17.755*education=basic.4y + 4.618*education=basic.6y + 8.780*education=basic.9y + -11.501*education=high.school + 0.492*education=illiterate + 17.412*education=professional.course + 6202.572*education=university.degree + -979.771*education=unknown + -189.978*emp.var.rate + -6.319*euribor3m + -21.495*housing=no + -14.435*housing=unknown + 6210.316*housing=yes + -190.295*job=admin. + 23.169*job=blue-collar + 6202.200*job=entrepreneur + 6202.200*job=housemaid + -3.208*job=management + -15.447*job=retired + 1.781*job=self-employed + 11.396*job=services + -6.637*job=student + 6202.572*job=technician + -9.976*job=unemployed + -4.575*job=unknown + -12.143*loan=no + -0.386*loan=unknown + -197.722*loan=yes + -12.308*marital=divorced + -9.185*marital=married + -1004.328*marital=single + 8.559*marital=unknown + -11.501*month=apr + 9.110*month=aug + -1180.300*month=dec + -189.978*month=jul + 14.316*month=jun + -124.764*month=mar + 6203.997*month=may + -0.884*month=nov + -9.761*month=oct + 12.301*month=sep + -990.322*nr.employed + -189.978*pdays + -14.323*poutcome=failure + 4.874*poutcome=nonexistent + -7.191*poutcome=success + 1.698*previous Interpreting the output The output of the trainlogistic command is an equation representing the sum of all predictor variables multiplied by their respective coefficient. The coefficients give the change in the log-odds of the outcome for one unit increase in the corresponding feature or predictor variable. Odds are represented as the ratio of probabilities, and they express the relative probabilities of occurrence or nonoccurrence of an event. If we take the base 10 logarithm of odds and multiply the results by 10, it gives us the log-odds. Let's take an example to understand it better. Let's assume that the probability of some event E occurring is 75 percent: P(E)=75%=75/100=3/4 The probability of E not happening is as follows: 1-P(A)=25%=25/100=1/4 The odds in favor of E occurring are P(E)/(1-P(E))=3:1 and odds against it would be 1:3. This shows that the event is three times more likely to occur than to not occur. Log-odds would be 10*log(3). For example, a unit increase in the age will decrease the log-odds of the client subscribing to a term deposit by 97.148 times, whereas a unit increase in cons.conf.idx will increase the log-odds by 1051.996. Here, the change is measured by keeping other variables at the same value. Testing the model After the model is trained, it's time to test the model's performance by using a validation set. Mahout has the runlogistic command for the same, the options are as follows: mahout runlogistic ––help We run the following command on the command line: mahout runlogistic --auc --confusion --input train_data/input_bank_data.csv --model model   AUC = 0.59 confusion: [[25189.0, 2613.0], [424.0, 606.0]] entropy: [[NaN, NaN], [-45.3, -7.1]] To get the scores for each instance, we use the --scores option as follows: mahout runlogistic --scores --input train_data/input_bank_data.csv --model model To test the model on the test data, we will pass on the test file created during the split process as follows: mahout runlogistic --auc --confusion --input test_data/input_bank_data.csv --model model   AUC = 0.60 confusion: [[10743.0, 1118.0], [192.0, 303.0]] entropy: [[NaN, NaN], [-45.2, -7.5]] Prediction Mahout doesn't have an out of the box command line for implementation of logistic regression for prediction of new samples. Note that the new samples for the prediction won't have the target label y, we need to predict that value. There is a way to work around this, though; we can use mahout runlogistic for generating a prediction by adding a dummy column as the y target variable and adding some random values. The runlogistic command expects the target variable to be present, hence the dummy columns are added. We can then get the predicted score using the --scores option. Summary In this article, we covered the basic machine learning concepts. We also saw the logistic regression example in Mahout. Resources for Article:   Further resources on this subject: Implementing the Naïve Bayes classifier in Mahout [article] Learning Random Forest Using Mahout [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 4995
Modal Close icon
Modal Close icon