Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-training-and-visualizing-a-neural-network-with-r
Oli Huggins
16 Feb 2016
8 min read
Save for later

Training and Visualizing a neural network with R

Oli Huggins
16 Feb 2016
8 min read
The development of a neural network is inspired by human brain activities. As such, this type of network is a computational model that mimics the pattern of the human mind. In contrast to this, support vector machines first, map input data into a high dimensional feature space defined by the kernel function, and find the optimum hyperplane that separates the training data by the maximum margin. In short, we can think of support vector machines as a linear algorithm in a high dimensional space. In this article, we will cover: Training a neural network with neuralnet Visualizing a neural network trained by neuralnet (For more resources related to this topic, see here.) Training a neural network with neuralnet The neural network is constructed with an interconnected group of nodes, which involves the input, connected weights, processing element, and output. Neural networks can be applied to many areas, such as classification, clustering, and prediction. To train a neural network in R, you can use neuralnet, which is built to train multilayer perceptron in the context of regression analysis, and contains many flexible functions to train forward neural networks. In this recipe, we will introduce how to use neuralnet to train a neural network. Getting ready In this recipe, we will use an iris dataset as our example dataset. We will first split the irisdataset into a training and testing datasets, respectively. How to do it... Perform the following steps to train a neural network with neuralnet: First load the iris dataset and split the data into training and testing datasets: > data(iris) > ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.7, 0.3)) > trainset = iris[ind == 1,]> testset = iris[ind == 2,] Then, install and load the neuralnet package: > install.packages("neuralnet")> library(neuralnet) Add the columns versicolor, setosa, and virginica based on the name matched value in the Species column: > trainset$setosa = trainset$Species == "setosa" > trainset$virginica = trainset$Species == "virginica" > trainset$versicolor = trainset$Species == "versicolor" Next, train the neural network with the neuralnet function with three hidden neurons in each layer. Notice that the results may vary with each training, so you might not get the same result: > network = neuralnet(versicolor + virginica + setosa~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, trainset, hidden=3) > network Call: neuralnet(formula = versicolor + virginica + setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = trainset, hidden = 3) 1 repetition was calculated. Error Reached Threshold Steps 1 0.8156100175 0.009994274769 11063 Now, you can view the summary information by accessing the result.matrix attribute of the built neural network model: > network$result.matrix 1 error 0.815610017474 reached.threshold 0.009994274769 steps 11063.000000000000 Intercept.to.1layhid1 1.686593311644 Sepal.Length.to.1layhid1 0.947415215237 Sepal.Width.to.1layhid1 -7.220058260187 Petal.Length.to.1layhid1 1.790333443486 Petal.Width.to.1layhid1 9.943109233330 Intercept.to.1layhid2 1.411026063895 Sepal.Length.to.1layhid2 0.240309549505 Sepal.Width.to.1layhid2 0.480654059973 Petal.Length.to.1layhid2 2.221435192437 Petal.Width.to.1layhid2 0.154879347818 Intercept.to.1layhid3 24.399329878242 Sepal.Length.to.1layhid3 3.313958088512 Sepal.Width.to.1layhid3 5.845670010464 Petal.Length.to.1layhid3 -6.337082722485 Petal.Width.to.1layhid3 -17.990352566695 Intercept.to.versicolor -1.959842102421 1layhid.1.to.versicolor 1.010292389835 1layhid.2.to.versicolor 0.936519720978 1layhid.3.to.versicolor 1.023305801833 Intercept.to.virginica -0.908909982893 1layhid.1.to.virginica -0.009904635231 1layhid.2.to.virginica 1.931747950462 1layhid.3.to.virginica -1.021438938226 Intercept.to.setosa 1.500533827729 1layhid.1.to.setosa -1.001683936613 1layhid.2.to.setosa -0.498758815934 1layhid.3.to.setosa -0.001881935696 Lastly, you can view the generalized weight by accessing it in the network: > head(network$generalized.weights[[1]]) How it works... The neural network is a network made up of artificial neurons (or nodes). There are three types of neurons within the network: input neurons, hidden neurons, and output neurons. In the network, neurons are connected; the connection strength between neurons is called weights. If the weight is greater than zero, it is in an excitation status. Otherwise, it is in an inhibition status. Input neurons receive the input information; the higher the input value, the greater the activation. Then, the activation value is passed through the network in regard to weights and transfer functions in the graph. The hidden neurons (or output neurons) then sum up the activation values and modify the summed values with the transfer function. The activation value then flows through hidden neurons and stops when it reaches the output nodes. As a result, one can use the output value from the output neurons to classify the data. Artificial Neural Network The advantages of a neural network are: firstly, it can detect a nonlinear relationship between the dependent and independent variable. Secondly, one can efficiently train large datasets using the parallel architecture. Thirdly, it is a nonparametric model so that one can eliminate errors in the estimation of parameters. The main disadvantages of neural network are that it often converges to the local minimum rather than the global minimum. Also, it might over-fit when the training process goes on for too long. In this recipe, we demonstrate how to train a neural network. First, we split the iris dataset into training and testing datasets, and then install the neuralnet package and load the library into an R session. Next, we add the columns versicolor, setosa, and virginica based on the name matched value in the Species column, respectively. We then use the neuralnet function to train the network model. Besides specifying the label (the column where the name equals to versicolor, virginica, and setosa) and training attributes in the function, we also configure the number of hidden neurons (vertices) as three in each layer. Then, we examine the basic information about the training process and the trained network saved in the network. From the output message, it shows the training process needed 11,063 steps until all the absolute partial derivatives of the error function were lower than 0.01 (specified in the threshold). The error refers to the likelihood of calculating Akaike Information Criterion (AIC). To see detailed information on this, you can access the result.matrix of the built neural network to see the estimated weight. The output reveals that the estimated weight ranges from -18 to 24.40; the intercepts of the first hidden layer are 1.69, 1.41 and 24.40, and the two weights leading to the first hidden neuron are estimated as 0.95 (Sepal.Length), -7.22 (Sepal.Width), 1.79 (Petal.Length), and 9.94 (Petal.Width). We can lastly determine that the trained neural network information includes generalized weights, which express the effect of each covariate. In this recipe, the model generates 12 generalized weights, which are the combination of four covariates (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) to three responses (setosa, virginica, versicolor). See also For a more detailed introduction on neuralnet, one can refer to the following paper: Günther, F., and Fritsch, S. (2010). neuralnet: Training of neural networks. The R journal, 2(1), 30-38. Visualizing a neural network trained by neuralnet The package, neuralnet, provides the plot function to visualize a built neural network and the gwplot function to visualize generalized weights. In following recipe, we will cover how to use these two functions. Getting ready You need to have completed the previous recipe by training a neural network and have all basic information saved in network. How to do it... Perform the following steps to visualize the neural network and the generalized weights: You can visualize the trained neural network with the plot function: > plot(network) Figure 10: The plot of trained neural network Furthermore, You can use gwplot to visualize the generalized weights: > par(mfrow=c(2,2)) > gwplot(network,selected.covariate="Petal.Width") > gwplot(network,selected.covariate="Sepal.Width") > gwplot(network,selected.covariate="Petal.Length") > gwplot(network,selected.covariate="Petal.Width") Figure 11: The plot of generalized weights How it works... In this recipe, we demonstrate how to visualize the trained neural network and the generalized weights of each trained attribute. Also, the plot includes the estimated weight, intercepts and basic information about the training process. At the bottom of the figure, one can find the overall error and number of steps required to converge. If all the generalized weights are close to zero on the plot, it means the covariate has little effect. However, if the overall variance is greater than one, it means the covariate has a nonlinear effect. See also For more information about gwplot, one can use the help function to access the following document: > ?gwplot Summary To learn more about machine learning with R, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Machine Learning with R (Second Edition) - Read Online Mastering Machine Learning with R - Read Online Resources for Article: Further resources on this subject: Introduction to Machine Learning with R [article] Hive Security [article] Spark - Architecture and First Program [article]
Read more
  • 0
  • 0
  • 14821

Packt
16 Feb 2016
3 min read
Save for later

Machine learning and Python – the Dream Team

Packt
16 Feb 2016
3 min read
In this article we will be learning more about machine learning and Python. Machine learning (ML) teaches machines how to carry out tasks by themselves. It is that simple. The complexity comes with the details, and that is most likely the reason you are reading this article. (For more resources related to this topic, see here.) Machine learning and Python – the dream team The goal of machine learning is to teach machines (software) to carry out tasks by providing them a couple of examples (how to do or not do a task). Let us assume that each morning when you turn on your computer, you perform the same task of moving e-mails around so that only those e-mails belonging to a particular topic end up in the same folder. After some time, you feel bored and think of automating this chore. One way would be to start analyzing your brain and writing down all the rules your brain processes while you are shuffling your e-mails. However, this will be quite cumbersome and always imperfect. While you will miss some rules, you will over-specify others. A better and more future-proof way would be to automate this process by choosing a set of e-mail meta information and body/folder name pairs and let an algorithm come up with the best rule set. The pairs would be your training data, and the resulting rule set (also called model) could then be applied to future e-mails, which we have not yet seen. This is machine learning in its simplest form. Of course, machine learning (often also referred to as data mining or predictive analysis) is not a brand new field in itself. Quite the contrary, its success over the recent years can be attributed to the pragmatic way of using rock-solid techniques and insights from other successful fields; for example, statistics. There, the purpose is for us humans to get insights into the data by learning more about the underlying patterns and relationships. As you read more and more about successful applications of machine learning (you have checked out kaggle.com already, haven't you?), you will see that applied statistics is a common field among machine learning experts. As you will see later, the process of coming up with a decent ML approach is never a waterfall-like process. Instead, you will see yourself going back and forth in your analysis, trying out different versions of your input data on diverse sets of ML algorithms. It is this explorative nature that lends itself perfectly to Python. Being an interpreted high-level programming language, it may seem that Python was designed specifically for the process of trying out different things. What is more, it does this very fast. Sure enough, it is slower than C or similar statically typed programming languages; nevertheless, with a myriad of easy-to-use libraries that are often written in C, you don't have to sacrifice speed for agility. Summary In this is article we learned about machine learning and its goals. To learn more please refer to the following books: Building Machine Learning Systems with Python - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition) Expert Python Programming (https://www.packtpub.com/application-development/expert-python-programming) Resources for Article:   Further resources on this subject: Python Design Patterns in Depth – The Observer Pattern [article] Python Design Patterns in Depth: The Factory Pattern [article] Customizing IPython [article]
Read more
  • 0
  • 0
  • 11883

article-image-data-mining
Packt
16 Feb 2016
11 min read
Save for later

Data mining

Packt
16 Feb 2016
11 min read
Let's talk about data mining. What is data mining? Data mining is the discovery of a model in data; it's also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining. (For more resources related to this topic, see here.) The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization. Feature extraction This is to extract the most prominent features of the data and ignore the rest. Here are some examples: Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items. Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining. Summarization The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. The data mining process There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM: Cross-Industry Standard Process for Data Mining(CRISP-DM) Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA CRISP-DM There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking: Let's look at the phases in detail: Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan. Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality. Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format. Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied. Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases. Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge. SEMMA Here is an overview of the process for SEMMA: Let's look at these processes in detail: Sample: In this step, a portion of a large dataset is extracted Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step Modify: The variables are created, selected, and transformed to focus on the model construction process Model: A variable combination of models is searched to predict a desired outcome Assess: The findings from the data mining process are evaluated by its usefulness and reliability Social network mining As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented. Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web. Social network When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows: There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely. There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other. There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related. Here are some varieties of social networks: Telephone networks: The nodes in this network are phone numbers and represent individuals E-mail networks: The nodes represent e-mail addresses, which represent individuals Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges. Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students. Text mining Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining. One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects. The process of text mining is described as follows: Text mining starts from preparing the text corpus, which are reports, letters and so forth The second step is to build a semistructured text database that is based on the text corpus The third step is to build a term-document matrix in which the term frequency is included The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization Information retrieval and text mining Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows: Specify a query. The following are some of the types of queries: Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword Boolean query: This is constructed with Boolean operators and keywords Phrase query: This is a query that consists of a sequence of words that makes up a phrase Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases Full document query: This query is a full document to find other documents similar to the query document Natural language questions: This query helps to express users' requirements as a natural language question Search the document collection. Return the subset of relevant documents. Mining text for prediction Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue. Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied. Web data mining Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications. Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it. Web mining tasks can be defined into at least three types: Web structure mining: This helps to find useful information or valuable structural summary about sites and pages from hyperlinks Web content mining: This helps to mine useful information from web page contents Web usage mining: This helps to discover user access patterns from web logs to detect intrusion, fraud, and attempted break-in The algorithms applied to web data mining are originated from classical data mining algorithms; it shares many similarities, such as the mining process, however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons: The data is unstructured The information of the Web keeps changing and the amount of data keeps growing Any data type is available on the Web, such as structured and unstructured data Heterogeneous information is on the web; redundant pages are present too Vast amounts of information on the web is linked The data is noisy Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows: Information extraction (IE):The task of IE consists of a couple of steps, tokenization, sentence segmentation, part-of-speech assignment, named entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling, and merging. Natural language processing (NLP): This researches the linguistic characteristics of human-human and human-machine interactive, models of linguistic competence and performance, frameworks to implement process with such models, processes'/models' iterative refinement, and evaluation techniques for the result systems. Classical NLP tasks related to web data mining are tagging, knowledge representation, ontologies, and so on. Question answering: The goal is to find the answer from a collection of text to questions in natural language format. It can be categorized into slot filling, limited domain, and open domain with bigger difficulties for the latter. One simple example is based on a predefined FAQ to answer queries from customers. Resource discovery: The popular applications are collecting important pages preferentially; similarity search using link topology, topical locality and focused crawling; and discovering communities. Summary We have looked at the broad aspects of data mining here. In case you are wondering what to look at next, check out how to "data mine" in R with Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r). If R is not your taste, you can "data mine" with Python as well. Check out Learning Data Mining with Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-python). Resources for Article: Further resources on this subject: Machine Learning with R [Article] Machine learning and Python – the Dream Team [Article] Machine Learning in Bioinformatics [Article]
Read more
  • 0
  • 0
  • 28370

article-image-r-vs-pandas
Packt
16 Feb 2016
1 min read
Save for later

R vs Pandas

Packt
16 Feb 2016
1 min read
This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/. By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following: R data types and their pandas equivalents Slicing and selection Arithmetic operations on datatype columns Aggregation and GroupBy Matching Split-apply-combine Melting and reshaping Factors and categorical data
Read more
  • 0
  • 0
  • 4657

Packt
16 Feb 2016
11 min read
Save for later

Python Design Patterns in Depth – The Observer Pattern

Packt
16 Feb 2016
11 min read
In this atricle you will see a group of objects when the state of another object changes. A very popular example lies in the Model-View-Controller (MVC) pattern. Assume that we are using the data of the same model in two views, for instance in a pie chart and in a spreadsheet. Whenever the model is modified, both the views need to be updated. That's the role of the Observer design pattern [Eckel08, page 213]. The Observer pattern describes a publish-subscribe relationship between a single object, : the publisher, which is also known as the subject or observable, and one or more objects, : the subscribers, also known as observers. In the MVC example, the publisher is the model and the subscribers are the views. However, MVC is not the only publish-subscribe example. Subscribing to a news feed such as RSS or Atom is another example. Many readers can subscribe to the feed typically using a feed reader, and every time a new item is added, they receive the update automatically. The ideas behind Observer are the same as the ideas behind MVC and the separation of concerns principle, that is, to increase decoupling between the publisher and subscribers, and to make it easy to add/remove subscribers at runtime. Additionally, the publisher is not concerned about who its observers are. It just sends notifications to all the subscribers [GOF95, page 327]. (For more resources related to this topic, see here.) A real-life example In reality, an auction resembles Observer. Every auction bidder has a number paddle that is raised whenever they want to place a bid. Whenever the paddle is raised by a bidder, the auctioneer acts as the subject by updating the price of the bid and broadcasting the new price to all bidders (subscribers). The following figure, courtesy of www.sourcemaking.com, [j.mp/observerpat], shows how the Observer pattern relates to an auction: A software example The django-observer package [j.mp/django-obs] is a third-party Django package that can be used to register callback functions that are executed when there are changes in several Django fields. Many different types of fields are supported (CharField, IntegerField, and so forth). RabbitMQ is a library that can be used to add asynchronous messaging support to an application. Several messaging protocols are supported, such as HTTP and AMQP. RabbitMQ can be used in a Python application to implement a publish-subscribe pattern, which is nothing more than the Observer design pattern [j.mp/rabbitmqobs]. Use cases We generally use the Observer pattern when we want to inform/update one or more objects (observers/subscribers) about a change that happened to another object (subject/publisher/observable). The number of observers as well as who the observers are may vary and can be changed dynamically (at runtime). We can think of many cases where Observer can be useful. Whether it is RSS, Atom, or another format, the idea is the same; you follow a feed, and every time it is updated, you receive a notification about the update [Zlobin13, page 60]. The same concept exists in social networking. If you are connected to another person using a social networking service, and your connection updates something, you are notified about it. It doesn't matter if the connection is a Twitter user that you follow, a real friend on Facebook, or a business colleague on LinkedIn. Event-driven systems are another example where Observer can be (and usually is) used. In such systems, listeners are used to "listen" for specific events. The listeners are triggered when an event they are listening to is created. This can be typing a specific key (of the keyboard), moving the mouse, and more. The event plays the role of the publisher and the listeners play the role of the observers. The key point in this case is that multiple listeners (observers) can be attached to a single event (publisher) [j.mp/magobs]. Implementation We will implement a data formatter. The ideas described here are based on the ActiveState Python Observer code recipe [j.mp/pythonobs]. There is a default formatter that shows a value in the decimal format. However, we can add/register more formatters. In this example, we will add a hex and binary formatter. Every time the value of the default formatter is updated, the registered formatters are notified and take action. In this case, the action is to show the new value in the relevant format. Observer is actually one of the patterns where inheritance makes sense. We can have a base Publisher class that contains the common functionality of adding, removing, and notifying observers. Our DefaultFormatter class derives from Publisher and adds the formatter-specific functionality. We can dynamically add and remove observers on demand. The following class diagram shows an instance of the example using two observers: HexFormatter and BinaryFormatter. Note that, because class diagrams are static, they cannot show the whole lifetime of a system, only the state of it at a specific point in time. We begin with the Publisher class. The observers are kept in the observers list. The add() method registers a new observer, or throws an error if it already exists. The remove() method unregisters an existing observer, or throws an exception if it does not exist. Finally, the notify() method informs all observers about a change: class Publisher: def __init__(self): self.observers = [] def add(self, observer): if observer not in self.observers: self.observers.append(observer) else: print('Failed to add: {}'.format(observer)) def remove(self, observer): try: self.observers.remove(observer) except ValueError: print('Failed to remove: {}'.format(observer)) def notify(self): [o.notify(self) for o in self.observers] Let's continue with the DefaultFormatter class. The first thing that __init__() does is call __init__() method of the base class, since this is not done automatically in Python. A DefaultFormatter instance has name to make it easier for us to track its status. We use name mangling in the _data variable to state that it should not be accessed directly. Note that this is always possible in Python [Lott14, page 54] but fellow developers have no excuse for doing so, since the code already states that they shouldn't. There is a serious reason for using name mangling in this case. Stay tuned. DefaultFormatter treats the _data variable as an integer, and the default value is zero: class DefaultFormatter(Publisher): def __init__(self, name): Publisher.__init__(self) self.name = name self._data = 0 The __str__() method returns information about the name of the publisher and the value of _data. type(self).__name__ is a handy trick to get the name of a class without hardcoding it. It is one of those things that make the code less readable but easier to maintain. It is up to you to decide if you like it or not: def __str__(self): return "{}: '{}' has data = {}".format(type(self).__name__, self.name, self._data) There are two data() methods. The first one uses the @property decorator to give read access to the _data variable. Using this, we can just execute object.data instead of object.data(): @property def data(self): return self._data The second data() method is more interesting. It uses the @setter decorator, which is called every time the assignment (=) operator is used to assign a new value to the _data variable. This method also tries to cast a new value to an integer, and does exception handling in case this operation fails: @data.setter def data(self, new_value): try: self._data = int(new_value) except ValueError as e: print('Error: {}'.format(e)) else: self.notify() The next step is to add the observers. The functionality of HexFormatter and BinaryFormatter is very similar. The only difference between them is how they format the value of data received by the publisher, that is, in hexadecimal and binary, respectively: class HexFormatter: def notify(self, publisher): print("{}: '{}' has now hex data = {}".format(type(self).__name__, publisher.name, hex(publisher.data))) class BinaryFormatter: def notify(self, publisher): print("{}: '{}' has now bin data = {}".format(type(self).__name__, publisher.name, bin(publisher.data))) No example is fun without some test data. The main() function initially creates a DefaultFormatter instance named test1 and afterwards attaches (and detaches) the two available observers. Exception handling is also exercised to make sure that the application does not crash when erroneous data is passed by the user. Moreover, things such as trying to add the same observer twice or removing an observer that does not exist should cause no crashes: def main(): df = DefaultFormatter('test1') print(df) print() hf = HexFormatter() df.add(hf) df.data = 3 print(df) print() bf = BinaryFormatter() df.add(bf) df.data = 21 print(df) print() df.remove(hf) df.data = 40 print(df) print() df.remove(hf) df.add(bf) df.data = 'hello' print(df) print() df.data = 15.8 print(df) Here's how the full code of the example (observer.py) looks: class Publisher: def __init__(self): self.observers = [] def add(self, observer): if observer not in self.observers: self.observers.append(observer) else: print('Failed to add: {}'.format(observer)) def remove(self, observer): try: self.observers.remove(observer) except ValueError: print('Failed to remove: {}'.format(observer)) def notify(self): [o.notify(self) for o in self.observers] class DefaultFormatter(Publisher): def __init__(self, name): Publisher.__init__(self) self.name = name self._data = 0 def __str__(self): return "{}: '{}' has data = {}".format(type(self).__name__, self.name, self._data) @property def data(self): return self._data @data.setter def data(self, new_value): try: self._data = int(new_value) except ValueError as e: print('Error: {}'.format(e)) else: self.notify() class HexFormatter: def notify(self, publisher): print("{}: '{}' has now hex data = {}".format(type(self).__name__, publisher.name, hex(publisher.data))) class BinaryFormatter: def notify(self, publisher): print("{}: '{}' has now bin data = {}".format(type(self).__name__, publisher.name, bin(publisher.data))) def main(): df = DefaultFormatter('test1') print(df) print() hf = HexFormatter() df.add(hf) df.data = 3 print(df) print() bf = BinaryFormatter() df.add(bf) df.data = 21 print(df) print() df.remove(hf) df.data = 40 print(df) print() df.remove(hf) df.add(bf) df.data = 'hello' print(df) print() df.data = 15.8 print(df) if __name__ == '__main__': main() Executing observer.py gives the following output: >>> python3 observer.py DefaultFormatter: 'test1' has data = 0 HexFormatter: 'test1' has now hex data = 0x3 DefaultFormatter: 'test1' has data = 3 HexFormatter: 'test1' has now hex data = 0x15 BinaryFormatter: 'test1' has now bin data = 0b10101 DefaultFormatter: 'test1' has data = 21 BinaryFormatter: 'test1' has now bin data = 0b101000 DefaultFormatter: 'test1' has data = 40 Failed to remove: <__main__.HexFormatter object at 0x7f30a2fb82e8> Failed to add: <__main__.BinaryFormatter object at 0x7f30a2fb8320> Error: invalid literal for int() with base 10: 'hello' BinaryFormatter: 'test1' has now bin data = 0b101000 DefaultFormatter: 'test1' has data = 40 BinaryFormatter: 'test1' has now bin data = 0b1111 DefaultFormatter: 'test1' has data = 15 What we see in the output is that as the extra observers are added, more (and relevant) output is shown, and when an observer is removed, it is not notified any longer. That's exactly what we want: runtime notifications that we are able to enable/disable on demand. The defensive programming part of the application also seems to work fine. Trying to do funny things such as removing an observer that does not exist or adding the same observer twice is not allowed. The messages shown are not very user-friendly but I leave that up to you as an exercise. Runtime failures of trying to pass a string when the API expects a number are also properly handled without causing the application to crash/terminate. This example would be much more interesting if it were interactive. Even a simple menu that allows the user to attach/detach observers at runtime and modify the value of DefaultFormatter would be nice because the runtime aspect becomes much more visible. Feel free to do it. Another nice exercise is to add more observers. For example, you can add an octal formatter, a roman numeral formatter, or any other observer that uses your favorite representation. Be creative and have fun! Summary In this article, we covered the Observer design pattern. We use Observer when we want to be able to inform/notify all stakeholders (an object or a group of objects) when the state of an object changes. An important feature of observer is that the number of subscribers/observers as well as who the subscribers are may vary and can be changed at runtime. You can refer more books on this topic mentioned as follows: Expert Python Programming: https://www.packtpub.com/application-development/expert-python-programming Learning Python Design Patterns: https://www.packtpub.com/application-development/learning-python-design-patterns Resources for Article: Further resources on this subject: Python Design Patterns in Depth: The Singleton Pattern[article] Python Design Patterns in Depth: The Factory Pattern[article] Automating Processes with ModelBuilder and Python[article]
Read more
  • 0
  • 0
  • 29706

article-image-deep-learning-r
Packt
15 Feb 2016
12 min read
Save for later

Deep learning in R

Packt
15 Feb 2016
12 min read
As the title suggests, in this article, we will be taking a look at some of the deep learning models in R. Some of the pioneering advancements in neural networks research in the last decade have opened up a new frontier in machine learning that is generally called by the name deep learning. The general definition of deep learning is, a class of machine learning techniques, where many layers of information processing stages in hierarchical supervised architectures are exploited for unsupervised feature learning and for pattern analysis/classification. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lower-level ones. Although there are many similar definitions and architectures for deep learning, two common elements in all of them are: multiple layers of nonlinear information processing and supervised or unsupervised learning of feature representations at each layer from the features learned at the previous layer. The initial works on deep learning were based on multilayer neural network models. Recently, many other forms of models are also used such as deep kernel machines and deep Q-networks. Researchers have experimented with multilayer neural networks even in previous decades. However, two reasons were limiting any progress with learning using such architectures. The first reason is that the learning of parameters of the network is a nonconvex optimization problem and often one gets stuck at poor local minima's starting from random initial conditions. The second reason is that the associated computational requirements were huge. A breakthrough for the first problem came when Geoffrey Hinton developed a fast algorithm for learning a special class of neural networks called deep belief nets (DBN). We will describe DBNs in more detail in the later sections. The high computational power requirements were met with the advancement in computing using general purpose graphical processing units (GPGPUs). What made deep learning so popular for practical applications is the significant improvement in accuracy achieved in automatic speech recognition and computer vision. For example, the word error rate in automatic speech recognition of a switchboard conversational speech had reached a saturation of around 40% after years of research. However, using deep learning, the word error rate was reduced dramatically to close to 10% in a matter of a few years. Another well-known example is how deep convolution neural network achieved the least error rate of 15.3% in the 2012 ImageNet Large Scale Visual Recognition Challenge compared to state-of-the-art methods that gave 26.2% as the least error rate. In this article, we will describe one class of deep learning models called deep belief networks. Interested readers are requested to read the book by Li Deng and Dong Yu for a detailed understanding of various methods and applications of deep learning. We will also illustrate the use of DBN with the R package darch. Restricted Boltzmann machines A restricted Boltzmann machine (RBM) is a two-layer network (bi-partite graph), in which one layer is a visible layer (v) and the second layer is a hidden layer (h). All nodes in the visible layer and all nodes in the hidden layer are connected by undirected edges, and there no connections between nodes in the same layer: An RBM is characterized by the joint distribution of states of all visible units v={V1,V2,...,VM}and states of all hidden units h={h1,h2,...,hN} given by: Here, E(v,h|Ɵ) is called the energy function  Z=ƩvƩhexp(-E(v,h|Ɵ) and is the normalization constant known by the name partition function from Statistical Physics nomenclature. There are mainly two types of RBMs. In the first one, both v and h are Bernoulli random variables. In the second type, h is a Bernoulli random variable whereas v is a Gaussian random variable. For Bernoulli RBM, the energy function is given by: Here, Wij represents the weight of the edge between nodes Vi and hj; bi and aj are bias parameters for the visible and hidden layers, respectively. For this energy function, the exact expressions for the conditional probability can be derived as follows: Here, is the logistic function 1/(1+exp(-x)). If the input variables are continuous, one can use the Gaussian RBM; the energy function of it is given by: Also, in this case, the conditional probabilities of vi and hj will become as follows: This is a normal distribution with mean ƩMI=1Wijhj+bi and variance 1. Now that we have described the basic architecture of an RBM, how is it that it is trained? If we try to use the standard approach of taking the gradient of log-likelihood, we get the following update rule: Here, IEdata(vi,hj) is the expectation of vi,hj computed using IEmodel(vi,hj) the dataset and is the same expectation computed using the model. However, one cannot use this exact expression for updating weights because IEmodel(vi,hj) is difficult to compute. The first breakthrough came to solve this problem and, hence, to train deep neural networks, when Hinton and team proposed an algorithm called Contrastive Divergence (CD). The essence of the algorithm is described in the next paragraph. The idea is to approximate IEmodel(vi,hj) by using values of vi and hj generated using Gibbs sampling from the conditional distributions mentioned previously. One scheme of doing this is as follows: Initialize Vt=0 from the dataset. Find ht=0 by sampling from the conditional distribution ht=0 ~ p(h|vt=0). Find Vt=1 by sampling from the conditional distribution vt=1 ~ p(v|ht=0). Find ht=1 by sampling from the conditional distribution ht=1 ~ p(h|vt=1). Once we find values of Vt=1 and ht=1 , use (vit=1hjt=1) which is the product of ith component of Vt=1 and jth component of ht=1, as an approximation for IEmodel(vi,hj). This is called CD-1 algorithm. One can generalize this to use the values from the kth step of Gibbs sampling and it is known as CD-k algorithm. One can easily see the connection between RBMs and Bayesian inference. Since the CD algorithm is like a posterior density estimate, one could say that RBMs are trained using a Bayesian inference approach. Although the Contrastive Divergence algorithm looks simple, one needs to be very careful in training RBMs, otherwise the model can result in overfitting. Readers who are interested in using RBMs in practical applications should refer to the technical report where this is discussed in detail. Deep belief networks One can stack several RBMs, one on top of each other, such that the values of hidden units in the layer n-1(hi,n-1) would become values of visible units in the nth layer (vi,n), and so on. The resulting network is called a deep belief network. It was one of the main architectures used in early deep learning networks for pretraining. The idea of pretraining a NN is the following: in the standard three-layer (input-hidden-output) NN, one can start with random initial values for the weights and using the backpropagation algorithm can find a good minimum of the log-likelihood function. However, when the number of layers increases, the straightforward application of backpropagation does not work because starting from output layer, as we compute the gradient values for the layers deep inside, their magnitude becomes very small. This is called the gradient vanishing problem. As a result, the network will get trapped in some poor local minima. Backpropagation still works if we are starting from the neighborhood of a good minimum. To achieve this, a DNN is often pretrained in an unsupervised way using a DBN. Instead of starting from random values of weights, first train a DBN in an unsupervised way and use weights from the DBN as initial weights for a corresponding supervised DNN. It was seen that such DNNs pretrained using DBNs perform much better. The layer-wise pretraining of a DBN proceeds as follows. Start with the first RBM and train it using input data in the visible layer and the CD algorithm (or its latest better variants). Then, stack a second RBM on top of this. For this RBM, use values sample from as the values for the visible layer. Continue this process for the desired number of layers. The outputs of hidden units from the top layer can also be used as inputs for training a supervised model. For this, add a conventional NN layer at the top of DBN with the desired number of classes as the number of output nodes. Input for this NN would be the output from the top layer of DBN. This is called DBN-DNN architecture. Here, a DBN's role is generating highly efficient features (the output of the top layer of DBN) automatically from the input data for the supervised NN in the top layer. The architecture of a five-layer DBN-DNN for a binary classification task is shown in the following figure: The last layer is trained using the backpropagation algorithm in a supervised manner for the two classes c1 and c2 . We will illustrate the training and classification with such a DBN-DNN using the darch R package. The darch R package The darch package, written by Martin Drees, is one of the R packages by which one can begin doing deep learning in R. It implements the DBN described in the previous section. The package can be downloaded from https://cran.r-project.org/web/packages/darch/index.html. The main class in the darch package implements deep architectures and provides the ability to train them with Contrastive Divergence and fine-tune with backpropagation, resilient backpropagation, and conjugate gradients. The new instances of the class are created with the newDArch constructor. It is called with the following arguments: a vector containing the number of nodes in each layers, the batch size, a Boolean variable to indicate whether to use the ff package for computing weights and outputs, and the name of the function for generating the weight matrices. Let us create a network having two input units, four hidden units, and one output unit: install.packages("darch") #one time >library(darch) >darch ← newDArch(c(2,4,1),batchSize = 2,genWeightFunc = generateWeights) INFO [2015-07-19 18:50:29] Constructing a darch with 3 layers. INFO [2015-07-19 18:50:29] Generating RBMs. INFO [2015-07-19 18:50:29] Construct new RBM instance with 2 visible and 4 hidden units. INFO [2015-07-19 18:50:29] Construct new RBM instance with 4 visible and 1 hidden units. Let us train the DBN with a toy dataset. We are using this because for training any realistic examples, it would take a long time, hours if not days. Let us create an input data set containing two columns and four rows: >inputs ← matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE) >outputs ← matrix(c(0,1,1,0),nrow=4) Now, let us pretrain the DBN using the input data: >darch ← preTrainDArch(darch,inputs,maxEpoch=1000) We can have a look at the weights learned at any layer using the getLayerWeights( ) function. Let us see how the hidden layer looks like: >getLayerWeights(darch,index=1) [[1]] [,1] [,2] [,3] [,4] [1,] 8.167022 0.4874743 -7.563470 -6.951426 [2,] 2.024671 -10.7012389 1.313231 1.070006 [3,] -5.391781 5.5878931 3.254914 3.000914 Now, let's do a backpropagation for supervised learning. For this, we need to first set the layer functions to sigmoidUnitDerivatives: >layers ← getLayers(darch) >for(i in length(layers):1){ layers[[i]][[2]] ← sigmoidUnitDerivative } >setLayers(darch) ← layers >rm(layers) Finally, the following two lines perform the backpropagation: >setFineTuneFunction(darch) ← backpropagation >darch ← fineTuneDArch(darch,inputs,outputs,maxEpoch=1000) We can see the prediction quality of DBN on the training data itself by running darch as follows: >darch ← getExecuteFunction(darch)(darch,inputs) >outputs_darch ← getExecOutputs(darch) >outputs_darch[[2]] [,1] [1,] 9.998474e-01 [2,] 4.921130e-05 [3,] 9.997649e-01 [4,] 3.796699e-05 Comparing with the actual output, DBN has predicted the wrong output for the first and second input rows. Since this example was just to illustrate how to use the darch package, we are not worried about the 50% accuracy here. Other deep learning packages in R Although there are some other deep learning packages in R such as deepnet and RcppDL, compared with libraries in other languages such as Cuda (C++) and Theano (Python), R yet does not have good native libraries for deep learning. The only available package is a wrapper for the Java-based deep learning open source project H2O. This R package, h20, allows running H2O via its REST API from within R. Readers who are interested in serious deep learning projects and applications should use H2O using h2o packages in R. One needs to install H2O in your machine to use h2o. Summary We have learned one of the latest advances in neural networks that is called deep learning. It can be used to solve many problems such as computer vision and natural language processing that involves highly cognitive elements. The artificial intelligent systems using deep learning were able to achieve accuracies comparable to human intelligence in tasks such as speech recognition and image classification. To know more about Bayesian modeling in R, check out Learning Bayesian Models with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-bayesian-models-r). You can also check out our other R books, Data Analysis with R (https://www.packtpub.com/big-data-and-business-intelligence/data-analysis-r), and Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition). Resources for Article: Further resources on this subject: Working with Data – Exploratory Data Analysis [article] Big Data Analytics [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 6448
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-python-design-patterns-depth-factory-pattern
Packt
15 Feb 2016
17 min read
Save for later

Python Design Patterns in Depth: The Factory Pattern

Packt
15 Feb 2016
17 min read
Creational design patterns deal with an object creation [j.mp/wikicrea]. The aim of a creational design pattern is to provide better alternatives for situations where a direct object creation (which in Python happens by the __init__() function [j.mp/divefunc], [Lott14, page 26]) is not convenient. In the Factory design pattern, a client asks for an object without knowing where the object is coming from (that is, which class is used to generate it). The idea behind a factory is to simplify an object creation. It is easier to track which objects are created if this is done through a central function, in contrast to letting a client create objects using a direct class instantiation [Eckel08, page 187]. A factory reduces the complexity of maintaining an application by decoupling the code that creates an object from the code that uses it [Zlobin13, page 30]. Factories typically come in two forms: the Factory Method, which is a method (or in Pythonic terms, a function) that returns a different object per input parameter [j.mp/factorympat]; the Abstract Factory, which is a group of Factory Methods used to create a family of related products [GOF95, page 100], [j.mp/absfpat] (For more resources related to this topic, see here.) Factory Method In the Factory Method, we execute a single function, passing a parameter that provides information about what we want. We are not required to know any details about how the object is implemented and where it is coming from. A real-life example An example of the Factory Method pattern used in reality is in plastic toy construction. The molding powder used to construct plastic toys is the same, but different figures can be produced using different plastic molds. This is like having a Factory Method in which the input is the name of the figure that we want (soldier and dinosaur) and the output is the plastic figure that we requested. The toy construction case is shown in the following figure, which is provided by www.sourcemaking.com [j.mp/factorympat]. A software example The Django framework uses the Factory Method pattern for creating the fields of a form. The forms module of Django supports the creation of different kinds of fields (CharField, EmailField) and customizations (max_length, required) [j.mp/djangofacm]. Use cases If you realize that you cannot track the objects created by your application because the code that creates them is in many different places instead of a single function/method, you should consider using the Factory Method pattern [Eckel08, page 187]. The Factory Method centralizes an object creation and tracking your objects becomes much more easier. Note that it is absolutely fine to create more than one Factory Method, and this is how it is typically done in practice. Each Factory Method logically groups the creation of objects that have similarities. For example, one Factory Method might be responsible for connecting you to different databases (MySQL, SQLite), another Factory Method might be responsible for creating the geometrical object that you request (circle, triangle), and so on. The Factory Method is also useful when you want to decouple an object creation from an object usage. We are not coupled/bound to a specific class when creating an object, we just provide partial information about what we want by calling a function. This means that introducing changes to the function is easy without requiring any changes to the code that uses it [Zlobin13, page 30]. Another use case worth mentioning is related with improving the performance and memory usage of an application. A Factory Method can improve the performance and memory usage by creating new objects only if it is absolutely necessary [Zlobin13, page 28]. When we create objects using a direct class instantiation, extra memory is allocated every time a new object is created (unless the class uses caching internally, which is usually not the case). We can see that in practice in the following code (file id.py), it creates two instances of the same class A and uses the id() function to compare their memory addresses. The addresses are also printed in the output so that we can inspect them. The fact that the memory addresses are different means that two distinct objects are created as follows: class A(object):     pass if __name__ == '__main__':     a = A()     b = A()     print(id(a) == id(b))     print(a, b) Executing id.py on my computer gives the following output:>> python3 id.pyFalse<__main__.A object at 0x7f5771de8f60> <__main__.A object at 0x7f5771df2208> Note that the addresses that you see if you execute the file are not the same as I see because they depend on the current memory layout and allocation. But the result must be the same: the two addresses should be different. There's one exception that happens if you write and execute the code in the Python Read-Eval-Print Loop (REPL) (interactive prompt), but that's a REPL-specific optimization which is not happening normally. Implementation Data comes in many forms. There are two main file categories for storing/retrieving data: human-readable files and binary files. Examples of human-readable files are XML, Atom, YAML, and JSON. Examples of binary files are the .sq3 file format used by SQLite and the .mp3 file format used to listen to music. In this example, we will focus on two popular human-readable formats: XML and JSON. Although human-readable files are generally slower to parse than binary files, they make data exchange, inspection, and modification much more easier. For this reason, it is advised to prefer working with human-readable files, unless there are other restrictions that do not allow it (mainly unacceptable performance and proprietary binary formats). In this problem, we have some input data stored in an XML and a JSON file, and we want to parse them and retrieve some information. At the same time, we want to centralize the client's connection to those (and all future) external services. We will use the Factory Method to solve this problem. The example focuses only on XML and JSON, but adding support for more services should be straightforward. First, let's take a look at the data files. The XML file, person.xml, is based on the Wikipedia example [j.mp/wikijson] and contains information about individuals (firstName, lastName, gender, and so on) as follows: <persons>   <person>     <firstName>John</firstName>     <lastName>Smith</lastName>     <age>25</age>     <address>       <streetAddress>21 2nd Street</streetAddress>       <city>New York</city>       <state>NY</state>       <postalCode>10021</postalCode>     </address>     <phoneNumbers>       <phoneNumber type="home">212 555-1234</phoneNumber>       <phoneNumber type="fax">646 555-4567</phoneNumber>     </phoneNumbers>     <gender>       <type>male</type>     </gender>   </person>   <person>     <firstName>Jimy</firstName>     <lastName>Liar</lastName>     <age>19</age>     <address>       <streetAddress>18 2nd Street</streetAddress>       <city>New York</city>       <state>NY</state>       <postalCode>10021</postalCode>     </address>     <phoneNumbers>       <phoneNumber type="home">212 555-1234</phoneNumber>     </phoneNumbers>     <gender>       <type>male</type>     </gender>   </person>   <person>     <firstName>Patty</firstName>     <lastName>Liar</lastName>     <age>20</age>     <address>       <streetAddress>18 2nd Street</streetAddress>       <city>New York</city>       <state>NY</state>       <postalCode>10021</postalCode>     </address>     <phoneNumbers>       <phoneNumber type="home">212 555-1234</phoneNumber>       <phoneNumber type="mobile">001 452-8819</phoneNumber>     </phoneNumbers>     <gender>       <type>female</type>     </gender>   </person> </persons> The JSON file, donut.json, comes from the GitHub account of Adobe [j.mp/adobejson] and contains donut information (type, price/unit i.e. ppu, topping, and so on) as follows: [   {     "id": "0001",     "type": "donut",     "name": "Cake",     "ppu": 0.55,     "batters": {       "batter": [         { "id": "1001", "type": "Regular" },         { "id": "1002", "type": "Chocolate" },         { "id": "1003", "type": "Blueberry" },         { "id": "1004", "type": "Devil's Food" }       ]     },     "topping": [       { "id": "5001", "type": "None" },       { "id": "5002", "type": "Glazed" },       { "id": "5005", "type": "Sugar" },       { "id": "5007", "type": "Powdered Sugar" },       { "id": "5006", "type": "Chocolate with Sprinkles" },       { "id": "5003", "type": "Chocolate" },       { "id": "5004", "type": "Maple" }     ]   },   {     "id": "0002",     "type": "donut",     "name": "Raised",     "ppu": 0.55,     "batters": {       "batter": [         { "id": "1001", "type": "Regular" }       ]     },     "topping": [       { "id": "5001", "type": "None" },       { "id": "5002", "type": "Glazed" },       { "id": "5005", "type": "Sugar" },       { "id": "5003", "type": "Chocolate" },       { "id": "5004", "type": "Maple" }     ]   },   {     "id": "0003",     "type": "donut",     "name": "Old Fashioned",     "ppu": 0.55,     "batters": {       "batter": [         { "id": "1001", "type": "Regular" },         { "id": "1002", "type": "Chocolate" }       ]     },     "topping": [       { "id": "5001", "type": "None" },       { "id": "5002", "type": "Glazed" },       { "id": "5003", "type": "Chocolate" },       { "id": "5004", "type": "Maple" }     ]   } ] We will use two libraries that are part of the Python distribution for working with XML and JSON: xml.etree.ElementTree and json as follows: import xml.etree.ElementTree as etree import json The JSONConnector class parses the JSON file and has a parsed_data() method that returns all data as a dictionary (dict). The property decorator is used to make parsed_data() appear as a normal variable instead of a method as follows: class JSONConnector:     def __init__(self, filepath):         self.data = dict()         with open(filepath, mode='r', encoding='utf-8') as f:             self.data = json.load(f)       @property     def parsed_data(self):         return self.data The XMLConnector class parses the XML file and has a parsed_data() method that returns all data as a list of xml.etree.Element as follows: class XMLConnector:     def __init__(self, filepath):         self.tree = etree.parse(filepath)     @property    def parsed_data(self):         return self.tree The connection_factory() function is a Factory Method. It returns an instance of JSONConnector or XMLConnector depending on the extension of the input file path as follows: def connection_factory(filepath):     if filepath.endswith('json'):         connector = JSONConnector     elif filepath.endswith('xml'):         connector = XMLConnector     else:         raise ValueError('Cannot connect to {}'.format(filepath))     return connector(filepath) The connect_to() function is a wrapper of connection_factory(). It adds exception handling as follows: def connect_to(filepath):     factory = None     try:         factory = connection_factory(filepath)     except ValueError as ve:         print(ve)     return factory The main() function demonstrates how the Factory Method design pattern can be used. The first part makes sure that exception handling is effective as follows: def main():     sqlite_factory = connect_to('data/person.sq3') The next part shows how to work with the XML files using the Factory Method. XPath is used to find all person elements that have the last name Liar. For each matched person, the basic name and phone number information are shown as follows: xml_factory = connect_to('data/person.xml')     xml_data = xml_factory.parsed_data()     liars = xml_data.findall     (".//{person}[{lastName}='{}']".format('Liar'))     print('found: {} persons'.format(len(liars)))     for liar in liars:         print('first name:         {}'.format(liar.find('firstName').text))         print('last name: {}'.format(liar.find('lastName').text))         [print('phone number ({}):'.format(p.attrib['type']),         p.text) for p in liar.find('phoneNumbers')] The final part shows how to work with the JSON files using the Factory Method. Here, there's no pattern matching, and therefore the name, price, and topping of all donuts are shown as follows: json_factory = connect_to('data/donut.json')     json_data = json_factory.parsed_data     print('found: {} donuts'.format(len(json_data)))     for donut in json_data:         print('name: {}'.format(donut['name']))         print('price: ${}'.format(donut['ppu']))         [print('topping: {} {}'.format(t['id'], t['type'])) for t         in donut['topping']] For completeness, here is the complete code of the Factory Method implementation (factory_method.py) as follows: import xml.etree.ElementTree as etree import json class JSONConnector:     def __init__(self, filepath):         self.data = dict()         with open(filepath, mode='r', encoding='utf-8') as f:             self.data = json.load(f)     @property     def parsed_data(self):         return self.data class XMLConnector:     def __init__(self, filepath):         self.tree = etree.parse(filepath)     @property     def parsed_data(self):         return self.tree def connection_factory(filepath):     if filepath.endswith('json'):         connector = JSONConnector     elif filepath.endswith('xml'):         connector = XMLConnector     else:         raise ValueError('Cannot connect to {}'.format(filepath))     return connector(filepath) def connect_to(filepath):     factory = None     try:        factory = connection_factory(filepath)     except ValueError as ve:         print(ve)     return factory def main():     sqlite_factory = connect_to('data/person.sq3')     print()     xml_factory = connect_to('data/person.xml')     xml_data = xml_factory.parsed_data     liars = xml_data.findall(".//{}[{}='{}']".format('person',     'lastName', 'Liar'))     print('found: {} persons'.format(len(liars)))     for liar in liars:         print('first name:         {}'.format(liar.find('firstName').text))         print('last name: {}'.format(liar.find('lastName').text))         [print('phone number ({}):'.format(p.attrib['type']),         p.text) for p in liar.find('phoneNumbers')]     print()     json_factory = connect_to('data/donut.json')     json_data = json_factory.parsed_data     print('found: {} donuts'.format(len(json_data)))     for donut in json_data:     print('name: {}'.format(donut['name']))     print('price: ${}'.format(donut['ppu']))     [print('topping: {} {}'.format(t['id'], t['type'])) for t     in donut['topping']] if __name__ == '__main__':     main() Here is the output of this program as follows: >>> python3 factory_method.pyCannot connect to data/person.sq3found: 2 personsfirst name: Jimylast name: Liarphone number (home): 212 555-1234first name: Pattylast name: Liarphone number (home): 212 555-1234phone number (mobile): 001 452-8819found: 3 donutsname: Cakeprice: $0.55topping: 5001 Nonetopping: 5002 Glazedtopping: 5005 Sugartopping: 5007 Powdered Sugartopping: 5006 Chocolate with Sprinklestopping: 5003 Chocolatetopping: 5004 Maplename: Raisedprice: $0.55topping: 5001 Nonetopping: 5002 Glazedtopping: 5005 Sugartopping: 5003 Chocolatetopping: 5004 Maplename: Old Fashionedprice: $0.55topping: 5001 Nonetopping: 5002 Glazedtopping: 5003 Chocolatetopping: 5004 Maple Notice that although JSONConnector and XMLConnector have the same interfaces, what is returned by parsed_data() is not handled in a uniform way. Different codes must be used to work with each connector. Although it would be nice to be able to use the same code for all connectors, this is at most times not realistic unless we use some kind of common mapping for the data which is very often provided by external data providers. Assuming that you can use exactly the same code for handling the XML and JSON files, what changes are required to support a third format, for example, SQLite? Find an SQLite file or create your own and try it. As is now, the code does not forbid a direct instantiation of a connector. Is it possible to do this? Try doing it (hint: functions in Python can have nested classes). Summary To learn more about design patterns in depth, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Python Design Patterns (https://www.packtpub.com/application-development/learning-python-design-patterns) Learning Python Design Patterns – Second Edition (https://www.packtpub.com/application-development/learning-python-design-patterns-second-edition) Resources for Article:   Further resources on this subject: Recommending Movies at Scale (Python) [article] An In-depth Look at Ansible Plugins [article] Elucidating the Game-changing Phenomenon of the Docker-inspired Containerization Paradigm [article]
Read more
  • 0
  • 0
  • 30615

article-image-depth-look-ansible-plugins
Packt
15 Feb 2016
9 min read
Save for later

An In-depth Look at Ansible Plugins

Packt
15 Feb 2016
9 min read
In this article by Rishabh Das, author of the book Extending Ansible, we will deep dive into what Ansible plugins are and how you can write your own custom Ansible plugin. The article will discuss the different types of Ansible plugins in detail and explore them on a code level. The article will walk you through the Ansible Python API and using the extension points to write your own Ansible plugins. (For more resources related to this topic, see here.) Lookup plugins Lookup plugins are designed to read data from different sources and feed them to Ansible. The data source can be either from the local filesystem on the controller node or from an external data source. These may also be for file formats that are not natively supported by Ansible. If you decide to write your own lookup plugin, you need to drop it in one of the following directories for Ansible to pick it up during the execution of an Ansible playbook. A directory named lookup_plugins in the project root At ~/.ansible/plugins/lookup_plugins/ At /usr/share/ansible_plugins/lookup_plugins/ By default, a number of lookup plugins are already available in Ansible. Let's discuss some of the commonly used lookup plugins. Lookup plugin – file This is the most basic type of lookup plugin available in Ansible. It reads through the file content on the controller node. The data read from the file can then be fed to the Ansible playbook as a variable. In the most basic form, usage of file lookup is demonstrated in the following Ansible playbook: --- - hosts: all vars: data: "{{ lookup('file', './test-file.txt') }}" tasks: - debug: msg="File contents {{ data }}" The preceding playbook will read data off a local file, test-file.txt, from the playbook's root directory into a data variable. This variable is then fed to the task : debug module, which uses the data variable to print it onscreen. Lookup plugin – csvfile The csvfile lookup plugin was designed to read data from a CSV file on the controller node. This lookup module is designed to take in several parameters, which are discussed in this table: Parameter Default value Description file ansible.csv This is the file to read data from delimiter TAB This is the delimiter used in the CSV file, usually ','. col 1 This is the column number (index) default Empty string This returns this value if the requested key is not found in the CSV file Let's take an example of reading data from the following CSV file. The CSV file contains population and area details of different cities: File: city-data.csv City, Area, Population Pune, 700, 2.5 Million Bangalore, 741, 4.3 Million Mumbai, 603, 12 Million This file lies in the controller node at the root of the Ansible play. To read off data from this file, the csvfile lookup plugin is used. The following Ansible play tries to read the population of Mumbai from the preceding CSV file: Ansible Play – test-csv.yaml --- - hosts: all tasks: - debug: msg="Population of Mumbai is {{lookup('csvfile', 'Mumbai file=city-data.csv delimiter=, col=2')}}"   Lookup plugin – dig The dig lookup plugin can be used to run DNS queries against Fully Qualified Domain Name (FQDN). You can customize the lookup plugin's output using the different flags that are supported by the plugin. In the most basic form, it returns the IP of the given FQDN. This plugin has a dependency on the python-dns package. This should be installed on the controller node. The following Ansible play explains how to fetch the TXT record for any FQDN: --- - hosts: all tasks: - debug: msg="TXT record {{ lookup('dig', 'yahoo.com./TXT') }}" - debug: msg="IP of yahoo.com {{lookup('dig', 'yahoo.com', wantlist=True)}}" The preceding Ansible play will fetch the TXT records in step one and the IPs associated with FQDN yahoo.com in second. It is also possible to perform reverse DNS lookups using the dig plugin with the following syntax: - debug: msg="Reverse DNS for 8.8.8.8 is {{ lookup('dig', '8.8.8.8/PTR') }}" Lookup plugin – ini The ini lookup plugin is designed to read data off an .ini file. The INI file, in general, is a collection of key-value pairs under defined sections. The ini lookup plugin supports the following parameters: Parameter Default value Description type ini This is the type of file. It currently supports two formats: ini and property. file ansible.ini This is the name of file to read data from. section global This is the section of the ini file from which the specified key needs to be read. re false If the key is a regular expression, we need to set this to true. default Empty string If the requested key is not found in the ini file, we need to return this.   Taking an example of the following ini file, let's try to read some keys using the ini lookup plugin. The file is named network.ini: [default] bind_host = 0.0.0.0 bind_port = 9696 log_dir = /var/log/network [plugins] core_plugin = rdas-net firewall = yes The following Ansible play will read off the keys from the ini file: --- - hosts: all tasks: - debug: msg="core plugin {{ lookup('ini', 'core_plugin file=network.ini section=plugins') }}" - debug: msg="core plugin {{ lookup('ini', 'bind_port file=network.ini section=default') }}" The ini lookup plugin can also be used to read off values through a file that does not contain sections—for instance, a Java property file. Loops – lookup plugins for Iteration There are times when you need to perform the same task over and over again. It might be the case of installing various dependencies for a package or multiple inputs that go through the same operation—for instance, while checking and starting various services. Just like any other programming language provides a way to iterate over data to perform repetitive tasks, Ansible also provides a clean way to carry out the same operation. The concept is called looping and is provided by Ansible lookup plugins. Loops in Ansible are generally identified as those starting with “with_”. Ansible supports a number of looping options. A few of the most commonly used are discussed in the following section. Standard loop – with_items This is the simplest and most commonly used loop in Ansible. It is used to iterate over an item list and perform some operation on it. The following Ansible play demonstrates the use of the with_items lookup loop: --- - hosts: all tasks: - name: Install packages yum: name={{ item }} state=present with_items: - vim - wget - ipython The with_items lookup loop supports the use of hashes in which you can access the variables using the .<keyname> item in the Ansible playbook. The following playbook demonstrates the use of with_item to iterate over a given hash: --- - hosts: all tasks: - name: Create directories with specific permissions file: path={{item.dir}} state=directory mode={{item.mode | int}} with_items: - { dir: '/tmp/ansible', mode: 755 } - { dir: '/tmp/rdas', mode: 755 } The preceding playbook will create two directories with the specified permission sets. If you look closely while accessing the mode key from item, there exists a | int filter. This is a jinja2 filter, which is used to convert a string to an integer. DoUntill loop – until This has the same implementation as that in any other programming language. It executes at least once and keeps executing unless a specific condition is reached, as follows: < Code to follow > Creating your own lookup plugin This article covered some already available Ansible lookup plugins and explained how these can be used. This section will try to replicate a functionality of the dig lookup to get the IP address of a given FQDN. This will be done without using the dnspython library and will use the basic socket library for Python. The following example is only a demonstration of how you can write your own Ansible lookup plugin: import socket class LookupModule(object): def __init__(self, basedir=None, **kwargs): self.basedir = basedir def run(self, hostname, inject=None, **kwargs): hostname = str(hostname) try: host_detail = socket.gethostbyname(hostname) except: host_detail = 'Invalid Hostname' return host_detail The preceding code is a lookup plugin; let’s call it hostip. As you can note, there exists a class named LookupModule. Ansible identifies a Python file or module as a lookup plugin only when there exists a class called LookupModule. The module takes an argument hostname and checks whether there exists an IP corresponding to it—that is, whether it can be resolved to a valid IP address. If yes, it returns the IP address of the requested FQDN. If not, it returns Invalid Hostname. To use this module, place it in the lookup_modules directory at the root of the Ansible play. The following playbook demonstrates how you can use the hostip lookup just created: --- - hosts: all tasks: - debug: msg="{{lookup('hostip', item, wantlist=True)}}" with_items: - www.google.co.in - saliux.wordpress.com - www.twitter.com The preceding play will loop through the list of websites and pass it as an argument to the hostip lookup plugin. This will in turn return the IP associated with the requested domain. If you notice, there is an argument wantlist=True also passed while the hostip lookup plugin was called. This is to handle multiple outputs; that is, if there are multiple values associated with the requested domain, the values will be returned as a list. This makes it easy to iterate over the output values. Summary This article picked up on how the Ansible Python API for plugins is implemented in various Ansible plugins. The article discussed various types of plugins in detail, both from the implementation point of view and on a code level. The article also demonstrated how to write sample plugins by writing custom lookup plugins. By the end of this article, you should be able to write your own custom plugin for Ansible. Resources for Article: Further resources on this subject: Mastering Ansible – Protecting Your Secrets with Ansible [article] Ansible – An Introduction [article] Getting Started with Ansible [article]
Read more
  • 0
  • 0
  • 26635

article-image-python-design-patterns-depth-singleton-pattern
Packt
15 Feb 2016
14 min read
Save for later

Python Design Patterns in Depth: The Singleton Pattern

Packt
15 Feb 2016
14 min read
There are situations where you need to create only one instance of data throughout the lifetime of a program. This can be a class instance, a list, or a dictionary, for example. The creation of a second instance is undesirable. This can result in logical errors or malfunctioning of the program. The design pattern that allows you to create only one instance of data is called singleton. In this article, you will learn about module-level, classic, and borg singletons; you'll also learn about how they work, when to use them, and build a two-threaded web crawler that uses a singleton to access the shared resource. (For more resources related to this topic, see here.) Singleton is the best candidate when the requirements are as follows: Controlling concurrent access to a shared resource If you need a global point of access for the resource from multiple or different parts of the system When you need to have only one object Some typical use cases of using a singleton are: The logging class and its subclasses (global point of access for the logging class to send messages to the log) Printer spooler (your application should only have a single instance of the spooler in order to avoid having a conflicting request for the same resource) Managing a connection to a database File manager Retrieving and storing information on external configuration files Read-only singletons storing some global states (user language, time, time zone, application path, and so on) There are several ways to implement singletons. We will look at module-level singleton, classic singletons, and borg singleton. Module-level singleton All modules are singletons by nature because of Python's module importing steps: Check whether a module is already imported. If yes, return it. If not, find a module, initialize it, and return it. Initializing a module means executing a code, including all module-level assignments. When you import the module for the first time, all of the initializations will be done; however, if you try to import the module for the second time, Python will return the initialized module. Thus, the initialization will not be done, and you get a previously imported module with all of its data. So, if you want to quickly make a singleton, use the following steps and keep the shared data as the module attribute. singletone.py: only_one_var = "I'm only one var" module1.py: import single tone print singleton.only_one_var singletone.only_one_var += " after modification" import module2 module2.py: import singletone print singleton.only_one_var Here, if you try to import a global variable in a singleton module and change its value in the module1 module, module2 will get a changed variable. This function is quick and sometimes is all that you need; however, we need to consider the following points: It's pretty error-prone. For example, if you happen to forget the global statements, variables local to the function will be created and, the module's variables won't be changed, which is not what you want. It's ugly, especially if you have a lot of objects that should remain as singletons. They pollute the module namespace with unnecessary variables. They don't permit lazy allocation and initialization; all global variables will be loaded during the module import process. It's not possible to re-use the code because you can not use the inheritance. No special methods and no object-oriented programming benefits at all. Classic singleton In classic singleton in Python, we check whether an instance is already created. If it is created, we return it; otherwise, we create a new instance, assign it to a class attribute, and return it. Let's try to create a dedicated singleton class: class Singleton(object): def __new__(cls): if not hasattr(cls, 'instance'): cls.instance = super(Singleton, cls).__new__(cls) return cls.instance Here, before creating the instance, we check for the special __new__ method, which is called right before __init__ if we had created an instance earlier. If not, we create a new instance; otherwise, we return the already created instance. Let's check how it works: >>> singleton = Singleton() >>> another_singleton = Singleton() >>> singleton is another_singleton True >>> singleton.only_one_var = "I'm only one var" >>> another_singleton.only_one_var I'm only one var Try to subclass the Singleton class with another one. class Child(Singleton): pass If it's a successor of Singleton, all of its instances should also be the instances of Singleton, thus sharing its states. But this doesn't work as illustrated in the following code: >>> child = Child() >>> child is singleton >>> False >>> child.only_one_var AttributeError: Child instance has no attribute 'only_one_var' To avoid this situation, the borg singleton is used. Borg singleton Borg is also known as monostate. In the borg pattern, all of the instances are different, but they share the same state. In the following code , the shared state is maintained in the _shared_state attribute. And all new instances of the Borg class will have this state as defined in the __new__ class method. class Borg(object):    _shared_state = {}    def __new__(cls, *args, **kwargs):        obj = super(Borg, cls).__new__(cls, *args, **kwargs)        obj.__dict__ = cls._shared_state        return obj Generally, Python stores the instance state in the __dict__ dictionary and when instantiated normally, every instance will have its own __dict__. But, here we deliberately assign the class variable _shared_state to all of the created instances. Here is how it works with subclassing: class Child(Borg):    pass>>> borg = Borg()>>> another_borg = Borg()>>> borg is another_borgFalse>>> child = Child()>>> borg.only_one_var = "I'm the only one var">>> child.only_one_varI'm the only one var So, despite the fact that you can't compare objects by their identity, using the is statement, all child objects share the parents' state. If you want to have a class, which is a descendant of the Borg class but has a different state, you can reset shared_state as follows: class AnotherChild(Borg):    _shared_state = {}>>> another_child = AnotherChild()>>> another_child.only_one_varAttributeError: AnotherChild instance has no attribute 'shared_state' Which type of singleton should be used is up to you. If you expect that your singleton will not be inherited, you can choose the classic singleton; otherwise, it's better to stick with borg. Implementation in Python As a practical example, we'll create a simple web crawler that scans a website you open on it, follows all the links that lead to the same website but to other pages, and downloads all of the images it'll find. To do this, we'll need two functions: a function that scans a website for links, which leads to other pages to build a set of pages to visit, and a function that scans a page for images and downloads them. To make it quicker, we'll download images in two threads. These two threads should not interfere with each other, so don't scan pages if another thread has already scanned them, and don't download images that are already downloaded. So, a set with downloaded images and scanned web pages will be a shared resource for our application, and we'll keep it in a singleton instance. In this example, you will need a library for parsing and screen scraping websites named BeautifulSoup and an HTTP client library httplib2. It should be sufficient to install both with either of the following commands: $ sudo pip install BeautifulSoup httplib2 $ sudo easy_install BeautifulSoup httplib2 First of all, we'll create a Singleton class. Let's use the classic singleton in this example: import httplib2import osimport reimport threadingimport urllibfrom urlparse import urlparse, urljoinfrom BeautifulSoup import BeautifulSoupclass Singleton(object):    def __new__(cls):        if not hasattr(cls, 'instance'):             cls.instance = super(Singleton, cls).__new__(cls)        return cls.instance It will return the singleton objects to all parts of the code that request it. Next, we'll create a class for creating a thread. In this thread, we'll download images from the website: class ImageDownloaderThread(threading.Thread):    """A thread for downloading images in parallel."""    def __init__(self, thread_id, name, counter):        threading.Thread.__init__(self)        self.name = name    def run(self):        print 'Starting thread ', self.name        download_images(self.name)        print 'Finished thread ', self.name The following function traverses the website using BFS algorithms, finds links, and adds them to a set for further downloading. We are able to specify the maximum links to follow if the website is too large. def traverse_site(max_links=10):    link_parser_singleton = Singleton()    # While we have pages to parse in queue    while link_parser_singleton.queue_to_parse:        # If collected enough links to download images, return        if len(link_parser_singleton.to_visit) == max_links:            return        url = link_parser_singleton.queue_to_parse.pop()        http = httplib2.Http()        try:            status, response = http.request(url)        except Exception:            continue        # Skip if not a web page        if status.get('content-type') != 'text/html':            continue        # Add the link to queue for downloading images        link_parser_singleton.to_visit.add(url)        print 'Added', url, 'to queue'        bs = BeautifulSoup(response)        for link in BeautifulSoup.findAll(bs, 'a'):            link_url = link.get('href')            # <img> tag may not contain href attribute            if not link_url:                continue            parsed = urlparse(link_url)            # If link follows to external webpage, skip it            if parsed.netloc and parsed.netloc != parsed_root.netloc:                continue            # Construct a full url from a link which can be relative            link_url = (parsed.scheme or parsed_root.scheme) + '://' + (parsed.netloc or parsed_root.netloc) + parsed.path or ''            # If link was added previously, skip it            if link_url in link_parser_singleton.to_visit:                continue            # Add a link for further parsing            link_parser_singleton.queue_to_parse = [link_url] + link_parser_singleton.queue_to_parse The following function downloads images from the last web resource page in the singleton.to_visit queue and saves it to the img directory. Here, we use a singleton for synchronizing shared data, which is a set of pages to visit between two threads: def download_images(thread_name):    singleton = Singleton()    # While we have pages where we have not download images    while singleton.to_visit:        url = singleton.to_visit.pop()        http = httplib2.Http()        print thread_name, 'Starting downloading images from', url        try:            status, response = http.request(url)        except Exception:            continue        bs = BeautifulSoup(response)       # Find all <img> tags        images = BeautifulSoup.findAll(bs, 'img')        for image in images:            # Get image source url which can be absolute or relative            src = image.get('src')            # Construct a full url. If the image url is relative,            # it will be prepended with webpage domain.            # If the image url is absolute, it will remain as is            src = urljoin(url, src)            # Get a base name, for example 'image.png' to name file locally            basename = os.path.basename(src)            if src not in singleton.downloaded:                singleton.downloaded.add(src)                print 'Downloading', src                # Download image to local filesystem                urllib.urlretrieve(src, os.path.join('images', basename))        print thread_name, 'finished downloading images from', url Our client code is as follows: if __name__ == '__main__':    root = 'http://python.org'    parsed_root = urlparse(root)    singleton = Singleton()    singleton.queue_to_parse = [root]    # A set of urls to download images from    singleton.to_visit = set()    # Downloaded images    singleton.downloaded = set()    traverse_site()    # Create images directory if not exists    if not os.path.exists('images'):        os.makedirs('images')    # Create new threads    thread1 = ImageDownloaderThread(1, "Thread-1", 1)    thread2 = ImageDownloaderThread(2, "Thread-2", 2)    # Start new Threads    thread1.start()    thread2.start() Run a crawler using the following command: $ python crawler.py You should get the following output (your output may vary because the order in which the threads access resources is not predictable): If you go to the images directory, you will find the downloaded images there. Summary To learn more about design patterns in depth, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Python Design Patterns – Second Edition (https://www.packtpub.com/application-development/learning-python-design-patterns-second-edition) Mastering Python Design Patterns (https://www.packtpub.com/application-development/mastering-python-design-patterns) Resources for Article: Further resources on this subject: Python Design Patterns in Depth: The Factory Pattern [Article] Recommending Movies at Scale (Python) [Article] Customizing IPython [Article]
Read more
  • 0
  • 0
  • 43802

article-image-getting-started-etcd
Packt
15 Feb 2016
6 min read
Save for later

Getting Started with etcd

Packt
15 Feb 2016
6 min read
In this article we will cover etcd, CoreOS's central hub of all services that provides a reliable way of storing shared data across cluster machines and monitoring it. In this article, we will cover the following topics: Introducing etcd Reading and writing to etcd from the host machine Reading and writing from an application container Watching etcd changes A TTL (time to live) example use cases of etcd (For more resources related to this topic, see here.) Introducing etcd The etcd function is an open source distributed key value store on a computer network where information is stored on more than one node and data is replicated using the Raft consensus algorithm. The etcd function is used to store the CoreOS cluster service discovery and the shared configuration. The configuration is stored in the write-ahead log and includes the cluster member ID, cluster ID and cluster configuration, and everything else that is put there container applications running in the cluster. The etcd function runs on each cluster's central services role machine, and gracefully handles master election during network partitions and in the event of a loss of the current master. Reading and writing to etcd from the host machine You are going to learn how read and write to ectd from the host machine. We will use both the etcdctl and curl examples here. Logging in to host To login to CoreOS VM, follow these steps: Boot your CoreOS VM installed. In your terminal, type this: $ cdcoreos-vagrant $ vagrant up We need to login to the host via ssh: $ vagrant ssh Reading and writing to ectd Let's read and write to etcd using etcdctl. So, perform these steps: Set with etcdctl a message1 key with Book1 as the value: $ etcdctl set /message1 Book1Book1 (we got respond for our successful write to etcd Now, let's read the key value to double-check whether everything is fine there: $ etcdctl get /message1 Book1 Perfect! Next, let's try to do the same using curl via an HTTP-based API. The curl function is handy for accessing etcd from any place from where you have access to etcd cluster but don't want/need to use the etcdctl client: $ curl -L -X PUT http://127.0.0.1:2379/v2/keys/message2 -d value="Book2" {"action":"set","key":"/message2","prevValue":"Book1","value":"Book2","index":13371} Let's read it: $ curl -L http://127.0.0.1:2379/v2/keys/message2 {"action":"get","node":{"key":"/message2","value":"Book2","modifiedIndex":13371,"createdIndex":13371}} Using the HTTP-based etcd API means that etcd can be read from and written to by client applications without the need to interact with the command line. Now, if we want to delete the key value pair, we type the following command: $ etcdctl rm /message1 $ curl -L -X DELETE http://127.0.0.1:2379/v2/keys/message2 Also, we can add a key value pair to a directory, as directories are created automatically when a key is placed inside. We need only one command to put a key inside a directory: $ etcdctl set /foo-directory/foo-key somekey Let's now check the directory's content: $ etcdctl ls /foo-directory –recursive /foo-directory/foo-key Finally, we get the key value from the directory by typing: $ etcdctl get /foo-directory/foo-key somekey Reading and writing from the application container Usually, application containers (this is a general term for docker, rkt, and other types of containers) do not have etcdctl or even curl installed by default. Installing curl is much easier than installing etcdctl. For our example, we will use the AlpineLinux docker image, which is very small in size and will not take much time to pull from docker registry: Firstly, we need to check the docker0 interface IP, which we will use with curl: $ echo"$(ifconfig docker0 | awk'/<inet>/ { print $2}'):2379" 10.1.42.1:2379 Let's run the docker container with a bash shell: $ docker run -it alpine ash We should see something like this in Command Prompt:/ #. As curl is not installed by default on AlpineLinux, we need to install it: $ apk update&&apk add curl $ curl -L http://10.1.42.1:2379/v2/keys/ {"action":"get","node":{"key":"/","dir":true,"nodes":[{"key":"/coreos.com","dir":true,"modifiedIndex":3,"createdIndex":3}]}} Repeat steps 3 and 4 from the previous subtopic so that you understand that it does not matter where you are connecting to etcd from, curl still works in the same way. Press Ctrl +D to exit from the docker container. Watching changes in etcd This time, let's watch the key changes in etcd. Watching key changes is useful when we have, for example, one fleet unit with nginx writing its port to etcd, and another reverse proxy application watching for changes and updating its config: We need to create a directory in etcd first: $ etcdctlmkdir /foo-data Next, we watch for changes in this directory: $ etcdctl watch /foo-data--recursive Now open another CoreOS shell in a new terminal window: $ cdcoreos-vagrant $ vagrantssh We put a new key to /foo-data directory: $ etcdctl set /foo-data/Book is_cool In the first terminal, we should see a notification saying that the key was changed: is_cool A TTL (time to live) examples Sometimes, it is handy to put a time to live (TTL) for a key to expire in a certain amount of time. This is useful, for example,in the case of watching a key with a 60 second TTL, from a reverse proxy. So, if the nginx fleet service has not updated the key, it will expire in 60 seconds and will be removed from etcd. Then the reverse proxy checks for it and does not find it. Hence, it will remove the nginx service from config. Let's set TTL for 30 seconds in this example: Type this in a terminal: $ etcdctl set /foo "I'm Expiring in 30 sec" --ttl 30 I'm Expiring in 30 sec Verify that the key is still there: $ etcdctl get /foo I'm Expiring in 30 sec Check againafter 30 seconds : $ etcdctl get /foo If your requested key has already expired, you will be returned Error: 100: Error: 100: Key not found (/foo) [17053] This time the key got deleted by etcd because we put a TTL of 30 seconds for it. TTL is very handy to use to communicate between different services using etcd as the checking point. Use cases of etcd Application containers running on worker nodes with etcd in proxy mode can read and write to an etcd cluster. Very common etcd use cases are as follows: storing database connection settings, cache settings, and shared settings. For example, the Vulcand proxy server (http://vulcanproxy.com/) uses etcd to store web host connection details, and it becomes available for all cluster-connected worker machines. Another example could be to store a database password for MySQL and retrieve it when running an application container. Summary To learn more about CoreOS Essentials, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning CoreOS (https://www.packtpub.com/networking-and-servers/learning-coreos) Resources for Article: Further resources on this subject: Mastering CentOS 7 Linux Server[article] Linux Shell Scripting[article] What is Kali Linux[article]
Read more
  • 0
  • 0
  • 8612
article-image-modular-programming-ecmascript-6
Packt
15 Feb 2016
18 min read
Save for later

Modular Programming in ECMAScript 6

Packt
15 Feb 2016
18 min read
Modular programming is one of the most important and frequently used software design techniques. Unfortunately, JavaScript didn't support modules natively that lead JavaScript programmers to use alternative techniques to achieve modular programming in JavaScript. But now, ES6 brings modules into JavaScript officially. This article is all about how to create and import JavaScript modules. In this article, we will first learn how the modules were created earlier, and then we will jump to the new built-in module system that was introduced in ES6, known as the ES6 modules. In this article, we'll cover: What is modular programming? The benefits of modular programming The basics of IIFE modules, AMD, UMD, and CommonJS Creating and importing the ES6 modules The basics of the Modular Loader Creating a basic JavaScript library using modules (For more resources related to this topic, see here.) The JavaScript modules in a nutshell The practice of breaking down programs and libraries into modules is called modular programming. In JavaScript, a module is a collection of related objects, functions, and other components of a program or library that are wrapped together and isolated from the scope of the rest of the program or library. A module exports some variables to the outside program to let it access the components wrapped by the module. To use a module, a program needs to import the module and the variables exported by the module. A module can also be split into further modules called as its submodules, thus creating a module hierarchy. Modular programming has many benefits. Some benefits are: It keeps our code both cleanly separated and organized by splitting into multiple modules Modular programming leads to fewer global variables, that is, it eliminates the problem of global variables, because modules don't interface via the global scope, and each module has its own scope Makes code reusability easier as importing and using the same modules in different projects is easier It allows many programmers to collaborate on the same program or library, by making each programmer to work on a particular module with a particular functionality Bugs in an application can easily be easily identified as they are localized to a particular module Implementing modules – the old way Before ES6, JavaScript had never supported modules natively. Developers used other techniques and third-party libraries to implement modules in JavaScript. Using Immediately-invoked function expression (IIFE), Asynchronous Module Definition (AMD), CommonJS, and Universal Module Definition (UMD) are various popular ways of implementing modules in ES5. As these ways were not native to JavaScript, they had several problems. Let's see an overview of each of these old ways of implementing modules. The Immediately-Invoked Function Expression The IIFE is used to create an anonymous function that invokes itself. Creating modules using IIFE is the most popular way of creating modules. Let's see an example of how to create a module using IIFE: //Module Starts (function(window){   var sum = function(x, y){     return x + y;   }     var sub = function(x, y){     return x - y;   }   var math = {     findSum: function(a, b){       return sum(a,b);     },     findSub: function(a, b){       return sub(a, b);     }   }   window.math = math; })(window) //Module Ends console.log(math.findSum(1, 2)); //Output "3" console.log(math.findSub(1, 2)); //Output "-1" Here, we created a module using IIFE. The sum and sub variables are global to the module, but not visible outside of the module. The math variable is exported by the module to the main program to expose the functionalities that it provides. This module works completely independent of the program, and can be imported by any other program by simply copying it into the source code, or importing it as a separate file. A library using IIFE, such as jQuery, wraps its all of its APIs in a single IIFE module. When a program uses a jQuery library, it automatically imports the module. Asynchronous Module Definition AMD is a specification for implementing modules in browser. AMD is designed by keeping the browser limitations in mind, that is, it imports modules asynchronously to prevent blocking the loading of a webpage. As AMD is not a native browser specification, we need to use an AMD library. RequireJS is the most popular AMD library. Let's see an example on how to create and import modules using RequireJS. According to the AMD specification, every module needs to be represented by a separate file. So first, create a file named math.js that represents a module. Here is the sample code that will be inside the module: define(function(){   var sum = function(x, y){     return x + y;   }   var sub = function(x, y){     return x - y;   }   var math = {     findSum: function(a, b){       return sum(a,b);     },     findSub: function(a, b){       return sub(a, b);     }   }   return math; }); Here, the module exports the math variable to expose its functionality. Now, let's create a file named index.js, which acts like the main program that imports the module and the exported variables. Here is the code that will be inside the index.js file: require(["math"], function(math){   console.log(math.findSum(1, 2)); //Output "3"   console.log(math.findSub(1, 2)); //Output "-1" }) Here, math variable in the first parameter is the name of the file that is treated as the AMD module. The .js extension to the file name is added automatically by RequireJS. The math variable, which is in the second parameter, references the exported variable. Here, the module is imported asynchronously, and the callback is also executed asynchronously. CommonJS CommonJS is a specification for implementing modules in Node.js. According to the CommonJS specification, every module needs to be represented by a separate file. The CommonJS modules are imported synchronously. Let's see an example on how to create and import modules using CommonJS. First, we will create a file named math.js that represents a module. Here is a sample code that will be inside the module: var sum = function(x, y){   return x + y; } var sub = function(x, y){   return x - y; } var math = {   findSum: function(a, b){     return sum(a,b);   },   findSub: function(a, b){     return sub(a, b);   } } exports.math = math; Here, the module exports the math variable to expose its functionality. Now, let's create a file named index.js, which acts like the main program that imports the module. Here is the code that will be inside the index.js file: var math = require("./math").math; console.log(math.findSum(1, 2)); //Output "3" console.log(math.findSub(1, 2)); //Output "-1" Here, the math variable is the name of the file that is treated as module. The .js extension to the file name is added automatically by CommonJS. Universal Module Definition We saw three different specifications of implementing modules. These three specifications have their own respective ways of creating and importing modules. Wouldn't it have been great if we can create modules that can be imported as an IIFE, AMD, or CommonJS module? UMD is a set of techniques that is used to create modules that can be imported as an IIFE, CommonJS, or AMD module. Therefore now, a program can import third-party modules, irrespective of what module specification it is using. The most popular UMD technique is returnExports. According to the returnExports technique, every module needs to be represented by a separate file. So, let's create a file named math.js that represents a module. Here is the sample code that will be inside the module: (function (root, factory) {   //Environment Detection   if (typeof define === 'function' && define.amd) {     define([], factory);   } else if (typeof exports === 'object') {     module.exports = factory();   } else {     root.returnExports = factory();   } }(this, function () {   //Module Definition   var sum = function(x, y){     return x + y;   }   var sub = function(x, y){     return x - y;   }   var math = {     findSum: function(a, b){       return sum(a,b);     },     findSub: function(a, b){       return sub(a, b);     }   }   return math; })); Now, you can successfully import the math.js module any way that you wish, for instance, by using CommonJS, RequireJS, or IIFE. Implementing modules – the new way ES6 introduced a new module system called ES6 modules. The ES6 modules are supported natively and therefore, they can be referred as the standard JavaScript modules. You should consider using ES6 modules instead of the old ways, because they have neater syntax, better performance, and many new APIs that are likely to be packed as the ES6 modules. Let's have a look at the ES6 modules in detail. Creating the ES6 modules Every ES6 module needs to be represented by a separate .js file. An ES6 module can contain any JavaScript code, and it can export any number of variables. A module can export a variable, function, class, or any other entity. We need to use the export statement in a module to export variables. The export statement comes in many different formats. Here are the formats: export {variableName}; export {variableName1, variableName2, variableName3}; export {variableName as myVariableName}; export {variableName1 as myVariableName1, variableName2 as myVariableName2}; export {variableName as default}; export {variableName as default, variableName1 as myVariableName1, variableName2}; export default function(){}; export {variableName1, variableName2} from "myAnotherModule"; export * from "myAnotherModule"; Here are the differences in these formats: The first format exports a variable. The second format is used to export multiple variables. The third format is used to export a variable with another name, that is, an alias. The fourth format is used to export multiple variables with different names. The fifth format uses default as the alias. We will find out the use of this later in this article. The sixth format is similar to fourth format, but it also has the default alias. The seventh format works similar to fifth format, but here you can place an expression instead of a variable name. The eighth format is used to export the exported variables of a submodule. The ninth format is used to export all the exported variables of a submodule. Here are some important things that you need to know about the export statement: An export statement can be used anywhere in a module. It's not compulsory to use it at the end of the module. There can be any number of export statements in a module. You cannot export variables on demand. For example, placing the export statement in the if…else condition throws an error. Therefore, we can say that the module structure needs to be static, that is, exports can be determined on compile time. You cannot export the same variable name or alias multiple times. But you can export a variable multiple times with a different alias. All the code inside a module is executed in the strict mode by default. The values of the exported variables can be changed inside the module that exported them. Importing the ES6 modules To import a module, we need to use the import statement. The import statement comes in many different formats. Here are the formats: import x from "module-relative-path"; import {x} from "module-relative-path"; import {x1 as x2} from "module-relative-path"; import {x1, x2} from "module-relative-path"; import {x1, x2 as x3} from "module-relative-path"; import x, {x1, x2} from "module-relative-path"; import "module-relative-path"; import * as x from "module-relative-path"; import x1, * as x2 from "module-relative-path"; An import statement consists of two parts: the variable names we want to import and the relative path of the module. Here are the differences in these formats: In the first format, the default alias is imported. The x is alias of the default alias. In the second format, the x variable is imported. The third format is the same as the second format. It's just that x2 is an alias of x1. In the fourth format, we import the x1 and x2 variables. In the fifth format, we import the x1 and x2 variables. The x3 is an alias of the x2 variable. In the sixth format, we import the x1 and x2 variable, and the default alias. The x is an alias of the default alias. In the seventh format, we just import the module. We do not import any of the variables exported by the module. In the eighth format, we import all the variables, and wrap them in an object called x. Even the default alias is imported. The ninth format is the same as the eighth format. Here, we give another alias to the default alias.[RR1]  Here are some important things that you need to know about the import statement: While importing a variable, if we import it with an alias, then to refer to that variable, we have to use the alias and not the actual variable name, that is, the actual variable name will not be visible, only the alias will be visible. The import statement doesn't import a copy of the exported variables; rather, it makes the variables available in the scope of the program that imports it. Therefore, if you make a change to an exported variable inside the module, then the change is visible to the program that imports it. The imported variables are read-only, that is, you cannot reassign them to something else outside of the scope of the module that exports them. A module can only be imported once in a single instance of a JavaScript engine. If we try to import it again, then the already imported instance of the module will be used. We cannot import modules on demand. For example, placing the import statement in the if…else condition throws an error. Therefore, we can say that the imports should be able to be determined on compile time. The ES6 imports are faster than the AMD and CommonJS imports, because the ES6 imports are supported natively and also as importing modules and exporting variables are not decided on demand. Therefore, it makes JavaScript engine easier to optimize performance. The module loader A module loader is a component of a JavaScript engine that is responsible for importing modules. The import statement uses the build-in module loader to import modules. The built-in module loaders of the different JavaScript environments use different module loading mechanisms. For example, when we import a module in JavaScript running in the browsers, then the module is loaded from the server. On the other hand, when we import a module in Node.js, then the module is loaded from filesystem. The module loader loads modules in a different manner, in different environments, to optimize the performance. For example, in the browsers, the module loader loads and executes modules asynchronously in order to prevent the importing of the modules that block the loading of a webpage. You can programmatically interact with the built-in module loader using the module loader API to customize its behavior, intercept module loading, and fetch the modules on demand. We can also use this API to create our own custom module loaders. The specifications of the module loader are not specified in ES6. It is a separate standard, controlled by the WHATWG browser standard group. You can find the specifications of the module loader at http://whatwg.github.io/loader/. The ES6 specifications only specify the import and export statements. Using modules in browsers The code inside the <script> tag doesn't support the import statement, because the tag's synchronous nature is incompatible with the asynchronicity of the modules in browsers. Instead, you need to use the new <module> tag to import modules. Using the new <module> tag, we can define a script as a module. Now, this module can import other modules using the import statement. If you want to import a module using the <script> tag, then you have to use the Module Loader API. The specifications of the <module> tag are not specified in ES6. Using modules in the eval() function You cannot use the import and export statements in the eval() function. To import modules in the eval() function, you need to use the Module Loader API. The default exports vs. the named exports When we export a variable with the default alias, then it's called as a default export. Obviously, there can only be one default export in a module, as an alias can be used only once. All the other exports except the default export are called as named exports. It's recommended that a module should either use default export or named exports. It's not a good practice to use both together. The default export is used when we want to export only one variable. On the other hand, the named exports are used when we want to export the multiple variables. Diving into an example Let's create a basic JavaScript library using the ES6 modules. This will help us understand how to use the import and export statements. We will also learn how a module can import other modules. The library that we will create is going to be a math library, which provides basic logarithmic and trigonometric functions. Let's get started with creating our library: Create a file named math.js, and a directory named math_modules. Inside the math_modules directory, create two files named logarithm.js and trigonometry.js, respectively. Here, the math.js file is the root module, whereas the logarithm.js and the trigonometry.js files are its submodules. Place this code inside the logarithm.js file: var LN2 = Math.LN2; var N10 = Math.LN10;   function getLN2() {   return LN2; }   function getLN10() {   return LN10; }   export {getLN2, getLN10}; Here, the module is exporting the functions named as exports. It's preferred that the low-level modules in a module hierarchy should export all the variables separately, because it may be possible that a program may need just one exported variable of a library. In this case, a program can import this module and a particular function directly. Loading all the modules when you need just one module is a bad idea in terms of performance. Similarly, place this code in the trigonometry.js file: var cos = Math.cos; var sin = Math.sin; function getSin(value) {   return sin(value); } function getCos(value) {   return cos(value); } export {getCos, getSin}; Here we do something similar. Place this code inside the math.js file, which acts as the root module: import * as logarithm from "math_modules/logarithm"; import * as trigonometry from "math_modules/trigonometry"; export default {   logarithm: logarithm,   trigonometry: trigonometry } It doesn't contain any library functions. Instead, it makes easy for a program to import the complete library. It imports its submodules, and then exports their exported variables to the main program. Here, in case the logarithm.js and trigonometry.js scripts depends on other submodules, then the math.js module shouldn't import those submodules, because logarithm.js and trigonometry.js are already importing them. Here is the code using which a program can import the complete library: import math from "math"; console.log(math.trigonometry.getSin(3)); console.log(math.logarithm.getLN2(3)); Summary In this article, we saw what modular programming is and learned different modular programming specifications. We also saw different ways to create modules using JavaScript. Technologies such as the IIFE, CommonJS, AMD, UMD, and ES6 modules are covered. Finally, we created a basic library using the modular programming design technique. Now, you should be confident enough to build the JavaScript apps using the ES6 modules. To learn more about ECMAScript and JavaScript, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: JavaScript at Scale (https://www.packtpub.com/web-development/javascript-scale) Google Apps Script for Beginners (https://www.packtpub.com/web-development/google-apps-script-beginners) Learning TypeScript (https://www.packtpub.com/web-development/learning-typescript) JavaScript Concurrency (https://www.packtpub.com/web-development/javascript-concurrency) You can also watch out for an upcoming title, Mastering JavaScript Object-Oriented Programming, on this technology on Packt Publishing's website at https://www.packtpub.com/web-development/mastering-javascript-object-oriented-programming. Resources for Article:   Further resources on this subject: Concurrency Principles [article] Using Client Methods [article] HTML5 APIs [article]
Read more
  • 0
  • 0
  • 17160

article-image-recommending-movies-scale-python
Packt
15 Feb 2016
57 min read
Save for later

Recommending Movies at Scale (Python)

Packt
15 Feb 2016
57 min read
In this article, we will cover the following recipes: Modeling preference expressions Understanding the data Ingesting the movie review data Finding the highest-scoring movies Improving the movie-rating system Measuring the distance between users in the preference space Computing the correlation between users Finding the best critic for a user Predicting movie ratings for users Collaboratively filtering item by item Building a nonnegative matrix factorization model Loading the entire dataset into the memory Dumping the SVD-based model to the disk Training the SVD-based model Testing the SVD-based model (For more resources related to this topic, see here.) Introduction From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting. Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also "anti-recommendations"— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience. It is this basic premise that a group of techniques called "collaborative filtering" use to make recommendations. Simply stated, this premise can be boiled down to the assumption that those who have similar past preferences will share the same preferences in the future. This is from a human perspective, of course, and a typical corollary to this assumption is from the perspective of the things being preferred—sets of items that are preferred by the same people will be more likely to preferred together in the future—and this is the basis for what is commonly described in the literature as user-centric collaborative filtering versus item-centric collaborative filtering. The term collaborative filtering was coined by David Goldberg in a paper titled Using collaborative filtering to weave an information tapestry, ACM, where he proposed a system called Tapestry, which was designed at Xerox PARC in 1992, to annotate documents as interesting or uninteresting and to give document recommendations to people who are searching for good reads. Collaborative filtering algorithms search large groupings of preference expressions to find similarities to some input preference or preferences. The output from these algorithms is a ranked list of suggestions that is a subset of all possible preferences, and hence, it's called "filtering". The "collaborative" comes from the use of many other peoples' preferences in order to find suggestions for themselves. This can be seen either as a search of the space of preferences (for brute-force techniques), a clustering problem (grouping similarly preferred items), or even some other predictive model. Many algorithmic attempts have been created in order to optimize or solve this problem across sparse or large datasets, and we will discuss a few of them in this article. The goals of this article are: Understanding how to model preferences from a variety of sources Learning how to compute similarities using distance metrics Modeling recommendations using matrix factorization for star ratings These two different models will be implemented in Python using readily available datasets on the Web. To demonstrate the techniques in this article, we will use the oft-cited MovieLens database from the University of Minnesota that contains star ratings of moviegoers for their preferred movies. Modeling preference expressions We have already pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering. More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice's taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products. How to do it… We will model preference expressions over the next few recipes, including: Understanding the data Ingesting the movie review data Finding the highest rated movies Improving the movie-rating system How it works… A preference expression is an instance of a model of demonstrable relative selection. That is to say, preference expressions are data points that are used to show subjective ranking between a group of items for a person. Even more formally, we should say that preference expressions are not simply relative, but also temporal—for example, the statement of preference also has a fixed time relativity as well as item relativity. Preference expression is an instance of a model of demonstrable relative selection. While it would be nice to think that we can subjectively and accurately express our preferences in a global context (for example, rate a movie as compared to all other movies), our tastes, in fact, change over time, and we can really only consider how we rank items relative to each other. Models of preference must take this into account and attempt to alleviate biases that are caused by it. The most common types of preference expression models simplify the problem of ranking by causing the expression to be numerically fuzzy, for example: Boolean expressions (yes or no) Up and down voting (such as abstain, dislike) Weighted signaling (the number of clicks or actions) Broad ranked classification (stars, hated or loved) The idea is to create a preference model for an individual user—a numerical model of the set of preference expressions for a particular individual. Models build the individual preference expressions into a useful user-specific context that can be computed against. Further reasoning can be performed on the models in order to alleviate time-based biases or to perform ontological reasoning or other categorizations. As the relationships between entities get more complex, you can express their relative preferences by assigning behavioral weights to each type of semantic connection. However, choosing the weight is difficult and requires research to decide relative weights, which is why fuzzy generalizations are preferred. As an example, the following table shows you some well-known ranking preference systems: Reddit Voting   Online Shopping   Star Reviews   Up Vote 1 Bought 2 Love 5 No Vote 0 Viewed 1 Liked 4 Down Vote -1 No purchase 0 Neutral 3         Dislike 2         Hate 1 For the rest of this article, we will only consider a single, very common preference expression: star ratings on a scale of 1 to 5. Understanding the data Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine. Getting ready To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/. In this article, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease. How to do it… Perform the following steps to better understand the data that we will be working with throughout this article: Download the data from http://grouplens.org/datasets/movielens/. The 100K dataset is the one that you want (ml-100k.zip). Unzip the downloaded data into the directory of your choice. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows: head -n 5 u.item Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command: more u.item 2 n The preceding command gives you the following output: 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0 |0 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| 0|0 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0| 0|0 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0 The following command will produce the given output: head -n 5 u.data For Windows, you can use the following command: more u.item 2 n 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 How it works… The two main files that we will be using are as follows: u.data: This contains the user moving ratings u.item: This contains the movie information and other details Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited. For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western. There's more… Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/). Ingesting the movie review data Recommendation engines require large amounts of training data in order to do a good job, which is why they're often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how. Getting ready You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. Ensure that you have NumPy correctly installed. How to do it… The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory: Open your favorite Python editor or IDE. There is a lot of code, so it should be far simpler to enter directly into a text file than Read Eval Print Loop (REPL). We create a function to import the movie reviews: import csv import datetime def load_reviews(path, **kwargs): """ Loads MovieLens reviews """ options = { 'fieldnames': ('userid', 'movieid', 'rating', 'timestamp'), 'delimiter': 't', } options.update(kwargs) parse_date = lambda r,k: datetime.fromtimestamp(float(r[k])) parse_int = lambda r,k: int(r[k]) with open(path, 'rb') as reviews: reader = csv.DictReader(reviews, **options) for row in reader: row['userid'] = parse_int(row, 'userid') row['movieid'] = parse_int(row, 'movieid') row['rating'] = parse_int(row, 'rating') row['timestamp'] = parse_date(row, 'timestamp') yield row We create a helper function to help import the data: import os def relative_path(path): """ Returns a path relative from this code file """ dirname = os.path.dirname(os.path.realpath('__file__')) path = os.path.join(dirname, path) return os.path.normpath(path)  We create another function to load the movie information: def load_movies(path, **kwargs): """ Loads MovieLens movies """ options = { 'fieldnames': ('movieid', 'title', 'release', 'video', 'url'),'delimiter': '|','restkey': 'genre', } options.update(kwargs) parse_int = lambda r,k: int(r[k]) parse_date = lambda r,k: datetime.strptime(r[k], '%d-%b- %Y') if r[k] else None with open(path, 'rb') as movies: reader = csv.DictReader(movies, **options) for row in reader: row['movieid'] = parse_int(row, 'movieid') row['release'] = parse_date(row, 'release') row['video'] = parse_date(row, 'video')             yield row Finally, we start creating a MovieLens class that will be augmented in later recipes: from collections import defaultdict class MovieLens(object): """ Data structure to build our recommender model on. """ def __init__(self, udata, uitem): """ Instantiate with a path to u.data and u.item """ self.udata = udata self.uitem = uitem self.movies = {} self.reviews = defaultdict(dict) self.load_dataset() def load_dataset(self): """ Loads the two datasets into memory, indexed on the ID. """ for movie in load_movies(self.uitem): self.movies[movie['movieid']] = movie for review in load_reviews(self.udata): self.reviews[review['userid']][review['movieid']] = review  Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system: data = relative_path('data/ml-100k/u.data') item = relative_path('data/ml-100k/u.item') model = MovieLens(data, item) How it works… The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is t. The options.update(kwargs) line means that we'll accept whatever users pass to this function. We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row. Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory. We'll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We'll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There's more… section, we'll discuss a Python pro-tip to alleviate this concern. Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews' data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows: data = relative_path('../data/ml-100k/u.data') item = relative_path('../data/ml-100k/u.item') model = MovieLens(data, item) Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset. Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself: import os def relative_path(path): """ Returns a path relative from this code file """ dirname = os.path.dirname(os.path.realpath('__file__')) path = os.path.join(dirname, path) return os.path.normpath(path) Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn't too bad for modern machines. However, we should also keep in mind that we'll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods. Finding the highest-scoring movies If you're looking for a good movie, you'll often want to see the most popular or best rated movies overall. Initially, we'll take a naïve approach to compute a movie's aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class. Getting ready These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one. How to do it… Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list: Augment the MovieLens class with a new method to get all reviews for a particular movie: class MovieLens(object): ... def reviews_for_movie(self, movieid): """ Yields the reviews for a given movie """ for review in self.reviews.values(): if movieid in review: yield review[movieid] Then, add an additional method to compute the top 10 movies reviewed by users: import heapq from operator import itemgetter class MovieLens(object): ... def average_reviews(self): """ Averages the star rating for all movies. Yields a tuple of movieid, the average rating, and the number of reviews. """ for movieid in self.movies: reviews = list(r['rating'] for r in self.reviews_for_movie(movieid)) average = sum(reviews) / float(len(reviews)) yield (movieid, average, len(reviews)) def top_rated(self, n=10): """ Yields the n top rated movies """ return heapq.nlargest(n, self.average_reviews(), key=itemgetter(1)) Note that the … notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class. Now, let's print the top-rated results: for mid, avg, num in model.top_rated(10): title = model.movies[mid]['title'] print "[%0.3f average rating (%i reviews)] %s" % (avg, num,title) Executing the preceding commands in your REPL should produce the following output: [5.000 average rating (1 reviews)] Entertaining Angels: The Dorothy Day Story (1996) [5.000 average rating (2 reviews)] Santa with Muscles (1996) [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994) [5.000 average rating (1 reviews)] They Made Me a Criminal (1939) [5.000 average rating (1 reviews)] Aiqing wansui (1994) [5.000 average rating (1 reviews)] Someone Else's America (1995) [5.000 average rating (2 reviews)] Saint of Fort Washington, The (1993) [5.000 average rating (3 reviews)] Prefontaine (1997) [5.000 average rating (3 reviews)] Star Kid (1997) [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow and Light (1996) How it works… The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method. With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average. The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values. There's more… Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe. See also The heapq documentation available at https://docs.python.org/2/library/heapq.html Improving the movie-rating system We don't want to build a recommendation engine with a system that considers the likely straight-to-DVD Santa with Muscles as generally superior to Casablanca. Thus, the naïve scoring approach used previously must be improved upon and is the focus of this recipe. Getting ready Make sure that you have completed the previous recipes in this article first. How to do it… The following steps implement and test a new movie-scoring algorithm: Let's implement a new Bayesian movie-scoring algorithm as shown in the following function, adding it to the MovieLens class: def bayesian_average(self, c=59, m=3): """ Reports the Bayesian average with parameters c and m. """ for movieid in self.movies: reviews = list(r['rating'] for r in self.reviews_for_movie(movieid)) average = ((c * m) + sum(reviews)) / float(c + len(reviews)) yield (movieid, average, len(reviews))   Next, we will replace the top_rated method in the MovieLens class with the version in the following commands that uses the new Bayesian_average method from the preceding step: def top_rated(self, n=10): """ Yields the n top rated movies """ return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1))  Printing our new top-10 list looks a bit more familiar to us and Casablanca is now happily rated number 4: [4.234 average rating (583 reviews)] Star Wars (1977) [4.224 average rating (298 reviews)] Schindler's List (1993) [4.196 average rating (283 reviews)] Shawshank Redemption, The (1994) [4.172 average rating (243 reviews)] Casablanca (1942) [4.135 average rating (267 reviews)] Usual Suspects, The (1995) [4.123 average rating (413 reviews)] Godfather, The (1972) [4.120 average rating (390 reviews)] Silence of the Lambs, The (1991) [4.098 average rating (420 reviews)] Raiders of the Lost Ark (1981) [4.082 average rating (209 reviews)] Rear Window (1954) [4.066 average rating (350 reviews)] Titanic (1997) How it works… Taking the average of movie reviews, as in shown the previous recipe, simply did not work because some movies did not have enough ratings to give a meaningful comparison to movies with more ratings. What we'd really like is to have every single movie critic rate every single movie. Given that this is impossible, we could derive an estimate for how the movie would be rated if an infinite number of people rated the movie; this is hard to infer from one data point, so we should say that we would like to estimate the movie rating if the same number of people gave it a rating on an average (for example, filtering our results based on the number of reviews). This estimate can be computed with a Bayesian average, implemented in the bayesian_average() function, to infer these ratings based on the following equation:   Here, m is our prior for the average of stars, and C is a confidence parameter that is equivalent to the number of observations in our posterior. Determining priors can be a complicated and magical art. Rather than taking the complex path of fitting a Dirichlet distribution to our data, we can simply choose an m prior of 3 with our 5-star rating system, which means that our prior assumes that star ratings tend to be reviewed around the median value. In choosing C, you are expressing how many reviews are needed to get away from the prior; we can compute this by looking at the average number of reviews per movie: print float(sum(num for mid, avg, num in model.average_reviews())) / len(model.movies) This gives us an average number of 59.4, which we use as the default value in our function definition. There's more… Play around with the C parameter. You should find that if you change the parameter so that C = 50, the top-10 list subtly shifts; in this case, Schindler's List and Star Wars are swapped in rankings, as are Raiders of the Lost Ark and Rear Window— note that both the swapped movies have far more reviews than the former, which means that the higher C parameter was balancing the fewer ratings of the other movie. See also See how Yelp deals with this challenge at http://venturebeat.com/2009/10/12/how-yelp-deals-with-everybody-getting-four-stars-on-average/ Measuring the distance between users in the preference space The two most recognizable types of collaborative filtering systems are user-based recommenders and item-based recommenders. If one were to imagine that the preference space is an N-dimensional feature space where either users or items are plotted, then we would say that similar users or items tend to cluster near each other in this preference space; hence, an alternative name for this type of collaborative filtering is nearest neighbor recommenders. A crucial step in this process is to come up with a similarity or distance metric with which we can compare critics to each other or mutually preferred items. This metric is then used to perform pairwise comparisons of a particular user to all other users, or conversely, for an item to be compared to all other items. Normalized comparisons are then used to determine recommendations. Although the computational space can become exceedingly large, distance metrics themselves are not difficult to compute, and in this recipe, we will explore a few as well as implement our first recommender system. In this recipe, we will measure the distance between users; in the recipe after this one, we will look at another similarity distance indicator. Getting ready We will continue to build on the MovieLens class from the section titled Modeling Preference. If you have not had the opportunity to review this section, please have the code for that class ready. Importantly, we will want to access the data structures, MovieLens.movies and MovieLens.reviews, that have been loaded from the CSV files on the disk. How to do it… The following set of steps provide instructions on how to compute the Euclidean distance between users: Augment the MovieLens class with a new method, shared_preferences, to pull out movies that have been rated by two critics, A and B: class MovieLens(objects): ... def shared_preferences(self, criticA, criticB): """ Returns the intersection of ratings for two critics """ if criticA not in self.reviews: raise KeyError("Couldn't find critic '%s' in data" % criticA) if criticB not in self.reviews: raise KeyError("Couldn't find critic '%s' in data" % criticB) moviesA = set(self.reviews[criticA].keys()) moviesB = set(self.reviews[criticB].keys()) shared = moviesA & moviesB # Intersection operator # Create a reviews dictionary to return reviews = {} for movieid in shared: reviews[movieid] = ( self.reviews[criticA][movieid]['rating'], self.reviews[criticB][movieid]['rating'], ) return reviews Then, implement a function that computes the Euclidean distance between two critics using their shared movie preferences as a vector for the computation. This method will also be part of the MovieLens class: from math import sqrt ... def euclidean_distance(self, criticA, criticB): """ Reports the Euclidean distance of two critics, A&B by performing a J-dimensional Euclidean calculation of each of their preference vectors for the intersection of movies the critics have rated. """ # Get the intersection of the rated titles in the data. preferences = self.shared_preferences(criticA, criticB) # If they have no rankings in common, return 0. if len(preferences) == 0: return 0 # Sum the squares of the differences sum_of_squares = sum([pow(a-b, 2) for a, b in preferences.values()]) # Return the inverse of the distance to give a higher score to # folks who are more similar (e.g. less distance) add 1 to prevent # division by zero errors and normalize ranks in [0, 1] return 1 / (1 + sqrt(sum_of_squares)) With the preceding code implemented, test it in REPL: >>> data = relative_path('data/ml-100k/u.data') >>> item = relative_path('data/ml-100k/u.item') >>> model = MovieLens(data, item) >>> print model.euclidean_distance(232, 532) 0.1023021629920016 How it works… The new shared_preferences() method of the MovieLens class determines the shared preference space of two users. Critically, we can only compare users (the criticA and criticB input parameters) based on the things that they have both rated. This function uses Python sets to determine the list of movies that both A and B reviewed (the intersection of the movies A has rated and the movies B has rated). The function then iterates over this set, returning a dictionary whose keys are the movie IDs and the values are a tuple of ratings, for example, (ratingA, ratingB) for each movie that both users have rated. We can now use this dataset to compute similarity scores, which is done by the second function. The euclidean_distance() function takes two critics as the input, A and B, and computes the distance between users in preference space. Here, we have chosen to implement the Euclidean distance metric (the two-dimensional variation is well known to those who remember the Pythagorean theorem), but we could have implemented other metrics as well. This function will return a real number from 0 to 1, where 0 is less similar (farther apart) critics and 1 is more similar (closer together) critics. There's more… The Manhattan distance is another very popular metric and a very simple one to understand. It can simply sum the absolute values of the pairwise differences between elements of each vector. Or, in code, it can be executed in this manner: manhattan = sum([abs(a-b) for a, b in preferences.values()]) This metric is also called the city-block distance because, conceptually, it is as if you were counting the number of blocks north/south and east/west one would have to walk between two points in the city. Before implementing it for this recipe, you would also want to invert and normalize the value in some fashion to return a value in the [0, 1] range. See also The distance overview from Wikipedia available at http://en.wikipedia.org/wiki/Distance The Taxicab geometry from Wikipedia available at http://en.wikipedia.org/wiki/Taxicab_geometry Computing the correlation between users In the previous recipe, we used one out of many possible distance measures to capture the distance between the movie reviews of users. This distance between two specific users is not changed even if there are five or five million other users. In this recipe, we will compute the correlation between users in the preference space. Like distance metrics, there are many correlation metrics. The most popular of these are Pearson or Spearman correlations or Cosine distance. Unlike distance metrics, the correlation will change depending on the number of users and movies. Getting ready We will be continuing the efforts of the previous recipes again, so make sure you understand each one. How to do it… The following function implements the computation of the pearson_correlation function for two critics, which are criticA and criticB, and is added to the MovieLens class: def pearson_correlation(self, criticA, criticB): """ Returns the Pearson Correlation of two critics, A and B by performing the PPMC calculation on the scatter plot of (a, b) ratings on the shared set of critiqued titles. """ # Get the set of mutually rated items preferences = self.shared_preferences(criticA, criticB) # Store the length to save traversals of the len computation. # If they have no rankings in common, return 0. length = len(preferences) if length == 0: return 0 # Loop through the preferences of each critic once and compute the # various summations that are required for our final calculation. sumA = sumB = sumSquareA = sumSquareB = sumProducts = 0 for a, b in preferences.values(): sumA += a sumB += b sumSquareA += pow(a, 2) sumSquareB += pow(b, 2) sumProducts += a*b # Calculate Pearson Score numerator = (sumProducts*length) - (sumA*sumB) denominator = sqrt(((sumSquareA*length) - pow(sumA, 2)) * ((sumSquareB*length) - pow(sumB, 2))) # Prevent division by zero. if denominator == 0: return 0 return abs(numerator / denominator) How it works… The Pearson correlation computes the "product moment", which is the mean of the product of mean adjusted random variables and is defined as the covariance of two variables (a and b, in our case) divided by the product of the standard deviation of a and the standard deviation of b. As a formula, this looks like the following:   For a finite sample, which is what we have, the detailed formula, which was implemented in the preceding function, is as follows:   Another way to think about the Pearson correlation is as a measure of the linear dependence between two variables. It returns a score of -1 to 1, where negative scores closer to -1 indicate a stronger negative correlation, and positive scores closer to 1 indicate a stronger, positive correlation. A score of 0 means that the two variables are not correlated. In order for us to perform comparisons, we want to normalize our similarity metrics in the space of [0, 1] so that 0 means less similar and 1 means more similar, so we return the absolute value: >>> print model.pearson_correlation(232, 532) 0.06025793538385047 There's more… We have explored two distance metrics: the Euclidean distance and the Pearson correlation. There are many more, including the Spearman correlation, Tantimoto scores, Jaccard distance, Cosine similarity, and Manhattan distance, to name a few. Choosing the right distance metric for the dataset of your recommender along with the type of preference expression used is crucial to ensuring success in this style of recommender. It's up to the reader to explore this space further based on his or her interest and particular dataset. Finding the best critic for a user Now that we have two different ways to compute a similarity distance between users, we can determine the best critics for a particular user and see how similar they are to an individual's preferences. Getting ready Make sure that you have completed the previous recipes before tackling this one. How to do it… Implement a new method for the MovieLens class, similar_critics(), that locates the best match for a user: import heapq ... def similar_critics(self, user, metric='euclidean', n=None): """ Finds, ranks similar critics for the user according to the specified distance metric. Returns the top n similar critics if n is specified. """ # Metric jump table metrics = { 'euclidean': self.euclidean_distance, 'pearson': self.pearson_correlation, } distance = metrics.get(metric, None) # Handle problems that might occur if user not in self.reviews: raise KeyError("Unknown user, '%s'." % user) if not distance or not callable(distance): raise KeyError("Unknown or unprogrammed distance metric '%s'." % metric) # Compute user to critic similarities for all critics critics = {} for critic in self.reviews: # Don't compare against yourself! if critic == user: continue critics[critic] = distance(user, critic) if n: return heapq.nlargest(n, critics.items(), key=itemgetter(1)) return critics How it works… The similar_critics method, added to the MovieLens class, serves as the heart of this recipe. It takes as parameters the targeted user and two optional parameters: the metric to be used, which defaults to euclidean, and the number of results to be returned, which defaults to None. As you can see, this flexible method uses a jump table to determine what algorithm is to be used (you can pass in euclidean or pearson to choose the distance metric). Every other critic is compared to the current user (except a comparison of the user against themselves). The results are then sorted using the flexible heapq module and the top n results are returned. To test out our implementation, print out the results of the run for both similarity distances: >>> for item in model.similar_critics(232, 'euclidean', n=10): print "%4i: %0.3f" % item 688: 1.000 914: 1.000 47: 0.500 78: 0.500 170: 0.500 335: 0.500 341: 0.500 101: 0.414 155: 0.414 309: 0.414 >>> for item in model.similar_critics(232, 'pearson', n=10): print "%4i: %0.3f" % item 33: 1.000 36: 1.000 155: 1.000 260: 1.000 289: 1.000 302: 1.000 309: 1.000 317: 1.000 511: 1.000 769: 1.000 These scores are clearly very different, and it appears that Pearson thinks that there are much more similar users than the Euclidean distance metric. The Euclidean distance metric tends to favor users who have rated fewer items exactly the same. Pearson correlation favors more scores that fit well linearly, and therefore, Pearson corrects grade inflation where two critics might rate movies very similarly, but one user rates them consistently one star higher than the other. If you plot out how many shared rankings each critic has, you'll see that the data is very sparse. Here is the preceding data with the number of rankings appended: Euclidean scores: 688: 1.000 (1 shared rankings) 914: 1.000 (2 shared rankings) 47: 0.500 (5 shared rankings) 78: 0.500 (3 shared rankings) 170: 0.500 (1 shared rankings) Pearson scores: 33: 1.000 (2 shared rankings) 36: 1.000 (3 shared rankings) 155: 1.000 (2 shared rankings) 260: 1.000 (3 shared rankings) 289: 1.000 (3 shared rankings) Therefore, it is not enough to find similar critics and use their ratings to predict our users' scores; instead, we will have to aggregate the scores of all of the critics, regardless of similarity, and predict ratings for the movies we haven't rated. Predicting movie ratings for users To predict how we might rate a particular movie, we can compute a weighted average of critics who have also rated the same movies as the user. The weight will be the similarity of the critic to user—if a critic has not rated a movie, then their similarity will not contribute to the overall ranking of the movie. Getting ready Ensure that you have completed the previous recipes in this large, cumulative article. How to do it… The following steps walk you through the prediction of movie ratings for users: First, add the predict_ranking function to the MovieLens class in order to predict the ranking a user might give a particular movie with similar critics: def predict_ranking(self, user, movie, metric='euclidean', critics=None): """ Predicts the ranking a user might give a movie based on the weighted average of the critics similar to the that user. """ critics = critics or self.similar_critics(user, metric=metric) total = 0.0 simsum = 0.0 for critic, similarity in critics.items(): if movie in self.reviews[critic]: total += similarity * self.reviews[critic][movie]['rating'] simsum += similarity if simsum == 0.0: return 0.0 return total / simsum Next, add the predict_all_rankings method to the MovieLens class: def predict_all_rankings(self, user, metric='euclidean', n=None): """ Predicts all rankings for all movies, if n is specified returns the top n movies and their predicted ranking. """ critics = self.similar_critics(user, metric=metric) movies = { movie: self.predict_ranking(user, movie, metric, critics) for movie in self.movies } if n: return heapq.nlargest(n, movies.items(), key=itemgetter(1)) return movies How it works… The predict_ranking method takes a user and a movie along with a string specifying the distance metric and returns the predicted rating for that movie for that particular user. A fourth argument, critics, is meant to be an optimization for the predict_all_rankings method, which we'll discuss shortly. The prediction gathers all critics who are similar to the user and computes the weighted total rating of the critics, filtered by those who actually did rate the movie in question. The weights are simply their similarity to the user, computed by the distance metric. This total is then normalized by the sum of the similarities to move the rating back into the space of 1 to 5 stars: >>> print model.predict_ranking(422, 50, 'euclidean') 4.35413151722 >>> print model.predict_ranking(422, 50, 'pearson') 4.3566797826 Here, we can see the predictions for Star Wars (ID 50 in our MovieLens dataset) for the user 422. The Euclidean and Pearson computations are very close to each other (which isn't necessarily to be expected), but the prediction is also very close to the user's actual rating, which is 4. The predict_all_rankings method computes the ranking predictions for all movies for a particular user according to the passed-in metric. It optionally takes a value, n, to return the top n best matches. This function optimizes the similar critics' lookup by only executing it once and then passing those discovered critics to the predict_ranking function in order to improve the performance. However, this method must be run on every single movie in the dataset: >>> for mid, rating in model.predict_all_rankings(578, 'pearson', 10): ... print "%0.3f: %s" % (rating, model.movies[mid]['title']) 5.000: Prefontaine (1997) 5.000: Santa with Muscles (1996) 5.000: Marlene Dietrich: Shadow and Light (1996) 5.000: Star Kid (1997) 5.000: Aiqing wansui (1994) 5.000: Someone Else's America (1995) 5.000: Great Day in Harlem, A (1994) 5.000: Saint of Fort Washington, The (1993) 4.954: Anna (1996) 4.817: Innocents, The (1961) As you can see, we have now computed what our recommender thinks the top movies for this particular user are, along with what we think the user will rate the movie! The top-10 list of average movie ratings plays a huge rule here and a potential improvement could be to use the Bayesian averaging in addition to the similarity weighting, but that is left for the reader to implement. Collaboratively filtering item by item So far, we have compared users to other users in order to make our predictions. However, the similarity space can be partitioned in two ways. User-centric collaborative filtering plots users in the preference space and discovers how similar users are to each other. These similarities are then used to predict rankings, aligning the user with similar critics. Item-centric collaborative filtering does just the opposite; it plots the items together in the preference space and makes recommendations according to how similar a group of items are to another group. Item-based collaborative filtering is a common optimization as the similarity of items changes slowly. Once enough data has been gathered, reviewers adding reviews does not necessarily change the fact that Toy Story is more similar to Babe than The Terminator, and users who prefer Toy Story might prefer the former to the latter. Therefore, you can simply compute item similarities once in a single offline-process and use that as a static mapping for recommendations, updating the results on a semi-regular basis. This recipe will walk you through item-by-item collaborative filtering. Getting ready This recipe requires the completion of the previous recipes in this article. How to do it… Construct the following function to perform item-by-item collaborative filtering: def shared_critics(self, movieA, movieB): """ Returns the intersection of critics for two items, A and B """ if movieA not in self.movies: raise KeyError("Couldn't find movie '%s' in data" %movieA) if movieB not in self.movies: raise KeyError("Couldn't find movie '%s' in data" %movieB) criticsA = set(critic for critic in self.reviews if movieA in self.reviews[critic]) criticsB = set(critic for critic in self.reviews if movieB in self.reviews[critic]) shared = criticsA & criticsB # Intersection operator # Create the reviews dictionary to return reviews = {} for critic in shared: reviews[critic] = ( self.reviews[critic][movieA]['rating'], self.reviews[critic][movieB]['rating'], ) return reviews def similar_items(self, movie, metric='euclidean', n=None): # Metric jump table metrics = { 'euclidean': self.euclidean_distance, 'pearson': self.pearson_correlation, } distance = metrics.get(metric, None) # Handle problems that might occur if movie not in self.reviews: raise KeyError("Unknown movie, '%s'." % movie) if not distance or not callable(distance): raise KeyError("Unknown or unprogrammed distance metric '%s'." % metric) items = {} for item in self.movies: if item == movie: continue items[item] = distance(item, movie, prefs='movies') if n: return heapq.nlargest(n, items.items(), key=itemgetter(1)) return items How it works… To perform item-by-item collaborative filtering, the same distance metrics can be used but must be updated to use the preferences from shared_critics rather than shared_preferences (for example, item similarity versus user similarity). Update the functions to accept a prefs parameter that determines which preferences are to be used, but I'll leave that to the reader as it is only two lines of code. If you print out the list of similar items for a particular movie, you can see some interesting results. For example, review the similarity results for The Crying Game (1992), which has an ID of 631: for movie, similarity in model.similar_items(631, 'pearson').items(): print "%0.3f: %s" % (similarity, model.movies[movie]['title']) 0.127: Toy Story (1995) 0.209: GoldenEye (1995) 0.069: Four Rooms (1995) 0.039: Get Shorty (1995) 0.340: Copycat (1995) 0.225: Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 0.232: Twelve Monkeys (1995) ... This crime thriller is not very similar to Toy Story, which is a children's movie, but is more similar to Copycat, which is another crime thriller. Of course, critics who have rated many movies skew the results, and more movie reviews are needed before this normalizes into something more compelling. It is presumed that the item similarity scores are run regularly but do not need to be computed in real time. Given a set of computed item similarities, computing recommendations are as follows: def predict_ranking(self, user, movie, metric='euclidean'): movies = self.similar_items(movie, metric=metric) total = 0.0 simsum = 0.0 for relmovie, similarity in movies.items(): # Ignore movies already reviewed by user if relmovie in self.reviews[user]: total += similarity * self.reviews[user][relmovie]['rating'] simsum += similarity if simsum == 0.0: return 0.0 return total / simsum This method simply uses the inverted item-to-item similarity scores rather than the user-to-user similarity scores. Since similar items can be computed offline, the lookup for movies via the self.similar_items method should be a database lookup rather than a real-time computation. >>> print model.predict_ranking(232, 52, 'pearson') 3.980443976 You can then compute a ranked list of all possible recommendations in a similar way as the user-to-user recommendations. Building a nonnegative matrix factorization model A general improvement on the basic cross-wise nearest-neighbor similarity scoring of collaborative filtering is a matrix factorization method, which is also known as Singular Value Decomposition (SVD). Matrix factorization methods attempt to explain the ratings through the discovery of latent features that are not easily identifiable by analysts. For instance, this technique can expose possible features such as the amount of action, family friendliness, or fine-tuned genre discovery in our movies dataset. What's especially interesting about these features is that they are continuous and not discrete values and can represent an individual's preference along a continuum. In this sense, the model can explore shades of characteristics, for example, perhaps a critic in the movie reviews' dataset, such as action flicks with a strong female lead that are set in European countries. A James Bond movie might represent a shade of that type of movie even though it only ticks the set in European countries and action genre boxes. Depending on how similarly reviewers rate the movie, the strength of the female counterpart to James Bond will determine how they might like the movie. Also, extremely helpfully, the matrix factorization model does well on sparse data, that is data with few recommendation and movie pairs. Reviews' data is particularly sparse because not everyone has rated the same movies and there is a massive set of available movies. SVD can also be performed in parallel, making it a good choice for much larger datasets. In the remaining recipes in this article, we will build a nonnegative matrix factorization model in order to improve our recommendation engine. How to do it… Loading the entire dataset into the memory. Dumping the SVD-based model to the disk. Training the SVD-based model. Testing the SVD-based model. How it works… Matrix factorization, or SVD works, by finding two matrices such that when you take their dot product (also known as the inner product or scalar product), you will get a close approximation of the original matrix. We have expressed our training matrix as a sparse N x M matrix of users to movies where the values are the 5-star rating if it exists, otherwise, the value is blank or 0. By factoring the model with the values that we have and then taking the dot product of the two matrices produced by the factorization, we hope to fill in the blank spots in our original matrix with a prediction of how the user would have rated the movie in that column. The intuition is that there should be some latent features that determine how users rate an item, and these latent features are expressed through the semantics of their previous ratings. If we can discover the latent features, we will be able to predict new ratings. Additionally, there should be fewer features than there are users and movies (otherwise, each movie or user would be a unique feature). This is why we compose our factored matrices by some feature length before taking their dot product. Mathematically, this task is expressed as follows. If we have a set of U users and M movies, let R of size |U| x |M| be the matrix that contains the ratings of users. Assuming that we have K latent features, find two matrices, P and Q, where P is |U| x K and Q is |M| x K such that the dot product of P and Q transpose approximates R. P, which therefore represent the strength of the associations between users and features and Q represents the association of movies with features. There are a few ways to go about factorization, but the choice we made was to perform gradient descent. Gradient descent initializes two random P and Q matrices, computes their dot product, and then minimizes the error compared to the original matrix by traveling down a slope of an error function (the gradient). This way, the algorithm hopes to find a local minimum where the error is within an acceptable threshold. Our function computed the error as the squared difference between the predicted value and the actual value. To minimize the error, we modify the values pik and qkj by descending along the gradient of the current error slope, differentiating our error equation with respect to p yields: We then differentiate our error equation with respect to the variable q yields in the following equation: We can then derive our learning rule, which updates the values in P and Q by a constant learning rate, which is α. This learning rate, α, should not be too large because it determines how big of a step we take towards the minimum, and it is possible to step across to the other side of the error curve. It should also not be too small, otherwise it will take forever to converge. We continue to update our P and Q matrices, minimizing the error until the sum of the error squared is below some threshold, 0.001 in our code, or until we have performed a maximum number of iterations. Matrix factorization has become an important technique for recommender systems, particularly those that leverage Likert-scale-like preference expressions—notably, star ratings. The Netflix Prize challenge has shown us that matrix-factored approaches perform with a high degree of accuracy for ratings prediction tasks. Additionally, matrix factorization is a compact, memory-efficient representation of the parameter space for a model and can be trained in parallel, can support multiple feature vectors, and can be improved with confidence levels. Generally, they are used to solve cold-start problems with sparse reviews and in an ensemble with more complex hybrid-recommenders that also compute content-based recommenders. See also Wikipedia's overview of the dot product available at http://en.wikipedia.org/wiki/Dot_product Loading the entire dataset into the memory The first step in building a nonnegative factorization model is to load the entire dataset in the memory. For this task, we will be leveraging NumPy highly. Getting ready In order to complete this recipe, you'll have to download the MovieLens database from the University of Minnesota GroupLens page at http://grouplens.org/datasets/movielens/ and unzip it in a working directory where your code will be. We will also use NumPy in this code significantly, so please ensure that you have this numerical analysis package downloaded and ready. Additionally, we will use the load_reviews function from the previous recipes. If you have not had the opportunity to review the appropriate section, please have the code for that function ready. How to do it… To build our matrix factorization model, we'll need to create a wrapper for the predictor that loads the entire dataset into memory. We will perform the following steps: We create the following Recommender class as shown. Please note that this class depends on the previously created and discussed load_reviews function: import numpy as np import csv class Recommender(object): def __init__(self, udata): self.udata = udata self.users = None self.movies = None self.reviews = None self.load_dataset() def load_dataset(self): """ Load an index of users & movies as a heap and reviews table as a N x M array where N is the number of users and M is the number of movies. Note that order matters so that we can look up values outside of the matrix! """ self.users = set([]) self.movies = set([]) for review in load_reviews(self.udata): self.users.add(review['userid']) self.movies.add(review['movieid']) self.users = sorted(self.users) self.movies = sorted(self.movies) self.reviews = np.zeros(shape=(len(self.users), len(self.movies))) for review in load_reviews(self.udata): uid = self.users.index(review['userid']) mid = self.movies.index(review['movieid']) self.reviews[uid, mid] = review['rating'] With this defined, we can instantiate a model by typing the following command: data_path = '../data/ml-100k/u.data' model = Recommender(data_path) How it works… Let's go over this code line by line. The instantiation of our recommender requires a path to the u.data file; creates holders for our list of users, movies, and reviews; and then loads the dataset. We need to hold the entire dataset in memory for reasons that we will see later. The basic data structure to perform our matrix factorization on is an N x M matrix where N is the number of users and M is the number of movies. To create this, we will first load all the movies and users into an ordered list so that we can look up the index of the user or movie by its ID. In the case of MovieLens, all of the IDs are contiguous from 1; however, this might not always be the case. It is good practice to have an index lookup table. Otherwise, you will be unable to fetch recommendations from our computation! Once we have our index lookup lists, we create a NumPy array of all zeroes in the size of the length of our users' list by the length of our movies list. Keep in mind that the rows are users and the columns are movies! We then go through the ratings data a second time and then add the value of the rating at the uid, mid index location of our matrix. Note that if a user hasn't rated a movie, their rating is 0. This is important! Print the array out by entering model.reviews, and you should see something as follows: [[ 5. 3. 4. ..., 0. 0. 0.] [ 4. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 5. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 5. 0. ..., 0. 0. 0.]] There's more… Let's get a sense of how sparse or dense our dataset is by adding the following two methods to the Recommender class: def sparsity(self): """ Report the percent of elements that are zero in the array """ return 1 - self.density() def density(self): """ Return the percent of elements that are nonzero in the array """ nonzero = float(np.count_nonzero(self.reviews)) return nonzero / self.reviews.size Adding these methods to our Recommender class will help us evaluate our recommender, and it will also help us identify recommenders in the future. Print out the results: print "%0.3f%% sparse" % model.sparsity() print "%0.3f%% dense" % model.density() You should see that the MovieLens 100k dataset is 0.937 percent sparse and 0.063 percent dense. This is very important to keep note of along with the size of the reviews dataset. Sparsity, which is common to most recommender systems, means that we might be able to use sparse matrix algorithms and optimizations. Additionally, as we begin to save models, this will help us identify the models as we load them from serialized files on the disk. Dumping the SVD-based model to the disk Before we build our model, which will take a long time to train, we should create a mechanism for us to load and dump our model to the disk. If we have a way of saving the parameterization of the factored matrix, then we can reuse our model without having to train it every time we want to use it—this is a very big deal since this model will take hours to train! Luckily, Python has a built-in tool for serializing and deserializing Python objects—the pickle module. How to do it… Update the Recommender class as follows: import pickle class Recommender(object): @classmethod def load(klass, pickle_path): """ Instantiates the class by deserializing the pickle. Note that the object returned may not be an exact match to the code in this class (if it was saved before updates). """ with open(pickle_path, 'rb') as pkl: return pickle.load(pkl) def __init__(self, udata, description=None): self.udata = udata self.users = None self.movies = None self.reviews = None # Descriptive properties self.build_start = None self.build_finish = None self.description = None # Model properties self.model = None self.features = 2 self.steps = 5000 self.alpha = 0.0002 self.beta = 0.02 self.load_dataset() def dump(self, pickle_path): """ Dump the object into a serialized file using the pickle module. This will allow us to quickly reload our model in the future. """ with open(pickle_path, 'wb') as pkl: pickle.dump(self, pkl) How it works… The @classmethod feature is a decorator in Python for declaring a class method instead of an instance method. The first argument that is passed in is the type instead of an instance (which we usually refer to as self). The load class method takes a path to a file on the disk that contains a serialized pickle object, which it then loads using the pickle module. Note that the class that is returned might not be an exact match with the Recommender class at the time you run the code—this is because the pickle module saves the class, including methods and properties, exactly as it was when you dumped it. Speaking of dumping, the dump method provides the opposite functionality, allowing you to serialize the methods, properties, and data to disk in order to be loaded again in the future. To help us identify the objects that we're dumping and loading from disk, we've also added some descriptive properties including a description, some build parameters, and some timestamps to our __init__ function. Training the SVD-based model We're now ready to write our functions that factor our training dataset and build our recommender model. You can see the required functions in this recipe. How to do it… We construct the following functions to train our model. Note that these functions are not part of the Recommender class: def initialize(R, K): """ Returns initial matrices for an N X M matrix, R and K features. :param R: the matrix to be factorized :param K: the number of latent features :returns: P, Q initial matrices of N x K and M x K sizes """ N, M = R.shape P = np.random.rand(N,K) Q = np.random.rand(M,K) return P, Q def factor(R, P=None, Q=None, K=2, steps=5000, alpha=0.0002, beta=0.02): """ Performs matrix factorization on R with given parameters. :param R: A matrix to be factorized, dimension N x M :param P: an initial matrix of dimension N x K :param Q: an initial matrix of dimension M x K :param K: the number of latent features :param steps: the maximum number of iterations to optimize in :param alpha: the learning rate for gradient descen :param beta: the regularization parameter :returns: final matrices P and Q """ if not P or not Q: P, Q = initialize(R, K) Q = Q.T rows, cols = R.shape for step in xrange(steps): for i in xrange(rows): for j in xrange(cols): if R[i,j] > 0: eij = R[i,j] - np.dot(P[i,:], Q[:,j]) for k in xrange(K): P[i,k] = P[i,k] + alpha * (2 * eij * Q[k,j] - beta * P[i,k]) Q[k,j] = Q[k,j] + alpha * (2 * eij * P[i,k] - beta * Q[k,j]) e = 0 for i in xrange(rows): for j in xrange(cols): if R[i,j] > 0: e = e + pow(R[i,j] - np.dot(P[i,:], Q[:,j]), 2) for k in xrange(K): e = e + (beta/2) * (pow(P[i,k], 2) + pow(Q[k,j], 2)) if e < 0.001: break return P, Q.T How it works… We discussed the theory and the mathematics of what we are doing in the previous recipe, Building a non-negative matrix factorization model, so let's talk about the code. The initialize function creates two matrices, P and Q, that have a size related to the reviews matrix and the number of features, namely N x K and M x K, where N is the number of users and M is the number of movies. Their values are initialized to random numbers that are between 0.0 and 1.0. The factor function computes P and Q using gradient descent such that the dot product of P and Q is within a mean squared error of less than 0.001 or 5000 steps that have gone by, whichever comes first. Especially note that only values that are greater than 0 are computed. These are the values that we're trying to predict; therefore, we do not want to attempt to match them in our code (otherwise, the model will be trained on zero ratings)! This is also the reason that you can't use NumPy's built-in Singular Value Decomposition (SVD) function, which is np.linalg.svd or np.linalg.solve. There's more… Let's use these factorization functions to build our model and to save the model to disk once it has been built—this way, we can load the model at our convenience using the dump and load methods in the class. Add the following method to the Recommender class: def build(self, output=None): """ Trains the model by employing matrix factorization on training data set, (sparse reviews matrix). The model is the dot product of the P and Q decomposed matrices from the factorization. """ options = { 'K': self.features, 'steps': self.steps, 'alpha': self.alpha, 'beta': self.beta, } self.build_start = time.time() self.P, self.Q = factor(self.reviews, **options) self.model = np.dot(self.P, self.Q.T) self.build_finish = time.time() if output: self.dump(output) This helper function will allow us to quickly build our model. Note that we're also saving P and Q—the parameters of our latent features. This isn't necessary, as our predictive model is the dot product of the two factored matrices. Deciding whether or not to save this information in your model is a trade-off between re-training time (you can potentially start from the current P and Q parameters although you must beware of the overfit) and disk space, as pickle will be larger on the disk with these matrices saved. To build this model and dump the data to the disk, run the following code: model = Recommender(relative_path('../data/ml-100k/u.data')) model.build('reccod.pickle') Warning! This will take a long time to build! On a 2013 MacBook Pro with a 2.8 GHz processor, this process took roughly 9 hours 15 minutes and required 23.1 MB of memory; this is not insignificant for most of the Python scripts you might be used to writing! It is not a bad idea to continue through the rest of the recipe before building your model. It is also probably not a bad idea to test your code on a smaller test set of 100 records before moving on to the entire process! Additionally, if you don't have the time to train the model, you can find the pickle module of our model in the errata of this book. Testing the SVD-based model This recipe brings this article on recommendation engines to a close. We use our new nonnegative matrix factorization-based model and take a look at some of the predicted reviews. How to do it… The final step in leveraging our model is to access the predicted reviews for a movie based on our model: def predict_ranking(self, user, movie): uidx = self.users.index(user) midx = self.movies.index(movie) if self.reviews[uidx, midx] > 0: return None return self.model[uidx, midx] How it works… Computing the ranking is relatively easy; we simply need to look up the index of the user and the index of the movie and look up the predicted rating in our model. This is why it is so essential to save an ordered list of the users and movies in our pickle module; this way, if the data changes (we add users or movies) but the change isn't reflected in our model, an exception is raised. Because models are historical predictions and not sensitive to changes in time, we need to ensure that we continually retrain our model with new data. This method also returns None if we know the ranking of the user (for example, it's not a prediction); we'll leverage this in the next step. There's more… To predict the highest-ranked movies, we can leverage the previous function to order the highest predicted rankings for our user: import heapq from operator import itemgetter def top_rated(self, user, n=12): movies = [(mid, self.predict_ranking(user, mid)) for mid in self.movies] return heapq.nlargest(n, movies, key=itemgetter(1)) We can now print out the top-predicted movies that have not been rated by the user: >>> rec = Recommender.load('reccod.pickle') >>> for item in rec.top_rated(234): ... print "%i: %0.3f" % item 814: 4.437 1642: 4.362 1491: 4.361 1599: 4.343 1536: 4.324 1500: 4.323 1449: 4.281 1650: 4.147 1645: 4.135 1467: 4.133 1636: 4.133 1651: 4.132 It's then simply a matter of using the movie ID to look up the movie in our movies database. Summary To learn more about Data Science, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Principles of Data Science (https://www.packtpub.com/big-data-and-business-intelligence/principles-data-science) Python Data Science Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/python-data-science-cookbook) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article: Further resources on this subject: Big Data[article] Big Data Analysis (R and Hadoop)[article] Visualization of Big Data[article]
Read more
  • 0
  • 0
  • 12171

article-image-changing-views
Packt
15 Feb 2016
25 min read
Save for later

Changing Views

Packt
15 Feb 2016
25 min read
In this article by Christopher Pitt, author of the book React Components, has explained how to change sections without reloading the page. We'll use this knowledge to create the public pages of the website our CMS is meant to control. (For more resources related to this topic, see here.) Location, location, and location Before we can learn about alternatives to reloading pages, let's take a look at how the browser manages reloads. You've probably encountered the window object. It's a global catch-all object for browser-based functionality and state. It's also the default this scope in any HTML page: We've even accessed window before. When we rendered to document.body or used document.querySelector, the window object was assumed. It's the same as if we were to call window.document.querySelector. Most of the time document is the only property we need. That doesn't mean it's the only property useful to us. Try the following, in the console: console.log(window.location); You should see something similar to: Location {     hash: ""     host: "127.0.0.1:3000"     hostname: "127.0.0.1"     href: "http://127.0.0.1:3000/examples/login.html"     origin: "http://127.0.0.1:3000"     pathname: "/examples/login.html"     port: "3000"     ... } If we were trying to work out which components to show based on the browser URL, this would be an excellent place to start. Not only can we read from this object, but we can also write to it: <script>     window.location.href = "http://material-ui.com"; </script> Putting this in an HTML page or entering that line of JavaScript in the console will redirect the browser to material-ui.com. It's the same if you click on the link. And if it's to a different page (than the one the browser is pointing at), then it will cause a full page reload. A bit of history So how does this help us? We're trying to avoid full page reloads, after all. Let's experiment with this object. Let's see what happens when we add something like #page-admin to the URL: Adding #page-admin to the URL leads to the window.location.hash property being populated with the same page. What's more, changing the hash value won't reload the page! It's the same as if we clicked on a link that had that hash in the href attribute. We can modify it without causing full page reloads, and each modification will store a new entry in the browser history. Using this trick, we can step through a number of different "states" without reloading the page, and be able to backtrack each with the browser's back button. Using browser history Let's put this trick to use in our CMS. First, let's add a couple functions to our Nav component: export default (props) => {     // ...define class names       var redirect = (event, section) => {         window.location.hash = `#${section}`;         event.preventDefault();     }       return <div className={drawerClassNames}>         <header className="demo-drawer-header">             <img src="images/user.jpg"                  className="demo-avatar" />         </header>         <nav className={navClassNames}>             <a className="mdl-navigation__link"                href="/examples/login.html"                onClick={(e) => redirect(e, "login")}>                 <i className={buttonIconClassNames}                    role="presentation">                     lock                 </i>                 Login             </a>             <a className="mdl-navigation__link"                href="/examples/page-admin.html"                onClick={(e) => redirect(e, "page-admin")}>                 <i className={buttonIconClassNames}                    role="presentation">                     pages                 </i>                 Pages             </a>         </nav>     </div>; }; We add an onClick attribute to our navigation links. We've created a special function that will change window.location.hash and prevent the default full page reload behavior the links would otherwise have caused. This is a neat use of arrow functions, but we're ultimately creating three new functions in each render call. Remember that this can be expensive, so it's best to move the function creation out of render. We'll replace this shortly. It's also interesting to see template strings in action. Instead of "#" + section, we can use `#${section}` to interpolate the section name. It's not as useful in small strings, but becomes increasingly useful in large ones. Clicking on the navigation links will now change the URL hash. We can add to this behavior by rendering different components when the navigation links are clicked: import React from "react"; import ReactDOM from "react-dom"; import Component from "src/component"; import Login from "src/login"; import Backend from "src/backend"; import PageAdmin from "src/page-admin";   class Nav extends Component {     render() {         // ...define class names           return <div className={drawerClassNames}>             <header className="demo-drawer-header">                 <img src="images/user.jpg"                      className="demo-avatar" />             </header>             <nav className={navClassNames}>                 <a className="mdl-navigation__link"                    href="/examples/login.html"                    onClick={(e) => this.redirect(e, "login")}>                     <i className={buttonIconClassNames}                        role="presentation">                         lock                     </i>                     Login                 </a>                 <a className="mdl-navigation__link"                    href="/examples/page-admin.html"                    onClick={(e) => this.redirect(e, "page-admin")}>                     <i className={buttonIconClassNames}                        role="presentation">                         pages                     </i>                     Pages                 </a>             </nav>         </div>;     }       redirect(event, section) {         window.location.hash = `#${section}`;           var component = null;           switch (section) {             case "login":                 component = <Login />;                 break;             case "page-admin":                 var backend = new Backend();                 component = <PageAdmin backend={backend} />;                 break;         }           var layoutClassNames = [             "demo-layout",             "mdl-layout",             "mdl-js-layout",             "mdl-layout--fixed-drawer"         ].join(" ");           ReactDOM.render(             <div className={layoutClassNames}>                 <Nav />                 {component}             </div>,             document.querySelector(".react")         );           event.preventDefault();     } };   export default Nav; We've had to convert the Nav function to a Nav class. We want to create the redirect method outside of render (as that is more efficient) and also isolate the choice of which component to render. Using a class also gives us a way to name and reference Nav, so we can create a new instance to overwrite it from within the redirect method. It's not ideal packaging this kind of code within a component, so we'll clean that up in a bit. We can now switch between different sections without full page reloads. There is one problem still to solve. When we use the browser back button, the components don't change to reflect the component that should be shown for each hash. We can solve this in a couple of ways. The first approach we can try is checking the hash frequently: componentDidMount() {     var hash = window.location.hash;       setInterval(() => {         if (hash !== window.location.hash) {             hash = window.location.hash;             this.redirect(null, hash.slice(1), true);         }     }, 100); }   redirect(event, section, respondingToHashChange = false) {     if (!respondingToHashChange) {         window.location.hash = `#${section}`;     }       var component = null;       switch (section) {         case "login":             component = <Login />;             break;         case "page-admin":             var backend = new Backend();             component = <PageAdmin backend={backend} />;             break;     }       var layoutClassNames = [         "demo-layout",         "mdl-layout",         "mdl-js-layout",         "mdl-layout--fixed-drawer"     ].join(" ");       ReactDOM.render(         <div className={layoutClassNames}>             <Nav />             {component}         </div>,         document.querySelector(".react")     );       if (event) {         event.preventDefault();     } } Our redirect method has an extra parameter, to apply the new hash whenever we're not responding to a hash change. We've also wrapped the call to event.preventDefault in case we don't have a click event to work with. Other than those changes, the redirect method is the same. We've also added a componentDidMount method, in which we have a call to setInterval. We store the initial window.location.hash and check 10 times a second to see if it has change. The hash value is #login or #page-admin, so we slice the first character off and pass the rest to the redirect method. Try clicking on the different navigation links, and then use the browser back button. The second option is to use the newish pushState and popState methods on the window.history object. They're not very well supported yet, so you need to be careful to handle older browsers or sure you don't need to handle them. You can learn more about pushState and popState at https://developer.mozilla.org/en-US/docs/Web/API/History_API. Using a router Our hash code is functional but invasive. We shouldn't be calling the render method from inside a component (at least not one we own). So instead, we're going to use a popular router to manage this stuff for us. Download it with the following: $ npm install react-router --save Then we need to join login.html and page-admin.html back into the same file: <!DOCTYPE html> <html>     <head>         <script src="/node_modules/babel-core/browser.js"></script>         <script src="/node_modules/systemjs/dist/system.js"></script>         <script src="https://storage.googleapis.com/code.getmdl.io/1.0.6/material.min.js"></script>         <link rel="stylesheet" href="https://storage.googleapis.com/code.getmdl.io/1.0.6/material.indigo-pink.min.css" />         <link rel="stylesheet" href="https://fonts.googleapis.com/icon?family=Material+Icons" />         <link rel="stylesheet" href="admin.css" />     </head>     <body class="         mdl-demo         mdl-color--grey-100         mdl-color-text--grey-700         mdl-base">         <div class="react"></div>         <script>             System.config({                 "transpiler": "babel",                 "map": {                     "react": "/examples/react/react",                     "react-dom": "/examples/react/react-dom",                     "router": "/node_modules/react-router/umd/ReactRouter"                 },                 "baseURL": "../",                 "defaultJSExtensions": true             });               System.import("examples/admin");         </script>     </body> </html> Notice how we've added the ReactRouter file to the import map? We'll use that in admin.js. First, let's define our layout component: var App = function(props) {     var layoutClassNames = [         "demo-layout",         "mdl-layout",         "mdl-js-layout",         "mdl-layout--fixed-drawer"     ].join(" ");       return (         <div className={layoutClassNames}>             <Nav />             {props.children}         </div>     ); }; This creates the page layout we've been using and allows a dynamic content component. Every React component has a this.props.children property (or props.children in the case of a stateless component), which is an array of nested components. For example, consider this component: <App>     <Login /> </App> Inside the App component, this.props.children will be an array with a single item—an instance of the Login. Next, we'll define handler components for the two sections we want to route: var LoginHandler = function() {     return <Login />; }; var PageAdminHandler = function() {     var backend = new Backend();     return <PageAdmin backend={backend} />; }; We don't really need to wrap Login in LoginHandler but I've chosen to do it to be consistent with PageAdminHandler. PageAdmin expects an instance of Backend, so we have to wrap it as we see in this example. Now we can define routes for our CMS: ReactDOM.render(     <Router history={browserHistory}>         <Route path="/" component={App}>             <IndexRoute component={LoginHandler} />             <Route path="login" component={LoginHandler} />             <Route path="page-admin" component={PageAdminHandler} />         </Route>     </Router>,     document.querySelector(".react") ); There's a single root route, for the path /. It creates an instance of App, so we always get the same layout. Then we nest a login route and a page-admin route. These create instances of their respective components. We also define an IndexRoute so that the login page will be displayed as a landing page. We need to remove our custom history code from Nav: import React from "react"; import ReactDOM from "react-dom"; import { Link } from "router";   export default (props) => {     // ...define class names       return <div className={drawerClassNames}>         <header className="demo-drawer-header">             <img src="images/user.jpg"                  className="demo-avatar" />         </header>         <nav className={navClassNames}>             <Link className="mdl-navigation__link" to="login">                 <i className={buttonIconClassNames}                    role="presentation">                     lock                 </i>                 Login             </Link>             <Link className="mdl-navigation__link" to="page-admin">                 <i className={buttonIconClassNames}                    role="presentation">                     pages                 </i>                 Pages             </Link>         </nav>     </div>; }; And since we no longer need a separate redirect method, we can convert the class back into a statement component (function). Notice we've swapped anchor components for a new Link component. This interacts with the router to show the correct section when we click on the navigation links. We can also change the route paths without needing to update this component (unless we also change the route names). Creating public pages Now that we can easily switch between CMS sections, we can use the same trick to show the public pages of our website. Let's create a new HTML page just for these: <!DOCTYPE html> <html>     <head>         <script src="/node_modules/babel-core/browser.js"></script>         <script src="/node_modules/systemjs/dist/system.js"></script>     </head>     <body>         <div class="react"></div>         <script>             System.config({                 "transpiler": "babel",                 "map": {                     "react": "/examples/react/react",                     "react-dom": "/examples/react/react-dom",                     "router": "/node_modules/react-router/umd/ReactRouter"                 },                 "baseURL": "../",                 "defaultJSExtensions": true             });               System.import("examples/index");         </script>     </body> </html> This is a reduced form of admin.html without the material design resources. I think we can ignore the appearance of these pages for the moment, while we focus on the navigation. The public pages are almost 100%, so we can use stateless components for them. Let's begin with the layout component: var App = function(props) {     return (         <div className="layout">             <Nav pages={props.route.backend.all()} />             {props.children}         </div>     ); }; This is similar to the App admin component, but it also has a reference to a Backend. We define that when we render the components: var backend = new Backend(); ReactDOM.render(     <Router history={browserHistory}>         <Route path="/" component={App} backend={backend}>             <IndexRoute component={StaticPage} backend={backend} />             <Route path="pages/:page" component={StaticPage} backend={backend} />         </Route>     </Router>,     document.querySelector(".react") ); For this to work, we also need to define a StaticPage: var StaticPage = function(props) {     var id = props.params.page || 1;     var backend = props.route.backend;       var pages = backend.all().filter(         (page) => {             return page.id == id;         }     );       if (pages.length < 1) {         return <div>not found</div>;     }       return (         <div className="page">             <h1>{pages[0].title}</h1>             {pages[0].content}         </div>     ); }; This component is more interesting. We access the params property, which is a map of all the URL path parameters defined for this route. We have :page in the path (pages/:page), so when we go to pages/1, the params object is {"page":1}. We also pass a Backend to Page, so we can fetch all pages and filter them by page.id. If no page.id is provided, we default to 1. After filtering, we check to see if there are any pages. If not, we return a simple not found message. Otherwise, we render the content of the first page in the array (since we expect the array to have a length of at least 1). We now have a page for the public pages of the website: Summary In this article, we learned about how the browser stores URL history and how we can manipulate it to load different sections without full page reloads. Resources for Article:   Further resources on this subject: Introduction to Akka [article] An Introduction to ReactJs [article] ECMAScript 6 Standard [article]
Read more
  • 0
  • 0
  • 11553
article-image-searching-your-data
Packt
12 Feb 2016
22 min read
Save for later

Searching Your Data

Packt
12 Feb 2016
22 min read
In this article by Rafał Kuć and Marek Rogozinski the authors of this book Elasticsearch Server Third Edition, we dived into Elasticsearch indexing. We learned a lot when it comes to data handling. We saw how to tune Elasticsearch schema-less mechanism and we now know how to create our own mappings. We also saw the core types of Elasticsearch and we used analyzers – both the one that comes out of the box with Elasticsearch and the one we define ourselves. We used bulk indexing, and we added additional internal information to our indices. Finally, we learned what segment merging is, how we can fine tune it, and how to use routing in Elasticsearch and what it gives us. This article is fully dedicated to querying. By the end of this article, you will have learned the following topics: How to query Elasticsearch Using the script process Understanding the querying process (For more resources related to this topic, see here.) Querying Elasticsearch So far, when we searched our data, we used the REST API and a simple query or the GET request. Similarly, when we were changing the index, we also used the REST API and sent the JSON-structured data to Elasticsearch. Regardless of the type of operation we wanted to perform, whether it was a mapping change or document indexation, we used JSON structured request body to inform Elasticsearch about the operation details. A similar situation happens when we want to send more than a simple query to Elasticsearch we structure it using the JSON objects and send it to Elasticsearch in the request body. This is called the query DSL. In a broader view, Elasticsearch supports two kinds of queries: basic ones and compound ones. Basic queries, such as the term query, are used for querying the actual data. The second type of query is the compound query, such as the bool query, which can combine multiple queries. However, this is not the whole picture. In addition to these two types of queries, certain queries can have filters that are used to narrow down your results with certain criteria. Filter queries don't affect scoring and are usually very efficient and easily cached. To make it even more complicated, queries can contain other queries (don't worry; we will try to explain all this!). Furthermore, some queries can contain filters and others can contain both queries and filters. Although this is not everything, we will stick with this working explanation for now. The example data If not stated otherwise, the following mappings will be used for the rest of the article: { "book" : { "properties" : { "author" : { "type" : "string" }, "characters" : { "type" : "string" }, "copies" : { "type" : "long", "ignore_malformed" : false }, "otitle" : { "type" : "string" }, "tags" : { "type" : "string", "index" : "not_analyzed" }, "title" : { "type" : "string" }, "year" : { "type" : "long", "ignore_malformed" : false, "index" : "analyzed" }, "available" : { "type" : "boolean" } } } } The preceding mappings represent a simple library and were used to create the library index. One thing to remember is that Elasticsearch will analyze the string based fields if we don't configure it differently. The preceding mappings were stored in the mapping.json file and in order to create the mentioned library index we can use the following commands: curl -XPOST 'localhost:9200/library' curl -XPUT 'localhost:9200/library/book/_mapping' -d @mapping.json We also used the following sample data as the example ones for this article: { "index": {"_index": "library", "_type": "book", "_id": "1"}} { "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true, "section" : 3} { "index": {"_index": "library", "_type": "book", "_id": "2"}} { "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false, "section" : 1} { "index": {"_index": "library", "_type": "book", "_id": "3"}} { "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12} { "index": {"_index": "library", "_type": "book", "_id": "4"}} { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true} We stored our sample data in the documents.json file, and we use the following command to index it: curl -s -XPOST 'localhost:9200/_bulk' --data-binary @documents.json A simple query The simplest way to query Elasticsearch is to use the URI request query. For example, to search for the word crime in the title field, you could send a query using the following command: curl -XGET 'localhost:9200/library/book/_search?q=title:crime&pretty' This is a very simple, but limited, way of submitting queries to Elasticsearch. If we look from the point of view of the Elasticsearch query DSL, the preceding query is the query_string query. It searches for the documents that have the term crime in the title field and can be rewritten as follows: { "query" : { "query_string" : { "query" : "title:crime" } } } Sending a query using the query DSL is a bit different, but still not rocket science. We send the GET (POST is also accepted in case your tool or library doesn't allow sending request body in HTTP GET requests) HTTP request to the _search REST endpoint as earlier and include the query in the request body. Let's take a look at the following command: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "query" : { "query_string" : { "query" : "title:crime" } } }' As you can see, we used the request body (the -d switch) to send the whole JSON-structured query to Elasticsearch. The pretty request parameter tells Elasticsearch to structure the response in such a way that we humans can read it more easily. In response to the preceding command, we get the following output: { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } } Nice! We got our first search results with the query DSL. Paging and result size Elasticsearch allows us to control how many results we want to get (at most) and from which result we want to start. The following are the two additional properties that can be set in the request body: from: This property specifies the document that we want to have our results from. Its default value is 0, which means that we want to get our results from the first document. size: This property specifies the maximum number of documents we want as the result of a single query (which defaults to 10). For example, if weare only interested in aggregations results and don't care about the documents returned by the query, we can set this parameter to 0. If we want our query to get documents starting from the tenth item on the list and get 20 of items from there on, we send the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "from" : 9, "size" : 20, "query" : { "query_string" : { "query" : "title:crime" } } }' Returning the version value In addition to all the information returned, Elasticsearch can return the version of the document. To do this, we need to add the version property with the value of true to the top level of our JSON object. So, the final query, which requests for version information, will look as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "version" : true, "query" : { "query_string" : { "query" : "title:crime" } } }' After running the preceding query, we get the following results: { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_version" : 1, "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } } As you can see, the _version section is present for the single hit we got. Limiting the score For nonstandard use cases, Elasticsearch provides a feature that lets us filter the results on the basis of a minimum score value that the document must have to be considered a match. In order to use this feature, we must provide the min_score value at the top level of our JSON object with the value of the minimum score. For example, if we want our query to only return documents with a score higher than 0.75, we send the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "min_score" : 0.75, "query" : { "query_string" : { "query" : "title:crime" } } }' We get the following response after running the preceding query: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } } If you look at the previous examples, the score of our document was 0.5, which is lower than 0.75, and thus we didn't get any documents in response. Limiting the score usually doesn't make much sense because comparing scores between the queries is quite hard. However, maybe in your case, this functionality will be needed. Choosing the fields that we want to return With the use of the fields array in the request body, Elasticsearch allows us to define which fields to include in the response. Remember that you can only return these fields if they are marked as stored in the mappings used to create the index, or if the _source field was used (Elasticsearch uses the _source field to provide the stored values and the _source field is turned on by default). So, for example, to return only the title and the year fields in the results (for each document), send the following query to Elasticsearch: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "fields" : [ "title", "year" ], "query" : { "query_string" : { "query" : "title:crime" } } }' In response, we get the following output: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "title" : [ "Crime and Punishment" ], "year" : [ 1886 ] } } ] } } As you can see, everything worked as we wanted to. There are four things we will like to share with you, which are as follows: If we don't define the fields array, it will use the default value and return the _source field if available. If we use the _source field and request a field that is not stored, then that field will be extracted from the _source field (however, this requires additional processing). If we want to return all the stored fields, we just pass an asterisk (*) as the field name. From a performance point of view, it's better to return the _source field instead of multiple stored fields. This is because getting multiple stored fields may be slower compared to retrieving a single _source field. Source filtering In addition to choosing which fields are returned, Elasticsearch allows us to use the so-called source filtering. This functionality allows us to control which fields are returned from the _source field. Elasticsearch exposes several ways to do this. The simplest source filtering allows us to decide whether a document should be returned or not. Consider the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : false, "query" : { "query_string" : { "query" : "title:crime" } } }' The result retuned by Elasticsearch should be similar to the following one: { "took" : 12, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5 } ] } } Note that the response is limited to base information about a document and the _source field was not included. If you use Elasticsearch as a second source of data and content of the document is served from SQL database or cache, the document identifier is all you need. The second way is similar to as described in the preceding fields, although we define which fields should be returned in the document source itself. Let's see that using the following example query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : ["title", "otitle"], "query" : { "query_string" : { "query" : "title:crime" } } }' We wanted to get the title and the otitle document fields in the returned _source field. Elasticsearch extracted those values from the original _source value and included the _source field only with the requested fields. The whole response returned by Elasticsearch looked as follows: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "otitle" : "Преступлéние и наказáние", "title" : "Crime and Punishment" } } ] } } We can also use asterisk to select which fields should be returned in the _source field; for example, title* will return value for the title field and for title10 (if we have such field in our data). If we have more extended document with nested part, we can use notation with dot; for example, title.* to select all the fields nested under the title object. Finally, we can also specify explicitly which fields we want to include and which to exclude from the _source field. We can include fields using the include property and we can exclude fields using the exclude property (both of them are arrays of values). For example, if we want the returned _source field to include all the fields starting with the letter t but not the title field, we will run the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : { "include" : [ "t*"], "exclude" : ["title"] }, "query" : { "query_string" : { "query" : "title:crime" } } }' Using the script fields Elasticsearch allows us to use script-evaluated values that will be returned with the result documents. To use the script fields functionality, we add the script_fields section to our JSON query object and an object with a name of our choice for each scripted value that we want to return. For example, to return a value named correctYear, which is calculated as the year field minus 1800, we run the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "doc["year"].value - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }' By default, Elasticsearch doesn't allow us to use dynamic scripting. If you tried the preceding query, you probably got an error with information stating that the scripts of type [inline] with operation [search] and language [groovy] are disabled. To make this example work, you should add the script.inline: on property to the elasticsearch.yml file. However, this exposes a security threat. Using the doc notation, like we did in the preceding example, allows us to catch the results returned and speed up script execution at the cost of higher memory consumption. We also get limited to single-valued and single term fields. If we care about memory usage, or if we are using more complicated field values, we can always use the _source field. The same query using the _source field looks as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }' The following response is returned by Elasticsearch with dynamic scripting enabled: { "took" : 76, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "correctYear" : [ 86 ] } } ] } } As you can see, we got the calculated correctYear field in response. Passing parameters to the script fields Let's take a look at one more feature of the script fields - passing of additional parameters. Instead of having the value 1800 in the equation, we can usea variable name and pass its value in the params section. If we do this, our query will look as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - paramYear", "params" : { "paramYear" : 1800 } } }, "query" : { "query_string" : { "query" : "title:crime" } } }' As you can see, we added the paramYear variable as part of the scripted equation and provided its value in the params section. This allows Elasticsearch to execute the same script with different parameter values in a slightly more efficient way. Understanding the querying process After reading the previous section, we now know how querying works in Elasticsearch. You know that Elasticsearch, in most cases, needs to scatter the query across multiple nodes, get the results, merge them, fetch the relevant documents from one or more shards, and return the final results to the client requesting the documents. What we didn't talk about are two additional things that define how queries behave: search type and query execution preference. We will now concentrate on these functionalities of Elasticsearch. Query logic Elasticsearch is a distributed search engine and so all functionality provided must be distributed in its nature. It is exactly the same with querying. Because we would like to discuss some more advanced topics on how to control the query process, we first need to know how it works. Let's now get back to how querying works. By default, if we don't alter anything, the query process will consist of two phases: the scatter and the gather phase. The aggregator node (the one that receivesthe request) will run the scatter phase first. During that phase, the query is distributed to all the shards that our index is built of (of course if routing is not used). For example, if it is built of 5 shards and 1 replica then 5 physical shards will be queried (we don't need to query a shard and its replica as they contain the same data). Each of the queried shards will only return the document identifier and the score of the document. The node that sent the scatter query will wait for all the shards to complete their task, gather the results, and sort them appropriately (in this case, from top scoring to the lowest scoring ones). After that, a new request will be sent to build the search results. However, now only to those shards that held the documents to build the response. In most cases, Elasticsearch won't send the request to all the shards but to its subset. That's because we usually don't get the complete result of the query but only a portion of it. This phase is called the gather phase. After all the documents are gathered, the final response is built and returned as the query result. This is the basic and default Elasticsearch behavior but we can change it. Search type Elasticsearch allows us to choose how we want our query to be processed internally. We can do that by specifying the search type. There are different situations where different search type are appropriate: sometimes one can care only about the performance while sometimes query relevance is the most important factor. You should remember that each shard is a small Lucene index and in order to return more relevant results, some information, such as frequencies, needs to be transferred between the shards. To control how the queries are executed, we can pass the search_type request parameter and set it to one of the following values: query_then_fetch: In the first step, the query is executed to get the information needed to sort and rank the documents. This step is executed against all the shards. Then only the relevant shards are queried for the actual content of the documents. Different from query_and_fetch, the maximum number of results returned by this query type will be equal to the size parameter. This is the search type used by default if no search type is provided with the query, and this is the query type we described previously. dfs_query_then_fetch: Again, as with the previous dfs_query_and_fetch, dfs_query_then_fetch is similar to its counterpart query_then_fetch. However, it contains an additional phase comparing which calculates distributed term frequencies. There are also two deprecated search types: count and scan. The first one is deprecated starting from Elasticsearch 2.0 and the second one starting with Elasticsearch 2.1. The first search type used to provide benefits where only aggregations or the number of documents was relevant, but now it is enough to add size equal to 0 to your queries. The scan request was used for scrolling functionality. So if we would like to use the simplest search type, we would run the following command: curl -XGET 'localhost:9200/library/book/_search?pretty&search_type=query_then_fetch' -d '{ "query" : { "term" : { "title" : "crime" } } }' Search execution preference In addition to the possibility of controlling how the query is executed, we can also control on which shards to execute the query. By default, Elasticsearch uses shards and replicas, both the ones available on the node we've sent the request and on the other nodes in the cluster. The default behavior is mostly the proper method of shard preference of queries. But there may be times when we want to change the default behavior. For example, you may want the search to be only executed on the primary shards. To do that, we can set the preference request parameter to one of the following values: _primary: The operation will be only executed on the primary shards, so the replicas won't be used. This can be useful when we need to use the latest information from the index but our data is not replicated right away. _primary_first: The operation will be executed on the primary shards if they are available. If not, it will be executed on the other shards. _replica: The operation will be executed only on the replica shards. _replica_first: This operation is similar to _primary_first, but uses replica shards. The operation will be executed on the replica shards if possible, and on the primary shards if the replicas are not available. _local: The operation will be executed on the shards available on the node which the request was sent and if such shards are not present, the request will be forwarded to the appropriate nodes. _only_node:node_id: This operation will be executed on the node with the provided node identifier. _only_nodes:nodes_spec: This operation will be executed on the nodes that are defined in nodes_spec. This can be an IP address, a name, a name or IP address using wildcards, and so on. For example, if nodes_spec is set to 192.168.1.*, the operation will be run on the nodes with IP address starting with 192.168.1. _prefer_node:node_id: Elasticsearch will try to execute the operation on the node with the provided identifier. However, if the node is not available, it will be executed on the nodes that are available. _shards:1,2: Elasticsearch will execute the operation on the shards with the given identifiers; in this case, on shards with identifiers 1 and 2. The _shards parameter can be combined with other preferences, but the shards identifiers need to be provided first. For example, _shards:1,2;_local. Custom value: Any custom, string value may be passed. Requests with the same values provided will be executed on the same shards. For example, if we would like to execute a query only on the local shards, we would run the following command: curl -XGET 'localhost:9200/library/_search?pretty&preference=_local' -d '{ "query" : { "term" : { "title" : "crime" } } }' Search shards API When discussing the search preference, we will also like to mention the search shards API exposed by Elasticsearch. This API allows us to check which shards the query will be executed at. In order to use this API, run a request against the search_shards rest end point. For example, to see how the query will be executed, we run the following command: curl -XGET 'localhost:9200/library/_search_shards?pretty' -d '{"query":"match_all":{}}' The response to the preceding command will be as follows: { "nodes" : { "my0DcA_MTImm4NE3cG3ZIg" : { "name" : "Cloud 9", "transport_address" : "127.0.0.1:9300", "attributes" : { } } }, "shards" : [ [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 0, "index" : "library", "version" : 4, "allocation_id" : { "id" : "9ayLDbL1RVSyJRYIJkuAxg" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 1, "index" : "library", "version" : 4, "allocation_id" : { "id" : "wfpvtaLER-KVyOsuD46Yqg" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 2, "index" : "library", "version" : 4, "allocation_id" : { "id" : "zrLPWhCOSTmjlb8TY5rYQA" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 3, "index" : "library", "version" : 4, "allocation_id" : { "id" : "efnvY7YcSz6X8X8USacA7g" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 4, "index" : "library", "version" : 4, "allocation_id" : { "id" : "XJHW2J63QUKdh3bK3T2nzA" } } ] ] } As you can see, in the response returned by Elasticsearch, we have the information about the shards that will be used during the query process. Of course, with the search shards API, we can use additional parameters that control the querying process. These properties are routing, preference, and local. We are already familiar with the first two. The local parameter is a Boolean (values true or false) one that allows us to tell Elasticsearch to use the cluster state information stored on the local node (setting local to true) instead of the one from the master node (setting local to false). This allows us to diagnose problems with cluster state synchronization. Summary This article has been all about the querying Elasticsearch. We started by looking at how to query Elasticsearch and what Elasticsearch does when it needs to handle the query. We also learned about the basic and compound queries, so we are now able to use both simple queries as well as the ones that group multiple small queries together. Finally, we discussed how to choose the right query for a given use case. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [article] Integrating Elasticsearch with the Hadoop ecosystem [article] Elasticsearch Administration [article]
Read more
  • 0
  • 0
  • 2549

article-image-factory-method-pattern
Packt
10 Feb 2016
10 min read
Save for later

The Factory Method Pattern

Packt
10 Feb 2016
10 min read
In this article by Anshul Verma and Jitendra Zaa, author of the book Apex Design Patterns, we will discuss some problems that can occur mainly during the creation of class instances and how we can write the code for the creation of objects in a more simple, easy to maintain, and scalable way. (For more resources related to this topic, see here.) In this article, we will discuss the the factory method creational design pattern. Often, we find that some classes have common features (behavior) and can be considered classes of the same family. For example, multiple payment classes represent a family of payment services. Credit card, debit card, and net banking are some of the examples of payment classes that have common methods, such as makePayment, authorizePayment, and so on. Using the factory method pattern, we can develop controller classes, which can use these payment services, without knowing the actual payment type at design time. The factory method pattern is a creational design pattern used to create objects of classes from the same family without knowing the exact class name at design time. Using the factory method pattern, classes can be instantiated from the common factory method. The advantage of using this pattern is that it delegates the creation of an object to another class and provides a good level of abstraction. Let's learn this pattern using the following example: The Universal Call Center company is new in business and provides free admin support to customers to resolve issues related to their products. A call center agent can provide some information about the product support; for example, to get the Service Level Agreement (SLA) or information about the total number of tickets allowed to open per month. A developer came up with the following class: public class AdminBasicSupport{ /** * return SLA in hours */ public Integer getSLA() { return 40; } /** * Total allowed support tickets allowed every month */ public Integer allowedTickets() { // As this is basic support return 9999; } } Now, to get the SLA of AdminBasicSupport, we need to use the following code every time: AdminBasicSupport support = new AdminBasicSupport(); System.debug('Support SLA is - '+support.getSLA()); Output - Support SLA is – 40 The "Universal Call Centre" company was doing very well, and in order to grow the business and increase the profit, they started the premium support for customers who were willing to pay for cases and get a quick support. To make them special from the basic support, they changed the SLA to 12 hours and maximum 50 cases could be opened in one month. A developer had many choices to make this happen in the existing code. However, instead of changing the existing code, they created a new class that would handle only the premium support-related functionalities. This was a good decision because of the single responsibility principle. public class AdminPremiumSupport{ /** * return SLA in hours */ public Integer getSLA() { return 12; } /** * Total allowed support tickets allowed every month is 50 */ public Integer allowedTickets() { return 50; } } Now, every time any information regarding the SLA or allowed tickets per month is needed, the following Apex code can be used: if(Account.supportType__c == 'AdminBasic') { AdminBasicSupport support = new AdminBasicSupport(); System.debug('Support SLA is - '+support.getSLA()); }else{ AdminPremiumSupport support = new AdminPremiumSupport(); System.debug('Support SLA is - '+support.getSLA()); } As we can see in the preceding example, instead of adding some conditions to the existing class, the developer decided to go with a new class. Each class has its own responsibility, and they need to be changed for only one reason. If any change is needed in the basic support, then only one class needs to be changed. As we all know that this design principle is known as the Single Responsibility Principle. Business was doing exceptionally well in the call center, and they planned to start the golden and platinum support as well. Developers started facing issues with the current approach. Currently, they have two classes for the basic and premium support and requests for two more classes were in the pipeline. There was no guarantee that the support type will not remain the same in future. Because of every new support type, a new class is needed; and therefore, the previous code needs to be updated to instantiate these classes. The following code will be needed to instantiate these classes: if(Account.supportType__c == 'AdminBasic') { AdminBasicSupport support = new AdminBasicSupport(); System.debug('Support SLA is - '+support.getSLA()); }else if(Account.supportType__c == 'AdminPremier') { AdminPremiumSupport support = new AdminPremiumSupport(); System.debug('Support SLA is - '+support.getSLA()); }else if(Account.supportType__c == 'AdminGold') { AdminGoldSupport support = new AdminGoldSupport(); System.debug('Support SLA is - '+support.getSLA()); }else{ AdminPlatinumSupport support = new AdminPlatinumSupport(); System.debug('Support SLA is - '+support.getSLA()); } We are only considering the getSLA() method, but in a real application, there can be other methods and scenarios as well. The preceding code snippet clearly depicts the code duplicity and maintenance nightmare. The following image shows the overall complexity of the example that we are discussing: Although they are using a separate class for each support type, an introduction to a new support class will lead to changes in the code in all existing code locations where these classes are being used. The development team started brainstorming to make sure that the code is capable to extend easily in future with the least impact on the existing code. One of the developers came up with a suggestion to use an interface for all support classes so that every class can have the same methods and they can be referred to using an interface. The following interface was finalized to reduce the code duplicity: public Interface IAdminSupport{ Integer getSLA() ; Integer allowedTickets(); } Methods defined within an interface have no access modifiers and just contain their signatures. Once an interface was created, it was time to update existing classes. In our case, only one line needed to be changed and the remaining part of the code was the same because both the classes already have the getSLA() and allowedTickets() methods. Let's take a look at the following line of code: public class AdminPremiumSupport{ This will be changed to the following code: public class AdminBasicSupportImpl implements IAdminSupport{ The following line of code is as follows: public class AdminPremiumSupport{ This will be changed to the following code: public class AdminPremiumSupportImpl implements IAdminSupport{ In the same way, the AdminGoldSupportImpl and AdminPlatinumSupportImpl classes are written. A class diagram is a type of Unified Modeling Language (UML), which describes classes, methods, attributes, and their relationships, among other objects in a system. You can read more about class diagrams at https://en.wikipedia.org/wiki/Class_diagram. The following image shows a class diagram of the code written by developers using an interface: Now, the code to instantiate different classes of the support type can be rewritten as follows: IAdminSupport support = null; if(Account.supportType__c == 'AdminBasic') { support = new AdminBasicSupportImpl(); }else if(Account.supportType__c == 'AdminPremier') { support = new AdminPremiumSupportImpl(); }else if(Account.supportType__c == 'AdminGold') { support = new AdminGoldSupportImpl(); }else{ support = new AdminPlatinumSupportImpl(); } System.debug('Support SLA is - '+support.getSLA()); There is no switch case statement in Apex, and that's why multiple if and else statements are written. As per the product team, a new compiler may be released in 2016 and it will be supported. You can vote for this idea at https://success.salesforce.com/ideaView?id=08730000000BrSIAA0. As we can see, the preceding code is minimized to create a required instance of a concrete class, and then uses an interface to access methods. This concept is known as program to interface. This is one of the most recommended OOP principles suggested to be followed. As interfaces are kinds of contracts, we already know which methods will be implemented by concrete classes, and we can completely rely on the interface to call them, which hides their complex implementation and logic. It has a lot of advantages and a few of them are loose coupling and dependency injection. A concrete class is a complete class that can be used to instantiate objects. Any class that is not abstract or an interface can be considered a concrete class. We still have one problem in the previous approach. The code to instantiate concrete classes is still present at many locations and will still require changes if a new support type is added. If we can delegate the creation of concrete classes to some other class, then our code will be completely independent of the existing code and new support types. This concept of delegating decisions and creation of similar types of classes is known as the factory method pattern. The following class can be used to create concrete classes and will act as a factory: /** * This factory class is used to instantiate concrete class * of respective support type * */ public class AdminSupportFactory { public static IAdminSupport getInstance(String supporttype){ IAdminSupport support = null; if(supporttype == 'AdminBasic') { support = new AdminBasicSupportImpl(); }else if(supporttype == 'AdminPremier') { support = new AdminPremiumSupportImpl(); }else if(supporttype == 'AdminGold') { support = new AdminGoldSupportImpl(); }else if(supporttype == 'AdminPlatinum') { support = new AdminPlatinumSupportImpl(); } return support ; } } In the preceding code, we only need to call the getInstance(string) method, and this method will take a decision and return the actual implementation. As a return type is an interface, we already know the methods that are defined, and we can use the method without actually knowing its implementation. This is a very good example of abstraction. The final class diagram of the factory method pattern that we discussed will look like this: The following code snippet can be used repeatedly by any client code to instantiate a class of any support type: IAdminSupport support = AdminSupportFactory.getInstance ('AdminBasic'); System.debug('Support SLA is - '+support.getSLA()); Output : Support SLA is – 40 Reflection in Apex The problem with the preceding design is that whenever a new support needs to be added, we need to add a condition to AdminSupportFactory. We can store the mapping between a support type and its concrete class name in Custom setting. This way, whenever a new concrete class is added, we don't even need to change the factory class and a new entry needs to be added to custom setting. Consider custom setting created by the Support_Type__c name with the Class_Name__c field name of the text type with the following records: Name Class name AdminBasic AdminBasicSupportImpl AdminGolden AdminGoldSupportImpl AdminPlatinum AdminPlatinumSupportImpl AdminPremier AdminPremiumSupportImpl However, using reflection, the AdminSupportFactory class can also be rewritten to instantiate service types at runtime as follows: /** * This factory class is used to instantiate concrete class * of respective support type * */ public class AdminSupportFactory { public static IAdminSupport getInstance(String supporttype) { //Read Custom setting to get actual class name on basis of Support type Support_Type__c supportTypeInfo = Support_Type__c.getValues(supporttype); //from custom setting get appropriate class name Type t = Type.forName(supportTypeInfo.Class_Name__c); IAdminSupport retVal = (IAdminSupport)t.newInstance(); return retVal; } } In the preceding code, we are using the Type system class. This is a very powerful class used to instantiate a new class at runtime. It has the following two important methods: forName: This returns a type that is equivalent to a string passed newInstance: This creates a new object for a specified type Inspecting classes, methods, and variables at runtime without knowing a class name, or instantiating a new object and invoking methods at runtime is known as Reflection in computer science. One more advantage of using the factory method, custom setting, and reflection together is that if in future one of the support types need to be replaced by another service type permanently, then we need to simply change the appropriate mapping in custom setting without any changes in the code. Summary In this article, we discussed how to deal with various situations while instantiating objects using design patterns, using the factory method. Resources for Article: Further resources on this subject: Getting Your APEX Components Logic Right[article] AJAX Implementation in APEX[article] Custom Coding with Apex[article]
Read more
  • 0
  • 19
  • 34900
Modal Close icon
Modal Close icon