Data | Tech News, Tutorials & Expert Insights

article-image-performance-considerations

03 Mar 2015

13 min read

Performance Considerations

03 Mar 2015

0
0
2339

Packt

01 Jun 2016

8 min read

Classifier Construction

Packt

01 Jun 2016

8 min read

In this article by Pratik Joshi, author of the book Python Machine Learning Cookbook, we will build a simple classifier using supervised learning, and then go onto build a logistic-regression classifier. Building a simple classifier In the field of machine learning, classification refers to the process of using the characteristics of data to separate it into a certain number of classes. A supervised learning classifier builds a model using labeled training data, and then uses this model to classify unknown data. Let's take a look at how to build a simple classifier. (For more resources related to this topic, see here.) How to do it… Before we begin, make sure thatyou have imported thenumpy and matplotlib.pyplot packages. After this, let's create some sample data: X = np.array([[3,1], [2,5], [1,8], [6,4], [5,2], [3,5], [4,7], [4,-1]]) Let's assign some labels to these points: y = [0, 1, 1, 0, 0, 1, 1, 0] As we have only two classes, the list y contains 0s and 1s. In general, if you have N classes, then the values in y will range from 0 to N-1. Let's separate the data into classes that are based on the labels: class_0 = np.array([X[i] for i in range(len(X)) if y[i]==0]) class_1 = np.array([X[i] for i in range(len(X)) if y[i]==1]) To get an idea about our data, let's plot this, as follows: plt.figure() plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s') plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x') This is a scatterplot where we use squares and crosses to plot the points. In this context,the marker parameter specifies the shape that you want to use. We usesquares to denote points in class_0 and crosses to denote points in class_1. If you run this code, you will see the following figure: In the preceding two lines, we just use the mapping between X and y to create two lists. If you were asked to inspect the datapoints visually and draw a separating line, what would you do? You would simply draw a line in between them. Let's go ahead and do this: line_x = range(10) line_y = line_x We just created a line with the mathematical equation,y = x. Let's plot this, as follows: plt.figure() plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s') plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x') plt.plot(line_x, line_y, color='black', linewidth=3) plt.show() If you run this code, you should see the following figure: There's more… We built a really simple classifier using the following rule: the input point (a, b) belongs to class_0 if a is greater than or equal tob;otherwise, it belongs to class_1. If you inspect the points one by one, you will see that this is true. This is it! You just built a linear classifier that can classify unknown data. It's a linear classifier because the separating line is a straight line. If it's a curve, then it becomes a nonlinear classifier. This formation worked fine because there were a limited number of points, and we could visually inspect them. What if there are thousands of points? How do we generalize this process? Let's discuss this in the next section. Building a logistic regression classifier Despite the word regression being present in the name, logistic regression is actually used for classification purposes. Given a set of datapoints, our goal is to build a model that can draw linear boundaries between our classes. It extracts these boundaries by solving a set of equations derived from the training data. Let's see how to do that in Python: We will use the logistic_regression.pyfile that is already provided to you as a reference. Assuming that you have imported the necessary packages, let's create some sample data along with training labels: X = np.array([[4, 7], [3.5, 8], [3.1, 6.2], [0.5, 1], [1, 2], [1.2, 1.9], [6, 2], [5.7, 1.5], [5.4, 2.2]]) y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]) Here, we assume that we have three classes. Let's initialize the logistic regression classifier: classifier = linear_model.LogisticRegression(solver='liblinear', C=100) There are a number of input parameters that can be specified for the preceding function, but a couple of important ones are solver and C. The solverparameter specifies the type of solver that the algorithm will use to solve the system of equations. The C parameter controls the regularization strength. A lower value indicates higher regularization strength. Let's train the classifier: classifier.fit(X, y) Let's draw datapoints and boundaries: plot_classifier(classifier, X, y) We need to define this function: def plot_classifier(classifier, X, y): # define ranges to plot the figure x_min, x_max = min(X[:, 0]) - 1.0, max(X[:, 0]) + 1.0 y_min, y_max = min(X[:, 1]) - 1.0, max(X[:, 1]) + 1.0 The preceding values indicate the range of values that we want to use in our figure. These values usually range from the minimum value to the maximum value present in our data. We add some buffers, such as 1.0 in the preceding lines, for clarity. In order to plot the boundaries, we need to evaluate the function across a grid of points and plot it. Let's go ahead and define the grid: # denotes the step size that will be used in the mesh grid step_size = 0.01 # define the mesh grid x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size)) The x_values and y_valuesvariables contain the grid of points where the function will be evaluated. Let's compute the output of the classifier for all these points: # compute the classifier output mesh_output = classifier.predict(np.c_[x_values.ravel(), y_values.ravel()]) # reshape the array mesh_output = mesh_output.reshape(x_values.shape) Let's plot the boundaries using colored regions: # Plot the output using a colored plot plt.figure() # choose a color scheme plt.pcolormesh(x_values, y_values, mesh_output, cmap=plt.cm.Set1) This is basically a 3D plotter that takes the 2D points and the associated values to draw different regions using a color scheme. You can find all the color scheme options athttp://matplotlib.org/examples/color/colormaps_reference.html. Let's overlay the training points on the plot: plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='black', linewidth=2, cmap=plt.cm.Paired) # specify the boundaries of the figure plt.xlim(x_values.min(), x_values.max()) plt.ylim(y_values.min(), y_values.max()) # specify the ticks on the X and Y axes plt.xticks((np.arange(int(min(X[:, 0])-1), int(max(X[:, 0])+1), 1.0))) plt.yticks((np.arange(int(min(X[:, 1])-1), int(max(X[:, 1])+1), 1.0))) plt.show() Here, plt.scatter plots the points on the 2D graph. TheX[:, 0] specifies that we should take all the values along axis 0 (X-axis in our case), and X[:, 1] specifies axis 1 (Y-axis). The c=y parameter indicates the color sequence. We use the target labels to map to colors using cmap. We basically want different colors based on the target labels; hence, we use y as the mapping. The limits of the display figure are set using plt.xlim and plt.ylim. In order to mark the axes with values, we need to use plt.xticks and plt.yticks. These functions mark the axes with values so that it's easier for us to see where the points are located. In the preceding code, we want the ticks to lie between the minimum and maximum values with a buffer of 1 unit. We also want these ticks to be integers. So, we use theint() function to round off the values. If you run this code, you should see the following output: Let's see how the Cparameter affects our model. The C parameter indicates the penalty for misclassification. If we set this to 1.0, we will get the following figure: If we set C to 10000, we get the following figure: As we increase C, there is a higher penalty for misclassification. Hence, the boundaries get more optimal. Summary We successfully employed supervised learning to build a simple classifier. We subsequently went on to construct a logistic-regression classifier and saw different results of tweaking C—the regularization strength parameter. Resources for Article: Further resources on this subject: Python Scripting Essentials [article] Web scraping with Python (Part 2) [article] Web Server Development [article]

0
0
2323

Packt

02 Sep 2015

12 min read

Meeting SAP Lumira

Packt

02 Sep 2015

12 min read

In this article by Dmitry Anoshin, author of the book SAP Lumira Essentials, Dmitry talks about living in a century of information technology. There are a lot of electronic devices around us which generate lots of data. For example, you can surf the Internet, visit a couple of news portals, order new Nike Air Max shoes from a web store, write a couple of messages to your friend, and chat on Facebook. Your every action produces data. We can multiply that action by the amount of people who have access to the internet or just use a cell phone, and we get really BIG DATA. Of course, you have a question: how big is it? Now, it starts from terabytes or even petabytes. The volume is not the only issue; moreover, we struggle with the variety of data. As a result, it is not enough to analyze only the structured data. We should dive deep in to unstructured data, such as machine data which are generated by various machines. (For more resources related to this topic, see here.) Nowadays, we should have a new core competence—dealing with Big Data—, because these vast data volumes won't be just stored, they need to be analysed and mined for information that management can use in order to make right business decisions. This helps to make the business more competitive and efficient. Unfortunately, in modern organizations there are still many manual steps needed in order to get data and try to answer your business questions. You need the help of your IT guys, or need to wait until new data is available in your enterprise data warehouse. In addition, you are often working with an inflexible BI tool, which can only refresh a report or export it in to Excel. You definitely need a new approach, which gives you a competitive advantage, dramatically reduces errors, and accelerates business decisions. So, we can highlight some of the key points for this kind of analytics: Integrating data from heterogeneous systems Giving more access to data Using sophisticated analytics Reducing manual coding Simplifying processes Reducing time to prepare data Focusing on self-service Leveraging powerful computing resources We could continue this list with many other bullet points. If you are a fan of traditional BI tools, you may think that it is almost impossible. Yes, you are right, it is impossible. That's why we need to change the rules of the game. As the business world changes, you must change as well. Maybe you have guessed what this means, but if not, I can help you. I will focus on a new approach of doing data analytics, which is more flexible and powerful. It is called data discovery. Of course, we need the right way in order to overcome all the challenges of the modern world. That's why we have chosen SAP Lumira—one of the most powerful data discovery tools in the modern market. But before diving deep into this amazing tool, let's consider some of the challenges of data discovery that are in our path, as well as data discovery advantages. Data discovery challenges Let's imagine that you have several terabytes of data. Unfortunately, it is raw unstructured data. In order to get business insight from this data you have to spend a lot of time in order to prepare and clean the data. In addition, you are restricted by the capabilities of your machine. That's why a good data discovery tool usually is combined of software and hardware. As a result, this gives you more power for exploratory data analysis. Let's imagine that this entire Big Data store is in Hadoop or any NoSQL data store. You have to at least be at good programmer in order to do analytics on this data. Here we can find other benefit of a good data discovery tool: it gives a powerful tool to business users, who are not as technical and maybe don't even know SQL. Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resilience of these clusters comes from the software's ability to detect and handle failures at the application layer. A NoSQL data store is a next generation database, mostly addressing some of the following points: non-relational, distributed, open-source, and horizontally scalable. Data discovery versus business intelligence You may be confused about data discovery and business intelligence technologies; it seems they are very close to each other or even BI tools can do all what data discovery can do. And why do we need a separate data discovery tool, such as, SAP Lumira? In order to better understand the difference between the two technologies, you can look at the table below: Enterprise BI Data discovery Key users All users Advanced analysts Approach Vertically-oriented (top to bottom), semantic layers, requests to existing repositories Vertically-oriented (bottom-up), mushup, putting data in the selected repository Interface Reports, dashboards Visualization Users Reporting Analysis Implementation By IT consultants By business users Let's consider the pros and cons of data discovery: Pros: Rapidly analyze data with a short shelf life Ideal for small teams Best for tactical analysis Great for answering on-off questions quickly Cons: Difficult to handle for enterprise organizations Difficult for junior users Lack of scalability As a result, it is clear that BI and data discovery handles their own tasks and complement each other. The role of data discovery Most organizations have a data warehouse. It was planned to supporting daily operations and to help make business decisions. But sometimes organizations need to meet new challenges. For example, Retail Company wants to improve their customer experience and decide to work closely with the customer database. Analysts try to segment customers into cohorts and try to analyse customer's behavior. They need to handle all customer data, which is quite big. In addition, they can use external data in order to learn more about their customers. If they start to use a corporate BI tool, every interaction, such as adding new a field or filter, can take 10-30 minutes. Another issue is adding a new field to an existing report. Usually, it is impossible without the help of IT staff, due to security or the complexities of the BI Enterprise solution. This is unacceptable in a modern business. Analysts want get an answer to their business questions immediately, and they prefer to visualize data because, as you know, human perception of visualization is much higher than text. In addition, these analysts may be independent from IT. They have their data discovery tool and they can connect to any data sources in the organization and check their crazy hypotheses. There are hundreds of examples where BI and DWH is weak, and data discovery is strong. Introducing SAP Lumira Starting from this point, we will focus on learning SAP Lumira. First of all, we need to understand what SAP Lumira is exactly. SAP Lumira is a family of data discovery tools which give us an opportunity to create amazing visualizations or even tell fantastic stories based on our big or small data. We can connect most of the popular data sources, such as Relational Database Management Systems (RDBMSs), flat files, excel spreadsheets or SAP applications. We are able to create datasets with measures, dimensions, hierarchies, or variables. In addition, Lumira allows us to prepare, edit, and clean our data before it is processed. SAP Lumira offers us a huge arsenal of graphical charts and tables to visualize our data. In addition, we can create data stories or even infographics based on our data by grouping charts, single cells, or tables together on boards to create presentation- style dashboards. Moreover, we can add images or text in order to add details. The following are the three main products in the Lumira family offered by SAP: SAP Lumira Desktop SAP Lumira Server SAP Lumira Cloud Lumira Desktop can be either a personal edition or a standard edition. Both of them give you the opportunity to analyse data on your local machine. You can even share your visualizations or insights via PDF or XLS. Lumira Server is also in two variations—Edge and Server. As you know, SAP BusinessObjects also has two types of license for the same software, Edge and Enterprise, and they differ only in terms of the number of users and the type of license. The Edge version is smaller; for example, it can cover the needs of a team or even the whole department. Lumira Cloud is Software as a Service (SaaS). It helps to quickly visualize large volumes of data without having to sacrifice performance or security. It is especially designed to speed time to insight. In addition, it saves time and money with flexible licensing options. Data connectors We met SAP Lumira for the first time and we played with the interface, and the reader could adjust the general settings of SAP Lumira. In addition, we can find this interesting menu in the middle of the window: There are several steps which help us to discover our data and gain business insights. In this article we start from first step by exploring data in SAP Lumira to create a document and acquire a dataset, which can include part or all of the original values from a data source. This is through Acquire Data. Let's click on Acquire Data. This new window will come up: There are four areas on this window. They are: A list of possible data sources (1): Here, the user can connect to his data source. Recently used objects (2): The user can open his previous connections or files. Ordinary buttons (3), such as Previous, Next, Create, and Cancel. This small chat box (4) we can find at almost every page. SAP Lumira cares about the quality of the product and gives the opportunity to the user to make a screen print and send feedback to SAP. Let's go deeper and consider more closely every connection in the table below: Data Source Description Microsoft Excel Excel data sheets Flat file CSV, TXT, LOG, PRN, or TSV SAP HANA There are two possible ways: Offline (downloading data) and Online (connected to SAP HANA) SAP BusinessObjects universe UNV or UNX SQL Databases Query data via SQL from relational databases SAP Business warehouse Downloaded data from a BEx Query or an InfoProvider Let's try to connect some data sources and extract some data from them. Microsoft spreadsheets Let's start with the easiest exercise. For example, our manager of inventory asked us to analyse flop products, which are not popular, and he sent us two excel spreadsheets, Unicorn_flop_products.xls and Unicorn_flop_price.xls. There are two different worksheets because prices and product attributes are in different systems. Both files have a unique field—SKU. As a result, it is possible to merge them by this field and analyse them as one data set. SKU or stock keeping unit is a distinct item for sale, such as a product or service, and them attributes associated with the item distinguish it from other items. For a product, these attributes include, but are not limited to, manufacturer, product description, material, size, color, packaging, and warranty terms. When a business takes inventory, it counts the quantity of each SKU. Connecting to the SAP BO universe Universe is a core thing in the SAP BusinessObjects BI platform. It is the semantic layer that isolates business users from the technical complexities of the databases where their corporate information is stored. For the ease of the end user, universes are made up of objects and classes that map to data in the database, using everyday terms that describe their business environment. Introducing Unicorn Fashion universe The Unicorn Fashion company uses the SAP BusinessObjects BI platform (BIP) as its primary BI tool. There is another Unicorn Fashion universe, which was built based on the unicorn datamart. It has a similar structure and joins as datamart. The following image shows the Unicorn Fashion universe: It unites two business processes: Sales (orange) and Stock (green) and has the following structure in business layer: Product: This specifies the attributes of an SKU, such as brand, category, ant, and so on Price: This denotes the different pricing of the SKU Sales: This specifies the sales business process Order: This denotes the order number, the shipping information, and orders measures Sales Date: This specifies the attributes of order date, such as month, year, and so on Sales Measures: This denotes various aggregated measures, such as shipped items, revenue waterfall, and so on Stock: This specifies the information about the quantity on stock Stock Date: This denotes the attributes of stock date, such as month, year, and so on Summary A step-by-step guide of learning SAP Lumira essentials starting from overview of SAP Lumira family products. We will demonstrate various data discovery techniques using real world scenarios of online ecommerce retailer. Moreover, we have detail recipes of installations, administration and customization of SAP Lumira. In addition, we will show how to work with data starting from acquiring data from various data sources, then preparing it and visualize through rich functionality of SAP Lumira. Finally, it teaches how to present data via data story or infographic and publish it across your organization or world wide web. Learn data discovery techniques, build amazing visualizations, create fantastic stories and share these visualizations through electronic medium with one of the most powerful tool – SAP Lumira. Moreover, we will focus on extracting data from different sources such as plain text, Microsoft Excel spreadsheets, SAP BusinessObjects BI Platform, SAP HANA and SQL databases. Finally, it will teach how to publish result of your painstaking work on various mediums, such as SAP BI Clients, SAP Lumira Cloud and so on. Resources for Article: Further resources on this subject: Creating Mobile Dashboards [article] Report Data Filtering [article] Creating Our First Universe [article]

0
0
2322

article-image-unveil-power-your-business-data-oracle-discoverer

Packt

08 Apr 2010

4 min read

Unveil the Power of Your Business Data with Oracle Discoverer

Packt

08 Apr 2010

4 min read

0
0
2317

article-image-customizing-page-management-liferay-portal-52-systems-development

Packt

20 Oct 2009

5 min read

Customizing Page Management in Liferay Portal 5.2 Systems Development

Packt

20 Oct 2009

5 min read

Customizing page management with more features The Ext Manage Pages portlet not only clones the out of the box Manage Pages portlet, but it also extends the model and service — supporting customized data, for example, Keywords. We can make these Keywords localized too. Adding localized feature Liferay portal is designed to handle as many languages as you want to support. By default, it supports up to 22 languages. When a page is loading, the portal will detect the language, pull up the corresponding language file, and display the text in the correct language. We want the Keywords to be localized too. For example, the default language is English (United States) and the localized language is Deutsch (Deutschland). Thus, you have the ability to enter not only the Name and HTML Title in German, but also the Keywords in German. As shown in the following screenshot, when you change the language of the page in German using the language portlet, you will see the entire web site changed to German, including the portlet title and input fields. For example, the title of the portlet now has the Ext Seiteneinstellungen value and the Keywords now become Schlüsselwörter. How do we implement this feature? In other words, how do we customize the language display in the page management? Let's add the localized feature for the Ext Manage Pages portlet. Extending model for locale First of all, we need to extend the model and to implement that model in order to support the localized feature. For the ExtLayout model, let's add the locale method first. Locate the ExtLayout.java file from the com.ext.portlet.layout.model package in the /ext/ext-service/src folder, and open it. Add the following lines before the line } in ExtLayout.java and save it: public String getKeywords(Locale locale);public String getKeywords(String localeLanguageId);public String getKeywords(Locale locale, boolean useDefault);public String getKeywords(String localeLanguageId, boolean useDefault);public void setKeywords(String keywords, Locale locale); As shown in the code above, it adds getting and setting methods for the Keywords field with locale features. Now let's add the implementation for the ExtLayout model: Locate the ExtLayoutImpl.java file from the com.ext.portlet.layout.model.impl package in the /ext/ext-impl/src folder and open it. Add the following lines before the last } in ExtLayoutImpl.java file and save it: public String getKeywords(Locale locale) { String localeLanguageId = LocaleUtil.toLanguageId(locale); return getKeywords(localeLanguageId);}public String getKeywords(String localeLanguageId) { return LocalizationUtil.getLocalization(getKeywords(), localeLanguageId);}public String getKeywords(Locale locale, boolean useDefault) { String localeLanguageId = LocaleUtil.toLanguageId(locale); return getKeywords(localeLanguageId, useDefault);}public String getKeywords(String localeLanguageId, boolean useDefault) { return LocalizationUtil.getLocalization( getKeywords(), localeLanguageId, useDefault);}public void setKeywords(String keywords, Locale locale) { String localeLanguageId = LocaleUtil.toLanguageId(locale); if (Validator.isNotNull(keywords)) { setKeywords(LocalizationUtil.updateLocalization( getKeywords(), "keywords", keywords, localeLanguageId)); } else { setKeywords(LocalizationUtil.removeLocalization( getKeywords(), "keywords", localeLanguageId)); }} As shown in the code above, it adds implementation for get and set methods of the ExtLayout model. Customizing language properties Language files have locale-specific definitions. By default, Language.properties (at /portal/portal-impl/src/content) contains English phrase variations further defined for United States, while Language_de.properties (at /portal/portal-impl/src/content) contains German phrase variations further defined for Germany. In Ext, Language-ext.properties (available at /ext/ext-impl/src/content) contains English phrase variations further defined for United States, while Language-ext_de.properties (should be available at /ext/ext-impl/src/content) contains German phrase variations further defined for Germany. First, let's add a message in Language-ext.properties, by using the following steps: Locate the Language-ext.properties file in the /ext/ext-impl/src/content folder and open it. Add the following line after the line view-reports=View Reports for Books and save it. keywords=Keywords This code specifies the keywords message key with a Keywords value in English: Then we need to add German language feature in Language-ext_de.properties as follows: Create a language file Language-ext_de.properties in the /ext/ext-impl/src/content folder and open it. Add the following lines at the beginning and save it: ## Portlet namesjavax.portlet.title.EXT_1=Berichtejavax.portlet.title.jsp_portlet=JSP Portletjavax.portlet.title.book_reports=Berichte für das Buchjavax.portlet.title.extLayoutManagement=Ext Seiteneinstellungenjavax.portlet.title.extCommunities=Ext Communities## Messagesview-reports=Ansicht-Berichte für Bücherkeywords=Schlüsselwörter## Category titlescategory.book=Buch## Model resourcesmodel.resource.com.ext.portlet.reports.model.ReportsEntry= Buch ## Action namesaction.ADD_BOOK=Fügen Sie Buch hinzu As shown in the code above, it specifies the same keys as that of Language-ext.properties. But all the keys' values were specified in German instead of English. For example, the message keywords has a Schlüsselwörter value in German. In addition, you can set German as the default language and Germany as the default country if it is required. Here are the simple steps to do so: Locate the system-ext.properties file in the /ext/ext-impl/src folder and open it. Add the following lines at the end of system-ext.properties and save it: user.country=DEuser.language=de The code above sets the default locale — the language German (Deutsch) and the country Germany (Deutschland). In general, there are many language files, for example Language-ext.properties and Language-ext_de.properties, and some language files would overwrite others in runtime loading. For example, Languageext_de.properties will overwrite Language-ext.properties when the language is set as German. These are the three simple rules which indicate the priorities of these language files: The ext versions take precedence over the non-ext versions. The language-specific versions, for example _de, take precedence over the non language-specific versions. The location-specific versions, such as -ext_de, take precedence over the non location-specific versions. For instance, the following is a ranking from bottom to top for the German language: Language-ext_de.properties Language_de.properties Language-ext.properties Language.properties

0
0
2315

Packt

22 May 2014

13 min read

A/B Testing – Statistical Experiments for the Web

Packt

22 May 2014

13 min read

(For more resources related to this topic, see here.) Defining A/B testing At its most fundamental level, A/B testing just involves creating two different versions of a web page. Sometimes, the changes are major redesigns of the site or the user experience, but usually, the changes are as simple as changing the text on a button. Then, for a short period of time, new visitors are randomly shown one of the two versions of the page. The site tracks their behavior, and the experiment determines whether one version or the other increases the users' interaction with the site. This may mean more click-through, more purchases, or any other measurable behavior. This is similar to other methods in other domains that use different names. The basic framework randomly tests two or more groups simultaneously and is sometimes called random-controlled experiments or online-controlled experiments. It's also sometimes referred to as split testing, as the participants are split into two groups. These are all examples of between-subjects experiment design. Experiments that use these designs all split the participants into two groups. One group, the control group, gets the original environment. The other group, the test group, gets the modified environment that those conducting the experiment are interested in testing. Experiments of this sort can be single-blind or double-blind. In single-blind experiments, the subjects don't know which group they belong to. In double-blind experiments, those conducting the experiments also don't know which group the subjects they're interacting with belong to. This safeguards the experiments against biases that can be introduced by participants being aware of which group they belong to. For example, participants could get more engaged if they believe they're in the test group because this is newer in some way. Or, an experimenter could treat a subject differently in a subtle way because of the group that they belong to. As the computer is the one that directly conducts the experiment, and because those visiting your website aren't aware of which group they belong to, website A/B testing is generally an example of double-blind experiments. Of course, this is an argument for only conducting the test on new visitors. Otherwise, the user might recognize that the design has changed and throw the experiment away. For example, the users may be more likely to click on a new button when they recognize that the button is, in fact, new. However, if they are new to the site as a whole, then the button itself may not stand out enough to warrant extra attention. In some cases, these subjects can test more variant sites. This divides the test subjects into more groups. There needs to be more subjects available in order to compensate for this. Otherwise, the experiment's statistical validity might be in jeopardy. If each group doesn't have enough subjects, and therefore observations, then there is a larger error rate for the test, and results will need to be more extreme to be significant. In general, though, you'll want to have as many subjects as you reasonably can. Of course, this is always a trade-off. Getting 500 or 1000 subjects may take a while, given the typical traffic of many websites, but you still need to take action within a reasonable amount of time and put the results of the experiment into effect. So we'll talk later about how to determine the number of subjects that you actually need to get a certain level of significance. Another wrinkle that is you'll want to know as soon as possible is whether one option is clearly better or not so that you can begin to profit from it early. In the multi-armed bandit problem, this is a problem of exploration versus exploitation. This refers to the tension in the experiment design (and other domain) between exploring the problem space and exploiting the resources you've found in the experiment so far. We won't get into this further, but it is a factor to stay aware of as you perform A/B tests in the future. Because of the power and simplicity of A/B testing, it's being widely used in a variety of domains. For example, marketing and advertising make extensive use of it. Also, it has become a powerful way to test and improve measurable interactions between your website and those who visit it online. The primary requirement is that the interaction be somewhat limited and very measurable. Interesting would not make a good metric; the click-through rate or pages visited, however, would. Because of this, A/B tests validate changes in the placement or in the text of buttons that call for action from the users. For example, a test might compare the performance of Click for more! against Learn more now!. Another test may check whether a button placed in the upper-right section increases sales versus one in the center of the page. These changes are all incremental, and you probably don't want to break a large site redesign into pieces and test all of them individually. In a larger redesign, several changes may work together and reinforce each other. Testing them incrementally and only applying the ones that increase some metric can result in a design that's not aesthetically pleasing, is difficult to maintain, and costs you users in the long run. In these cases, A/B testing is not recommended. Some other things that are regularly tested in A/B tests include the following parts of a web page: The wording, size, and placement of a call-to-action button The headline and product description The length, layout, and fields in a form The overall layout and style of the website as a larger test, which is not broken down The pricing and promotional offers of products The images on the landing page The amount of text on a page Now that we have an understanding of what A/B testing is and what it can do for us, let's see what it will take to set up and perform an A/B test. Conducting an A/B test In creating an A/B test, we need to decide several things, and then we need to put our plan into action. We'll walk through those decisions here and create a simple set of web pages that will test the aspects of design that we are interested in changing, based upon the behavior of the user. Before we start building stuff, though, we need to think through our experiment and what we'll need to build. Planning the experiment For this article, we're going to pretend that we have a website for selling widgets (or rather, looking at the website Widgets!). The web page in this screenshot is the control page. Currently, we're getting 24 percent click-through on it from the Learn more! button. We're interested in the text of the button. If it read Order now! instead of Learn more!, it might generate more click-through. (Of course, actually explaining what the product is and what problems it solves might be more effective, but one can't have everything.) This will be the test page, and we're hoping that we can increase the click-through rate to 29 percent (a five percent absolute increase). Now that we have two versions of the page to experiment with, we can frame the experiment statistically and figure out how many subjects we'll need for each version of the page in order to achieve a statistically meaningful increase in the click-through rate on that button. Framing the statistics First, we need to frame our experiment in terms of the null-hypothesis test. In this case, the null hypothesis would look something like this: Changing the button copy from Learn more! to Order now! Would not improve the click-through rate. Remember, this is the statement that we're hoping to disprove (or fail to disprove) in the course of this experiment. Now we need to think about the sample size. This needs to be fixed in advance. To find the sample size, we'll use the standard error formula, which will be solved to get the number of observations to make for about a 95 percent confidence interval in order to get us in the ballpark of how large our sample should be: In this, δ is the minimum effect to detect and σ² is the sample variance. If we are testing for something like a percent increase in the click-through, the variance is σ² = p(1 – p), where p is the initial click-through rate with the control page. So for this experiment, the variance will be 0.24(1-0.24) or 0.1824. This would make the sample size for each variable 16(0.1824 / 0.052) or almost 1170. The code to compute this in Clojure is fairly simple: (defn get-target-sample [rate min-effect] (let [v (* rate (- 1.0 rate))] (* 16.0 (/ v (* min-effect min-effect))))) Running the code from the prompt gives us the response that we expect: user=> (get-target-sample 0.24 0.05) 1167.36 Part of the reason to calculate the number of participants needed is that monitoring the progress of the experiment and stopping it prematurely can invalidate the results of the test because it increases the risk of false positives where the experiment says it has disproved the null hypothesis when it really hasn't. This seems counterintuitive, doesn't it? Once we have significant results, we should be able to stop the test. Let's work through it. Let's say that in actuality, there's no difference between the control page and the test page. That is, both sets of copy for the button get approximately the same click-through rate. If we're attempting to get p ≤ 0.05, then it means that the test will return a false positive five percent of the time. It will incorrectly say that there is a significant difference between the click-through rates of the two buttons five percent of the time. Let's say that we're running the test and planning to get 3,000 subjects. We end up checking the results of every 1,000 participants. Let's break down what might happen: Run A B C D E F G H 1000 No No No No Yes Yes Yes Yes 2000 No No Yes Yes No Yes No Yes 3000 No Yes No Yes No No Yes Yes Final No Yes No Yes No No Yes Yes Stopped No Yes Yes Yes Yes Yes Yes Yes Let's read this table. Each lettered column represents a scenario for how the significance of the results may change over the run of the test. The rows represent the number of observations that have been made. The row labeled Final represents the experiment's true finishing result, and the row labeled Stopped represents the result if the experiment is stopped as soon as a significant result is seen. The final results show us that out of eight different scenarios, the final result would be significant in four cases (B, D, G, and H). However, if the experiment is stopped prematurely, then it will be significant in seven cases (all but A). The test could drastically over-generate false positives. In fact, most statistical tests assume that the sample size is fixed before the test is run. It's exciting to get good results, so we'll design our system so that we can't easily stop it prematurely. We'll just take that temptation away. With this in mind, let's consider how we can implement this test. Building the experiment There are several options to actually implement the A/B test. We'll consider several of them and weigh their pros and cons. Ultimately, the option that works best for you really depends on your circumstances. However, we'll pick one for this article and use it to implement the test for it. Looking at options to build the site The first way to implement A/B testing is to use a server-side implementation. In this case, all of the processing and tracking is handled on the server, and visitors' actions would be tracked using GET or POST parameters on the URL for the resource that the experiment is attempting to drive traffic towards. The steps for this process would go something like the following ones: A new user visits the site and requests for the page that contains the button or copy that is being tested. The server recognizes that this is a new user and assigns the user a tracking number. It assigns the user to one of the test groups. It adds a row in a database that contains the tracking number and the test group that the user is part of. It returns the page to the user with the copy, image, or design that is reflective of the control or test group. The user views the returned page and decides whether to click on the button or link or not. If the server receives a request for the button's or link's target, it updates the user's row in the tracking table to show us that the interaction was a success, that is, that the user did a click-through or made a purchase. This way of handling it keeps everything on the server, so it allows more control and configuration over exactly how you want to conduct your experiment. A second way of implementing this would be to do everything using JavaScript (or ClojureScript, https://github.com/clojure/clojurescript). In this scenario, the code on the page itself would randomly decide whether the user belonged to the control or the test group, and it would notify the server that a new observation in the experiment was beginning. It would then update the page with the appropriate copy or image. Most of the rest of this interaction is the same as the one in previous scenario. However, the complete steps are as follows: A new user visits the site and requests for the page that contains the button or copy being tested. The server inserts some JavaScript to handle the A/B test into the page. As the page is being rendered, the JavaScript library generates a new tracking number for the user. It assigns the user to one of the test groups. It renders that page for the group that the user belongs to, which is either the control group or the test group. It notifies the server of the user's tracking number and the group. The server takes this notification and adds a row for the observation in the database. The JavaScript in the browser tracks the user's next move either by directly notifying the server using an AJAX call or indirectly using a GET parameter in the URL for the next page. The server receives the notification whichever way it's sent and updates the row in the database. The downside of this is that having JavaScript take care of rendering the experiment might take slightly longer and may throw off the experiment. It's also slightly more complicated, because there are more parts that have to communicate. However, the benefit is that you can create a JavaScript library, easily throw a small script tag into the page, and immediately have a new A/B experiment running. In reality, though, you'll probably just use a service that handles this and more for you. However, it still makes sense to understand what they're providing for you, and that's what this article tries to do by helping you understand how to perform an A/B test so that you can be make better use of these A/B testing vendors and services.

0
0
2308

article-image-getting-started-python-26-text-processing

Packt

10 Jan 2011

14 min read

Getting Started with Python 2.6 Text Processing

Packt

10 Jan 2011

14 min read

Python 2.6 Text Processing: Beginners Guide The easiest way to learn how to manipulate text with Python The easiest way to learn text processing with Python Deals with the most important textual data formats you will encounter Learn to use the most popular text processing libraries available for Python Packed with examples to guide you through Read more about this book (For more resources on this subject, see here.) Categorizing types of text data Textual data comes in a variety of formats. For our purposes, we'll categorize text into three very broad groups. Isolating down into segments helps us to understand the problem a bit better, and subsequently choose a parsing approach. Each one of these sweeping groups can be further broken down into more detailed chunks. One thing to remember when working your way through the book is that text content isn't limited to the Latin alphabet. This is especially true when dealing with data acquired via the Internet. Providing information through markup Structured text includes formats such as XML and HTML. These formats generally consist of text content surrounded by special symbols or markers that give extra meaning to a file's contents. These additional tags are usually meant to convey information to the processing application and to arrange information in a tree-like structure. Markup allows a developer to define his or her own data structure, yet rely on standardized parsers to extract elements. For example, consider the following contrived HTML document. <html> <head> <title>Hello, World!</title> </head> <body> Hi there, all of you earthlings. Take us to your leader. </body></html> In this example, our document's title is clearly identified because it is surrounded by opening and closing &lttitle> and </title> elements. Note that although the document's tags give each element a meaning, it's still up to the application developer to understand what to do with a title object or a p element. Notice that while it still has meaning to us humans, it is also laid out in such a way as to make it computer friendly. One interesting aspect to these formats is that it's possible to embed references to validation rules as well as the actual document structure. This is a nice benefit in that we're able to rely on the parser to perform markup validation for us. This makes our job much easier as it's possible to trust that the input structure is valid. Meaning through structured formats Text data that falls into this category includes things such as configuration files, marker delimited data, e-mail message text, and JavaScript Object Notation web data. Content within this second category does not contain explicit markup much like XML and HTML does, but the structure and formatting is required as it conveys meaning and information about the text to the parsing application. For example, consider the format of a Windows INI file or a Linux system's /etc/hosts file. There are no tags, but the column on the left clearly means something other than the column on the right. Python provides a collection of modules and libraries intended to help us handle popular formats from this category. Understanding freeform content This category contains data that does not fall into the previous two groupings. This describes e-mail message content, letters, article copy, and other unstructured character-based content. However, this is where we'll largely have to look at building our own processing components. There are external packages available to us if we wish to perform common functions. Some examples include full text searching and more advanced natural language processing. Ensuring you have Python installed Our first order of business is to ensure that you have Python installed. We'll be working with Python 2.6 and we assume that you're using that same version. If there are any drastic differences in earlier releases, we'll make a note of them as we go along. All of the examples should still function properly with Python 2.4 and later versions. If you don't have Python installed, you can download the latest 2.X version from http://www.python.org. Most Linux distributions, as well as Mac OS, usually have a version of Python preinstalled. At the time of this writing, Python 2.6 was the latest version available, while 2.7 was in an alpha state. Providing support for Python 3 The examples in this book are written for Python 2. However, wherever possible, we will provide code that has already been ported to Python 3. You can find the Python 3 code in the Python3 directories in the code bundle available on the Packt Publishing FTP site. Unfortunately, we can't promise that all of the third-party libraries that we'll use will support Python 3. The Python community is working hard to port popular modules to version 3.0. However, as the versions are incompatible, there is a lot of work remaining. In situations where we cannot provide example code, we'll note this. Implementing a simple cipher Let's get going early here and implement our first script to get a feel for what's in store. A Caesar Cipher is a simple form of cryptography in which each letter of the alphabet is shifted down by a number of letters. They're generally of no cryptographic use when applied alone, but they do have some valid applications when paired with more advanced techniques. This preceding diagram depicts a cipher with an offset of three. Any X found in the source data would simply become an A in the output data. Likewise, any A found in the input data would become a D. Time for action – implementing a ROT13 encoder The most popular implementation of this system is ROT13. As its name suggests, ROT13 shifts – or rotates – each letter by 13 spaces to produce an encrypted result. As the English alphabet has 26 letters, we simply run it a second time on the encrypted text in order to get back to our original result. Let's implement a simple version of that algorithm. Start your favorite text editor and create a new Python source file. Save it as rot13.py. Enter the following code exactly as you see it below and save the file. import sysimport stringCHAR_MAP = dict(zip( string.ascii_lowercase, string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13] ))def rotate13_letter(letter): """ Return the 13-char rotation of a letter. """ do_upper = False if letter.isupper(): do_upper = True letter = letter.lower() if letter not in CHAR_MAP: return letter else: letter = CHAR_MAP[letter] if do_upper: letter = letter.upper() return letterif __name__ == '__main__': for char in sys.argv[1]: sys.stdout.write(rotate13_letter(char)) sys.stdout.write('n') Now, from a command line, execute the script as follows. If you've entered all of the code correctly, you should see the same output. $ python rot13.py 'We are the knights who say, nee!' Run the script a second time, using the output of the first run as the new input string. If everything was entered correctly, the original text should be printed to the console. $ python rot13.py 'Dv ziv gsv pmrtsgh dsl hzb, mvv!' What just happened? We implemented a simple text-oriented cipher using a collection of Python's string handling features. We were able to see it put to use for both encoding and decoding source text. We saw a lot of stuff in this little example, so you should have a good feel for what can be accomplished using the standard Python string object. Following our initial module imports, we defined a dictionary named CHAR_MAP, which gives us a nice and simple way to shift our letters by the required 13 places. The value of a dictionary key is the target letter! We also took advantage of string slicing here. In our translation function rotate13_letter, we checked whether our input character was uppercase or lowercase and then saved that as a Boolean attribute. We then forced our input to lowercase for the translation work. As ROT13 operates on letters alone, we only performed a rotation if our input character was a letter of the Latin alphabet. We allowed other values to simply pass through. We could have just as easily forced our string to a pure uppercased value. The last thing we do in our function is restore the letter to its proper case, if necessary. This should familiarize you with upper- and lowercasing of Python ASCII strings. We're able to change the case of an entire string using this same method; it's not limited to single characters. >>> name = 'Ryan Miller'>>> name.upper()'RYAN MILLER'>>> "PLEASE DO NOT SHOUT".lower()'please do not shout'>>> It's worth pointing out here that a single character string is still a string. There is not a char type, which you may be familiar with if you're coming from a different language such as C or C++. However, it is possible to translate between character ASCII codes and back using the ord and chr built-in methods and a string with a length of one. Notice how we were able to loop through a string directly using the Python for syntax. A string object is a standard Python iterable, and we can walk through them detailed as follows. In practice, however, this isn't something you'll normally do. In most cases, it makes sense to rely on existing libraries. $ pythonPython 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)[GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> for char in "Foo":... print char...Foo>>> Finally, you should note that we ended our script with an if statement such as the following: Python modules all contain an internal __name__ variable that corresponds to the name of the module. If a module is executed directly from the command line, as is this script, whose name value is set to __main__, this code only runs if we've executed this script directly. It will not run if we import this code from a different script. You can import the code directly from the command line and see for yourself. >>> if__name__ == '__main__' Notice how we were able to import our module and see all of the methods and attributes inside of it, but the driver code did not execute. Have a go hero – more translation work Each Python string instance contains a collection of methods that operate on one or more characters. You can easily display all of the available methods and attributes by using the dir method. For example, enter the following command into a Python window. Python responds by printing a list of all methods on a string object. $ pythonPython 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)[GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import rot13>>> dir(rot13)['CHAR_MAP', '__builtins__', '__doc__', '__file__', '__name__', '__package__', 'rotate13_letter', 'string', 'sys']>>> Much like the isupper and islower methods discussed previously, we also have an isspace method. Using this method, in combination with your newfound knowledge of Python strings, update the method we defined previously to translate spaces to underscores and underscores to spaces. Processing structured markup with a filter Our ROT13 application works great for simple one-line strings that we can fit on the command line. However, it wouldn't work very well if we wanted to encode an entire file, such as the HTML document we took a look at earlier. In order to support larger text documents, we'll need to change the way we accept input. We'll redesign our application to work as a filter. A filter is an application that reads data from its standard input file descriptor and writes to its standard output file descriptor. This allows users to create command pipelines that allow multiple utilities to be strung together. If you've ever typed a command such as cat /etc/hosts | grep mydomain.com, you've set up a pipeline In many circumstances, data is fed into the pipeline via the keyboard and completes its journey when a processed result is displayed on the screen. Time for action – processing as a filter Let's make the changes required to allow our simple ROT13 processor to work as a command-line filter. This will allow us to process larger files. Create a new source file and enter the following code. When complete, save the file as rot13-b.py. >>> dir("content")['__add__', '__class__', '__contains__', '__delattr__', '__doc__','__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__','__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__','__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count','decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index','isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle','isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace','rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split','splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate','upper', 'zfill']>>> Enter the following HTML data into a new text file and save it as sample_page.html. We'll use this as example input to our updated rot13.py. import sysimport stringCHAR_MAP = dict(zip( string.ascii_lowercase, string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13] ))def rotate13_letter(letter): """ Return the 13-char rotation of a letter. """ do_upper = False if letter.isupper(): do_upper = True letter = letter.lower() if letter not in CHAR_MAP: return letter else: letter = CHAR_MAP[letter] if do_upper: letter = letter.upper() return letterif __name__ == '__main__': for line in sys.stdin: for char in line: sys.stdout.write(rotate13_letter(char)) Now, run our rot13.py example and provide our HTML document as standard input data. The exact method used will vary with your operating system. If you've entered the code successfully, you should simply see a new prompt. <html> <head> <title>Hello, World!</title> </head> <body> Hi there, all of you earthlings. Take us to your leader. </body></html> The contents of rot13.html should be as follows. If that's not the case, double back and make sure everything is correct. $ cat sample_page.html | python rot13-b.py > rot13.html$ Open the translated HTML file using your web browser. What just happened? We updated our rot13.py script to read standard input data rather than rely on a command-line option. Doing this provides optimal configurability going forward and lets us feed input of varying length from a collection of different sources. We did this by looping on each line available on the sys.stdin file stream and calling our translation function. We wrote each character returned by that function to the sys.stdout stream. Next, we ran our updated script via the command line, using sample_page.html as input. As expected, the encoded version was printed on our terminal. As you can see, there is a major problem with our output. We should have a proper page title and our content should be broken down into different paragraphs. Remember, structured markup text is sprinkled with tag elements that define its structure and organization. In this example, we not only translated the text content, we also translated the markup tags, rendering them meaningless. A web browser would not be able to display this data properly. We'll need to update our processor code to ignore the tags. We'll do just that in the next section.

0
0
2306

Packt

23 Apr 2015

9 min read

Solr Indexing Internals

Packt

23 Apr 2015

9 min read

In this article by Jayant Kumar, author of the book Apache Solr Search Patterns, we will discuss use cases for Solr in e-commerce and job sites. We will look at the problems faced while providing search in an e-commerce or job site: The e-commerce problem statement The job site problem statement Challenges of large-scale indexing (For more resources related to this topic, see here.) The e-commerce problem statement E-commerce provides an easy way to sell products to a large customer base. However, there is a lot of competition among multiple e-commerce sites. When users land on an e-commerce site, they expect to find what they are looking for quickly and easily. Also, users are not sure about the brands or the actual products they want to purchase. They have a very broad idea about what they want to buy. Many customers nowadays search for their products on Google rather than visiting specific e-commerce sites. They believe that Google will take them to the e-commerce sites that have their product. The purpose of any e-commerce website is to help customers narrow down their broad ideas and enable them to finalize the products they want to purchase. For example, suppose a customer is interested in purchasing a mobile. His or her search for a mobile should list mobile brands, operating systems on mobiles, screen size of mobiles, and all other features as facets. As the customer selects more and more features or options from the facets provided, the search narrows down to a small list of mobiles that suit his or her choice. If the list is small enough and the customer likes one of the mobiles listed, he or she will make the purchase. The challenge is also that each category will have a different set of facets to be displayed. For example, searching for books should display their format, as in paperpack or hardcover, author name, book series, language, and other facets related to books. These facets were different for mobiles that we discussed earlier. Similarly, each category will have different facets and it needs to be designed properly so that customers can narrow down to their preferred products, irrespective of the category they are looking into. The takeaway from this is that categorization and feature listing of products should be taken care of. Misrepresentation of features can lead to incorrect search results. Another takeaway is that we need to provide multiple facets in the search results. For example, while displaying the list of all mobiles, we need to provide facets for a brand. Once a brand is selected, another set of facets for operating systems, network, and mobile phone features has to be provided. As more and more facets are selected, we still need to show facets within the remaining products. Example of facet selection on Amazon.com Another problem is that we do not know what product the customer is searching for. A site that displays a huge list of products from different categories, such as electronics, mobiles, clothes, or books, needs to be able to identify what the customer is searching for. A customer can be searching for samsung, which can be in mobiles, tablets, electronics, or computers. The site should be able to identify whether the customer has input the author name or the book name. Identifying the input would help in increasing the relevance of the result set by increasing the precision of the search results. Most e-commerce sites provide search suggestions that include the category to help customers target the right category during their search. Amazon, for example, provides search suggestions that include both latest searched terms and products along with category-wise suggestions: Search suggestions on Amazon.com It is also important that products are added to the index as soon as they are available. It is even more important that they are removed from the index or marked as sold out as soon as their stock is exhausted. For this, modifications to the index should be immediately visible in the search. This is facilitated by a concept in Solr known as Near Real Time Indexing and Search (NRT). The job site problem statement A job site serves a dual purpose. On the one hand, it provides jobs to candidates, and on the other, it serves as a database of registered candidates' profiles for companies to shortlist. A job search has to be very intuitive for the candidates so that they can find jobs suiting their skills, position, industry, role, and location, or even by the company name. As it is important to keep the candidates engaged during their job search, it is important to provide facets on the abovementioned criteria so that they can narrow down to the job of their choice. The searches by candidates are not very elaborate. If the search is generic, the results need to have high precision. On the other hand, if the search does not return many results, then recall has to be high to keep the candidate engaged on the site. Providing a personalized job search to candidates on the basis of their profiles and past search history makes sense for the candidates. On the recruiter side, the search provided over the candidate database is required to have a huge set of fields to search upon every data point that the candidate has entered. The recruiters are very selective when it comes to searching for candidates for specific jobs. Educational qualification, industry, function, key skills, designation, location, and experience are some of the fields provided to the recruiter during a search. In such cases, the precision has to be high. The recruiter would like a certain candidate and may be interested in more candidates similar to the selected candidate. The more like this search in Solr can be used to provide a search for candidates similar to a selected candidate. NRT is important as the site should be able to provide a job or a candidate for a search as soon as any one of them is added to the database by either the recruiter or the candidate. The promptness of the site is an important factor in keeping users engaged on the site. Challenges of large-scale indexing Let us understand how indexing happens and what can be done to speed it up. We will also look at the challenges faced during the indexing of a large number of documents or bulky documents. An e-commerce site is a perfect example of a site containing a large number of products, while a job site is an example of a search where documents are bulky because of the content in candidate resumes. During indexing, Solr first analyzes the documents and converts them into tokens that are stored in the RAM buffer. When the RAM buffer is full, data is flushed into a segment on the disk. When the numbers of segments are more than that defined in the MergeFactor class of the Solr configuration, the segments are merged. Data is also written to disk when a commit is made in Solr. Let us discuss a few points to make Solr indexing fast and to handle a large index containing a huge number of documents. Using multiple threads for indexing on Solr We can divide our data into smaller chunks and each chunk can be indexed in a separate thread. Ideally, the number of threads should be twice the number of processor cores to avoid a lot of context switching. However, we can increase the number of threads beyond that and check for performance improvement. Using the Java binary format of data for indexing Instead of using XML files, we can use the Java bin format for indexing. This reduces a lot of overhead of parsing an XML file and converting it into a binary format that is usable. The way to use the Java bin format is to write our own program for creating fields, adding fields to documents, and finally adding documents to the index. Here is a sample code: //Create an instance of the Solr server String SOLR_URL = "http://localhost:8983/solr" SolrServer server = new HttpSolrServer(SOLR_URL); //Create collection of documents to add to Solr server SolrInputDocument doc1 = new SolrInputDocument(); document.addField("id",1); document.addField("desc", "description text for doc 1"); SolrInputDocument doc2 = new SolrInputDocument(); document.addField("id",2); document.addField("desc", "description text for doc 2"); Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); docs.add(doc1); docs.add(doc2); //Add the collection of documents to the Solr server and commit. server.add(docs); server.commit(); Here is the reference to the API for the HttpSolrServer program http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html. Add all files from the <solr_directory>/dist folder to the classpath for compiling and running the HttpSolrServer program. Using the ConcurrentUpdateSolrServer class for indexing Using the ConcurrentUpdateSolrServer class instead of the HttpSolrServer class can provide performance benefits as the former uses buffers to store processed documents before sending them to the Solr server. We can also specify the number of background threads to use to empty the buffers. The API docs for ConcurrentUpdateSolrServer are found in the following link: http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html The constructor for the ConcurrentUpdateSolrServer class is defined as: ConcurrentUpdateSolrServer(String solrServerUrl, int queueSize, int threadCount) Here, queueSize is the buffer and threadCount is the number of background threads used to flush the buffers to the index on disk. Note that using too many threads can increase the context switching between threads and reduce performance. In order to optimize the number of threads, we should monitor performance (docs indexed per minute) after each increase and ensure that there is no decrease in performance. Summary In this article, we saw in brief the problems faced by e-commerce and job websites during indexing and search. We discussed the challenges faced while indexing a large number of documents. We also saw some tips on improving the speed of indexing. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [article] Apache Solr PHP Integration [article] Boost Your search [article]

0
0
2303

article-image-installation-and-basic-features-enterprisedb

Packt

24 Oct 2009

3 min read

Installation and basic features of EnterpriseDB

Packt

24 Oct 2009

3 min read

Installing the EnterpriseDB Download PostgrePlus Advanced Server 8.3 ( pgplus-advsvr-windows-83012b.exe (120MB) ) from the following site: http://www.enterprisedb.com/products/download.do. After downloading the program double click the executable file. You may need to choose a language from a list of languages. Here English has been chosen. The welcome window gets displayed as shown. Click Next. Choose the option you need. Read notes on this page to make the choice. Here, the Oracle compatibility has been chosen. Click Next or, choose a different location by browsing. Here the default location is accepted. Click Next. The window that shows up displays all the various features that are available. Pick and choose the features. Here all features are chosen. Click Next. The next window shows the links from where the JDBC drivers for connecting to Oracle and MySQL are available displayed. Click Next . In the window that shows up you need to choose the password for the Operating System UserID and Password. Read the cautionary remarks on this page. Choose Next. At this point your anti-virus program may require you to permit to run the program. McAfee is the anti-virus program on this computer. In the window that gets displayed you may need to choose the administrator's log in credentials. You may also Browse and select the Data Destination Directory. Herein the default is accepted. Click Next. In the windows that gets displayed you may choose the type of environment for which the server will be used as well as the work load for which you may be using the server. The dynamic tuning options available are: Server Utilization Development: This is a development machine and many other applications will be running on it. Stress testing should not be performed with this configuration. EnterpriseDB will use a minimal amount of memory. Mixed: Several applications will be running on this machine. Choose this option for web/application servers. Dedicated: This machine is dedicated to run EnterpriseDB and will use available memory to optimize performance. The Workload Profile Transaction Processing: The running application is a transaction intensive applications. General Purpose: The database will be used for transaction processing as well as complex queries and reporting. Reporting: The database will be used for reporting applications. For this tutorial, the Mixed option for Server Utilization and General Purpose for Workload Profile were chosen. Click on Next. The Summary page gets displayed showing all the options chosen. Click on the Install button. The window with a progress bar gets displayed. You may get a warning from the anti-virus program on your computer to allow the file to be executed. Click OK to allow install.

0
0
2302

Packt

11 Mar 2014

7 min read

Self-service reporting

Packt

11 Mar 2014

7 min read

(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables: Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]

0
0
2300

Packt

20 Feb 2018

17 min read

Decision Trees

Packt

20 Feb 2018

17 min read

0
0
2300

Packt

10 Nov 2016

6 min read

Data Clustering

Packt

10 Nov 2016

6 min read

In this article by Rodolfo Bonnin, the author of the book Building Machine Learning Projects with TensorFlow, we will start applying data transforming operations. We will begin finding interesting patterns in some given information, discovering groups of data, or clusters, and using clustering techniques. (For more resources related to this topic, see here.) In this process we'll also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library. The topics we will cover in this article are as follows: Getting an idea of how clustering works, and comparing it to other alternative existent classification techniques Using scikit-learn and matplotlib to enrichen the possibilities of dataset choices, and to get professional looking graphical representation of the data Implementing the K-means clustering algorithm Test some variations of the K-means methods to improve the fit and/or the convergence rate Three types of learning from data Based on how we approach the supervision of the samples, we can extract three types of learning: Unsupervised learning: The fully unsupervised approach directly takes a number of undetermined elements and builds a classification of them, looking at different properties that could determine its class Semi-supervised learning: The semi-supervised approach has a number of known classified items and then applies techniques to discover the class of the remaining items Supervised learning: In supervised learning, we start from a population of samples, which have a known type beforehand, and then build a model from it Normally there are three sample populations: one from which the model grows, called training set, one that is used to test the model, called training set, and then there are the samples for which we will be doing classification. Types of data learning based on supervision: unsupervised, semi-supervised, and supervised Unsupervised data clustering One of the simplest operations that can be initially applied to an unknown dataset is to try to understand the possible grouping or common features that the dataset members have. To do so, we could try to find representative points in them that summarize a balance of the parameters of the members of the group. This value could be, for example, the mean or the median of all the cluster members. This also guides to the idea of defining a notion of distance between members: all the members of the groups should be obviously at short distances between them and the representative points, that from the central points of the other groups. In the following image, we can see the results of a typical clustering algorithm and the representation of the cluster centers: Sample clustering algorithm output K-means K-means is a very well-known clustering algorithm that can be easily implemented. It is very straightforward and can guide (depending on the data layout) to a good initial understanding of the provided information. Mechanics of K-means K-means tries to divide a set of samples into K disjoint groups or clusters, using as a main indicator the mean value (be it 1D, 2D, and so on) of the members. This point is normally called centroid, referring to the arithmetic entity with the same name. One important characteristic of K-means is that K should be provided beforehand, and so some previous knowledge of the data is needed to avoid a non-representative result. Algorithm iteration criterion The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster contained samples. This is also known as minimization of inertia. Error minimization criteria for K-means K-means algorithm breakdown The mechanism of the K-means algorithm can be summarized in the following graphic: Simplified flow chart of the K-means process And this is a simplified summary of the algorithm: We start with unclassified samples and take K elements as the starting centroids. There are also possible simplifications of this algorithm that take the first elements in the element list, for the sake of brevity. We then calculate the distances between the samples and the first chosen samples, and so we get the first calculated centroids (or other representative values). You can see in the moving centroids in the illustration toward a more common sense centroid. After the centroids change, their displacement will provoke the individual distances to change, and so the cluster membership can change. So this is the time when we recalculate the centroids and repeat the first steps, in case the stop condition isn't met. The stopping conditions could be of various types: After n iterations (it could be that either we chose a too large number and we'll have unnecessary rounds of computing, or it could converge slowly and we will have a very unconvincing results) if the centroid doesn't have a very stable means. This stop condition could also be used as a last resort if we have a really long iterative process. Referring to the previous mean result, a possibly better criterion for the convergence of the iterations is to take a look at the changes of the centroids, be it in total displacement or total cluster element switches. The last one is employed normally, so we will stop the process once there are no more element-changing clusters: K-means simplified graphic Pros and cons of K-means The advantages of this method are: It scales very well (most of the calculations can be run in parallel) It has been used in a very large range of applications But its simplicity has also a price (no silver bullet rule applies): It requires apriori knowledge (the number of possible clusters should be known beforehand) The outlier values can push the values of the centroids, as they have the same value as any other sample As we assume that the figure is convex and isotropic, it doesn't work very well with non-circle-like delimited clusters Summary In this article, we got a simple overview of some of the most basic models we can implement, but we tried to be as detailed in the explanation as possible. From now on, we are able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcoming of them without having to load models with a greater number of unknown characteristics. You can also refer to the following books on the similar topics: Getting Started with TensorFlow: https://www.packtpub.com/big-data-and-business-intelligence/getting-started-tensorflow R Machine Learning Essentials: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-essentials Building Machine Learning Systems with Python - Second Edition: https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Preprocessing the Data [article]

0
0
2294

Packt

15 Jul 2014

6 min read

Securing the WAL Stream

Packt

15 Jul 2014

6 min read

(For more resources related to this topic, see here.) The primary mechanism that PostgreSQL uses to provide a data durability guarantee is through its Write Ahead Log (WAL). All transactional data is written to this location before ever being committed to database files. Once WAL files are no longer necessary for crash recovery, PostgreSQL will either delete or archive them. For the purposes of a highly available server, we recommend that you keep these important files as long as possible. There are several reasons for this; they are as follows: Archived WAL files can be used for Point In Time Recovery (PITR) If you are using streaming replication, interrupted streams can be re-established by applying WAL files until the replica has caught up WAL files can be reused to service multiple server copies In order to gain these benefits, we need to enable PostgreSQL WAL archiving and save these files until we no longer need them. This section will address our recommendations for long term storage of WAL files. Getting ready In order to properly archive WAL files, we recommend that you provision a server dedicated to backups or file storage. Depending on the transaction volume, an active PostgreSQL database might produce thousands of these on a daily basis. At 16 MB apiece, this is not an idle concern. For instance, for a 1 TB database, we recommend at least 3 TB of storage space. In addition, we will be using rsync as a daemon on this archive server. To install this on a Debian-based server, execute the following command as a root-level user: sudo apt-get install rsync Red-Hat-based systems will need this command instead: sudo yum install rsync xinetd How to do it... Our archive server has a 3 TB mount at the /db directory and is named arc_server on our network. The PostgreSQL source server resides at 192.168.56.10. Follow these steps for long-term storage of important WAL files on an archive server Enable rsync to run as a daemon on the archive server. On Debian based systems, edit the /etc/default/rsync file and change the RSYNC_ENABLE variable to true. On Red-Hat-based systems, edit the /etc/xinet.d/rsync file and change the disable parameter to no. Create a directory to store archived WAL files as the postgres user with these commands: sudo mkdir /db/pg_archived sudo chown postgres:postgres /db/pg_archived Create a file named /etc/rsyncd.conf and fill it with the following contents: [wal_store] path = /db/pg_archived comment = DB WAL Archives uid = postgres gid = postgres read only = false hosts allow = 192.168.56.10 hosts deny = * Start the rsync daemon. Debian-based systems should execute the following command: sudo service rsync start Red-Hat-based systems can start rsync with this command instead: sudo service xinetd start Change the archive_mode and archive_command parameters in postgresql.conf to read the following: archive_mode = on archive_command = 'rsync -aq %p arc_server::wal_store/%f' Restart the PostgreSQL server with a command similar to this: pg_ctl -D $PGDATA restart How it works The rsync utility is normally used to transfer files between two servers. However, we can take advantage of using it as a daemon to avoid connection overhead imposed by using SSH as an rsync protocol. Our first step is to ensure that the service is not disabled in some manner, which would make the rest of this guide moot. Next, we need a place to store archived WAL files on the archive server. Assuming that we have 3 TB of space in the /db directory, we simply claim /db/pg_archived as the desired storage location. There should be enough space to use /db for backups as well, but we won't discuss that here. Next, we create a file named /etc/rsyncd.conf, which will configure how rsync operates as a daemon. Here, we name the /db/pg_archived directory wal_store so that we can address the path by its name when sending files. We give it a human-readable name and ensure that files are owned by the postgres user, as this user also controls most of the PostgreSQL-related services. The next, and possibly the most important step, is to block all hosts but the primary PostgreSQL server from writing to this location. We set hosts deny to *, which blocks every server. Then, we set hosts allow to the primary database server's IP address so that only it has access. If everything goes well, we can start the rsync (or xinetd on Red Hat systems) service and we can see that in the following screenshot: Next, we enable archive_mode by setting it to on. With archive mode enabled, we can specify a command that will execute when PostgreSQL no longer needs a WAL file for crash recovery. In this case, we invoke the rsync command with the -a parameter to preserve elements such as file ownership, timestamps, and so on. In addition, we specify the -q setting to suppress output, as PostgreSQL only checks the command exit status to determine its success. In the archive_command setting, the %p value represents the full path to the WAL file, and %f resolves to the filename. In this context, we're sending the WAL file to the archive server at the wal_store module we defined in rsyncd.conf. Once we restart PostgreSQL, it will start storing all the old WAL files by sending them to the archive server. In case any rsync command fails because the archive server is unreachable, PostgreSQL will keep trying to send it until it is successful. If the archive server is unreachable for too long, we suggest that you change the archive_command setting to store files elsewhere. This prevents accidentally overfilling the PostgreSQL server storage. There's more... As we will likely want to use the WAL files on other servers, we suggest that you make a list of all the servers that could need WAL files. Then, modify the rsyncd.conf file on the archive server and add this section: [wal_fetch] path = /db/pg_archived comment = DB WAL Archive Retrieval uid = postgres gid = postgres read only = true hosts allow = host1, host2, host3 hosts deny = * Now, we can fetch WAL files from any of the hosts in hosts allow. As these are dedicated PostgreSQL replicas, recovery servers, or other defined roles, this makes the archive server a central location for all our WAL needs. See also We suggest that you read more about the archive_mode and archive_command settings on the PostgreSQL site. We've included a link here: http://www.postgresql.org/docs/9.3/static/runtime-config-wal.html The rsyncd.conf file should also have its own manual page. Read it with this command to learn more about the available settings: man rsyncd.conf Summary In this article, we've successfully learned how to secure the WAL stream by following the given steps. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Backup in PostgreSQL 9 [article] Recovery in PostgreSQL 9 [article]

0
0
2292

Packt

09 Sep 2015

22 min read

Sabermetrics with Apache Spark

Packt

09 Sep 2015

22 min read

In this article by Rindra Ramamonjison, the author of the book called Apache Spark Graph Processing, we will gain useful insights that are required to quickly process big data, and handle its complexities. It is not the secret analytics that have made a big impact in sports. The quest for an objective understanding of the game has a name even—"sabermetrics". Analytics has proven invaluable in many aspects, from building dream teams under tight cap constraints, to selecting game-specific strategies, to actively engaging with fans, and so on. In the following sections, we will analyze NCAA Men's college basketball game stats, gathered during a single season. As sports data experts, we are going to leverage Spark's graph processing library to answer several questions for retrospection. Apache Spark is a fast, general-purpose technology, which greatly simplifies the parallel processing of large data that is distributed over a computing cluster. While Spark handles different types of processing, here, we will focus on its graph-processing capability. In particular, our goal is to expose the powerful yet generic graph-aggregation operator of Spark—aggregateMessages. We can think of this operator as a version of MapReduce for aggregating the neighborhood information in graphs. In fact, many graph-processing algorithms, such as PageRank rely on iteratively accessing the properties of neighboring vertices and adjacent edges. By applying aggregateMessages on the NCAA College Basketball datasets, we will: Identify the basic mechanisms and understand the patterns for using aggregateMessages Apply aggregateMessages to create custom graph aggregation operations Optimize the performance and efficiency of aggregateMessages (For more resources related to this topic, see here.) NCAA College Basketball datasets As an illustrative example, the NCAA College Basketball datasets consist of two CSV datasets. This first one called teams.csv contains the list of all the college teams that played in NCAA Division I competition. Each team is associated with a 4-digit ID number. The second dataset called stats.csv contains the score and statistics of every game played during the 2014-2015 regular season. Loading team data into RDDs To start with, we parse and load these datasets into RDDs (Resilient Distributed Datasets), which are the core Spark abstraction for any data that is distributed and stored over a cluster. First, we create a class called GameStats that records a team's statistics during a game: case class GameStats( val score: Int, val fieldGoalMade: Int, val fieldGoalAttempt: Int, val threePointerMade: Int, val threePointerAttempt: Int, val threeThrowsMade: Int, val threeThrowsAttempt: Int, val offensiveRebound: Int, val defensiveRebound: Int, val assist: Int, val turnOver: Int, val steal: Int, val block: Int, val personalFoul: Int ) Loading game stats into RDDs We also add the following methods to GameStats in order to know how efficient a team's offense was: // Field Goal percentage def fgPercent: Double = 100.0 * fieldGoalMade / fieldGoalAttempt // Three Point percentage def tpPercent: Double = 100.0 * threePointerMade / threePointerAttempt // Free throws percentage def ftPercent: Double = 100.0 * threeThrowsMade / threeThrowsAttempt override def toString: String = "Score: " + score Next, we create a couple of classes for the games' result: abstract class GameResult( val season: Int, val day: Int, val loc: String ) case class FullResult( override val season: Int, override val day: Int, override val loc: String, val winnerStats: GameStats, val loserStats: GameStats ) extends GameResult(season, day, loc) FullResult has the year and day of the season, the location where the game was played, and the game statistics of both the winning and losing teams. Next, we will create a statistics graph of the regular seasons. In this graph, the nodes are the teams, whereas each edge corresponds to a specific game. To create the graph, let's parse the CSV file called teams.csv into the RDD teams: val teams: RDD[(VertexId, String)] = sc.textFile("./data/teams.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' (row(0).toInt, row(1)) } We can check the first few teams in this new RDD: scala> teams.take(3).foreach{println} (1101,Abilene Chr) (1102,Air Force) (1103,Akron) We do the same thing to obtain an RDD of the game results, which will have a type called RDD[Edge[FullResult]]. We just parse stats.csv, and record the fields that we need: The ID of the winning team The ID of the losing team The game statistics of both the teams val detailedStats: RDD[Edge[FullResult]] = sc.textFile("./data/stats.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' Edge(row(2).toInt, row(4).toInt, FullResult( row(0).toInt, row(1).toInt, row(6), GameStats( score = row(3).toInt, fieldGoalMade = row(8).toInt, fieldGoalAttempt = row(9).toInt, threePointerMade = row(10).toInt, threePointerAttempt = row(11).toInt, threeThrowsMade = row(12).toInt, threeThrowsAttempt = row(13).toInt, offensiveRebound = row(14).toInt, defensiveRebound = row(15).toInt, assist = row(16).toInt, turnOver = row(17).toInt, steal = row(18).toInt, block = row(19).toInt, personalFoul = row(20).toInt ), GameStats( score = row(5).toInt, fieldGoalMade = row(21).toInt, fieldGoalAttempt = row(22).toInt, threePointerMade = row(23).toInt, threePointerAttempt = row(24).toInt, threeThrowsMade = row(25).toInt, threeThrowsAttempt = row(26).toInt, offensiveRebound = row(27).toInt, defensiveRebound = row(28).toInt, assist = row(20).toInt, turnOver = row(30).toInt, steal = row(31).toInt, block = row(32).toInt, personalFoul = row(33).toInt ) ) ) } We can avoid typing all this by using the nice spark-csv package that reads CSV files into SchemaRDD. Let's check what we got: scala> detailedStats.take(3).foreach(println) Edge(1165,1384,FullResult(2006,8,N,Score: 75-54)) Edge(1393,1126,FullResult(2006,8,H,Score: 68-37)) Edge(1107,1324,FullResult(2006,9,N,Score: 90-73)) We then create our score graph using the collection of teams (of the type called RDD[(VertexId, String)]) as vertices, and the collection called detailedStats (of the type called RDD[(VertexId, String)]) as edges: scala> val scoreGraph = Graph(teams, detailedStats) For curiosity, let's see which team has won against the 2015 NCAA national champ Duke during the regular season. It seems Duke has lost only four games during the regular season: scala> scoreGraph.triplets.filter(_.dstAttr == "Duke").foreach(println)((1274,Miami FL),(1181,Duke),FullResult(2015,71,A,Score: 90-74)) ((1301,NC State),(1181,Duke),FullResult(2015,69,H,Score: 87-75)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,86,H,Score: 77-73)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,130,N,Score: 74-64)) Aggregating game stats After we have our graph ready, let's start aggregating the stats data in scoreGraph. In Spark, aggregateMessages is the operator for such a kind of jobs. For example, let's find out the average field goals made per game by the winners. In other words, the games that a team has lost will not be counted. To get the average for each team, we first need to have the number of games won by the team, and the total field goals that the team made in these games: // Aggregate the total field goals made by winning teams type Msg = (Int, Int) type Context = EdgeContext[String, FullResult, Msg] val winningFieldGoalMade: VertexRDD[Msg] = scoreGraph aggregateMessages( // sendMsg (ec: Context) => ec.sendToSrc(1, ec.attr.winnerStats.fieldGoalMade), // mergeMsg (x: Msg, y: Msg) => (x._1 + y._1, x._2+ y._2) ) The aggregateMessage operator There is a lot going on in the previous call to aggregateMessages. So, let's see it working in slow motion. When we called aggregateMessages on the scoreGraph, we had to pass two functions as arguments. SendMsg The first function has a signature called EdgeContext[VD, ED, Msg] => Unit. It takes an EdgeContext as input. Since it does not return anything, its return type is Unit. This function is needed for sending message between the nodes. Okay, but what is the EdgeContext type? EdgeContext represents an edge along with its neighboring nodes. It can access both the edge attribute, and the source and destination nodes' attributes. In addition, EdgeContext has two methods to send messages along the edge to its source node, or to its destination node. These methods are called sendToSrc and sendToDst respectively. Then, the type of messages being sent through the graph is defined by Msg. Similar to vertex and edge types, we can define the concrete type that Msg takes as we wish. Merge In addition to sendMsg, the second function that we need to pass to aggregateMessages is a mergeMsg function with the (Msg, Msg) => Msg signature. As its name implies, mergeMsg is used to merge two messages, received at each node into a new one. Its output must also be of the Msg type. Using these two functions, aggregateMessages returns the aggregated messages inside VertexRDD[Msg]. Example In our example, we need to aggregate the number of games played and the number of field goals made. Therefore, Msg is simply a pair of Int. Furthermore, each edge context needs to send a message to only its source node, that is, the winning team. This is because we want to compute the total field goals made by each team for only the games that it has won. The actual message sent to each "winner" node is the pair of integers (1, ec.attr.winnerStats.fieldGoalMade). Here, 1 serves as a counter for the number of games won by the source node. The second integer, which is the number of field goals in one game, is extracted from the edge attribute. As we set out to compute the average field goals per winning game for all teams, we need to apply the mapValues operator to the output of aggregateMessages, which is as follows: // Average field goals made per Game by the winning teams val avgWinningFieldGoalMade: VertexRDD[Double] = winningFieldGoalMade mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Int) => total.toDouble/count }) Here is the output: scala> avgWinningFieldGoalMade.take(5).foreach(println) (1260,24.71641791044776) (1410,23.56578947368421) (1426,26.239436619718308) (1166,26.137614678899084) (1434,25.34285714285714) Abstracting out the aggregation This was kind of cool! We can surely do the same thing for the average points per game scored by the winning teams: // Aggregate the points scored by winning teams val winnerTotalPoints: VertexRDD[(Int, Int)] = scoreGraph.aggregateMessages( // sendMsg triplet => triplet.sendToSrc(1, triplet.attr.winnerStats.score), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) // Average field goals made per Game by winning teams var winnersPPG: VertexRDD[Double] = winnerTotalPoints mapValues ( (id: VertexId, x: (Int, Int)) => x match { case (count: Int, total: Int) => total.toDouble/count }) Let's check the output: scala> winnersPPG.take(5).foreach(println) (1260,71.19402985074628) (1410,71.11842105263158) (1426,76.30281690140845) (1166,76.89449541284404) (1434,74.28571428571429) What if the coach wants to know the top five teams with the highest average three pointers made per winning game? By the way, he might also ask about the teams that are the most efficient in three pointers. Keeping things DRY We can copy and modify the previous code, but that would be quite repetitive. Instead, let's abstract out the average aggregation operator so that it can work on any statistics that the coach needs. Luckily, Scala's higher-order functions are there to help in this task. Let's define the functions that take a team's GameStats as an input, and return specific statistic that we are interested in. For now, we will need the number of three pointer made, and the average three pointer percentage: // Getting individual stats def threePointMade(stats: GameStats) = stats.threePointerMade def threePointPercent(stats: GameStats) = stats.tpPercent Then, we create a generic function that takes as an input a stats graph, and one of the functions defined previously, which has a signature called GameStats => Double: // Generic function for stats averaging def averageWinnerStat(graph: Graph[String, FullResult])(getStat: GameStats => Double): VertexRDD[Double] = { type Msg = (Int, Double) val winningScore: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg triplet => triplet.sendToSrc(1, getStat(triplet.attr.winnerStats)), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) winningScore mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, we can get the average stats by passing the threePointMade and threePointPercent to averageWinnerStat functions: val winnersThreePointMade = averageWinnerStat(scoreGraph)(threePointMade) val winnersThreePointPercent = averageWinnerStat(scoreGraph)(threePointPercent) With little efforts, we can tell the coach which five winning teams score the highest number of threes per game: scala> winnersThreePointMade.sortBy(_._2,false).take(5).foreach(println) (1440,11.274336283185841) (1125,9.521929824561404) (1407,9.008849557522124) (1172,8.967441860465117) (1248,8.915384615384616) While we are at it, let's find out the five most efficient teams in three pointers: scala> winnersThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1101,46.90555728464225) (1147,44.224282479431224) (1294,43.754532434101534) (1339,43.52308905887638) (1176,43.080814169045105) Interestingly, the teams that made the most three pointers per winning game are not always the one who are the most efficient ones at it. But it is okay because at least they have won these games. Coach wants more numbers The coach seems to argue against this argument. He asks us to get the same statistics, but he wants the average over all the games that each team has played. We then have to aggregate the information at all the nodes, and not only at the destination nodes. To make our previous abstraction more flexible, let's create the following types: trait Teams case class Winners extends Teams case class Losers extends Teams case class AllTeams extends Teams We modify the previous higher-order function to have an extra argument called Teams, which will help us specify those nodes where we want to collect and aggregate the required game stats. The new function becomes as the following: def averageStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, tms: Teams): VertexRDD[Double] = { type Msg = (Int, Double) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg tms match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.winnerStats))) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.loserStats))) case _ => t => { t.sendToSrc((1, getStat(t.attr.winnerStats))) t.sendToDst((1, getStat(t.attr.loserStats))) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, aggregateStat allows us to choose if we want to aggregate the stats for winners only, for losers only, or for the all teams. Since the coach wants the overall stats averaged over all the games played, we aggregate the stats by passing the AllTeams() flag in aggregateStat. In this case, we define the sendMsg argument in aggregateMessages to send the required stats to both source (the winner) and destination (the loser) using the EdgeContext class's sendToSrc and sendToDst functions respectively. This mechanism is pretty straightforward. We just need to make sure that we send the right information to the right node. In this case, we send winnerStats to the winner, and loserStatsto the loser. Okay, you get the idea now. So, let's apply it to please our coach. Here are the teams with the overall highest three pointers per page: // Average Three Point Made Per Game for All Teams val allThreePointMade = averageStat(scoreGraph)(threePointMade, AllTeams()) scala> allThreePointMade.sortBy(_._2, false).take(5).foreach(println) (1440,10.180811808118081) (1125,9.098412698412698) (1172,8.575657894736842) (1184,8.428571428571429) (1407,8.411149825783973) And here are the five most efficient teams overall in three pointers per game: // Average Three Point Percent for All Teams val allThreePointPercent = averageStat(scoreGraph)(threePointPercent, AllTeams()) Let's check the output: scala> allThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1429,38.8351815824302) (1323,38.522819895594) (1181,38.43052051444854) (1294,38.41227053353959) (1101,38.097896464168954) Actually, there is only a 2 percent difference between the most efficient team and the one in the fiftieth position. Most NCAA teams are therefore pretty efficient behind the line. I bet coach knew this already! Average points per game We can also reuse the averageStat function to get the average points per game for the winners. In particular, let's take a look at the two teams that won games with the highest and lowest scores: // Winning teams val winnerAvgPPG = averageStat(scoreGraph)(score, Winners()) Let's check the output: scala> winnerAvgPPG.max()(Ordering.by(_._2)) res36: (org.apache.spark.graphx.VertexId, Double) = (1322,90.73333333333333) scala> winnerAvgPPG.min()(Ordering.by(_._2)) res39: (org.apache.spark.graphx.VertexId, Double) = (1197,60.5) Apparently, the most defensive team can win game by scoring only 60 points, whereas the most offensive team can score an average of 90 points. Next, let's average the points per game for all games played and look at the two teams with the best and worst offense during the 2015 season: // Average Points Per Game of All Teams val allAvgPPG = averageStat(scoreGraph)(score, AllTeams()) Let's see the output: scala> allAvgPPG.max()(Ordering.by(_._2)) res42: (org.apache.spark.graphx.VertexId, Double) = (1322,83.81481481481481) scala> allAvgPPG.min()(Ordering.by(_._2)) res43: (org.apache.spark.graphx.VertexId, Double) = (1212,51.111111111111114) To no one's surprise, the best offensive team is the same as the one who scores the most in winning games. To win the games, 50 points are not enough in an average for a team to win the games. Defense stats – the D matters as in direction Previously, we obtained some statistics such as field goals or a three-point percentage that a team achieves. What if we want to aggregate instead the average points or rebounds that each team concedes to their opponents? To compute this, we define a new higher-order function called averageConcededStat. Compared to averageStat, this function needs to send loserStats to the winning team, and the winnerStats function to the losing team. To make things more interesting, we are going to make the team name as a part of the message Msg: def averageConcededStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, rxs: Teams): VertexRDD[(String, Double)] = { type Msg = (Int, Double, String) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg rxs match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.loserStats), t.srcAttr)) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.winnerStats), t.dstAttr)) case _ => t => { t.sendToSrc((1, getStat(t.attr.loserStats),t.srcAttr)) t.sendToDst((1, getStat(t.attr.winnerStats),t.dstAttr)) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2, x._3) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double, name: String) => (name, total/count) }) } With this, we can calculate the average points conceded by the winning and losing teams as follows: val winnersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Winners()) val losersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Losers()) Let's check the output: scala> losersAvgConcededPoints.min()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> winnersAvgConcededPoints.min()(Ordering.by(_._2)) res: (org.apache.spark.graphx.VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> losersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,78.85714285714286)) scala> winnersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,71.125)) The previous tells us that Abilene Christian University is the most defensive team. They concede the least points whether they win a game or not. On the other hand, Youngstown has the worst defense. Joining aggregated stats into graphs The previous example shows us how flexible the aggregateMessages operator is. We can define the Msg type of the messages to be aggregated to fit our needs. Moreover, we can select which nodes receive the messages. Finally, we can also define how we want to merge the messages. As a final example, let's aggregate many statistics about each team, and join this information into the nodes of the graph. To start, we create its own class for the team stats: // Average Stats of All Teams case class TeamStat( wins: Int = 0 // Number of wins ,losses: Int = 0 // Number of losses ,ppg: Int = 0 // Points per game ,pcg: Int = 0 // Points conceded per game ,fgp: Double = 0 // Field goal percentage ,tpp: Double = 0 // Three point percentage ,ftp: Double = 0 // Free Throw percentage ){ override def toString = wins + "-" + losses } Then, we collect the average stats for all teams using aggregateMessages in the following. For this, we define the type of the message to be an 8-element tuple that holds the counter for games played, wins, losses, and other statistics that will be stored in TeamStat as listed previously: type Msg = (Int, Int, Int, Int, Int, Double, Double, Double) val aggrStats: VertexRDD[Msg] = scoreGraph.aggregateMessages( // sendMsg t => { t.sendToSrc(( 1, 1, 0, t.attr.winnerStats.score, t.attr.loserStats.score, t.attr.winnerStats.fgPercent, t.attr.winnerStats.tpPercent, t.attr.winnerStats.ftPercent )) t.sendToDst(( 1, 0, 1, t.attr.loserStats.score, t.attr.winnerStats.score, t.attr.loserStats.fgPercent, t.attr.loserStats.tpPercent, t.attr.loserStats.ftPercent )) } , // mergeMsg (x, y) => ( x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 + y._4, x._5 + y._5, x._6 + y._6, x._7 + y._7, x._8 + y._8 ) ) Given the aggregate message called aggrStats, we map them into a collection of TeamStat: val teamStats: VertexRDD[TeamStat] = aggrStats mapValues { (id: VertexId, m: Msg) => m match { case ( count: Int, wins: Int, losses: Int, totPts: Int, totConcPts: Int, totFG: Double, totTP: Double, totFT: Double) => TeamStat( wins, losses, totPts/count, totConcPts/count, totFG/count, totTP/count, totFT/count) } } Next, let's join teamStats into the graph. For this, we first create a class called Team as a new type for the vertex attribute. Team will have a name and TeamStat: case class Team(name: String, stats: Option[TeamStat]) { override def toString = name + ": " + stats } Next, we use the joinVertices operator that we have seen in the previous chapter: // Joining the average stats to vertex attributes def addTeamStat(id: VertexId, t: Team, stats: TeamStat) = Team(t.name, Some(stats)) val statsGraph: Graph[Team, FullResult] = scoreGraph.mapVertices((_, name) => Team(name, None)). joinVertices(teamStats)(addTeamStat) We can see that the join has worked well by printing the first three vertices in the new graph called statsGraph: scala> statsGraph.vertices.take(3).foreach(println) (1260,Loyola-Chicago: Some(17-13)) (1410,TX Pan American: Some(7-21)) (1426,UT Arlington: Some(15-15)) To conclude this task, let's find out the top 10 teams in the regular seasons. To do so, we define an ordering for Option[TeamStat] as follows: import scala.math.Ordering object winsOrdering extends Ordering[Option[TeamStat]] { def compare(x: Option[TeamStat], y: Option[TeamStat]) = (x, y) match { case (None, None) => 0 case (Some(a), None) => 1 case (None, Some(b)) => -1 case (Some(a), Some(b)) => if (a.wins == b.wins) a.losses compare b.losses else a.wins compare b.wins }} Finally, we get the following: import scala.reflect.classTag import scala.reflect.ClassTag scala> statsGraph.vertices.sortBy(v => v._2.stats,false)(winsOrdering, classTag[Option[TeamStat]]). | take(10).foreach(println) (1246,Kentucky: Some(34-0)) (1437,Villanova: Some(32-2)) (1112,Arizona: Some(31-3)) (1458,Wisconsin: Some(31-3)) (1211,Gonzaga: Some(31-2)) (1320,Northern Iowa: Some(30-3)) (1323,Notre Dame: Some(29-5)) (1181,Duke: Some(29-4)) (1438,Virginia: Some(29-3)) (1268,Maryland: Some(27-6)) Note that the ClassTag parameter is required in sortBy to make use of Scala's reflection. This is why we had the previous imports. Performance optimization with tripletFields In addition to sendMsg and mergeMsg, aggregateMessages can also take an optional argument called tripletsFields, which indicates what data is accessed in the EdgeContext. The main reason for explicitly specifying such information is to help optimize the performance of the aggregateMessages operation. In fact, TripletFields represents a subset of the fields of EdgeTriplet, and it enables GraphX to populate only thse fields when necessary. The default value is TripletFields. All which means that the sendMsg function may access any of the fields in the EdgeContext. Otherwise, the tripletFields argument is used to tell GraphX that only part of the EdgeContext will be required so that an efficient join strategy can be used. All the possible options for the tripletsFields are listed here: TripletFields.All: Expose all the fields (source, edge, and destination) TripletFields.Dst: Expose the destination and edge fields, but not the source field TripletFields.EdgeOnly: Expose only the edge field. TripletFields.None: None of the triplet fields are exposed TripletFields.Src: Expose the source and edge fields, but not the destination field Using our previous example, if we are interested in computing the total number of wins and losses for each team, we will not need to access any field of the EdgeContext. In this case, we should use TripletFields. None to indicate so: // Number of wins of the teams val numWins: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToSrc(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) // Number of losses of the teams val numLosses: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToDst(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) To see that this works, let's print the top five and bottom five teams: scala> numWins.sortBy(_._2,false).take(5).foreach(println) (1246,34) (1437,32) (1112,31) (1458,31) (1211,31) scala> numLosses.sortBy(_._2, false).take(5).foreach(println) (1363,28) (1146,27) (1212,27) (1197,27) (1263,27) Should you want the name of the top five teams, you need to access the srcAttr attribute. In this case, we need to set tripletFields to TripletFields.Src: Kentucky as undefeated team in regular season: val numWinsOfTeams: VertexRDD[(String, Int)] = scoreGraph.aggregateMessages( t => { t.sendToSrc(t.srcAttr, 1) // Pass source attribute only }, (x, y) => (x._1, x._2 + y._2), TripletFields.Src ) Et voila! scala> numWinsOfTeams.sortBy(_._2._2, false).take(5).foreach(println) (1246,(Kentucky,34)) (1437,(Villanova,32)) (1112,(Arizona,31)) (1458,(Wisconsin,31)) (1211,(Gonzaga,31)) scala> numWinsOfTeams.sortBy(_._2._2).take(5).foreach(println) (1146,(Cent Arkansas,2)) (1197,(Florida A&M,2)) (1398,(Tennessee St,3)) (1263,(Maine,3)) (1420,(UMBC,4)) Kentucky has not lost any of its 34 games during the regular season. Too bad that they could not make it into the championship final. Warning about the MapReduceTriplets operator Prior to Spark 1.2, there was no aggregateMessages method in graph. Instead, the now deprecated mapReduceTriplets was the primary aggregation operator. The API for mapReduceTriplets is: class Graph[VD, ED] { def mapReduceTriplets[Msg]( map: EdgeTriplet[VD, ED] => Iterator[(VertexId, Msg)], reduce: (Msg, Msg) => Msg) : VertexRDD[Msg] } Compared to mapReduceTriplets, the new operator called aggregateMessages is more expressive as it employs the message passing mechanism instead of returning an iterator of messages as mapReduceTriplets does. In addition, aggregateMessages explicitly requires the user to specify the TripletFields object for performance improvement as we explained previously. In addition to the API improvements, aggregateMessages is optimized for performance. Because mapReduceTriplets is now deprecated, we will not discuss it further. If you have to use it with earlier versions of Spark, you can refer to the Spark programming guide. Summary In brief, AggregateMessages is a useful and generic operator that provides a functional abstraction for aggregating neighborhood information in the Spark graphs. Its definition is summarized here: class Graph[VD, ED] { def aggregateMessages[Msg: ClassTag]( sendMsg: EdgeContext[VD, ED, Msg] => Unit, mergeMsg: (Msg, Msg) => Msg, tripletFields: TripletFields = TripletFields.All) : VertexRDD[Msg] } This operator applies a user-defined sendMsg function to each edge in the graph using an EdgeContext. Each EdgeContext access the required information about the edge and passes this information to its source node and/or destination node using the sendToSrc and/or sendToDst respectively. After all the messages are received by the nodes, the mergeMsg function is used to aggregate these messages at each node. Some interesting reads Six keys to sports analytics Moneyball: The Art Of Winning An Unfair Game Golden State Warriors at the forefront of NBA data analysis How Data and Analytics Have Changed 'The Beautiful Game' NHL, SAP partnership to lead statistical revolution Resources for Article: Further resources on this subject: The Spark programming model[article] Apache Karaf – Provisioning and Clusters[article] Machine Learning Using Spark MLlib [article]

0
0
2289

article-image-manual-and-automated-testing

Packt

15 Nov 2016

10 min read

Manual and Automated Testing

Packt

15 Nov 2016

10 min read

In this article by Claus Führer the author of the book Scientific Computing with Python 3, we focus on two aspects of testing for scientific programming: Manual and Automated testing. Manual testing is what is done by every programmer to quickly check that an implementation is working. Automated testing is the refined, automated variant of that idea. We will introduce some tools available for automatic testing in general, with a view on the particular case of scientific computing. (For more resources related to this topic, see here.) Manual Testing During the development of code you do a lot of small tests in order to test its functionality. This could be called Manual Testing. Typically, you would test that a given function does what it is supposed to do, by manually testing the function in an interactive environment. For instance, suppose that you implement the Bisection algorithm. It is an algorithm that finds a zero (root) of a scalar nonlinear function. To start the algorithm an interval has to be given with the property, that the function takes different signs on the interval boundaries. You would then test an implementation of that algorithm typically by checking: That a solution is found when the function has opposite signs at the interval boundaries that an exception is raised when the function has the same sign at the interval boundaries Manual testing, as necessary as may seem to be, is unsatisfactory. Once you convinced yourself that the code does what it is supposed to do, you formulate a relatively small number of demonstration examples to convince others of the quality of the code. At that stage one often loses interest in the tests made during development and they are forgotten or even deleted. As soon as you change a detail and things no longer work correctly you might regret that your earlier tests are no longer available. Automatic Testing The correct way to develop any piece of code is to use automatic testing. The advantages are The automated repetition of a large number of tests after every code refactoring and before new versions are launched A silent documentation of the use of the code A documentation of the test coverage of your code: Did things work before a change or was a certain aspect never tested? We suggest to develop tests in parallel to the code. Good design of tests is an art of its own and there is rarely an investment which guarantees such a good pay-off in development time savings as the investment in good tests. Now we will go through the implementation of a simple algorithm with the automated testing methods in mind. Testing the bisection algorithm Let us examine automated testing for the bisection algorithm. With this algorithm a zero of a real valued function is found. An implementation of the algorithm can have the following form: def bisect(f,a,b,tol=1.e-8): """ Implementation of the bisection algorithm f real valued function a,b interval boundaries (float) with the property f(a)*f(b)<=0 tol tolerance ( float ) """ if f(a)*f(b)>0: raise ValueError ("Incorrect initial interval [a,b]") for i in range (100): c = (a + b)/2 . if f (a)*f(c) <= 0: b=c else: a=c if abs (a - b)<tol: return (a + b)/2 raise Exception (’ No root found within the given tolerance { }’.format (tol) We assume this to be stored in a file bisection.py. As a first test case we test that the zero of the function F(x) = x is found: def test_identity(): result = bisect(lambda x: x, -1., 1.) #(for lambda) expected = 0. assert allclose(result, expected),’expected zero not found’ text_identity() In this code you meet the Python keyword assert for the first time. It raises an exception AssertionError if its first argument returns the value False. Its optional second argument is a string with additional information. We use the function allclose in order to test for equality for float. Let us comment on some of the features of the test function. We use an assertion to make sure that an exception will be raised if the code does not behave as expected. We have to manually run the test in the line test_identity(). There are many tools to automate this kind of call. Let us now setup a test that checks if bisect raises an exception when the function has the same sign on both ends of the interval. For now, we will suppose that the exception raised is a ValueError exception. Example: Checking the sign for the bisection algorithm. def test_badinput(): try: bisect(lambda x: x,0.5,1) except ValueError: pass else: raise AssertionError() test_badinput() In this case an AssertionError is raised if the exception is not of type ValueError. There are tools to simplify the above construction to check that an exception is raised. Another useful kind of tests is the edge case test. Here we test arguments or user input which is likely to create mathematically undefined situations or states of the program not foreseen by the programmer. For instance, what happens if both bounds are equal? What happens if a>b? We easily setup up such a test by using for instance def test_equal_boundaries(): result = bisect(lambda x: x, 1., 1.) expected = 0. assert allclose(result, expected), ‘test equal interval bounds failed’ def test_reverse_boundaries(): result = bisect(lambda x: x, 1., -1.) expected = 0. assert allclose(result, expected), ‘test reverse interval bounds failed’ test_equal_boundaries() test_reverse_boundaries() Using unittest The standard Python package unittest greatly facilitates automated testing. That package requires that we rewrite our tests a little to be compatible. The first test would have to be rewritten in a class, as follows: from bisection import bisect import unittest class TestIdentity(unittest.TestCase): def test(self): result = bisect(lambda x: x, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) if __name__==‘__main__’: unittest.main() Let us examine the differences to the previous implementation. First, the test is now a method and a part of a class. The class must inherit from unittest,TestCase. The test method’s name must start with test. Note that we may now use one of the assertion tools of the package, namely . Finally, the tests are run using unittest.main. We recommend to write the tests in a file separate from the code to be tested. That’s why it starts with an import. The test passes and returns Ran 1 test in 0.002s OK If we would have run it with a loose tolerance parameter, e.g., 1.e-3, a failure of the test would have been reported: F ========================================================== FAIL: test (__main__.TestIdentity) ---------------------------------------------------------------------- Traceback (most recent call last): File “<ipython-input-11-e44778304d6f>“, line 5, in test self.assertAlmostEqual(result, expected) AssertionError: 0.00017089843750002018 != 0.0 within 7 places --------------------------------------------------------------------- Ran 1 test in 0.004s FAILED (failures=1) Tests can and should be grouped together as methods of a test class: Example: import unittest from bisection import bisect class TestIdentity(unittest.TestCase): def identity_fcn(self,x): return x def test_functionality(self): result = bisect(self.identity_fcn, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) def test_reverse_boundaries(self): result = bisect(self.identity_fcn, 1., -1.) expected = 0. self.assertAlmostEqual(result, expected) def test_exceeded_tolerance(self): tol=1.e-80 self.assertRaises(Exception, bisect, self.identity_fcn, -1.2, 1.,tol) if __name__==‘__main__’: unittest.main() Here, the last test needs some comments: We used the method unittest.TestCase.assertRaises. It tests whether an exception is correctly raised. Its first parameter is the exception type, for example,ValueError, Exception, and its second argument is a the name of the function, which is expected to raise the exception. The remaining arguments are the arguments for this function. The command unittest.main() creates an instance of the class TestIdentity and executes those methods starting by test. Test setUp and tearDown The class unittest.TestCase provides two special methods, setUp and tearDown, which are run before and after every call to a test method. This is needed when testing generators, which are exhausted after every test. We demonstrate this here by testing a program which checks in which line in a file a given string occurs for the first time: class NotFoundError(Exception): pass def find_string(file, string): for i,lines in enumerate(file.readlines()): if string in lines: return i raise NotFoundError(‘String {} not found in File {}‘. format(string,file.name)) We assume, that this code is saved in a file find_string.py. A test has to prepare a file and open it and remove it after the test: import unittest import os # used for, e.g., deleting files from find_in_file import find_string, NotFoundError class TestFindInFile(unittest.TestCase): def setUp(self): file = open(‘test_file.txt’, ‘w’) file.write(‘aha’) file.close() self.file = open(‘test_file.txt’, ‘r’) def tearDown(self): os.remove(self.file.name) def test_exists(self): line_no=find_string(self.file, ‘aha’) self.assertEqual(line_no, 0) def test_not_exists(self): self.assertRaises(NotFoundError, find_string,self.file, ‘bha’) if __name__==‘__main__’: unittest.main() Before each test setUp is run and afterwards tearDown is executed. Parametrizing Tests One frequently wants to repeat the same test set-up with different data sets. When using the functionalities of unittests this requires to automatically generate test cases with the corresponding methods injected: To this end we first construct a test case with one or several methods that will be used, when we later set up test methods. Let us consider the bisection method again and let us check if the values it returns are really zeros of the given function. We first build the test case and the method which will use for the tests: class Tests(unittest.TestCase): def checkifzero(self,fcn_with_zero,interval): result = bisect(fcn_with_zero,*interval,tol=1.e-8) function_value=fcn_with_zero(result) expected=0. self.assertAlmostEqual(function_value, expected) Then we dynamically create test functions as attributes of this class: test_data=[‘name’:’identity’, ‘function’:lambda x: x, ‘interval’:[-1.2, 1.], ‘name’:’parabola’, ‘function’:lambda x: x**2-1, ’interval’:[0, 10.], ‘name’:’cubic’, ‘function’:lambda x: x**3-2*x** 2,‘interval’:[0.1, 5.],] def make_test_function(dic): return lambda self:self.checkifzero(dic[‘function’],dic [‘interval’]) for data in test_data: setattr(Tests, “test_name”.format(name=data[‘name’]), make_test_function(data)) if __name__==‘__main__’: unittest.main() In this example the data is provided as a list of dictionaries. A function make_test_function dynamically generates a test function which uses a particular data dictionary to perform the test with the previously defined method checkifzero. This test function is made a method of the TestCase class by using the Python command settattr. Summary No program development without testing! In this article we showed the importance of well organized and documented tests. Some professionals even start development by first specifying tests. A useful tool for automatic testing is unittest, which we explained in detail. While testing improves the reliability of a code, profiling is needed to improve the performance. Alternative ways to code may result in large performance differences. We showed how to measure computation time and how to localize bottlenecks in your code. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine Learning with R [article] Storage Scalability [article]

0
0
2277

How-To Tutorials - Data

Performance Considerations

Classifier Construction

Meeting SAP Lumira

Unveil the Power of Your Business Data with Oracle Discoverer

Customizing Page Management in Liferay Portal 5.2 Systems Development

A/B Testing – Statistical Experiments for the Web

Getting Started with Python 2.6 Text Processing

Solr Indexing Internals

Installation and basic features of EnterpriseDB

Self-service reporting

Trending Topics

Decision Trees

Data Clustering

Securing the WAL Stream

Sabermetrics with Apache Spark

Manual and Automated Testing

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access