Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-sql-tuning-enhancements-oracle-12c
Packt
13 Dec 2016
13 min read
Save for later

SQL Tuning Enhancements in Oracle 12c

Packt
13 Dec 2016
13 min read
Background Performance Tuning is one of the most critical area of Oracle databases and having a good knowledge on SQL tuning helps DBAs in tuning production databases on a daily basis. Over the years Oracle optimizer has gone through several enhancements and each release presents a best among all optimizer versions. Oracle 12c is no different. Oracle has improved the optimizer and added new features in this release to make it better than previous release. In this article we are going to see some of the explicit new features of Oracle optimizer which helps us in tuning our queries. Objective In this article, Advait Deo and Indira Karnati, authors of the book OCP Upgrade 1Z0-060 Exam guide discusses new features of Oracle 12c optimizer and how it helps in improving the SQL plan. It also discusses some of the limitations of optimizer in previous release and how Oracle has overcome those limitations in this release. Specifically, we are going to discuss about dynamic plan and how it works (For more resources related to this topic, see here.) SQL Tuning Before we go into the details of each of these new features, let us rewind and check what we used to have in Oracle 11g. Behavior in Oracle 11g R1 Whenever an SQL is executed for the first time, an optimizer will generate an execution plan for the SQL based on the statistics available for the different objects used in the plan. If statistics are not available, or if the optimizer thinks that the existing statistics are of low quality, or if we have complex predicates used in the SQL for which the optimizer cannot estimate the cardinality, the optimizer may choose to use dynamic sampling for those tables. So, based on the statistics values, the optimizer generates the plan and executes the SQL. But, there are two problems with this approach: Statistics generated by dynamic sampling may not be of good quality as they are generated in limited time and are based on a limited sample size. But a trade-off is made to minimize the impact and try to approach a higher level of accuracy. The plan generated using this approach may not be accurate, as the estimated cardinality may differ a lot from the actual cardinality. The next time the query executes, it goes for soft parsing and picks the same plan. Behavior in Oracle 11g R2 To overcome these drawbacks, Oracle enhanced the dynamic sampling feature further in Oracle11g Release 2. In the 11.2 release, Oracle will automatically enable dynamic sample when the query is run if statistics are missing, or if the optimizer thinks that current statistics are not up to the mark. The optimizer also decides the level of the dynamic sample, provided the user does not set the non-default value of the OPTIMIZER_DYNAMIC_SAMPLING parameter (default value is 2). So, if this parameter has a default value in Oracle11g R2, the optimizer will decide when to spawn dynamic sampling in a query and at what level to spawn the dynamic sample. Oracle also introduced a new feature in Oracle11g R2 called cardinality feedback. This was in order to further improve the performance of SQLs, which are executed repeatedly and for which the optimizer does not have the correct cardinality, perhaps because of missing statistics, or complex predicate conditions, or because of some other reason. In such cases, cardinality feedback was very useful. The way cardinality feedback works is, during the first execution, the plan for the SQL is generated using the traditional method without using cardinality feedback. However, during the optimization stage of the first execution, the optimizer notes down all the estimates that are of low quality (due to missing statistics, complex predicates, or some other reason) and monitoring is enabled for the cursor that is created. If this monitoring is enabled during the optimization stage, then, at the end of the first execution, some cardinality estimates in the plan are compared with the actual estimates to understand how significant the variation is. If the estimates vary significantly, then the actual estimates for such predicates are stored along with the cursor, and these estimates are used directly for the next execution instead of being discarded and calculated again. So when the query executes the next time, it will be optimized again (hard parse will happen), but this time it will use the actual statistics or predicates that were saved in the first execution, and the optimizer will come up with better plan. But even with these improvements, there are drawbacks: With cardinality feedback, any missing cardinality or correct estimates are available for the next execution only and not for the first execution. So the first execution always go for regression. The dynamic sample improvements (that is, the optimizer deciding whether dynamic sampling should be used and the level of the dynamic sampling) are only applicable to parallel queries. It is not applicable to queries that aren't running in parallel. Dynamic sampling does not include joins and groups by columns. Oracle 12c has provided new improvements, which eliminates the drawbacks of Oracle11g R2. Adaptive execution plans – dynamic plans The Oracle optimizer chooses the best execution plan for a query based on all the information available to it. Sometimes, the optimizer may not have sufficient statistics or good quality statistics available to it, making it difficult to generate optimal plans. In Oracle 12c, the optimizer has been enhanced to adapt a poorly performing execution plan at run time and prevent a poor plan from being chosen on subsequent executions. An adaptive plan can change the execution plan in the current run when the optimizer estimates prove to be wrong. This is made possible by collecting the statistics at critical places in a plan when the query starts executing. A query is internally split into multiple steps, and the optimizer generates multiple sub-plans for every step. Based on the statistics collected at critical points, the optimizer compares the collected statistics with estimated cardinality. If the optimizer finds a deviation in statistics beyond the set threshold, it picks a different sub-plan for those steps. This improves the ability of the query-processing engine to generate better execution plans. What happens in adaptive plan execution? In Oracle12c, the optimizer generates dynamic plans. A dynamic plan is an execution plan that has many built-in sub-plans. A sub-plan is a portion of plan that the optimizer can switch to as an alternative at run time. When the first execution starts, the optimizer observes statistics at various critical stages in the plan. An optimizer makes a final decision about the sub-plan based on observations made during the execution up to this point. Going deeper into the logic for the dynamic plan, the optimizer actually places the statistics collected at various critical stages in the plan. These critical stages are the places in the plan where the optimizer has to join two tables or where the optimizer has to decide upon the optimal degree of parallelism. During the execution of the plan, the statistics collector buffers a portion of the rows. The portion of the plan preceding the statistics collector can have alternative sub-plans, each of which is valid for the subset of possible values returned by the collector. This means that each of the sub-plans has a different threshold value. Based on the data returned by the statistics collector, a sub-plan is chosen which falls in the required threshold. For example, an optimizer can insert a code to collect statistics before joining two tables, during the query plan building phase. It can have multiple sub-plans based on the type of join it can perform between two tables. If the number of rows returned by the statistics collector on the first table is less than the threshold value, then the optimizer might go with the sub-plan containing the nested loop join. But if the number of rows returned by the statistics collector is above the threshold values, then the optimizer might choose the second sub-plan to go with the hash join. After the optimizer chooses a sub-plan, buffering is disabled and the statistics collector stops collecting rows and passes them through instead. On subsequent executions of the same SQL, the optimizer stops buffering and chooses the same plan instead. With dynamic plans, the optimizer adapts to poor plan choices and correct decisions are made at various steps during runtime. Instead of using predetermined execution plans, adaptive plans enable the optimizer to postpone the final plan decision until statement execution time. Consider the following simple query: SELECT a.sales_rep, b.product, sum(a.amt) FROM sales a, product b WHERE a.product_id = b.product_id GROUP BY a.sales_rep, b.product When the query plan was built initially, the optimizer will put the statistics collector before making the join. So it will scan the first table (SALES) and, based on the number of rows returned, it might make a decision to select the correct type of join. The following figure shows the statistics collector being put in at various stages: Enabling adaptive execution plans To enable adaptive execution plans, you need to fulfill the following conditions: optimizer_features_enable should be set to the minimum of 12.1.0.1 optimizer_adapive_reporting_only should be set to FALSE (default) If you set the OPTIMIZER_ADAPTIVE_REPORTING_ONLY parameter to TRUE, the adaptive execution plan feature runs in the reporting-only mode—it collects the information for adaptive optimization, but doesn't actually use this information to change the execution plans. You can find out if the final plan chosen was the default plan by looking at the column IS_RESOLVED_ADAPTIVE_PLAN in the view V$SQL. Join methods and parallel distribution methods are two areas where adaptive plans have been implemented by Oracle12c. Adaptive execution plans and join methods Here is an example that shows how the adaptive execution plan will look. Instead of simulating a new query in the database and checking if the adaptive plan has worked, I used one of the queries in the database that is already using the adaptive plan. You can get many such queries if you check V$SQL with is_resolved_adaptive_plan = 'Y'. The following queries will list all SQLs that are going for adaptive plans. Select sql_id from v$sql where is_resolved_adaptive_plan = 'Y'; While evaluating the plan, the optimizer uses the cardinality of the join to select the superior join method. The statistics collector starts buffering the rows from the first table, and if the number of rows exceeds the threshold value, the optimizer chooses to go for a hash join. But if the rows are less than the threshold value, the optimizer goes for a nested loop join. The following is the resulting plan: SQL> SELECT * FROM TABLE(DBMS_XPLAN.display_cursor(sql_id=>'dhpn35zupm8ck',cursor_child_no=>0; Plan hash value: 3790265618 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 445 (100)| | | 1 | SORT ORDER BY | | 1 | 73 | 445 (1)| 00:00:01| | 2 | NESTED LOOPS | | 1 | 73 | 444 (0)| 00:00:01| | 3 | NESTED LOOPS | | 151 | 73 | 444 (0)| 00:00:01| |* 4 | TABLE ACCESS BY INDEX ROWID BATCHED| OBJ$ | 151 | 7701 | 293 (0)| 00:00:01| |* 5 | INDEX FULL SCAN | I_OBJ3 | 1 | | 20 (0)| 00:00:01| |* 6 | INDEX UNIQUE SCAN | I_TYPE2 | 1 | | 0 (0)| | |* 7 | TABLE ACCESS BY INDEX ROWID | TYPE$ | 1 | 22 | 1 (0)| 00:00:01| ------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - filter(SYSDATE@!-"O"."CTIME">.0007) 5 - filter("O"."OID$" IS NOT NULL) 6 - access("O"."OID$"="T"."TVOID") 7 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) Note ----- - this is an adaptive plan If we check this plan, we can see the notes section, and it tells us that this is an adaptive plan. It tells us that the optimizer must have started with some default plan based on the statistics in the tables and indexes, and during run time execution it changed the join method for a sub-plan. You can actually check which step optimizer has changed and at what point it has collected the statistics. You can display this using the new format of DBMS_XPLAN.DISPLAY_CURSOR – format => 'adaptive', resulting in the following: DEO>SELECT * FROM TABLE(DBMS_XPLAN.display_cursor(sql_id=>'dhpn35zupm8ck',cursor_child_no=>0,format=>'adaptive')); Plan hash value: 3790265618 ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | 445 (100)| | | 1 | SORT ORDER BY | | 1 | 73 | 445 (1)| 00:00:01 | |- * 2 | HASH JOIN | | 1 | 73 | 444 (0)| 00:00:01 | | 3 | NESTED LOOPS | | 1 | 73 | 444 (0)| 00:00:01 | | 4 | NESTED LOOPS | | 151 | 73 | 444 (0)| 00:00:01 | |- 5 | STATISTICS COLLECTOR | | | | | | | * 6 | TABLE ACCESS BY INDEX ROWID BATCHED| OBJ$ | 151 | 7701 | 293 (0)| 00:00:01 | | * 7 | INDEX FULL SCAN | I_OBJ3 | 1 | | 20 (0)| 00:00:01 | | * 8 | INDEX UNIQUE SCAN | I_TYPE2 | 1 | | 0 (0)| | | * 9 | TABLE ACCESS BY INDEX ROWID | TYPE$ | 1 | 22 | 1 (0)| 00:00:01 | |- * 10 | TABLE ACCESS FULL | TYPE$ | 1 | 22 | 1 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("O"."OID$"="T"."TVOID") 6 - filter(SYSDATE@!-"O"."CTIME">.0007) 7 - filter("O"."OID$" IS NOT NULL) 8 - access("O"."OID$"="T"."TVOID") 9 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) 10 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) Note ----- - this is an adaptive plan (rows marked '-' are inactive) In this output, you can see that it has given three extra steps. Steps 2, 5, and 10 are extra. But these steps were present in the original plan when the query started. Initially, the optimizer generated a plan with a hash join on the outer tables. During runtime, the optimizer started collecting rows returned from OBJ$ table (Step 6), as we can see the STATISTICS COLLECTOR at step 5. Once the rows are buffered, the optimizer came to know that the number of rows returned by the OBJ$ table are less than the threshold and so it can go for a nested loop join instead of a hash join. The rows indicated by - in the beginning belong to the original plan, and they are removed from the final plan. Instead of those records, we have three new steps added—Steps 3, 8, and 9. Step 10 of the full table scan on the TYPE$ table is changed to an index unique scan of I_TYPE2, followed by the table accessed by index rowed at Step 9. Adaptive plans and parallel distribution methods Adaptive plans are also useful in adapting from bad distributing methods when running the SQL in parallel. Parallel execution often requires data redistribution to perform parallel sorts, joins, and aggregates. The database can choose from among multiple data distribution methods to perform these options. The number of rows to be distributed determines the data distribution method, along with the number of parallel server processes. If many parallel server processes distribute only a few rows, the database chooses a broadcast distribution method and sends the entire result set to all the parallel server processes. On the other hand, if a few processes distribute many rows, the database distributes the rows equally among the parallel server processes by choosing a "hash" distribution method. In adaptive plans, the optimizer does not commit to a specific broadcast method. Instead, the optimizer starts with an adaptive parallel data distribution technique called hybrid data distribution. It places a statistics collector to buffer rows returned by the table. Based on the number of rows returned, the optimizer decides the distribution method. If the rows returned by the result are less than the threshold, the data distribution method switches to broadcast distribution. If the rows returned by the table are more than the threshold, the data distribution method switches to hash distribution. Summary In this article we learned the explicit new features of Oracle optimizer which helps us in tuning our queries. Resources for Article: Further resources on this subject: Oracle Essbase System 9 Components [article] Oracle E-Business Suite: Adjusting Items in Inventory and Classifying Items [article] Oracle Business Intelligence : Getting Business Information from Data [article]
Read more
  • 0
  • 0
  • 5035

article-image-tableau-data-extract-best-practices
Packt
12 Dec 2016
11 min read
Save for later

Tableau Data Extract Best Practices

Packt
12 Dec 2016
11 min read
In this article by Jenny Zhang, author of the book Tableau 10.0 Best Practices, you will learn the Best Practices about Tableau Data Extract. We will look into different ways of creating Tableau data extracts and technical details of how a Tableau data extract works. We will learn on how to create extract with large volume of data efficiently, and then upload and manage Tableau data extract in Tableau online. We will also take a look at refresh Tableau data extract, which is useful to keep your data up to date automatically. Finally, we will take a look using Tableau web connector to create data extract. (For more resources related to this topic, see here.) Different ways of creating Tableau data extracts Tableau provides a few ways to create extracts. Direct connect to original data sources Creating an extract by connecting to the original data source (Databases/Salesforce/Google Analytics and so on) will maintain the connection to the original data source. You can right click the extract to edit the extract and refresh the extract from the original data source. Duplicate of an extract If you create a duplicate of the extract by right click the data extract and duplicate, it will create a new .tde file and still maintain the connection to the original data source. If you refresh the duplicated data extract, it will not refresh the original data extract that you created the duplicate from. Connect to a Tableau Extract File If you create a data extract by connecting to a Tableau extract file (.tde), you will not have that connection to the original data source that the extract is created from since you are just connecting to a local .tde file. You cannot edit or refresh the data from the original data source. Duplicate this extract with connection to the local .tde file will NOT create a new .tde file. The duplication will still point to the same local .tde file. You can right click – Extract Data to create an extract out of an extract. But we do not normally do that. Technical details of how a Tableau data extract works Tableau data extract’s design principle A Tableau extract (.tde) file is a compressed snapshot of data extracted from a large variety of original data sources (excel, databases, Salesforce, NoSQL and so on). It is stored on disk and loaded into memory as required to create a Tableau Viz. There are two design principles of the Tableau extract make it ideal for data analytics. The first principle is Tableau extract is a columnar store. The columnar databases store column values rather than row values. The benefit is that the input/output time required to access/aggregate the values in a column is significantly reduced. That is why Tableau extract is great for data analytics. The second principle is how a Tableau extract is structured to make sure it makes best use of your computer’s memory. This will impact how it is loaded into memory and used by Tableau. To better understand this principle, we need to understand how Tableau extract is created and used as the data source to create visualization. When Tableau creates data extract, it defines the structure of the .tde file and creates separate files for each column in the original data source. When Tableau retrieves data from the original data source, it sorts, compresses and adds the values for each column to their own file. After that, individual column files are combined with metadata to form a single file with as many individual memory-mapped files as there are the columns in the original data source. Because a Tableau data extract file is a memory-mapped file, when Tableau requests data from a .tde file, the data is loaded directly into the memory by the operating system. Tableau does not have to open, process or decompress the file. If needed, the operating system continues to move data in and out of RAM to insure that all of the requested data is made available to Tableau. It means that Tableau can query data that is bigger than the RAM on the computer. Benefits of using Tableau data extract Following are the seven main benefits of using Tableau data extract Performance: Using Tableau data extract can increase performance when the underlying data source is slow. It can also speed up CustomSQL. Reduce load: Using Tableau data extract instead of a live connection to databases reduces the load on the database that can result from heavy traffic. Portability: Tableau data extract can be bundled with the visualizations in a packaged workbook for sharing with others. Pre-aggregation: When creating extract, you can choose to aggregate your data for certain dimensions. An aggregated extract has smaller size and contains only aggregated data. Accessing the values of aggregations in a visualization is very fast since all of the work to derive the values has been done. You can choose the level of aggregation. For example, you can choose to aggregate your measures to month, quarter, or year. Materialize calculated fields: When you choose to optimize the extract, all of the calculated fields that have been defined are converted to static values upon the next full refresh. They become additional data fields that can be accessed and aggregated as quickly as any other fields in the extract. The improvement on performance can be significant especially on string calculations since string calculations are much slower compared to numeric or date calculations. Publish to Tableau Public and Tableau Online: Tableau Public only supports Tableau extract files. Though Tableau Online can connect to some cloud based data sources, Tableau data extract is most common used. Support for certain function not available when using live connection: Certain function such as count distinct is only available when using Tableau data extract. How to create extract with large volume of data efficiently Load very large Excel file to Tableau If you have an Excel file with lots of data and lots of formulas, it could take a long time to load into Tableau. The best practice is to save the Excel as a .csv file and remove all the formulas. Aggregate the values to higher dimension If you do not need the values down to the dimension of what it is in the underlying data source, aggregate to a higher dimension will significantly reduce the extract size and improve performance. Use Data Source Filter Add a data source filter by right click the data source and then choose to Edit Data Source Filter to remove the data you do not need before creating the extract. Hide Unused Fields Hide unused fields before creating a data extract can speed up extract creation and also save storage space. Upload and manage Tableau data extract in Tableau online Create Workbook just for extracts One way to create extracts is to create them in different workbooks. The advantage is that you can create extracts on the fly when you need them. But the disadvantage is that once you created many extracts, it is very difficult to manage them. You can hardly remember which dashboard has which extracts. A better solution is to use one workbook just to create data extracts and then upload the extracts to Tableau online. When you need to create visualizations, you can use the extracts in Tableau online. If you want to manage the extracts further, you can use different workbooks for different types of data sources. For example, you can use one workbook for excel files, one workbook for local databases, one workbook for web based data and so on. Upload data extracts to default project The default project in Tableau online is a good place to store your data extracts. The reason is that the default project cannot be deleted. Another benefit is that when you use command line to refresh the data extracts, you do not need to specify project name if they are in the default project. Make sure Tableau online/server has enough space In Tableau Online/Server, it’s important to make sure that the backgrounder has enough disk space to store existing Tableau data extracts as well as refresh them and create new ones. A good rule of thumb is the size of the disk available to the backgrounder should be two to three times the size of the data extracts that are expected to be stored on it. Refresh Tableau data extract Local refresh of the published extract: Download a Local Copy of the Data source from Tableau Online. Go to Data Sources tab Click on the name of the extract you want to download Click download Refresh the Local Copy. Open the extract file in Tableau Desktop Right click on the data source in, and choose Extract- refresh Publish the refreshed Extract to Tableau Online. Right lick the extract and click Publish to server You will be asked if you wish to overwrite a file with the same name and click yes NOTE 1 If you need to make changes to any metadata, please do it before publishing to the server. NOTE 2 If you use the data extract in Tableau Online to create visualizations for multiple workbooks (which I believe you do since that is the benefit of using a shared data source in Tableau Online), please be very careful when making any changes to the calculated fields, groups, or other metadata. If you have other calculations created in the local workbook with the same name as the calculations in the data extract in Tableau Online, the Tableau Online version of the calculation will overwrite what you created in the local workbook. So make sure you have the correct calculations in the data extract that will be published to Tableau Online. Schedule data extract refresh in Tableau Online Only cloud based data sources (eg. Salesforce, Google analytics) can be refreshed using schedule jobs in Tableau online. One option is to use Tableau Desktop command to refresh non-cloud based data source in Tableau Online. Windows scheduler can be used to automate the refresh jobs to update extracts via Tableau Desktop command. Another option is to use the sync application or manually refresh the extracts using Tableau Desktop. NOTE If using command line to refresh the extract, + cannot be used in the data extract name. Tips for Incremental Refreshes Following are the tips for incremental refrences: Incremental extracts retrieve only new records from the underlying data source which reduces the amount of time required to refresh the data extract. If there are no new records to add during an incremental extract, the processes associated with performing an incremental extract still execute. The performance of incremental refresh is decreasing over time. This is because incremental extracts only grow in size, and as a result, the amount of data and areas of memory that must be accessed in order to satisfy requests only grow as well. In addition, larger files are more likely to be fragmented on a disk than smaller ones. When performing an incremental refresh of an extract, records are not replaced. Therefore, using a date field such as “Last Updated” in an incremental refresh could result in duplicate rows in the extract. Incremental refreshes are not possible after an additional file has been appended to a file based data source because the extract has multiple sources at that point. Use Tableau web connector to create data extract What is Tableau web connector? The Tableau Web Data Connector is the API that can be used by people who want to write some code to connect to certain web based data such as a web page. The connectors can be written in java. It seems that these web connectors can only connect to web pages, web services and so on. It can also connect to local files. How to use Tableau web connector? Click on Data | New Data source | Web Data Connector. Is the Tableau web connection live? The data is pulled when the connection is build and Tableau will store the data locally in Tableau extract. You can still refresh the data manually or via schedule jobs. Are there any Tableau web connection available? Here is a list of web connectors around the Tableau community: Alteryx: http://data.theinformationlab.co.uk/alteryx.html Facebook: http://tableaujunkie.com/post/123558558693/facebook-web-data-connector You can check the tableau community for more web connectors Summary In summary, be sure to keep in mind the following best practices for data extracts: Use full fresh when possible. Fully refresh the incrementally refreshed extracts on a regular basis. Publish data extracts to Tableau Online/Server to avoid duplicates. Hide unused fields/ use filter before creating extracts to improve performance and save storage space. Make sure there is enough continuous disk space for the largest extract file. A good way is to use SSD drivers. Resources for Article: Further resources on this subject: Getting Started with Tableau Public [article] Introduction to Practical Business Intelligence [article] Splunk's Input Methods and Data Feeds [article]
Read more
  • 0
  • 0
  • 16226

Packt
09 Dec 2016
4 min read
Save for later

What’s New in SQL Server 2016 Reporting Services

Packt
09 Dec 2016
4 min read
In this article by Robert C. Cain, coauthor of the book SQL Server 2016 Reporting Services Cookbook, we’ll take a brief tour of the new features in SQL Server 2016 Reporting Services. SQL Server 2016 Reporting Services is a true evolution in reporting technology. After making few changes to SSRS over the last several releases, Microsoft unveiled a virtual cornucopia of new features. (For more resources related to this topic, see here.) Report Portal The old Report Manager has received a complete facelift, along with many added new features. Along with it came a rename, it is now known as the Report Portal. The following is a screenshot of the new portal: KPIs KPIs are the first feature you’ll notice. The Report Portal has the ability to display key performance indicators directly, meaning your users can get important metrics at a glance, without the need to open reports. In addition, these KPIs can be linked to other report items such as reports and dashboards, so that a user can simply click on them to find more information. Mobile Reporting Microsoft recognized the users in your organization no longer use just a computer to retrieve their information. Mobile devices, such as phones and tablets, are now commonplace. You could, of course, design individual reports for each platform, but that would cause a lot of repetitive work and limit reuse. To solve this, Microsoft has incorporated a new tool, Mobile Reports. This allows you to create an attractive dashboard that can be displayed in any web browser. In addition, you can easily rearrange the dashboard layout to optimize for both phones and tablets. This means you can create your report once, and use it on multiple platforms. Below are three images of the same mobile report. The first was done via a web browser, the second on a tablet, and the final one on a phone: Paginated reports Traditional SSRS reports have now been renamed Paginated Reports, and are still a critical element in reporting. These provide the detailed information needed for day to day activities in your company. Paginated reports have received several enhancements. First, there are two new chart types, Sunburst and TreeMap. Reports may now be exported to a new format, PowerPoint. Additionally, all reports are now rendered in HTML 5 format. This makes them accessible to any browser, including those running on tablets or other platforms such as Linux or the Mac. PowerBI PowerBI Desktop reports may now be housed within the Report Portal. Currently, opening one will launch the PowerBI desktop application.However, Microsoft has announced in an upcoming update to SSRS 2016 PowerBI reports will be displayed directly within the Report Portal without the need to open the external app. Reporting applications Speaking of Apps, the Report Builder has received a facelift, updating it to a more modern user interface with a color scheme that matches the Report Portal. Report Builder has also been decoupled from the installation of SQL Server. In previous versions Report Builder was part of the SQL Server install, or it was available as a separate download. With SQL Server 2016, both the Report Builder and the Mobile Reporting tool are separate downloads making them easier to stay current as new versions are released. The Report Portal now contains links to download these tools. Excel Excel workbooks, often used as a reporting tool itself, may now be housed within the Report Portal. Opening them will launch Excel, similar to the way in which PowerBI reports currently work. Summary This article summarizes just some of the many new enhancements to SQL Server 2016 Reporting Services. With this release, Microsoft has worked toward meeting the needs of many users in the corporate environment, including the need for mobile reporting, dashboards, and enhanced paginated reports. For more details about these and many more features see the book SQL Server 2016 Reporting Services Cookbook, by Dinesh Priyankara and Robert C. Cain. Resources for Article: Further resources on this subject: Getting Started with Pentaho Data Integration [article] Where Is My Data and How Do I Get to It? [article] Configuring and Managing the Mailbox Server Role [article]
Read more
  • 0
  • 0
  • 2486

article-image-event-detection-news-headlines-hadoop
Packt
08 Dec 2016
13 min read
Save for later

Event detection from the news headlines in Hadoop

Packt
08 Dec 2016
13 min read
In this article by Anurag Shrivastava, author of Hadoop Blueprints, we will be learning how to build a text analytics system which detects the specific events from the random news headlines. Internet has become the main source of news in the world. There are thousands of website which constantly publish and update the news stories around the world. Not every news items is relevant for everyone but some news items are very critical for some people or businesses. For example, if you were major car manufacturer based in Germany having your suppliers located in India then you would be interested in the news from the region which can affect your supply chain. (For more resources related to this topic, see here.) Road accidents in India are a major social and economic problem. Road accidents leave a large number of fatalities behind and result in the loss of capital. In this example, we will build a system which detects if a news item refers to a road accident event. Let us define what we mean by it in the next paragraph. A road accident event may or may not result in fatal injuries. One or more vehicles and pedestrians may be involved in the accidents. A non road accident event news item is everything else which can not be categorized as a road accident event. It could be a road accident trend analysis related to road accidents or something totally unrelated. Technology stack To build this system, we will use the following technologies: Task Technology Data storage HDFS Data processing Hadoop MapReduce Query engine Hive and Hive UDF Data ingestion Curl and HDFS copy Event detection OpenNLP The event detection system is a machine learning based natural language processing system. The natural language processing system brings the intelligence to detect the events in the random headline sentences from the news items. An OpenNLP OpenSourceNaturalLanguageProcessingFramework (OpenNLP) is from apache software foundation. You can download the version 1.6.0 from https://opennlp.apache.org/ to run the examples in this blog. It is capable of detecting the entities, document categories, parts of speech, and so on in the text written by humans. We will use document categorization feature of OpenNLP in our system. Document categorization feature requires you to train the OpenNLP model with the help of sample text. As a result of training, we get a model. This resulting model is used to categorize the new text. Our training data looks as follows: r 1.46 lakh lives lost on Indian roads last year - The Hindu. r Indian road accident data | OpenGovernmentData (OGD) platform... r 400 people die everyday in road accidents in India: Report - India TV. n Top Indian female biker dies in road accident during country-wide tour. n Thirty die in road accidents in north India mountains—World—Dunya... n India's top woman biker Veenu Paliwal dies in road accident: India... r Accidents on India's deadly roads cost the economy over $8 billion... n Thirty die in road accidents in north India mountains (The Express) The first column can take two values: n indicates that the news item is a road accident event r indicates that the news item is not a road accident event or everything else This training set has total 200 lines. Please note that OpenNLP requires at least 15000 lines in the training set to deliver good results. Because we do not have so much training data, we will start with a small set but remain aware about the limitations of our model. You will see that even with a small training dataset, this model works reasonably well. Let us train and build our model: $ opennlp DoccatTrainer -model en-doccat.bin -lang en -data roadaccident.train.prn -encoding UTF-8 Here the file roadaccident.train.prn contains the training data. The output file en-doccat.bin contains the model which we will use in our data pipeline. We have built our model using the command line utility but it is also possible to build the model programmatically. The training data file is a plain text file, which you can expand with a bigger corpus of knowledge to make the model smarter. Next we will build the data pipeline as follows: Fetch RSS feeds This component will fetch RSS news feeds from the popular news web sites. In this case, we will just use one news from Google. We can always add more sites after our first RSS feed has been integrated. The whole RSS feed can be downloaded using the following command: $ curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" The previous command downloads the news headline for India. You can customize the RSS feed by visiting the Google news site is https://news.google.com for your region. Scheduler Our scheduler will fetch the RSS feed once in 6 hours. Let us assume that in 6 hours time interval, we have good likelihood of fetching fresh news items. We will wrap our feed fetching script in a shell file and invoke it using cron. The script is as follows: $ cat feedfetch.sh NAME= "newsfeed-"`date +%Y-%m-%dT%H.%M.%S` curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" > $NAME hadoop fs -put $NAME /xml/rss/newsfeeds Cron job setup line will be as follows: 0 */6 * * * /home/hduser/mycommand Please edit your cron job table using the following command and add the setup line in it: $ cronjob -e Loading data in HDFS To load data in HDFS, we will use HDFS put command which copies the downloaded RSS feed in a directory in HDFS. Let us make this directory in HDFS where our feed fetcher script will store the rss feeds: $ hadoop fs -mkdir /xml/rss/newsfeeds Query using Hive First we will create an external table in Hive for the new RSS feed. Using Xpath based select queries, we will extract the news headlines from the RSS feeds. These headlines will be passed to UDF to detect the categories: CREATE EXTERNAL TABLE IF NOT EXISTS rssnews( document STRING) COMMENT 'RSS Feeds from media' STORED AS TEXTFILE location '/xml/rss/newsfeeds'; The following command parses the XML to retrieve the title or the headlines from XML and explodes them in a single column table: SELECT explode(xpath(name, '//item/title/text()')) FROM xmlnews1; The sample output of the above command on my system is as follows: hive> select explode(xpath(document, '//item/title/text()')) from rssnews; Query ID = hduser_20161010134407_dcbcfd1c-53ac-4c87-976e-275a61ac3e8d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475744961620_0016, Tracking URL = http://localhost:8088/proxy/application_1475744961620_0016/ Kill Command = /home/hduser/hadoop-2.7.1/bin/hadoop job -kill job_1475744961620_0016 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-10-10 14:46:14,022 Stage-1 map = 0%, reduce = 0% 2016-10-10 14:46:20,464 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.69 sec MapReduce Total cumulative CPU time: 4 seconds 690 msec Ended Job = job_1475744961620_0016 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 4.69 sec HDFS Read: 120671 HDFS Write: 1713 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 690 msec OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost CPI(M) worker hacked to death in Kannur - The Hindu Akhilesh Yadav's comment on PM Modi's Lucknow visit shows Samajwadi Party's insecurity: BJP - The Indian Express PMO maintains no data about petitions personally read by PM - Daily News & Analysis AIADMK launches social media campaign to put an end to rumours regarding Amma's health - Times of India Pakistan, India using us to play politics: Former Baloch CM - Times of India Indian soldier, who recited patriotic poem against Pakistan, gets death threat - Zee News This Dussehra effigies of 'terrorism' to go up in flames - Business Standard 'Personal reasons behind Rohith's suicide': Read commission's report - Hindustan Times Time taken: 5.56 seconds, Fetched: 10 row(s) Hive UDF Our Hive User Defined Function (UDF) categorizeDoc takes a news headline and suggests if it is a news about a road accident or the road accident event as we explained earlier. This function is as follows: package com.mycompany.app;import org.apache.hadoop.io.Text;import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;import opennlp.tools.util.InvalidFormatException;import opennlp.tools.doccat.DoccatModel;import opennlp.tools.doccat.DocumentCategorizerME;import java.lang.String;import java.io.FileInputStream;import java.io.InputStream;import java.io.IOException;@Description( name = "getCategory", value = "_FUNC_(string) - gets the catgory of a document ")public final class MyUDF extends UDF { public Text evaluate(Text input) { if (input == null) return null; try { return new Text(categorizeDoc(input.toString())); } catch (Exception ex) { ex.printStackTrace(); return new Text("Sorry Failed: >> " + input.toString()); } } public String categorizeDoc(String doc) throws InvalidFormatException, IOException { InputStream is = new FileInputStream("./en-doccat.bin"); DoccatModel model = new DoccatModel(is); is.close(); DocumentCategorizerME classificationME = new DocumentCategorizerME(model); String documentContent = doc; double[] classDistribution = classificationME.categorize(documentContent); String predictedCategory = classificationME.getBestCategory(classDistribution); return predictedCategory; }} The function categorizeDoc take a single string as input. It loads the model which we created earlier from the file en-doccat.bin from the local directory. Finally it calls the classifier which returns the result to the calling function. The calling function MyUDF extends the hive UDF class. It calls the function categorizeDoc for each string line item input. If the it succeed then the value is returned to the calling program otherwise a message is returned which indicates that the category detection has failed. The pom.xml file to build the above file is as follows: $ cat pom.xml <?xml version="1.0" encoding="UTF-8"?> <project xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mycompany</groupId> <artifactId>app</artifactId> <version>1.0</version> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>2.0.0</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.8</version> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>com.mycompany.app.App</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> </pluginManagement> </build> </project> You can build the jar with all the dependencies in it using the following commands: $ mvn clean compile assembly:single The resulting jar file app-1.0-jar-with-dependencies.jar can be found in the target directory. Let us use this jar file in Hive to categorise the news headlines as follows: Copy jar file to the bin subdirectory in the Hive root: $ cp app-1.0-jar-with-dependencies.jar $HIVE_ROOT/bin Copy the trained model in the bin sub directory in the Hive root: $ cp en-doccat.bin $HIVE_ROOT/bin Run the categorization queries Run Hive: $hive Add jar file in Hive: hive> ADD JAR ./app-1.0-jar-with-dependencies.jar ; Create a temporary categorization function catDoc: hive> CREATE TEMPORARY FUNCTION catDoc as 'com.mycompany.app.MyUDF'; Create a table headlines to hold the headlines extracted from the RSS feed: hive> create table headlines( headline string); Insert the extracted headlines in the table headlines: hive> insert overwrite table headlines select explode(xpath(document, '//item/title/text()')) from rssnews; Let's test our UDF by manually passing a real news headline to it from a newspaper website: hive> hive> select catDoc("8 die as SUV falls into river while crossing bridge in Ghazipur") ; OK N The output is N which means this is indeed a headline about a road accident incident. This is reasonably good, so now let us run this function for the all the headlines: hive> select headline, catDoc(*) from headlines; OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu r Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost r Akhilesh Yadav Backs Rahul Gandhi's 'Dalali' Remark - NDTV r PMO maintains no data about petitions personally read by PM Narendra Modi - Economic Times n Mobile Internet Services Suspended In Protest-Hit Nashik - NDTV n Pakistan, India using us to play politics: Former Baloch CM - Times of India r CBI arrests Central Excise superintendent for taking bribe - Economic Times n Be extra vigilant during festivals: Centre's advisory to states - Times of India r CPI-M worker killed in Kerala - Business Standard n Burqa-clad VHP activist thrashed for sneaking into Muslim women gathering - The Hindu r Time taken: 0.121 seconds, Fetched: 10 row(s) You can see that our headline detection function works and output r or n. In the above example, we see many false positives where a headline has been incorrectly identified as a road accident. A better training for our model can improve the quality of our results. Further reading The book Hadoop Blueprints covers several case studies where we can apply Hadoop, HDFS, data ingestion tools such as Flume and Sqoop, query and visualization tools such as Hive and Zeppelin, machine learning tools such as BigML and Spark to build the solutions. You will discover how to build a fraud detection system using Hadoop or build a Data Lake for example. Summary In this article we have learned to build a text analytics system which detects the specific events from the random news headlines. This also covers how to apply Hadoop, HDFS, and other different tools. Resources for Article: Further resources on this subject: Spark for Beginners [article] Hive Security [article] Customizing heat maps (Intermediate) [article]
Read more
  • 0
  • 0
  • 1609

article-image-build-chatbot
Packt
07 Dec 2016
23 min read
Save for later

Build a Chatbot

Packt
07 Dec 2016
23 min read
In this article written by Alexander T. Combs, author of the book Python Machine Learning Blueprints, we are going to learn how to construct a chatbot from scratch. Along the way, we'll learn more about the history of the field and its future prospects. We'll cover the following topics in this article: The Turing test The history of chatbots The design of chatbots Building a chatbot (For more resources related to this topic, see here.) The Turing test In 1950, Alan Turing posed the question, “Can machines think?” in one of his more famous papers. Being the practical genius that he was, he decided not to tackle this question head on but to pose it in the framework of the “problem of other minds” instead. This problem poses the question, “How do we truly know that other people have minds like our own?”. As we can only observe their behavior and not the inner workings of their mind, we must take it on faith that they are like us. Turing proposed that if a machine behaved as if it was intelligent, then in fact, we should view it as such. This, in a nutshell, is the Turing test. Passing this test means convincing humans that a machine is a fellow human. You may wonder at this point whether a program ever passed the Turing test successfully. The answer is that there really is no official Turing test. The closest thing to an official event is the Loebner Prize. This is a contest held annually that awards modest prizes to chatbots that appear to be the most convincingly human to a panel of judges. If this sounds rigorous and academic, it has been described as anything but. Loebner's sponsorship for the contest is said to have been a product of his total disdain for labor of any kind. Hopefully, by this point, you realized that any claims about “passing the Turing test” are more spectacle than actual science. However, despite this, Turing's original concept is a useful construct. The history of chatbots I still remember the day my father sat me down in front of his IBM XT with its black and green flickering monitor and asked me to talk to “it”. The conversation went something like the following: >Hello? How are you feeling today? >Hi. Who is this? >Hi there… how are you today? >I'm fine. You didn't tell me who you are. >We should be discussing you, not me. >We should? Why? >Let's change focus a bit… Tell me about your family. >You are impossible to have a conversation with. >Can you elaborate on that? >You are killing me. >Perhaps you would like me to be killing you. >Ok, you are freaking me out. >Perhaps you're really talking about yourself? >What?? That makes no sense. >How would an answer to that help you? >END >Very Interesting. >QUIT >Thank you for talking with me. >#$###$#@@…. The “it” that I talked to was a program called ELIZA. Written in the mid-1960s at MIT's Artificial Intelligence Lab by Joseph Weizenbaum, ELIZA was created to mimic the responses of a Rogerian psychotherapist. Though nearly comical when examined in any depth, the program was capable of convincing some users that they were chatting with an actual human. This was a remarkable feat considering it was a scant 200 lines of code that used randomization and regular expressions to parrot back responses. Even today, this simple program remains a staple of popular culture. If you ask Siri who ELIZA is, she will tell you she is a friend and brilliant psychiatrist. If ELIZA was an early example of chatbots, what have we seen after this? In recent years, there has been an explosion of new chatbots; most notable of these is Cleverbot. Cleverbot was released to the world via the web in 1997. Since then, this bot has racked up hundreds of millions of conversions. Unlike early chatbots, Cleverbot (as the name suggests) appears to become more intelligent with each conversion. Though the exact details of the workings of the algorithm are difficult to find, it is said to work by recording all conversations in a database and finding the most appropriate response by identifying the most similar questions and responses in the database. I made up a nonsensical question in the following screenshot, and you can see that it found something similar to the object of my question in terms of a string match. I persisted: Again I got something…similar? You'll also notice that topics can persist across the conversation. In response to my answer, I was asked to go into more detail and justify my answer. This is one of the things that appears to make Cleverbot, well, clever. While chatbots that learn from humans can be quite amusing, they can also have a darker side. Just this past year, Microsoft released a chatbot named Tay on Twitter. People were invited to ask questions of Tay, and Tay would respond in accordance with her “personality”. Microsoft had apparently programmed the bot to appear to be 19-year-old American girl. She was intended to be your virtual “bestie”; the only problem was she started sounding like she would rather hang with the Nazi youth than you. As a result of these unbelievably inflammatory tweets, Microsoft was forced to pull Tay off Twitter and issue an apology: “As many of you know by now, on Wednesday we launched a chatbot called Tay. We are deeply sorry for the unintended offensive and hurtful tweets from Tay, which do not represent who we are or what we stand for, nor how we designed Tay. Tay is now offline and we'll look to bring Tay back only when we are confident we can better anticipate malicious intent that conflicts with our principles and values.” -March 25, 2016 Official Microsoft Blog Clearly, brands that want to release chatbots into the wild in the future should take a lesson from this debacle. There is no doubt that brands are embracing chatbots. Everyone from Facebook to Taco Bell is getting in on the game. Witness the TacoBot: Yes, this is a real thing, and despite the stumbles such as Tay, there is a good chance the future of UI looks a lot like TacoBot. One last example might even help explain why. Quartz recently launched an app that turns news into a conversation. Rather than lay out the day's stories as a flat list, you are engaged in a chat as if you were getting news from a friend. David Gasca, a PM at Twitter, describes his experience using the app in a post on Medium. He describes how the conversational nature invoked feelings that were normally only triggered in human relationships. This is his take on how he felt when he encountered an ad in the app: "Unlike a simple display ad, in a conversational relationship with my app, I feel like I owe something to it: I want to click. At the most subconscious level, I feel the need to reciprocate and not let the app down: The app has given me this content. It's been very nice so far and I enjoyed the GIFs. I should probably click since it's asking nicely.” If this experience is universal—and I expect that it is—this could be the next big thing in advertising, and have no doubt that advertising profits will drive UI design: “The more the bot acts like a human, the more it will be treated like a human.” -Mat Webb, technologist and co-author of Mind Hacks At this point, you are probably dying to know how these things work, so let's get on with it! The design of chatbots The original ELIZA application was two-hundred odd lines of code. The Python NLTK implementation is similarly short. An excerpt can be seen at the following link from NLTK's website (http://www.nltk.org/_modules/nltk/chat/eliza.html). I have also reproduced an except below: # Natural Language Toolkit: Eliza # # Copyright (C) 2001-2016 NLTK Project # Authors: Steven Bird <stevenbird1@gmail.com> # Edward Loper <edloper@gmail.com> # URL: <http://nltk.org/> # For license information, see LICENSE.TXT # Based on an Eliza implementation by Joe Strout <joe@strout.net>, # Jeff Epler <jepler@inetnebr.com> and Jez Higgins <mailto:jez@jezuk.co.uk>. # a translation table used to convert things you say into things the # computer says back, e.g. "I am" --> "you are" from future import print_function # a table of response pairs, where each pair consists of a # regular expression, and a list of possible responses, # with group-macros labelled as %1, %2. pairs = ((r'I need (.*)',("Why do you need %1?", "Would it really help you to get %1?","Are you sure you need %1?")),(r'Why don't you (.*)', ("Do you really think I don't %1?","Perhaps eventually I will %1.","Do you really want me to %1?")), [snip](r'(.*)?',("Why do you ask that?", "Please consider whether you can answer your own question.", "Perhaps the answer lies within yourself?", "Why don't you tell me?")), (r'quit',("Thank you for talking with me.","Good-bye.", "Thank you, that will be $150. Have a good day!")), (r'(.*)',("Please tell me more.","Let's change focus a bit... Tell me about your family.","Can you elaborate on that?","Why do you say that %1?","I see.", "Very interesting.","%1.","I see. And what does that tell you?","How does that make you feel?", "How do you feel when you say that?")) ) eliza_chatbot = Chat(pairs, reflections) def eliza_chat(): print("Therapistn---------") print("Talk to the program by typing in plain English, using normal upper-") print('and lower-case letters and punctuation. Enter "quit" when done.') print('='*72) print("Hello. How are you feeling today?") eliza_chatbot.converse() def demo(): eliza_chat() if name demo() == " main ": As you can see from this code, input text was parsed and then matched against a series of regular expressions. Once the input was matched, a randomized response (that sometimes echoed back a portion of the input) was returned. So, something such as I need a taco would trigger a response of Would it really help you to get a taco? Obviously, the answer is yes, and fortunately, we have advanced to the point that technology can provide one to you (bless you, TacoBot), but this was still in the early days. Shockingly, some people did actually believe ELIZA was a real human. However, what about more advanced bots? How are they constructed? Surprisingly, most of the chatbots that you're likely to encounter don't even use machine learning; they use what's known as retrieval-based models. This means responses are predefined according to the question and the context. The most common architecture for these bots is something called Artificial Intelligence Markup Language (AIML). AIML is an XML-based schema to represent how the bot should interact to the user's input. It's really just a more advanced version of how ELIZA works. Let's take a look at how responses are generated using AIML. First, all inputs are preprocessed to normalize them. This means when you input “Waaazzup???”, it is mapped to “WHAT IS UP”. This preprocessing step funnels down the myriad ways of saying the same thing into one input that can run against a single rule. Punctuation and other extraneous inputs are removed as well at this point. Once this is complete, the input is matched against the appropriate rule. The following is a sample template: <category> <pattern>WHAT IS UP</pattern> <template>The sky, duh. Pfft. Humans...</template> </category> This is the basic setup, but you can also layer in wildcards, randomization, and prioritization schemes. For example, the following pattern uses wildcard matching: <category> <pattern>* FOR ME<pattern> <template>I'm a bot. I don't <star/>. Ever.</template> </category> Here, the *wildcard matches one or more words before FORMEand then repeats these back in the output template. If the user were to type in Danceforme!, the response would be, I'm a bot. I don't dance. Ever. As you can see, these rules don't make for anything that approximates any type of real intelligence, but there are a few tricks that strengthen the illusion. One of the better ones is the ability to generate responses conditioned to a topic. For example, here is a rule that invokes a topic: <category> <pattern>I LIKE TURTLES</pattern> <template>I feel like this whole <set name="topic">turle</set> thing could be a problem. What do you like about them? </template> </category> Once the topic is set, then the rules specific to that context can be matched: <topic name="turtles"> <category> <pattern>* SHELL IS *</pattern> <template>I dislike turtles primarily because of their shells. What other creepy things do you like about turtles? </template> </category> <category> <pattern>* HIDE *</pattern> <template>I wish like a turtle that I could hide from this conversation.</template> </category> </topic> Let's see what this interaction looks like: >I like turtles! >I feel like this whole turtle thing could be a problem. What do you like about them? >I like how they hide in their shell >I wish like a turtle I could hide from this conversation. You can see that the continuity across the conversation adds a measure of realism. You probably think that this can't be state-of-the-art in this age of deep learning, and you're right. While most bots are rule-based, the next generation of chatbots are emerging, and they are based on neural networks. In 2015, Oriol Vinyas and Quoc Le of Google published a paper (http://arxiv.org/pdf/1506.05869v1.pdf), which described the construction of a neural network, based on sequence-to-sequence models. This type of model maps an input sequence, such as “ABC”, to an output sequence, such as “XYZ”. These inputs and outputs can be translations from one language to another for example. However, in the case of their work here, the training data was not language translation, but rather tech support transcripts and movie dialog. While the results from both models are both interesting, it was the interactions that were based on movie model that stole the headlines. The following are sample interactions taken from the paper: None of this was explicitly encoded by humans or present in a training set as asked, and yet, looking at this is, it is frighteningly like speaking with a human. However, let's see more… Note that the model responds with what appears to be knowledge of gender (he, she), of place (England), and career (player). Even questions of meaning, ethics, and morality are fair game: The conversation continues: If this transcript doesn't give you a slight chill of fear for the future, there's a chance you may already be some sort of AI. I wholeheartedly recommend reading the entire paper. It isn't overly technical, and it will definitely give you a glimpse of where this technology is headed. We talked a lot about the history, types, and design of chatbots, but let's now move on to building our own! Building a chatbot Now, having seen what is possible in terms of chatbots, you most likely want to build the best, most state-of-the-art, Google-level bot out there, right? Well, just put that out of your mind right now because we will do just the opposite! We will build the best, most awful bot ever! Let me tell you why. Building a chatbot comparable to what Google built takes some serious hardware and time. You aren't going to whip up a model on your MacBook Pro that takes anything less than a month or two to run with any type of real training set. This means that you will have to rent some time on an AWS box, and not just any box. This box will need to have some heavy-duty specs and preferably be GPU-enabled. You are more than welcome to attempt such a thing. However, if your goal is just to build something very cool and engaging, I have you covered here. I should also warn you in advance, although Cleverbot is no Tay, the conversations can get a bit salty. If you are easily offended, you may want to find a different training set. Ok, let's get started! First, as always, we need training data. Again, as always, this is the most challenging step in the process. Fortunately, I have come across an amazing repository of conversational data. The notsocleverbot.com site has people submit the most absurd conversations they have with Cleverbot. How can you ask for a better training set? Let's take a look at a sample conversation between Cleverbot and a user from the site: So, this is where we'll begin. We'll need to download the transcripts from the site to get started: You'll just need to paste the link into the form on the page. The format will be like the following: http://www.notsocleverbot.com/index.php?page=1. Once this is submitted, the site will process the request and return a page back that looks like the following: From here, if everything looks right, click on the pink Done button near the top right. The site will process the page and then bring you to the following page: Next, click on the Show URL Generator button in the middle: Next, you can set the range of numbers that you'd like to download from. For example, 1-20, by 1 step. Obviously, the more pages you capture, the better this model will be. However, remember that you are taxing the server, so please be considerate. Once this is done, click on Add to list and hit Return in the text box, and you should be able to click on Save. It will begin running, and when it is complete, you will be able to download the data as a CSV file. Next, we'll use our Jupyter notebook to examine and process the data. We'll first import pandasand the Python regular expressions library, re. We will also set the option in pandasto widen our column width so that we can see the data better: import pandas as pd import re pd.set_option('display.max_colwidth',200) Now, we'll load in our data: df = pd.read_csv('/Users/alexcombs/Downloads/nscb.csv') df The preceding code will result in the following output: As we're only interested in the first column, the conversation data, we'll parse this out: convo = df.iloc[:,0] convo The preceding code will result in the following output: You should be able to make out that we have interactions between User and Cleverbot, and that either can initiate the conversation. To get the data in the format that we need, we'll have to parse it into question and response pairs. We aren't necessarily concerned with who says what, but we are concerned with matching up each response to each question. You'll see why in a bit. Let's now perform a bit of regular expression magic on the text: clist = [] def qa_pairs(x): cpairs = re.findall(": (.*?)(?:$|n)", x) clist.extend(list(zip(cpairs, cpairs[1:]))) convo.map(qa_pairs); convo_frame = pd.Series(dict(clist)).to_frame().reset_index() convo_frame.columns = ['q', 'a'] The preceding code results in the following output: Okay, there's a lot of code there. What just happened? We first created a list to hold our question and response tuples. We then passed our conversations through a function to split them into these pairs using regular expressions. Finally, we set it all into a pandas DataFramewith columns labelled qand a. We will now apply a bit of algorithm magic to match up the closest question to the one a user inputs: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity vectorizer = TfidfVectorizer(ngram_range=(1,3)) vec = vectorizer.fit_transform(convo_frame['q']) What we did in the preceding code was to import our TfidfVectorizationlibrary and the cosine similarity library. We then used our training data to create a tf-idf matrix. We can now use this to transform our own new questions and measure the similarity to existing questions in our training set. We covered cosine similarity and tf-idf algorithms in detail, so flip back there if you want to understand how these work under the hood. Let's now get our similarity scores: my_q = vectorizer.transform(['Hi. My name is Alex.']) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) top5 = rs.iloc[0:5] top5 The preceding code results in the following output: What are we looking at here? This is the cosine similarity between the question I asked and the top five closest questions. To the left is the index and on the right is the cosine similarity. Let's take a look at these: convo_frame.iloc[top5.index]['q'] This results in the following output: As you can see, nothing is exactly the same, but there are definitely some similarities. Let's now take a look at the response: rsi = rs.index[0] rsi convo_frame.iloc[rsi]['a'] The preceding code results in the following output: Okay, so our bot seems to have an attitude already. Let's push further. We'll create a handy function so that we can test a number of statements easily: def get_response(q): my_q = vectorizer.transform([q]) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) rsi = rs.index[0] return convo_frame.iloc[rsi]['a'] get_response('Yes, I am clearly more clever than you will ever be!') This results in the following output: We have clearly created a monster, so we'll continue: get_response('You are a stupid machine. Why must I prove anything to you?') This results in the following output: I'm enjoying this. Let's keep rolling with it: get_response('My spirit animal is a menacing cat. What is yours?') To which I responded: get_response('I mean I didn't actually name it.') This results in the following output: Continuing: get_response('Do you have a name suggestion?') This results in the following output: To which I respond: get_response('I think it might be a bit aggressive for a kitten') This results in the following output: I attempt to calm the situation: get_response('No need to involve the police.') This results in the following output: And finally, get_response('And I you, Cleverbot') This results in the following output: Remarkably, this may be one of the best conversations I've had in a while: bot or no bot. Now that we have created this cake-based intelligence, let's set it up so that we can actually chat with it via text message. We'll need a few things to make this work. The first is a twilio account. They will give you a free account that lets you send and receive text messages. Go to http://ww.twilio.com and click to sign up for a free developer API key. You'll set up some login credentials and they will text your phone to confirm your number. Once this is set up, you'll be able to find the details in their Quickstart documentation. Make sure that you select Python from the drop-down menu in the upper left-hand corner. Sending messages from Python code is a breeze, but you will need to request a twilio number. This is the number that you will use to send a receive messages in your code. The receiving bit is a little more complicated because it requires that you to have a webserver running. The documentation is succinct, so you shouldn't have that hard a time getting it set up. You will need to paste a public-facing flask server's URL in under the area where you manage your twilio numbers. Just click on the number and it will bring you to the spot to paste in your URL: Once this is all set up, you will just need to make sure that you have your Flask web server up and running. I have condensed all the code here for you to use on your Flask app: from flask import Flask, request, redirect import twilio.twiml import pandas as pd import re from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity app = Flask( name ) PATH_TO_CSV = 'your/path/here.csv' df = pd.read_csv(PATH_TO_CSV) convo = df.iloc[:,0] clist = [] def qa_pairs(x): cpairs = re.findall(": (.*?)(?:$|n)", x) clist.extend(list(zip(cpairs, cpairs[1:]))) convo.map(qa_pairs); convo_frame = pd.Series(dict(clist)).to_frame().reset_index() convo_frame.columns = ['q', 'a'] vectorizer = TfidfVectorizer(ngram_range=(1,3)) vec = vectorizer.fit_transform(convo_frame['q']) @app.route("/", methods=['GET', 'POST']) def get_response(): input_str = request.values.get('Body') def get_response(q): my_q = vectorizer.transform([input_str]) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) rsi = rs.index[0] return convo_frame.iloc[rsi]['a'] resp = twilio.twiml.Response() if input_str: resp.message(get_response(input_str)) return str(resp) else: resp.message('Something bad happened here.') return str(resp) It looks like there is a lot going on, but essentially we use the same code that we used before, only now we grab the POST data that twilio sends—the text body specifically—rather than the data we hand-entered before into our get_requestfunction. If all goes as planned, you should have your very own weirdo bestie that you can text anytime, and what could be better than that! Summary In this article, we had a full tour of the chatbot landscape. It is clear that we are just on the cusp of an explosion of these sorts of applications. The Conversational UI revolution is just about to begin. Hopefully, this article has inspired you to create your own bot, but if not, at least perhaps you have a much richer understanding of how these applications work and how they will shape our future. I'll let the app say the final words: get_response("Say goodbye, Clevercake") Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Specialized Machine Learning Topics [article]
Read more
  • 0
  • 0
  • 2997

article-image-define-necessary-connections
Packt
02 Dec 2016
5 min read
Save for later

Define the Necessary Connections

Packt
02 Dec 2016
5 min read
In this article by Robert van Mölken and Phil Wilkins, the author of the book Implementing Oracle Integration Cloud Service, where we will see creating connections which is one of the core components of an integration we can easily navigate to the Designer Portal and start creating connections. (For more resources related to this topic, see here.) On the home page, click the Create link of the Connection tile as given in the following screenshot: Because we click on this link the Connections page is loaded, which lists of all created connections, a modal dialogue automatically opens on top of the list. This pop-up shows all the adapter types we can create. For our first integration we define two technology adapter connections, an inbound SOAP connection and an outbound REST connection. Inbound SOAP connection In the pop-up we can scroll down the list and find the SOAP adapter, but the modal dialogue also includes a search field. Just search on SOAP and the list will show the adapters matching the search criteria: Find your adapter by searching on the name or change the appearance from card to list view to show more adapters at ones. Click Select to open the New Connection page. Before we can setup any adapter specific configurations every creation starts with choosing a name and an optional description: Create the connection with the following details: Connection Name FlightAirlinesSOAP_Ch2 Identifier This will be proposed based on the connection name and there is no need to change unless you'd like an alternate name. It is usually the name in all CAPITALS and without spaces and has a max length of 32 characters. Connection Role Trigger The role chosen restricts the connection to be used only in selected role(s). Description This receives in Airline objects as a SOAP service. Click the Create button to accept the details. This will bring us to the specific adapter configuration page where we can add and modify the necessary properties. The one thing all the adapters have in common is the optional Email Address under Connection Administration. This email address is used to send notification to when problems or changes occur in the connection. A SOAP connection consists of three sections; Connection Properties, Security, and an optional Agent Group. On the right side of each section we can find a button to configure its properties.Let's configure each section using the following steps: Click the Configure Connectivity button. Instead of entering in an URL we are uploading the WSDL file. Check the box in the Upload File column. Click the newly shown Upload button. Upload the file ICSBook-Ch2-FlightAirlines-Source WSDL. Click OK to save the properties. Click the Configure Credentials button. In the pop-up that is shown we can configure the security credentials. We have the choice for Basic authentication, Username Password Token, or No Security Policy. Because we use it for our inbound connection we don't have to configure this. Select No Security Policy from the dropdown list. This removes the username and password fields. Click OK to save the properties. We leave the Agent Group section untouched. We can attach an Agent Group if we want to use it as an outbound connection to an on-premises web service. Click Test to check if the connection is working (otherwise it can't be used). For SOAP and REST it simply pings the given domain to check the connectivity, but others for example the Oracle SaaS adapters also authenticate and collect metadata. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Outbound REST connection Now that the inbound connection is created we can create our REST adapter. Click the Create New Connection button to show the Create Connection pop-up again and select the REST adapter. Create the connection with the following details: Connection Name FlightAirlinesREST_Ch2 Identifier This will be proposed based on the connection name Connection Role Invoke Description This returns the Airline objects as a REST/JSON service Email Address Your email address to use to send notifications to Let’s configure the connection properties using the following steps: Click the Configure Connectivity button. Select REST API Base URL for the Connection Type. Enter the URL were your Apiary mock is running on: http://private-xxxx-yourapidomain.apiary-mock.com. Click OK to save the values. Next configure the security credentials using the following steps: Click the Configure Credentials button. Select No Security Policy for the Security Policy. This removes the username and password fields. Click the OK button to save out choice. Click Test at the top to check if the connection is working. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Troubleshooting If the test fails for one of these connections check if the correct WSDL is used or that the connection URL for the REST adapter exists or is reachable. Summary In this article we looked at the processes of creating and testing the necessary connections and the creation of the integration itself. We have seen an inbound SOAP connection and an outbound REST connection. In demonstrating the integration we have also seen how to use Apiary to document and mock our backend REST service. Resources for Article: Further resources on this subject: Getting Started with a Cloud-Only Scenario [article] Extending Oracle VM Management [article] Docker Hosts [article]
Read more
  • 0
  • 0
  • 1417
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-suggesters-improving-user-search-experience
Packt
18 Nov 2016
11 min read
Save for later

Suggesters for Improving User Search Experience

Packt
18 Nov 2016
11 min read
In this article by Bharvi Dixit, the author of the book Mastering ElasticSearch 5.0 - Third Edition, we will focus on the topics for improving the user search experience using suggesters, which allows you to correct user query spelling mistakes and build efficient autocomplete mechanisms. First, let's look on the query possibilities and the responses returned by Elasticsearch. We will try to show you the general principles, and then we will get into more details about each of the available suggesters. (For more resources related to this topic, see here.) Using the suggester under search Before Elasticsearch 5.0, there was a possibility to get suggestions for a given text by using a dedicated _suggest REST endpoint. But in Elasticsearch 5.0, this dedicated _suggest endpoint has been deprecated in favor of using suggest API. In this release, the suggest only search requests have been optimized for performance reasons and we can execute the suggetions _search endpoint. Similar to query object, we can use a suggest object and what we need to provide inside suggest object is the text to analyze and the type of used suggester (term or phrase). So if we would like to get suggestions for the words chrimes in wordl (note that we've misspelled the word on purpose), we would run the following query: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } } } }' The dedicated endpoint _suggest has been deprecated in Elasticsearch version 5.0 and might be removed in future releases, so be advised to use suggestion request under _search endpoint. All the examples covered in this article usage the same _search endpoint for suggest request. As you can see, the suggestion request wrapped inside suggest object and is send to Elasticsearch in its own object with the name we chose (in the preceding case, it is first_suggestion). Next, we specify the text for which we want the suggestion to be returned using the text parameter. Finally, we add the suggester object, which is either term or phrase. The suggester object contains its configuration, which for the term suggester used in the preceding command, is the field we want to use for suggestions (the field property). We can also send more than one suggestion at a time by adding multiple suggestion names. For example, if in addition to the preceding suggestion, we would also include a suggestion for the word arest, we would use the following command: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } }, "second_suggestion": { "text": "arest", "term": { "field": "text" } } } }' Understanding the suggester response Let's now look at the example response for the suggestion query we have executed. Although the response will differ for each suggester type, let's look at the response returned by Elasticsearch for the first command we've sent in the preceding code that used the term suggester: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "first_suggestion" : [ { "text" : "chrimes", "offset" : 0, "length" : 7, "options" : [ { "text" : "crimes", "score" : 0.8333333, "freq" : 36 }, { "text" : "choices", "score" : 0.71428573, "freq" : 2 }, { "text" : "chrome", "score" : 0.6666666, "freq" : 2 }, { "text" : "chimps", "score" : 0.6666666, "freq" : 1 }, { "text" : "crimea", "score" : 0.6666666, "freq" : 1 } ] }, { "text" : "in", "offset" : 8, "length" : 2, "options" : [ ] }, { "text" : "wordl", "offset" : 11, "length" : 5, "options" : [ { "text" : "world", "score" : 0.8, "freq" : 436 }, { "text" : "words", "score" : 0.8, "freq" : 6 }, { "text" : "word", "score" : 0.75, "freq" : 9 }, { "text" : "worth", "score" : 0.6, "freq" : 21 }, { "text" : "worst", "score" : 0.6, "freq" : 16 } ] } ] } } As you can see in the preceding response, the term suggester returns a list of possible suggestions for each term that was present in the text parameter of our first_suggestion section. For each term, the term suggester will return an array of possible suggestions with additional information. Looking at the data returned for the wordl term, we can see the original word (the text parameter), its offset in the original text parameter (the offset parameter), and its length (the length parameter). The options array contains suggestions for the given word and will be empty if Elasticsearch doesn't find any suggestions. Each entry in this array is a suggestion and is characterized by the following properties: text: This is the text of the suggestion. score: This is the suggestion score; the higher the score, the better the suggestion will be. freq: This is the frequency of the suggestion. The frequency represents how many times the word appears in documents in the index we are running the suggestion query against. The higher the frequency, the more documents will have the suggested word in its fields and the higher the chance that the suggestion is the one we are looking for. Please remember that the phrase suggester response will differ from the one returned by the terms suggester, The term suggester The term suggester works on the basis of the edit distance, which means that the suggestion with fewer characters that needs to be changed or removed to make the suggestion look like the original word is the best one. For example, let's take the words worl and work. In order to change the worl term to work, we need to change the l letter to k, so it means a distance of one. Of course, the text provided to the suggester is analyzed and then terms are chosen to be suggested. The phrase suggester The term suggester provides a great way to correct user spelling mistakes on a per-term basis. However, if we would like to get back phrases, it is not possible to do that when using this suggester. This is why the phrase suggester was introduced. It is built on top of the term suggester and adds additional phrase calculation logic to it so that whole phrases can be returned instead of individual terms. It uses N-gram based language models to calculate how good the suggestion is and will probably be a better choice to suggest whole phrases instead of the term suggester. The N-gram approach divides terms in the index into grams—word fragments built of one or more letters. For example, if we would like to divide the word mastering into bi-grams (a two letter N-gram), it would look like this: ma as st te er ri in ng. The completion suggester Till now we read about term suggester and phrase suggester which are used for providing suggestions but completion suggester is completely different and it is used for as a prefix-based suggester for allowing us to create the autocomplete (search as you type) functionality in a very performance-effective way because of storing complicated structures in the index instead of calculating them during query time. This suggester is not about correcting user spelling mistakes. In Elasticsearch 5.0, Completion suggester has gone through complete rewrite. Both the syntax and data structures of completion type field have been changed and so is the response structure. There are many new exciting features and speed optimizations have been introduced in the completion suggester. One of these features is making completion suggester near real time which allows deleted suggestions to omit from suggestion results as soon as they are deleted. The logic behind the completion suggester The prefix suggester is based on the data structure called Finite State Transducer (FST) ( For more information refer, http://en.wikipedia.org/wiki/Finite_state_transducer). Although it is highly efficient, it may require significant resources to build on systems with large amounts of data in them: systems that Elasticsearch is perfectly suitable for. If we would like to build such a structure on the nodes after each restart or cluster state change, we may lose performance. Because of this, the Elasticsearch creators decided to use an FST-like structure during index time and store it in the index so that it can be loaded into the memory when needed. Using the completion suggester To use a prefix-based suggester we need to properly index our data with a dedicated field type called completion. It stores the FST-like structure in the index. In order to illustrate how to use this suggester, let's assume that we want to create an autocomplete feature to allow us to show book authors, which we store in an additional index. In addition to author's names, we want to return the identifiers of the books they wrote in order to search for them with an additional query. We start with creating the authors index by running the following command: curl -XPUT "http://localhost:9200/authors" -d' { "mappings": { "author": { "properties": { "name": { "type": "keyword" }, "suggest": { "type": "completion" } } } } }' Our index will contain a single type called author. Each document will have two fields: the name field, which is the name of the author, and the suggest field, which is the field we will use for autocomplete. The suggest field is the one we are interested in; we've defined it using the completion type, which will result in storing the FST-like structure in the index. Implementing your own autocompletion Completion suggester has been designed to be a powerful and easily implemented solution for autocomplete but it supports only prefix query. Most of the time autocomplete need only work as a prefix query for example, If I type elastic then I expect elasticsearch as a suggestion, not nonelastic. There are some use cases, when one wants to implement more general partial word completion. Completion suggester fails to fulfill this requirement. The second limitation of completion suggester is, it does not allow advance queries and filters searched. To get rid of both these limitations we are going to implement a custom autocomplete feature based on N-gram, which works in almost all the scenarios. Creating index Lets create an index location-suggestion with following settings and mappings: curl -XPUT "http://localhost:9200/location-suggestion" -d' { "settings": { "index": { "analysis": { "filter": { "nGram_filter": { "token_chars": [ "letter", "digit", "punctuation", "symbol", "whitespace" ], "min_gram": "2", "type": "nGram", "max_gram": "20" } }, "analyzer": { "nGram_analyzer": { "filter": [ "lowercase", "asciifolding", "nGram_filter" ], "type": "custom", "tokenizer": "whitespace" }, "whitespace_analyzer": { "filter": [ "lowercase", "asciifolding" ], "type": "custom", "tokenizer": "whitespace" } } } } }, "mappings": { "locations": { "properties": { "name": { "type": "text", "analyzer": "nGram_analyzer", "search_analyzer": "whitespace_analyzer" }, "country": { "type": "keyword" } } } } }' Understanding the parameters If you look carefully in preceding curl request for creating the index, it contains both settings and the mappings. We will see them now in detail one by one. Configuring settings Our settings contains two custom analyzers: nGram_analyzer and whitespace_analyzer. We have made custom whitespace_analyzer using whitespace tokenizer just for making due that all the tokens are indexed in lowercase and ascifolded form. Our main interest is nGram_analyzer, which contains a custom filter nGram_filter consisting following parameters: type: Specifies type of token filters which is nGram in our case. token_chars: Specifies what kind of characters are allowed in the generated tokens. Punctuations and special characters are generally removed from the token streams but in our example, we have intended to keep them. We have kept whitespace also so that a text which contains United States and a user searches for u s, United States still appears in the suggestion. min_gram and max_gram: These two attributes set the minimum and maximum length of substrings that will generated and added to the lookup table. For example, according to our settings for the index, the token India will generate following tokens: [ "di", "dia", "ia", "in", "ind", "indi", "india", "nd", "ndi", "ndia" ] Configuring mappings The document type of our index is locations and it has two fields, name and country. The most important thing to see is the way analyzers has been defined for name field which will be used for autosuggestion. For this field we have set index analyzer to our custom nGram_analyzer where the search analyzer is set to whitespace_analyzer. The index_analyzer parameter is no more supported from Elasticsearch version 5.0 onward. Also, if you want to configure search_analyzer property for a field, then you must configure analyzer property too the way we have shown in the preceding example. Summary In this article we focused on improving user search experience. We started with term and phrase suggesters and then covered search as you type that is, autocompletion feature which is implemented using completion suggester. We also saw the limitations of completion suggester in handling advanced queries and partial matching which further solved by implementing our custom completion using N-gram. Resources for Article: Further resources on this subject: Searching Your Data [article] Understanding Mesos Internals [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 2764

article-image-introducing-algorithm-design-paradigms
Packt
18 Nov 2016
10 min read
Save for later

Introducing Algorithm Design Paradigms

Packt
18 Nov 2016
10 min read
In this article by David Julian and Benjamin Baka, author of the book Python Data Structures and Algorithm, we will discern three broad approaches to algorithm design. They are as follows: Divide and conquer Greedy algorithms Dynamic programming   (For more resources related to this topic, see here.) As the name suggests, the divide and conquer paradigm involves breaking a problem into smaller subproblems, and then in some way combining the results to obtain a global solution. This is a very common and natural problem solving technique and is, arguably, the most used approach to algorithm design. Greedy algorithms often involve optimization and combinatorial problems; the classic example is applying it to the traveling salesperson problem, where a greedy approach always chooses the closest destination first. This shortest path strategy involves finding the best solution to a local problem in the hope that this will lead to a global solution. The dynamic programming approach is useful when our subproblems overlap. This is different from divide and conquer. Rather than breaking our problem into independent subproblems, with dynamic programming, intermediate results are cached and can be used in subsequent operations. Like divide and conquer, it uses recursion. However, dynamic programing allows us to compare results at different stages. This can have a performance advantage over divide and conquer for some problems because it is often quicker to retrieve a previously calculated result from memory rather than having to recalculate it. Recursion and backtracking Recursion is particularly useful for divide and conquer problems; however, it can be difficult to understand exactly what is happening, since each recursive call is itself spinning off other recursive calls. At the core of a recursive function are two types of cases. Base cases, which tell the recursion when to terminate and recursive cases that call the function they are in. A simple problem that naturally lends itself to a recursive solution is calculating factorials. The recursive factorial algorithm defines two cases—the base case, when n is zero, and the recursive case, when n is greater than zero. A typical implementation is shown in the following code: def factorial(n): #test for a base case if n==0: return 1 # make a calculation and a recursive call f= n*factorial(n-1) print(f) return(f) factorial(4) This code prints out the digits 1, 2, 4, 24. To calculate 4!, we require four recursive calls plus the initial parent call. On each recursion, a copy of the methods variables is stored in memory. Once the method returns, it is removed from memory. Here is a way to visualize this process: It may not necessarily be clear if recursion or iteration is a better solution to a particular problem, after all, they both repeat a series of operations and both are very well suited to divide and conquer approaches to algorithm design. An iteration churns away until the problem is done. Recursion breaks the problem down into smaller chunks and then combines the results. Iteration is often easier for programmers because the control stays local to a loop, whereas recursion can more closely represent mathematical concepts such as factorials. Recursive calls are stored in memory, whereas iterations are not. This creates a tradeoff between processor cycles and memory usage, so choosing which one to use may depend on whether the task is processor or memory intensive. The following table outlines the key differences between recursion and iteration. Recursion Iteration Terminates when a base case is reached Terminates when a defined condition is met Each recursive call requires space in memory Each iteration is not stored in memory An infinite recursion results in a stack overflow error An infinite iteration will run while the hardware is powered Some problems are naturally better suited to recursive solutions Iterative solutions may not always be obvious Backtracking Backtracking is a form of recursion that is particularly useful for types of problems such as traversing tree structures where we are presented with a number of options at each node, from which we must choose one. Subsequently, we are presented with a different set of options, and depending on the series of choices made, either a goal state or a dead end is reached. If it is the latter, we mast backtrack to a previous node and traverse a different branch. Backtracking is a divide and conquer method for exhaustive search. Importantly, backtracking prunes branches that cannot give a result. An example of back tracking is given by the following. Here, we have used a recursive approach to generating all the possible permutations of a given string, s, of a given length n: def bitStr(n, s): if n == 1: return s return [ digit + bits for digit in bitStr(1,s)for bits in bitStr(n - 1,s)] print (bitStr(3,'abc')) This generates the following output: Note the double list compression and the two recursive calls within this comprehension. This recursively concatenates each element of the initial sequence, returned when n = 1, with each element of the string generated in the previous recursive call. In this sense, it is backtracking to uncover previously ungenerated combinations. The final string that is returned is all n letter combinations of the initial string. Divide and conquer – long multiplication For recursion to be more than just a clever trick, we need to understand how to compare it to other approaches, such as iteration, and to understand when it is use will lead to a faster algorithm. An iterative algorithm that we are all familiar with is the procedure you learned in primary math classes, which was used to multiply two large numbers, that is, long multiplication. If you remember, long multiplication involved iterative multiplying and carry operations followed by a shifting and addition operation. Our aim here is to examine ways to measure how efficient this procedure is and attempt to answer the question, is this the most efficient procedure we can use for multiplying two large numbers together? In the following figure, we can see that multiplying two 4-digit numbers together requires 16 multiplication operations, and we can generalize to say that an n digit number requires, approximately, n2 multiplication operations: This method of analyzing algorithms, in terms of number of computational primitives such as multiplication and addition, is important because it can give a way to understand the relationship between the time it takes to complete a certain computation and the size of the input to that computation. In particular, we want to know what happens when the input, the number of digits, n, is very large. Can we do better? A recursive approach It turns out that in the case of long multiplication, the answer is yes, there are in fact several algorithms for multiplying large numbers that require fewer operations. One of the most well-known alternatives to long multiplication is the Karatsuba algorithm, published in 1962. This takes a fundamentally different approach: rather than iteratively multiplying single digit numbers, it recursively carries out multiplication operation on progressively smaller inputs. Recursive programs call themselves on smaller subset of the input. The first step in building a recursive algorithm is to decompose a large number into several smaller numbers. The most natural way to do this is to simply split the number into halves: the first half comprising the most significant digits and the second half comprising the least significant digits. For example, our four-digit number, 2345, becomes a pair of two digit numbers, 23 and 45. We can write a more general decomposition of any two n-digit numbers x and y using the following, where m is any positive integer less than n. For x-digit number: For y-digit number: So, we can now rewrite our multiplication problem x and y as follows: When we expand and gather like terms we get the following: More conveniently, we can write it like this (equation 1): Here, It should be pointed out that this suggests a recursive approach to multiplying two numbers since this procedure itself involves multiplication. Specifically, the products ac, ad, bc, and bd all involve numbers smaller than the input number, and so it is conceivable that we could apply the same operation as a partial solution to the overall problem. This algorithm, so far consists of four recursive multiplication steps and it is not immediately clear if it will be faster than the classic long multiplication approach. What we have discussed so far in regards to the recursive approach to multiplication was well known to mathematicians since the late 19th century. The Karatsuba algorithm improves on this is by making the following observation. We really only need to know three quantities, z2 = ac, z1=ad +bc, and z0 = bd to solve equation 1. We need to know the values of a, b, c, and d as they contribute to the overall sum and products involved in calculating the quantities z2, z1, and z0. This suggests the possibility that perhaps we can reduce the number of recursive steps. It turns out that this is indeed the situation. Since the products ac and bd are already in their simplest form, it seems unlikely that we can eliminate these calculations. We can, however, make the following observation: When we subtract the quantities ac and bd, which we have calculated in the previous recursive step, we get the quantity we need, namely ad + bc: This shows that we can indeed compute the sum of ad and bc without separately computing each of the individual quantities. In summary, we can improve on equation 1 by reducing from four recursive steps to three. These three steps are as follows: Recursively calculate ac. Recursively calculate bd. Recursively calculate (a +b)(c + d) and subtract ac and bd. The following code shows a Python implementation of the Karatsuba algorithm: from math import log10 def karatsuba(x,y): # The base case for recursion if x < 10 or y < 10: return x*y #sets n, the number of digits in the highest input number n = max(int(log10(x)+1), int(log10(y)+1)) # rounds up n/2 n_2 = int(math.ceil(n / 2.0)) #adds 1 if n is uneven n = n if n % 2 == 0 else n + 1 #splits the input numbers a, b = divmod(x, 10**n_2) c, d = divmod(y, 10**n_2) #applies the three recursive steps ac = karatsuba(a,c) bd = karatsuba(b,d) ad_bc = karatsuba((a+b),(c+d)) - ac - bd #performs the multiplication return (((10**n)*ac) + bd + ((10**n_2)*(ad_bc))) To satisfy ourselves that this does indeed work, we can run the following test function: import random def test(): for i in range(1000): x = random.randint(1,10**5) y = random.randint(1,10**5) expected = x * y result = karatsuba(x, y) if result != expected: return("failed") return('ok') Summary In this article, we looked at a way to recursively multiply large numbers and also a recursive approach for merge sort. We saw how to use backtracking for exhaustive search and generating strings. Resources for Article: Further resources on this subject: Python Data Structures [article] How is Python code organized [article] Algorithm Analysis [article]
Read more
  • 0
  • 0
  • 24738

article-image-manual-and-automated-testing
Packt
15 Nov 2016
10 min read
Save for later

Manual and Automated Testing

Packt
15 Nov 2016
10 min read
In this article by Claus Führer the author of the book Scientific Computing with Python 3, we focus on two aspects of testing for scientific programming: Manual and Automated testing. Manual testing is what is done by every programmer to quickly check that an implementation is working. Automated testing is the refined, automated variant of that idea. We will introduce some tools available for automatic testing in general, with a view on the particular case of scientific computing. (For more resources related to this topic, see here.) Manual Testing During the development of code you do a lot of small tests in order to test its functionality. This could be called Manual Testing. Typically, you would test that a given function does what it is supposed to do, by manually testing the function in an interactive environment. For instance, suppose that you implement the Bisection algorithm. It is an algorithm that finds a zero (root) of a scalar nonlinear function. To start the algorithm an interval has to be given with the property, that the function takes different signs on the interval boundaries. You would then test an implementation of that algorithm typically by checking: That a solution is found when the function has opposite signs at the interval boundaries that an exception is raised when the function has the same sign at the interval boundaries Manual testing, as necessary as may seem to be, is unsatisfactory. Once you convinced yourself that the code does what it is supposed to do, you formulate a relatively small number of demonstration examples to convince others of the quality of the code. At that stage one often loses interest in the tests made during development and they are forgotten or even deleted. As soon as you change a detail and things no longer work correctly you might regret that your earlier tests are no longer available. Automatic Testing The correct way to develop any piece of code is to use automatic testing. The advantages are The automated repetition of a large number of tests after every code refactoring and before new versions are launched A silent documentation of the use of the code A documentation of the test coverage of your code: Did things work before a change or was a certain aspect never tested? We suggest to develop tests in parallel to the code. Good design of tests is an art of its own and there is rarely an investment which guarantees such a good pay-off in development time savings as the investment in good tests. Now we will go through the implementation of a simple algorithm with the automated testing methods in mind. Testing the bisection algorithm Let us examine automated testing for the bisection algorithm. With this algorithm a zero of a real valued function is found. An implementation of the algorithm can have the following form: def bisect(f,a,b,tol=1.e-8): """ Implementation of the bisection algorithm f real valued function a,b interval boundaries (float) with the property f(a)*f(b)<=0 tol tolerance ( float ) """ if f(a)*f(b)>0: raise ValueError ("Incorrect initial interval [a,b]") for i in range (100): c = (a + b)/2 . if f (a)*f(c) <= 0: b=c else: a=c if abs (a - b)<tol: return (a + b)/2 raise Exception (’ No root found within the given tolerance { }’.format (tol) We assume this to be stored in a file bisection.py. As a first test case we test that the zero of the function F(x) = x is found: def test_identity(): result = bisect(lambda x: x, -1., 1.) #(for lambda) expected = 0. assert allclose(result, expected),’expected zero not found’ text_identity() In this code you meet the Python keyword assert for the first time. It raises an exception AssertionError if its first argument returns the value False. Its optional second argument is a string with additional information. We use the function allclose in order to test for equality for float. Let us comment on some of the features of the test function. We use an assertion to make sure that an exception will be raised if the code does not behave as expected. We have to manually run the test in the line test_identity(). There are many tools to automate this kind of call. Let us now setup a test that checks if bisect raises an exception when the function has the same sign on both ends of the interval. For now, we will suppose that the exception raised is a ValueError exception. Example: Checking the sign for the bisection algorithm. def test_badinput(): try: bisect(lambda x: x,0.5,1) except ValueError: pass else: raise AssertionError() test_badinput() In this case an AssertionError is raised if the exception is not of type ValueError. There are tools to simplify the above construction to check that an exception is raised. Another useful kind of tests is the edge case test. Here we test arguments or user input which is likely to create mathematically undefined situations or states of the program not foreseen by the programmer. For instance, what happens if both bounds are equal? What happens if a>b? We easily setup up such a test by using for instance def test_equal_boundaries(): result = bisect(lambda x: x, 1., 1.) expected = 0. assert allclose(result, expected), ‘test equal interval bounds failed’ def test_reverse_boundaries(): result = bisect(lambda x: x, 1., -1.) expected = 0. assert allclose(result, expected), ‘test reverse interval bounds failed’ test_equal_boundaries() test_reverse_boundaries() Using unittest The standard Python package unittest greatly facilitates automated testing. That package requires that we rewrite our tests a little to be compatible. The first test would have to be rewritten in a class, as follows: from bisection import bisect import unittest class TestIdentity(unittest.TestCase): def test(self): result = bisect(lambda x: x, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) if __name__==‘__main__’: unittest.main() Let us examine the differences to the previous implementation. First, the test is now a method and a part of a class. The class must inherit from unittest,TestCase. The test method’s name must start with test. Note that we may now use one of the assertion tools of the package, namely       . Finally, the tests are run using unittest.main. We recommend to write the tests in a file separate from the code to be tested. That’s why it starts with an import. The test passes and returns Ran 1 test in 0.002s OK If we would have run it with a loose tolerance parameter, e.g., 1.e-3, a failure of the test would have been reported: F ========================================================== FAIL: test (__main__.TestIdentity) ---------------------------------------------------------------------- Traceback (most recent call last): File “<ipython-input-11-e44778304d6f>“, line 5, in test self.assertAlmostEqual(result, expected) AssertionError: 0.00017089843750002018 != 0.0 within 7 places --------------------------------------------------------------------- Ran 1 test in 0.004s FAILED (failures=1) Tests can and should be grouped together as methods of a test class: Example: import unittest from bisection import bisect class TestIdentity(unittest.TestCase): def identity_fcn(self,x): return x def test_functionality(self): result = bisect(self.identity_fcn, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) def test_reverse_boundaries(self): result = bisect(self.identity_fcn, 1., -1.) expected = 0. self.assertAlmostEqual(result, expected) def test_exceeded_tolerance(self): tol=1.e-80 self.assertRaises(Exception, bisect, self.identity_fcn, -1.2, 1.,tol) if __name__==‘__main__’: unittest.main() Here, the last test needs some comments: We used the method unittest.TestCase.assertRaises. It tests whether an exception is correctly raised. Its first parameter is the exception type, for example,ValueError, Exception, and its second argument is a the name of the function, which is expected to raise the exception. The remaining arguments are the arguments for this function. The command unittest.main() creates an instance of the class TestIdentity and executes those methods starting by test. Test setUp and tearDown The class unittest.TestCase provides two special methods, setUp and tearDown, which are run before and after every call to a test method. This is needed when testing generators, which are exhausted after every test. We demonstrate this here by testing a program which checks in which line in a file a given string occurs for the first time: class NotFoundError(Exception): pass def find_string(file, string): for i,lines in enumerate(file.readlines()): if string in lines: return i raise NotFoundError(‘String {} not found in File {}‘. format(string,file.name)) We assume, that this code is saved in a file find_string.py. A test has to prepare a file and open it and remove it after the test: import unittest import os # used for, e.g., deleting files from find_in_file import find_string, NotFoundError class TestFindInFile(unittest.TestCase): def setUp(self): file = open(‘test_file.txt’, ‘w’) file.write(‘aha’) file.close() self.file = open(‘test_file.txt’, ‘r’) def tearDown(self): os.remove(self.file.name) def test_exists(self): line_no=find_string(self.file, ‘aha’) self.assertEqual(line_no, 0) def test_not_exists(self): self.assertRaises(NotFoundError, find_string,self.file, ‘bha’) if __name__==‘__main__’: unittest.main() Before each test setUp is run and afterwards tearDown is executed. Parametrizing Tests One frequently wants to repeat the same test set-up with different data sets. When using the functionalities of unittests this requires to automatically generate test cases with the corresponding methods injected: To this end we first construct a test case with one or several methods that will be used, when we later set up test methods. Let us consider the bisection method again and let us check if the values it returns are really zeros of the given function. We first build the test case and the method which will use for the tests: class Tests(unittest.TestCase): def checkifzero(self,fcn_with_zero,interval): result = bisect(fcn_with_zero,*interval,tol=1.e-8) function_value=fcn_with_zero(result) expected=0. self.assertAlmostEqual(function_value, expected) Then we dynamically create test functions as attributes of this class: test_data=[‘name’:’identity’, ‘function’:lambda x: x, ‘interval’:[-1.2, 1.], ‘name’:’parabola’, ‘function’:lambda x: x**2-1, ’interval’:[0, 10.], ‘name’:’cubic’, ‘function’:lambda x: x**3-2*x** 2,‘interval’:[0.1, 5.],] def make_test_function(dic): return lambda self:self.checkifzero(dic[‘function’],dic [‘interval’]) for data in test_data: setattr(Tests, “test_name”.format(name=data[‘name’]), make_test_function(data)) if __name__==‘__main__’: unittest.main() In this example the data is provided as a list of dictionaries. A function make_test_function dynamically generates a test function which uses a particular data dictionary to perform the test with the previously defined method checkifzero. This test function is made a method of the TestCase class by using the Python command settattr. Summary No program development without testing! In this article we showed the importance of well organized and documented tests. Some professionals even start development by first specifying tests. A useful tool for automatic testing is unittest, which we explained in detail. While testing improves the reliability of a code, profiling is needed to improve the performance. Alternative ways to code may result in large performance differences. We showed how to measure computation time and how to localize bottlenecks in your code. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine Learning with R [article] Storage Scalability [article]
Read more
  • 0
  • 0
  • 2277

article-image-tensorflow-toolbox
Packt
14 Nov 2016
6 min read
Save for later

The TensorFlow Toolbox

Packt
14 Nov 2016
6 min read
In this article by Saif Ahmed, author of the book Machine Learning with TensorFlow, we learned how most machine learning platforms are focused toward scientists and practitioners in academic or industrial settings. Accordingly, while quite powerful, they are often rough around the edges and have few user-experience features. (For more resources related to this topic, see here.) Quite a bit of effort goes into peeking at the model at various stages and viewing and aggregating performance across models and runs. Even viewing the neural network can involve far more effort than expected. While this was acceptable when neural networks were simple and only a few layers deep, today's networks are far deeper. In 2015, Microsoft won the annual ImageNet competition using a deep network with 152 layers. Visualizing such networks can be difficult, and peeking at weights and biases can be overwhelming. Practitioners started using home-built visualizers and bootstrapped tools to analyze their networks and run performance. TensorFlow changed this by releasing TensorBoard directly alongside their overall platform release. TensorBoard runs out of box with no additional installations or setup. Users just need to instrument their code according to what they wish to capture. It features plotting of events, learning rate and loss over time; histograms, for weights and biases; and images. The Graph Explorer allows interactive reviews of the neural network. A quick preview You can follow along with the code here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/cifar10/cifar10_train.py The example uses the CIFAR-10 image set. The CIFAR-10 dataset consists of 60,000 images in ten classes compiled by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The dataset has become one of several standard learning tools and benchmarks for machine learning efforts. Let's start with the Graph Explorer. We can immediately see a convolutional network being used. This is not surprising as we're trying to classify images here. This is just one possible view of the graph. You can try the Graph Explorer as well. It allows deep dives into individual components. Our next stop on the quick preview is the EVENTS tab. This tab shows scalar data over time. The different statistics are grouped into individual tabs on the right-hand side. The following screenshot shows a number of popular scalar statistics, such as loss, learning rate, cross entropy, and sparsity across multiple parts of the network. The HISTOGRAMS tab is a close cousin as it shows tensor data over time. Despite the name, as of TensorFlow v0.7, it does not actually display histograms. Rather, it shows summaries of tensor data using percentiles. The summary view is shown in the following figure. Just like with the EVENTS tab, the data is grouped into tabs on the right-hand side. Different runs can be toggled on and off and runs can be shown overlaid, allowing interesting comparisons. It features three runs, which we can see on the left side, and we'll look at just the softmax function and associated parameters. For now, don't worry too much about what these mean, we're just looking at what we can achieve for our own classifiers. However, the summary view does not do justice to the utility of the HISTOGRAMS tab. Instead, we will zoom into a single graph to observe what is going on. This is shown in the following figure: Notice that each histogram chart shows a time series of nine lines. The top is the maximum, the middle the median, and the bottom the minimum. The three lines directly above and below the median are one and half standard deviation, one standard deviation, and half standard deviation marks. Obviously, this does represent multimodal distributions as it is not a histogram. However, it does provide a quick gist of what would otherwise be a mountain of data to sift through. A couple of things to note are how data can be collected and segregated by runs, how different data streams can be collected, how we can enlarge the views, and how we can zoom into each of the graphs. Enough of graphics, lets jump into code so we can run this for ourselves! Installing TensorBoard TensorFlow comes prepackaged with TensorBoard, so it will already be installed. It runs as a locally served web application accessible via the browser at http://0.0.0.0:6006. Conveniently, there is no server-side code or configurations required. Depending on where your paths are, you may be able to run it directly, as follows: tensorboard --logdir=/tmp/tensorlogs If your paths are not correct, you may need to prefix the application accordingly, as shown in the following command line: tf_install_dir/ tensorflow/tensorboard --logdir=/tmp/tensorlogs On Linux, you can run it in the background and just let it keep running, as follows: nohup tensorboard --logdir=/tmp/tensorlogs & Some thought should be put into the directory structure though. The Runs list on the left side of the dashboard is driven by subdirectories in the logdir location. The following image shows two runs: MNIST_Run1 and MNIST_Run2. Having an organized runs folder will allow plotting successive runs side by side to see differences. When initializing the writer, you will pass in the log_location as the first parameter, as follows: writer = tf.train.SummaryWriter(log_location, sess.graph_def) Consider saving a base location and appending run-specific subdirectories for each run. This will help organize outputs without expending more thought on it. We’ll discuss more about this later. Incorporating hooks into our code The best way to get started with TensorBoard is by taking existing working examples and instrument them with the code required for TensorBoard. We will do this for several common training scripts. Summary In this article, we covered the major areas of TensorBoard—EVENTS, HISTOGRAMS, and viewing GRAPH. We modified popular models to see the exact changes required before TensorBoard could be up and running. This should have demonstrated the fairly minimal effort required to get started with TensorBoard. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Implementing Artificial Neural Networks with TensorFlow [article] Why we need Design Patterns? [article]
Read more
  • 0
  • 0
  • 2584
article-image-data-clustering
Packt
10 Nov 2016
6 min read
Save for later

Data Clustering

Packt
10 Nov 2016
6 min read
In this article by Rodolfo Bonnin, the author of the book Building Machine Learning Projects with TensorFlow, we will start applying data transforming operations. We will begin finding interesting patterns in some given information, discovering groups of data, or clusters, and using clustering techniques. (For more resources related to this topic, see here.) In this process we'll also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library. The topics we will cover in this article are as follows: Getting an idea of how clustering works, and comparing it to other alternative existent classification techniques Using scikit-learn and matplotlib to enrichen the possibilities of dataset choices, and to get professional looking graphical representation of the data Implementing the K-means clustering algorithm Test some variations of the K-means methods to improve the fit and/or the convergence rate Three types of learning from data Based on how we approach the supervision of the samples, we can extract three types of learning: Unsupervised learning: The fully unsupervised approach directly takes a number of undetermined elements and builds a classification of them, looking at different properties that could determine its class Semi-supervised learning: The semi-supervised approach has a number of known classified items and then applies techniques to discover the class of the remaining items Supervised learning: In supervised learning, we start from a population of samples, which have a known type beforehand, and then build a model from it Normally there are three sample populations: one from which the model grows, called training set, one that is used to test the model, called training set, and then there are the samples for which we will be doing classification. Types of data learning based on supervision: unsupervised, semi-supervised, and supervised Unsupervised data clustering One of the simplest operations that can be initially applied to an unknown dataset is to try to understand the possible grouping or common features that the dataset members have. To do so, we could try to find representative points in them that summarize a balance of the parameters of the members of the group. This value could be, for example, the mean or the median of all the cluster members. This also guides to the idea of defining a notion of distance between members: all the members of the groups should be obviously at short distances between them and the representative points, that from the central points of the other groups. In the following image, we can see the results of a typical clustering algorithm and the representation of the cluster centers: Sample clustering algorithm output K-means K-means is a very well-known clustering algorithm that can be easily implemented. It is very straightforward and can guide (depending on the data layout) to a good initial understanding of the provided information. Mechanics of K-means K-means tries to divide a set of samples into K disjoint groups or clusters, using as a main indicator the mean value (be it 1D, 2D, and so on) of the members. This point is normally called centroid, referring to the arithmetic entity with the same name. One important characteristic of K-means is that K should be provided beforehand, and so some previous knowledge of the data is needed to avoid a non-representative result. Algorithm iteration criterion The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster contained samples. This is also known as minimization of inertia. Error minimization criteria for K-means K-means algorithm breakdown The mechanism of the K-means algorithm can be summarized in the following graphic: Simplified flow chart of the K-means process And this is a simplified summary of the algorithm: We start with unclassified samples and take K elements as the starting centroids. There are also possible simplifications of this algorithm that take the first elements in the element list, for the sake of brevity. We then calculate the distances between the samples and the first chosen samples, and so we get the first calculated centroids (or other representative values). You can see in the moving centroids in the illustration toward a more common sense centroid. After the centroids change, their displacement will provoke the individual distances to change, and so the cluster membership can change. So this is the time when we recalculate the centroids and repeat the first steps, in case the stop condition isn't met. The stopping conditions could be of various types: After n iterations (it could be that either we chose a too large number and we'll have unnecessary rounds of computing, or it could converge slowly and we will have a very unconvincing results) if the centroid doesn't have a very stable means. This stop condition could also be used as a last resort if we have a really long iterative process. Referring to the previous mean result, a possibly better criterion for the convergence of the iterations is to take a look at the changes of the centroids, be it in total displacement or total cluster element switches. The last one is employed normally, so we will stop the process once there are no more element-changing clusters: K-means simplified graphic Pros and cons of K-means The advantages of this method are: It scales very well (most of the calculations can be run in parallel) It has been used in a very large range of applications But its simplicity has also a price (no silver bullet rule applies): It requires apriori knowledge (the number of possible clusters should be known beforehand) The outlier values can push the values of the centroids, as they have the same value as any other sample As we assume that the figure is convex and isotropic, it doesn't work very well with non-circle-like delimited clusters Summary In this article, we got a simple overview of some of the most basic models we can implement, but we tried to be as detailed in the explanation as possible. From now on, we are able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcoming of them without having to load models with a greater number of unknown characteristics. You can also refer to the following books on the similar topics: Getting Started with TensorFlow: https://www.packtpub.com/big-data-and-business-intelligence/getting-started-tensorflow R Machine Learning Essentials: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-essentials Building Machine Learning Systems with Python - Second Edition: https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Preprocessing the Data [article]
Read more
  • 0
  • 0
  • 2294

article-image-introduction-practical-business-intelligence
Packt
10 Nov 2016
20 min read
Save for later

Introduction to Practical Business Intelligence

Packt
10 Nov 2016
20 min read
In this article by Ahmed Sherif, author of the book Practical Business Intelligence, is going to explain what is business intelligence? Before answering this question, I want to pose and answer another question. What isn't business intelligence? It is not spreadsheet analysis done with transactional data with thousands of rows. One of the goals of Business Intelligence or BI is to shield the users of the data from the intelligent logic lurking behind the scenes of the application that is delivering the data to them. If the integrity of the data is compromised in any way by an individual not intimately familiar with the data source, then there cannot, by definition, be intelligence in the business decisions made within that same data. The single source of truth is the key for any Business Intelligence operation whether it is a mom and pop soda shop or a Fortune 500 company. Any report, dashboard, or analytical application that is delivering information to a user through a BI tool but the numbers cannot be tied back to the original source will break the trust between the user and the data and will defeat the purpose of Business Intelligence. (For more resources related to this topic, see here.) In my opinion, the most successful tools used for business intelligence directly shield the business user from the query logic used for displaying that same data in some form of visual manner. Business Intelligence has taken many forms in terms of labels over the years. Business Intelligence is the process of delivering actionable business decisions from analytical manipulation and presentation of data within the confines of a business environment. The delivery process mentioned in the definition will focus its attention on. The beauty of BI is that it is not owned by any one particular tool that is proprietary to a specific industry or company. Business Intelligence can be delivered using many different tools, some even that were not originally intended to be used for BI. The tool itself should not be the source where the query logic is applied to generate the business logic of the data. The tool should primarily serve as the delivery mechanism of the query that is generated by the data warehouse that houses both the data, as well as the logic. In this chapter we will cover the following topics: Understanding the Kimball method Understanding business intelligence Data and SQL Working with data and SQL Working with business intelligence tools Downloading and installing MS SQL Server 2014 Downloading and installing AdventureWorks Understanding the Kimball method As we discuss the data warehouse where our data is being housed, we will be remised not to bring up Ralph Kimball, one of the original architects of the data warehouse.  Kimball's methodology incorporated dimensional modeling, which has become the standard for modeling a data warehouse for Business Intelligence purposes. Dimensional modeling incorporates joining tables that have detail data and tables that have lookup data. A detail table is known as a fact table in dimensional modeling. An example of a fact table would be a table holding thousands of rows of transactional sales from a retail store.  The table will house several ID's affiliated with the product, the sales person, the purchase date, and the purchaser just to name a few. Additionally, the fact table will store numeric data for each individual transaction, such as sales quantity for sales amount to name a few examples. These numeric values will be referred to as measures. While there is usually one fact table, there will also be several lookup or dimensional tables that will have one table for each ID that is used in a fact table. So, for example,  there would be one dimensional table for the product name affiliated with a product ID. There would be one dimensional table for the month, week, day, and year of the id affiliated with the date. These dimensional tables are also referred to as Lookup Tables, because they kind of look up what the name of a dimension ID is affiliated with. Usually, you would find as many dimensional tables as there are ID's in the fact table. The dimensional tables would all be joined to the one fact table creating something of a 'star' look. Hence, the name for this type of table join is known as a star schema which is represented diagrammatically in the following figure. It is customary that the fact table will be the largest table in a data warehouse while the lookup tables will all be quite small in rows, some as small as one row. The tables are joined by ID's, also known as surrogate keys. Surrogate keys allow for the most efficient join between a fact table and a dimensional table as they are usually a data type of integer. As more and more detail is added to a dimensional table, that new dimension is just given the next number in line, usually starting with 1. Query performance between tables joins suffers when we introduce non-numeric characters into the join or worse, symbols (although most databases will not allow that). Understanding business intelligence architecture I will continuously hammer home the point that the various tools utilized to deliver the visual and graphical BI components should not house any internal logic to filter data out of the tool nor should it be the source of any built in calculations. The tools themselves should not house this logic as they will be utilized by many different users. If each user who develops a BI app off of the tool incorporates different internal filters without the tool, the single source of truth tying back to the data warehouse will become multiple sources of truths.  Any logic applied to the data to filter out a specific dimension or to calculate a specific measure should be applied in the data warehouse and then pulled into the tool. For example, if the requirement for a BI dashboard was to show current year and prior year sales for US regions only, the filter for region code would be ideally applied in the data warehouse as opposed to inside of the tool. The following is a query written in SQL joining two tables from the AdventureWorks database that highlights the difference between dimenions and measures.  The 'region' column is a dimension column and the 'SalesYTD' and 'SalesPY' are measure columns. In this example, the 'TerritoryID' is serving as the key join between 'SalesTerritory' and 'SalesPerson'. Since the measures are coming from the 'SalesPerson' table, that table will serve as the fact table and 'SalesPerson.TerritoryID' will serve as the fact ID. Since the Region column is dimensional and coming from the 'SalesTerritory' table, that table will serve as the dimensional or lookup table and 'SalesTerritory.TerritoryID' will serve as the dimension ID. In a finely-tuned data warehouse both the fact ID and dimension ID would be indexed to allow for efficient query performance. This performance is obtained by sorting the ID's numerically so that a row from one table that is being joined to another table does not have to be searched through the entire table but only a subset of that table. When the table is only a few hundred rows, it may not seem necessary to index columns, but when the table grows to a few hundred million rows, it may become necessary. Select region.Name as Region ,round(sum(sales.SalesYTD),2) as SalesYTD ,round(sum(sales.SalesLastYear),2) as SalesPY FROM [AdventureWorks2014].[Sales].[SalesTerritory] region left outer join [AdventureWorks2014].[Sales].[SalesPerson] sales on sales.TerritoryID = region.TerritoryID where region.CountryRegionCode = 'US' Group by region.Name order by region.Name asc There are several reasons why applying the logic at the database level is considered a best practice. Most of the time, these requests for filtering data or manipulating calculations are done at the BI tool level because it is easier for the developer than to go to the source. However, if these filters are being performed due to data quality issues then applying logic at the reporting level is only masking an underlying data issue that needs to be addressed across the entire data warehouse. You would be doing yourself a disservice in the long run as you will be establishing a precedence that the data quality would be handled by the report developer as opposed to the database administrator. You are just adding additional work onto your plate. Ideal BI tools will quickly connect to the data source and then allow for slicing and dicing of your dimensions and measures in a manner that will quickly inform the business of useful and practical information. Ultimately, the choice of a BI tool by an individual or an organization will come down to the ease of use of the tool as well as the flexibility to showcase the data through various components such as graphs, charts, widgets, and infographics. Management If you are a Business Intelligence manager looking to establish a department with a variety of tools to help flesh out your requirements, could serve as a good source for interview questions to weed out unqualified candidates. A manager could use to distinguish some of the nuances between these different skillsets and prioritize hiring based on immediate needs. Data Scientist The term Data Scientist has been misused in the BI industry, in my humble opinion. It has been lumped in with Data Analyst as well as BI Developer. Unfortunately, these three positions have separate skillsets and you will do yourself a disservice by assuming one person can do multiple positions successfully. A Data Scientist will be able to apply statistical algorithms behind the data that is being extracted from the BI tools and make predictions about what will happen in the future with that same data set. Due to this skillset, a Data Scientist may find the chapters focusing on R and Python to be of particular importance because of their abilities to leverage predictive capabilities within their BI delivery mechanisms. Data Analyst A Data Analyst is probably the second most misused position behind a Data Scientist. Typically, a Data Analyst should be analyzing the data that is coming out of the BI tools that are connected to the data warehouse. Most Data Analysts are comfortable working with Microsoft Excel. Often times they are asked to take on additional roles in developing dashboards that require additional programming skills.  This is where they would find some comfort using a tool like Power BI, Tableau, or QlikView. These tools would allow for a Data Analyst to quickly develop a storyboard or visualization that would allow for quick analysis with minimal programming skills. Visualization Developer A 'dataviz' developer is someone who can create complex visualizations out of data and showcase interesting interactions between different measures inside of a dataset that cannot necessarily be seen with a traditional chart or graph. More often than not these developers possess some programming background such as JavaScript, HTML, or CSS. These developers are also used to developing applications directly for the web and therefore would find D3.js a comfortable environment to program in. Working with Data and SQL The examples and exercises that will come from the AdventureWorks database.  The AdventureWorks database has a comprehensive list of tables that mimics an actual bicycle retailor. The examples will draw on different tables from the database to highlight BI reporting from the various segments appropriate for the AdventureWorks Company. These segments include Human Resources, Manufacturing, Sales, Purchasing, and Contact Management. A different segment of the data will be highlighted in each chapter utilizing a specific set of tools. A cursory understanding of SQL (structured query language) will be helpful to get a grasp of how data is being aggregated with dimensions and measures. Additionally, an understanding of the SQL statements used will help with the validation process to ensure a single source of truth between the source data and the output inside of the BI tool of choice. For more information about learning SQL, visit the following website: www.sqlfordummies.com Working with business intelligence tools Over the course of the last 20 years, there have been a growing number of software products released that were geared towards Business Intelligence. In addition, there have been a number of software products and programming languages that were not initially built for BI but later on became a staple for the industry. The tools used were chosen based on the fact that they were either built off of open source technology or they were products from companies that provided free versions of their software for development purposes. Many companies from the big enterprise firms have their own BI tools and they are quite popular. However, unless you have a license with them, it is unlikely that you will be able to use their tool without having to shell out a small fortune. Power BI and Excel Power BI is one of the more relatively newer BI tools from Microsoft.  It is known as a self-service solution and integrates seamlessly with other data sources such as Microsoft Excel and Microsoft SQL Server.  Our primary purpose in using Power BI will be to generate interactive dashboards, reports, and datasets for users. In addition to using Power BI we will also focus on utilizing Microsoft Excel to assist with some data analysis and validation of results that are being pulled from our data warehouse.  Pivot tables are very popular within MS Excel and will be used to validate aggregation done inside of the data warehouse. D3.js D3.js, also known as data-driven documents, is a JavaScript library known for delivery beautiful visualizations by manipulating documents based on data. Since D3 is rooted in JavaScript, all visualizations make a seamless transition to the web. D3 allows for major customization to any part of visualization and because of this flexibility, it will require a steeper learning curve that probably any other software program. D3 can consume data easily as a .json or a .csv file.  Additionally, the data can also be imbedded directly within the JavaScript code that renders the visualization on the web. R R is a free and open source statistical programming language that produces beautiful graphics. The R language has been widely used among the statistical community and more recently in the data science and machine learning community as well. Due to this fact, it has picked up steam in recent years as a platform for displaying and delivering effective and practical BI. In addition to visualizing BI, R has the ability to also visualize predictive analysis with algorithms and forecasts. While R is a bit raw in its interface, there have been some IDE's (Integrated Development Environment) that have been developed to ease the user experience. RStudio will be used to deliver the visualisations developed within R. Python Python is considered the most traditional programming language of all the different languages. It is a widely used general purpose programming language with several modules that are very powerful in analysing and visualizing data. Similar to R, Python is a bit raw in its own form for delivering beautiful graphics as a BI tool; however, with the incorporation of an IDE the user interface becomes much more of a pleasurable development experience. PyCharm will be the IDE used to develop BI with Python. PyCharm is free to use and allows creation of the iPython notebook which delivers seamless integration between Python and the powerful modules that will assist with BI. As a note, all code in Python will be developed using the Python 3 syntax. QlikView QlikView is a software company specializing in delivering business intelligence solutions using their desktop tool. QlikView is one of the leaders in delivering quick visualizations based on data and queries through their desktop application. They advertise themselves to be self-service BI for business users. While they do offer solutions that target more enterprise organizations, they also offer a free version of their tool for personal use. Tableau is probably the closest competitor in terms of delivering similar BI solutions. Tableau Tableau is a software company specializing in delivering business intelligence solutions using their desktop tool. If this sounds familiar to QlikView, it's probably because it's true. Both are leaders in the field of establishing a delivery mechanism with easy installation, setup, and connectivity to the available data. Tableau has a free version of their desktop tool. Again, Tableau excels at delivering both beautiful visualizations quickly as well as self-service data discovery to more advanced business users. Microsoft SQL Server Microsoft SQL will serve as the data warehouse for the examples that we will with the BI Tools. Microsoft SQL Server is relatively simple to install and set up as well it is free to download. Additionally, there are example databases that configure seamlessly with it, such as the AdventureWorks database. Downloading and Installing MS SQL Server 2014 First things first. We will need to get started with getting our database and data warehouse up and running so that we can begin to develop our BI environment. We will visit the Microsoft website below to start the download selection process. https://www.microsoft.com/en-us/download/details.aspx?id=42299 Select the specified language that is applicable to you and also select the MS SQL Server Express version with Advanced features that is 64-bit edition as shown in the following screenshot. Ideally you'll want to be working in a 64-bit edition when dealing with servers. After selecting the file, the download process should begin. Depending on your connection speed it could take some time as the file is slightly larger than 1 GB. The next step in the process is selecting a new stand-alone instance of SQL Server 2014 unless you already have a version and wish to upgrade instead as shown in the following screenshot.. After accepting the license terms, continue through the steps in the Global Rules as well as the Product Updates to get to the setup installation files. For the feature selection tab, make sure the following features are selected for your installation as shown in the following screenshot. Our preference is to label a named instance of this database to something related to the work we are doing.  Since this will be used for Business Intelligence, I went ahead and name this instance 'SQLBI' as shown in the following screenshot: The default Server Configuration settings are sufficient for now, there is no need to change anything under that section as shown in the following screenshot. Unless you are required to do so within your company or organization, for personal use it is sufficient to just go with Windows Authentication mode for sign-on as shown in the following screenshot. We will not need to do any configuring of reporting services, so it is sufficient for our purposes to just with installing Reporting Services Native mode without any need for configuration at this time. At this point the installation will proceed and may take anywhere between 20-30 minutes depending on the cpu resources. If you continue to have issues with your installation, you can visit the following website from Microsoft for additional help. http://social.technet.microsoft.com/wiki/contents/articles/23878.installing-sql-server-2014-step-by-step-tutorial.aspx Ultimately, if everything with the installation is successful, you'll want to see all portions of the installation have a green check mark next to their name and be labeled 'Successful' as shown in the following screenshot. Downloading and Installing AdventureWorks We are almost finished with getting our business intelligence data warehouse complete. We are now at the stage where we will extract and load data into our data warehouse. The last part is to download and install the AdventureWorks database from Microsoft. The zipped file for AdventureWorks 2014 is located in the following website from Microsoft: https://msftdbprodsamples.codeplex.com/downloads/get/880661 Once the file is downloaded and unzipped, you will find a file named the following: AdventureWorks2014.bak Copy that file and paste it in the following folder where it will be incorporated with your Microsoft SQL Server 2014 Express Edition. C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackup Also note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. Once that has been copied over, we can fire up Management Studio for SQL Server 2014 and start up a blank new query by going to File New Query with Current Connection Once you have a blank query set up, copy and paste the following code in the and execute it: use [master] Restore database AdventureWorks2014 from disk = 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackupAdventureWorks2014.bak' with move 'AdventureWorks2014_data' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.mdf', Move 'AdventureWorks2014_log' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.ldf' , replace Once again, please note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. At this point in time within the database you should have received a message saying that Microsoft SQL Server has processed 24248 pages for database 'AdventureWorks2014'. Once you have refreshed your database tab on the upper left hand corner of SQL Server, the AdventureWorks database will become visible as well as all of the appropriate tables as shown in the following screenshot: One final step that we will need to verify just to make sure that your login account has all of the appropriate server settings. When you right-click on the SQL Server name on the upper left hand portion of Management Studio, select the properties.  Select Permissions inside Properties. Find your username and check all of the rights under the Grant column as shown in the following screenshot: Finally, we need to also ensure that the folder that houses Microsoft SQL Server 2014 also has the appropriate rights enabled for your current user.  That specific folder is located under C:Program FilesMicrosoft SQL Server. For purposes of our exercises, we will assign all rights for the SQL Server user to the following folder as shown in the following screenshot: We are now ready to begin connecting our BI tools to our data! Summary The emphasis will be placed on implementing Business Intelligence best practices within the various tools that will be used based on the different levels of data that is provided within the AdventureWorks database. In the next chapter we will cover extracting additional data from the web that will be joined to the AdventureWorks database. This process is known as web scraping and can be performed with great success using tools such as Python and R. In addition to collecting the data, we will focus on transforming the collected data for optimal query performance. Resources for Article: Further resources on this subject: LabVIEW Basics [article] Thinking Probabilistically [article] Clustering Methods [article]
Read more
  • 0
  • 1
  • 4035

article-image-introduction-r-programming-language-and-statistical-environment
Packt
09 Nov 2016
34 min read
Save for later

Introduction to R Programming Language and Statistical Environment

Packt
09 Nov 2016
34 min read
In this article by Simon Walkowiak author of the book Big Data Analytics with R, we will have the opportunity to learn some most important R functions from base R installation and well-known third party packages used for data crunching, transformation, and analysis. More specifically in this article you will learn to: Understand the landscape of available R data structures Be guided through a number of R operations allowing you to import data from standard and proprietary data formats Carry out essential data cleaning and processing activities such as subsetting, aggregating, creating contingency tables, and so on Inspect the data by implementing a selection of Exploratory Data Analysis techniques such as descriptive statistics Apply basic statistical methods to estimate correlation parameters between two (Pearson's r) or more variables (multiple regressions) or find the differences between means for two (t-tests) or more groups Analysis of variance (ANOVA) Be introduced to more advanced data modeling tasks like logistic and Poisson regressions (For more resources related to this topic, see here.) Learning R This book assumes that you have been previously exposed to R programming language, and this article would serve more as a revision, and an overview, of the most essential operations, rather than a very thorough handbook on R. The goal of this work is to present you with specific R applications related to Big Data and the way you can combine R with your existing Big Data analytics workflows instead of teaching you basics of data processing in R. There is a substantial number of great introductory and beginner-level books on R available at IT specialized bookstores or online, directly from Packt Publishing, and other respected publishers, as well as on the Amazon store. Some of the recommendations include the following: R in Action: Data Analysis and Graphics with R by Robert Kabacoff (2015), 2nd edition, Manning Publications R Cookbook by Paul Teetor (2011), O'Reilly Discovering Statistics Using R by Andy Field, Jeremy Miles, and Zoe Field (2012), SAGE Publications R for Data Science by Dan Toomey (2014), Packt Publishing An alternative route to the acquisition of good practical R skills is through a large number of online resources, or more traditional tutor-led in-class training courses. The first option offers you an almost limitless choice of websites, blogs, and online guides. A good starting point is the main and previously mentioned Comprehensive R Archive Network (CRAN) page (https://cran.r-project.org/), which, apart from the R core software, contains several well-maintained manuals and Task Views—community run indexes of R packages dealing with specific statistical or data management issues. R-bloggers on the other hand (http://www.r-bloggers.com/) deliver regular news on R in the form of R-related blog posts or tutorials prepared by R enthusiasts and data scientists. Other interesting online sources, which you will probably find yourself using quite often, are as follows: http://www.inside-r.org/—news and information from and by R community http://www.rdocumentation.org/—a useful search engine of R packages and functions http://blog.rstudio.org/—a blog run and edited by RStudio engineers http://www.statmethods.net/—a very informative tutorial-laden website based on the popular R book R in Action by Rob Kabacoff However, it is very likely that after some initial reading, and several months of playing with R, your most frequent destinations to seek further R-related information and obtain help on more complex use cases for specific functions will become StackOverflow(http://stackoverflow.com/) and, even better, StackExchange (http://stackexchange.com/). StackExchange is in fact a network of support and question-and-answer community-run websites, which address many problems related to statistical, mathematical, biological, and other methods or concepts, whereas StackOverflow, which is currently one of the sub-sites under the StackExchange label, focuses more on applied programming issues and provides users with coding hints and solutions in most (if not all) programming languages known to developers. Both tend to be very popular amongst R users, and as of late December 2015, there were almost 120,000 R-tagged questions asked on StackOverflow. The http://stackoverflow.com/tags/r/info page also contains numerous links and further references to free interactive R learning resources, online books and manuals and many other. Another good idea is to start your R adventure from user-friendly online training courses available through online-learning providers like Coursera (https://www.coursera.org), DataCamp (https://www.datacamp.com), edX (https://www.edx.org), or CodeSchool (https://www.codeschool.com). Of course, owing to the nature of such courses, a successful acquisition of R skills is somewhat subjective, however, in recent years, they have grown in popularity enormously, and they have also gained rather positive reviews from employers and recruiters alike. Online courses may then be very suitable, especially for those who, for various reasons, cannot attend a traditional university degree with R components, or just prefer to learn R at their own leisure or around their working hours. Before we move on to the practical part, whichever strategy you are going to use to learn R, please do not be discouraged by the first difficulties. R, like any other programming language, or should I say, like any other language (including foreign languages), needs time, patience, long hours of practice, and a large number of varied exercises to let you explore many different dimensions and complexities of its syntax and rich libraries of functions. If you are still struggling with your R skills, however, I am sure the next section will get them off the ground. Revisiting R basics In the following section we will present a short revision of the most useful and frequently applied R functions and statements. We will start from a quick R and RStudio installation guide and then proceed to creating R data structures, data manipulation, and transformation techniques, and basic methods used in the Exploratory Data Analysis (EDA). Although the R codes listed in this book have been tested extensively, as always in such cases, please make sure that your equipment is not faulty and that you will be running all the following scripts at your own risk. Getting R and RStudio ready Depending on your operating system (whether Mac OS X, Windows, or Linux) you can download and install specific base R files directly from https://cran.r-project.org/. If you prefer to use RStudio IDE you still need to install R core available from CRAN website first and then download and run installers of the most recent version of RStudio IDE specific for your platform from https://www.rstudio.com/products/rstudio/download/. Personally I prefer to use RStudio, owing to its practical add-ons such as code highlighting and more user-friendly GUI, however, there is no particular reason why you can't use just the simple R core installation if you want to. Having said that, in this book we will be using RStudio in most of the examples. All code snippets have been executed and run on a MacBook Pro laptop with Mac OS X (Yosemite) operating system, 2.3 GHz Intel Core i5 processor, 1TB solid-state hard drive and 16GB of RAM memory, but you should also be fine with a much weaker configuration. In this article we won't be using any large data, and even in the remaining parts of this book the data sets used are limited to approximately 100MB to 130MB in size each. You are also provided with links and references to full Big Data whenever possible. If you would like to follow the practical parts of this book you are advised to download and unzip the R code and data for each article from the web page created for this book by Packt Publishing. If you use this book in PDF format it is not advisable to copy the code and paste it into the R console. When printed, some characters (like quotation marks " ") may be encoded differently than in R and the execution of such commands may result in errors being returned by the R console. Once you have downloaded both R core and RStudio installation files, follow the on-screen instructions for each installer. When you have finished installing them, open your RStudio software. Upon initialization of the RStudio you should see its GUI with a number of windows distributed on the screen. The largest one is the console in which you input and execute the code, line by line. You can also invoke the editor panel (it is recommended) by clicking on the white empty file icon in the top left corner of the RStudio software or alternatively by navigating to File | New File | R Script. If you have downloaded the R code from the book page of the Packt Publishing website, you may also just click on the Open an existing file (Ctrl + O) (a yellow open folder icon) and locate the downloaded R code on your computer's hard drive (or navigate to File | Open File…). Now your RStudio session is open and we can adjust some most essential settings. First, you need to set your working directory to the location on your hard drive where your data files are. If you know the specific location you can just type the setwd() command with a full and exact path to the location of your data as follows: > setwd("/Users/simonwalkowiak/Desktop/data") Of course your actual path will differ from mine, shown in the preceding code, however please mind that if you copy the path from the Windows Explorer address bar you will need to change the backslashes to forward slashes / (or to double backslashes \). Also, the path needs to be kept within the quotation marks "…". Alternatively you can set your working directory by navigating to Session | Set Working Directory | Choose Directory… to manually select the folder in which you store the data for this session. Apart from the ones we have already described, there are other ways to set your working directory correctly. In fact most of the operations, and even more complex data analysis and processing activities, can be achieved in R in numerous ways. For obvious reasons, we won't be presenting all of them, but we will just focus on the frequently used methods and some tips and hints applicable to special or difficult scenarios. You can check whether your working directory has been set correctly by invoking the following line: > getwd() [1] "/Users/simonwalkowiak/Desktop/data" From what you can see, the getwd() function returned the correct destination for my previously defined working directory. Setting the URLs to R repositories It is always good practice to check whether your R repositories are set correctly. R repositories are servers located at various institutes and organizations around the world, which store recent updates and new versions of third-party R packages. It is recommended that you set the URL of your default repository to the CRAN server and choose a mirror that is located relatively close to you. To set the repositories you may use the following code: > setRepositories(addURLs = c(CRAN = "https://cran.r-project.org/")) You can check your current, or default, repository URLs by invoking the following function: > getOption("repos") The output will confirm your URL selection:               CRAN "https://cran.r-project.org/" You will be able to choose specific mirrors when you install a new package for the first time during the session, or you may navigate to Tools | Global Options… | Packages. In the Package management section of the window you can alter the default CRAN mirror location—click on Change… button to adjust. Once your repository URLs and working directory are set, you can go on to create data structures that are typical for R programming language. R data structures The concept of data structures in various programming languages is extremely important and cannot be overlooked. Similarly in R, available data structures allow you to hold any type of data and use them for further processing and analysis. The kind of data structure which you use, puts certain constraints on how you can access and process data stored in this structure, and what manipulation techniques you can use. This section will briefly guide you through a number of basic data structures available in R language. Vectors Whenever I teach statistical computing courses, I always start by introducing R learners to vectors as the first data structure they should get familiar with. Vectors are one-dimensional structures that can hold any type of data that is numeric, character, or logical. In simple terms, a vector is a sequence of some sort of values (for example numeric, character, logical, and many more) of specified length. The most important thing that you need to remember is that an atomic vector may contain only one type of data. Let's then create a vector with 10 random deviates from a standard normal distribution, and store all its elements in an object which we will call vector1. In your RStudio console (or its editor) type the following: > vector1 <- rnorm(10) Let's now see the contents of our newly created vector1: > vector1 [1] -0.37758383 -2.30857701 2.97803059 -0.03848892 1.38250714 [6] 0.13337065 -0.51647388 -0.81756661 0.75457226 -0.01954176 As we drew random values, your vector most likely contains different elements to the ones shown in the preceding example. Let's then make sure that my new vector (vector2) is the same as yours. In order to do this we need to set a seed from which we will be drawing the values: > set.seed(123) > vector2 <- rnorm(10, mean=3, sd=2) > vector2 [1] 1.8790487 2.5396450 6.1174166 3.1410168 3.2585755 6.4301300 [7] 3.9218324 0.4698775 1.6262943 2.1086761 In the preceding code we've set the seed to an arbitrary number (123) in order to allow you to replicate the values of elements stored in vector2 and we've also used some optional parameters of the rnorm() function, which enabled us to specify two characteristics of our data, that is the arithmetic mean (set to 3) and standard deviation (set to 2). If you wish to inspect all available arguments of the rnorm() function, its default settings, and examples of how to use it in practice, type ?rnorm to view help and information on that specific function. However, probably the most common way in which you will be creating a vector of data is by using the c() function (c stands for concatenate) and then explicitly passing the values of each element of the vector: > vector3 <- c(6, 8, 7, 1, 2, 3, 9, 6, 7, 6) > vector3 [1] 6 8 7 1 2 3 9 6 7 6 In the preceding example we've created vector3 with 10 numeric elements. You can use the length() function of any data structure to inspect the number of elements: > length(vector3) [1] 10 The class() and mode() functions allow you to determine how to handle the elements of vector3 and how the data are stored in vector3 respectively. > class(vector3) [1] "numeric" > mode(vector3) [1] "numeric" The subtle difference between both functions becomes clearer if we create a vector that holds levels of categorical variable (known as a factor in R) with character values: > vector4 <- c("poor", "good", "good", "average", "average", "good", "poor", "good", "average", "good") > vector4 [1] "poor" "good" "good" "average" "average" "good" "poor" [8] "good" "average" "good" > class(vector4) [1] "character" > mode(vector4) [1] "character" > levels(vector4) NULL In the preceding example, both the class() and mode() outputs of our character vector are the same, as we still haven't set it to be treated as a categorical variable, and we haven't defined its levels (the contents of the levels() function is empty—NULL). In the following code we will explicitly set the vector to be recognized as categorical with three levels: > vector4 <- factor(vector4, levels = c("poor", "average", "good")) > vector4 [1] poor good good average average good poor good [8] average good Levels: poor average good The sequence of levels doesn't imply that our vector is ordered. We can order the levels of factors in R using the ordered() command. For example, you may want to arrange the levels of vector4 in reverse order, starting from "good": > vector4.ord <- ordered(vector4, levels = c("good", "average", "poor")) > vector4.ord [1] poor good good average average good poor good [8] average good Levels: good < average < poor You can see from the output that R has now properly recognized the order of our levels, which we had defined. We can now apply class() and mode() functions on the vector4.ord object: > class(vector4.ord) [1] "ordered" "factor" > mode(vector4.ord) [1] "numeric" You may very likely be wondering why the mode() function returned "numeric" type instead of "character". The answer is simple. By setting the levels of our factor, R has assigned values 1, 2, and 3 to "good", "average" and "poor" respectively, exactly in the same order as we had defined them in the ordered() function. You can check this using levels() and str() functions: > levels(vector4.ord) [1] "good" "average" "poor" > str(vector4.ord) Ord.factor w/ 3 levels "good"<"average"<..: 3 1 1 2 2 1 3 1 2 1 Just to finalize the subject of vectors, let's create a logical vector, which contains only TRUE and FALSE values: > vector5 <- c(TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE) > vector5 [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE Similarly, for all other vectors already presented, feel free to check their structure, class, mode, and length using appropriate functions shown in this section. What outputs did those commands return? Scalars The reason why I always start from vectors is that scalars just seem trivial when they follow vectors. To simplify things even more, think of scalars as one-element vectors which are traditionally used to hold some constant values for example: > a1 <- 5 > a1 [1] 5 Of course you may use scalars in computations and also assign any one-element outputs of mathematical or statistical operations to another, arbitrary named scalar for example: > a2 <- 4 > a3 <- a1 + a2 > a3 [1] 9 In order to complete this short subsection on scalars, create two separate scalars which will hold a character and a logical value. Matrices A matrix is a two-dimensional R data structure in which each of its elements must be of the same type; that is numeric, character, or logical. As matrices consist of rows and columns, their shape resembles tables. In fact, when creating a matrix, you can specify how you want to distribute values across its rows and columns for example: > y <- matrix(1:20, nrow=5, ncol=4) > y [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 In the preceding example we have allocated a sequence of 20 values (from 1 to 20) into five rows and four columns, and by default they have been distributed by column. We may now create another matrix in which we will distribute the values by rows and give names to rows and columns using the dimnames argument (dimnames stands for names of dimensions) in the matrix() function: > rows <- c("R1", "R2", "R3", "R4", "R5") > columns <- c("C1", "C2", "C3", "C4") > z <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE, dimnames=list(rows, columns)) > z C1 C2 C3 C4 R1 1 2 3 4 R2 5 6 7 8 R3 9 10 11 12 R4 13 14 15 16 R5 17 18 19 20 As we are talking about matrices it's hard not to mention anything about how to extract specific elements stored in a matrix. This skill will actually turn out to be very useful when we get to subsetting real data sets. Looking at the matrix y, for which we didn't define any names of its rows and columns, notice how R denotes them. The row numbers come in the format [r, ], where r is a consecutive number of a row, whereas the column are identified by [ ,c], where c is a consecutive number of a column. If you then wished to extract a value stored in the fourth row of the second column of our matrix y, you could use the following code to do so: > y[4,2] [1] 9 In case you wanted to extract the whole column number three from our matrix y, you could type the following: > y[,3] [1] 11 12 13 14 15 As you can see, we don't even need to allow an empty space before the comma in order for this short script to work. Let's now imagine you would like to extract three values stored in the second, third and fifth rows of the first column in our vector z with named rows and columns. In this case, you may still want to use the previously shown notation, you do not need to refer explicitly to the names of dimensions of our matrix z. Additionally, notice that for several values to extract we have to specify their row locations as a vector—hence we will put their row coordinates inside the c() function which we had previously used to create vectors: > z[c(2, 3, 5), 1] R2 R3 R5 5 9 17 Similar rules of extracting data will apply to other data structures in R such as arrays, lists, and data frames, which we are going to present next. Arrays Arrays are very similar to matrices with only one exception: they contain more dimensions. However, just like matrices or vectors, they may only hold one type of data. In R language, arrays are created using the array() function: > array1 <- array(1:20, dim=c(2,2,5)) > array1 , , 1 [,1] [,2] [1,] 1 3 [2,] 2 4 , , 2 [,1] [,2] [1,] 5 7 [2,] 6 8 , , 3 [,1] [,2] [1,] 9 11 [2,] 10 12 , , 4 [,1] [,2] [1,] 13 15 [2,] 14 16 , , 5 [,1] [,2] [1,] 17 19 [2,] 18 20 The dim argument, which was used within the array() function, specifies how many dimensions you want to distribute your data across. As we had 20 values (from 1 to 20) we had to make sure that our array can hold all 20 elements, therefore we decided to assign them into two rows, two columns, and five dimensions (2 x 2 x 5 = 20). You can check dimensionality of your multi-dimensional R objects with dim() command: > dim(array1) [1] 2 2 5 As with matrices, you can use standard rules for extracting specific elements from your arrays. The only difference is that now you have additional dimensions to take care of. Let's assume you would want to extract a specific value located in the second row of the first column in the fourth dimension of our array1: > array1[2, 1, 4] [1] 14 Also, if you need to find a location of a specific value, for example 11, within the array, you can simply type the following line: > which(array1==11, arr.ind=TRUE) dim1 dim2 dim3 [1,] 1 2 3 Here, the which() function returns indices of the array (arr.ind=TRUE), where the sought value equals 11 (hence ==). As we had only one instance of value 11 in our array, there is only one row specifying its location in the output. If we had more instances of 11, additional rows would be returned indicating indices for each element equal to 11. Data frames The following two short subsections concern two of probably the most widely used R data structures. Data frames are very similar to matrices, but they may contain different types of data. Here you might have suddenly thought of a typical rectangular data set with rows and columns or observations and variables. In fact you are correct. Most of the data sets are indeed imported into R as data frames. You can also create a simple data frame manually with the data.frame() function, but as each column in the data frame may be of a different type, we must first create vectors which will hold data for specific columns: > subjectID <- c(1:10) > age <- c(37,23,42,25,22,25,48,19,22,38) > gender <- c("male", "male", "male", "male", "male", "female", "female", "female", "female", "female") > lifesat <- c(9,7,8,10,4,10,8,7,8,9) > health <- c("good", "average", "average", "good", "poor", "average", "good", "poor", "average", "good") > paid <- c(T, F, F, T, T, T, F, F, F, T) > dataset <- data.frame(subjectID, age, gender, lifesat, health, paid) > dataset subjectID age gender lifesat health paid 1 1 37 male 9 good TRUE 2 2 23 male 7 average FALSE 3 3 42 male 8 average FALSE 4 4 25 male 10 good TRUE 5 5 22 male 4 poor TRUE 6 6 25 female 10 average TRUE 7 7 48 female 8 good FALSE 8 8 19 female 7 poor FALSE 9 9 22 female 8 average FALSE 10 10 38 female 9 good TRUE The preceding example presents a simple data frame which contains some dummy imaginary data, possibly a sample from a basic psychological experiment, which measured subjects' life satisfaction (lifesat) and their health status (health) and also collected other socio-demographic information such as age and gender, and whether the participant was a paid subject or a volunteer. As we deal with various types of data, the elements for each column had to be amalgamated into a single structure of a data frame using the data.frame() command, and specifying the names of objects (vectors) in which we stored all values. You can inspect the structure of this data frame with the previously mentioned str() function: > str(dataset) 'data.frame': 10 obs. of 6 variables: $ subjectID: int 1 2 3 4 5 6 7 8 9 10 $ age : num 37 23 42 25 22 25 48 19 22 38 $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 1 1 1 1 $ lifesat : num 9 7 8 10 4 10 8 7 8 9 $ health : Factor w/ 3 levels "average","good",..: 2 1 1 2 3 1 2 3 1 2 $ paid : logi TRUE FALSE FALSE TRUE TRUE TRUE ... The output of str() gives you some basic insights into the shape and format of your data in the dataset object, for example, number of observations and variables, names of variables, types of data they hold, and examples of values for each variable. While discussing data frames, it may also be useful to introduce you to another way of creating subsets. As presented earlier, you may apply standard extraction rules to subset data of your interest. For example, suppose you want to print only those columns which contain age, gender, and life satisfaction information from our dataset data frame. You may use the following two alternatives (the output not shown to save space, but feel free to run it): > dataset[,2:4] #or > dataset[, c("age", "gender", "lifesat")] Both lines of code will produce exactly the same results. The subset() function however gives you additional capabilities of defining conditional statements which will filter the data, based on the output of logical operators. You can replicate the preceding output using subset() in the following way: > subset(dataset[c("age", "gender", "lifesat")]) Assume now that you want to create a subset with all subjects who are over 30 years old, and with a score of greater than or equal to eight on the life satisfaction scale (lifesat). The subset() function comes very handy: > subset(dataset, age > 30 & lifesat >= 8) subjectID age gender lifesat health paid 1 1 37 male 9 good TRUE 3 3 42 male 8 average FALSE 7 7 48 female 8 good FALSE 10 10 38 female 9 good TRUE Or you want to produce an output with two socio-demographic variables of age and gender, of only these subjects who were paid to participate in this experiment: > subset(dataset, paid==TRUE, select=c("age", "gender")) age gender 1 37 male 4 25 male 5 22 male 6 25 female 10 38 female We will perform much more thorough and complex data transformations on real data frames in the second part of this article. Lists A list in R is a data structure, which is a collection of other objects. For example, in the list you can store vectors, scalars, matrices, arrays, data frames, and even other lists. In fact, lists in R are vectors, but they differ from atomic vectors, which we introduced earlier in this section as lists that can hold many different types of data. In the following example, we will construct a simple list (using list() function) which will include a variety of other data structures: > simple.vector1 <- c(1, 29, 21, 3, 4, 55) > simple.matrix <- matrix(1:24, nrow=4, ncol=6, byrow=TRUE) > simple.scalar1 <- 5 > simple.scalar2 <- "The List" > simple.vector2 <- c("easy", "moderate", "difficult") > simple.list <- list(name=simple.scalar2, matrix=simple.matrix, vector=simple.vector1, scalar=simple.scalar1, difficulty=simple.vector2) >simple.list $name [1] "The List" $matrix [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 2 3 4 5 6 [2,] 7 8 9 10 11 12 [3,] 13 14 15 16 17 18 [4,] 19 20 21 22 23 24 $vector [1] 1 29 21 3 4 55 $scalar [1] 5 $difficulty [1] "easy" "moderate" "difficult" > str(simple.list) List of 5 $ name : chr "The List" $ matrix : int [1:4, 1:6] 1 7 13 19 2 8 14 20 3 9 ... $ vector : num [1:6] 1 29 21 3 4 55 $ scalar : num 5 $ difficulty: chr [1:3] "easy" "moderate" "difficult" Looking at the preceding output, you can see that we have assigned names to each component in our list and the str() function prints them as if they were variables of a standard rectangular data set. In order to extract specific elements from a list, you first need to use a double square bracket notation [[x]] to identify a component x within the list. For example, assuming you want to print an element stored in its first row and the third column of the second component you may use the following line in R: > simple.list[[2]][1,3] [1] 3 Owing to their flexibility, lists are commonly used as preferred data structures in the outputs of statistical functions. It is then important for you to know how you can deal with lists and what sort of methods you can apply to extract and process data stored in them. Once you are familiar with the basic features of data structures available in R, you may wish to visit Hadley Wickham's online book at http://adv-r.had.co.nz/ in which he explains various more advanced concepts related to each native data structure in R language, and different techniques of subsetting data, depending on the way they are stored. Exporting R data objects In the previous section we created numerous objects, which you can inspect in the Environment tab window in RStudio. Alternatively, you may use the ls() function to list all objects stored in your global environment: > ls() If you've followed the article along, and run the script for this book line-by-line, the output of the ls() function should hopefully return 27 objects: [1] "a1" "a2" "a3" [4] "age" "array1" "columns" [7] "dataset" "gender" "health" [10] "lifesat" "paid" "rows" [13] "simple.list" "simple.matrix" "simple.scalar1" [16] "simple.scalar2" "simple.vector1" "simple.vector2" [19] "subjectID" "vector1" "vector2" [22] "vector3" "vector4" "vector4.ord" [25] "vector5" "y" "z" In this section we will present various methods of saving the created objects to your local drive and exporting their contents to a number of the most commonly used file formats. Sometimes, for various reasons, it may happen that you need to leave your project and exit RStudio or shut your PC down. If you do not save your created objects, you will lose all of them, the moment you close RStudio. Remember that R stores created data objects in the RAM of your machine, and whenever these objects are not in use any longer, R frees them from the memory, which simply means that they get deleted. Of course this might turn out to be quite costly, especially if you had not saved your original R script, which would have enabled you to replicate all the steps of your data processing activities when you start a new session in R. In order to prevent the objects from being deleted, you can save all or selected ones as .RData files on your hard drive. In the first case, you may use the save.image() function which saves your whole current workspace with all objects to your current working directory: > save.image(file = "workspace.RData") If you are dealing with large objects, first make sure you have enough storage space available on your drive (this is normally not a problem any longer), or alternatively you can reduce the size of the saved objects using one of the compression methods available. For example, the above workspace.RData file was 3,751 bytes in size without compression, but when xz compression was applied the size of the resulting file decreased to 3,568 bytes. > save.image(file = "workspace2.RData", compress = "xz") Of course, the difference in sizes in the presented example is minuscule, as we are dealing with very small objects, however it gets much more significant for bigger data structures. The trade-off of applying one of the compression methods is the time it takes for R to save and load .RData files. If you prefer to save only chosen objects (for example dataset data frame and simple.list list) you can achieve this with the save() function: > save(dataset, simple.list, file = "two_objects.RData") You may now test whether the above solutions worked by cleaning your global environment of all objects, and then loading one of the created files, for example: > rm(list=ls()) > load("workspace2.RData") As an additional exercise, feel free to explore other functions which allow you to write text representations of R objects, for example dump() or dput(). More specifically, run the following commands and compare the returned outputs: > dump(ls(), file = "dump.R", append = FALSE) > dput(dataset, file = "dput.txt") The save.image() and save() functions only create images of your workspace or selected objects on the hard drive. It is an entirely different story if you want to export some of the objects to data files of specified formats, for example, comma-separated, tab-delimited, or proprietary formats like Microsoft Excel, SPSS, or Stata. The easiest way to export R objects to generic file formats like CSV, TXT, or TAB is through the cat() function, but it only works on atomic vectors: > cat(age, file="age.txt", sep=",", fill=TRUE, labels=NULL, append=TRUE) > cat(age, file="age.csv", sep=",", fill=TRUE, labels=NULL, append=TRUE) The preceding code creates two files, one as a text file and another one as a comma-separated format, both of which contain values from the age vector that we had previously created for the dataset data frame. The sep argument is a character vector of strings to append after each element, the fill option is a logical argument which controls whether the output is automatically broken into lines (if set to TRUE), the labels parameter allows you to add a character vector of labels for each printed line of data in the file, and the append logical argument enables you to append the output of the call to the already existing file with the same name. In order to export vectors and matrices to TXT, CSV, or TAB formats you can use the write() function, which writes out a matrix or a vector in a specified number of columns for example: > write(age, file="agedata.csv", ncolumns=2, append=TRUE, sep=",") > write(y, file="matrix_y.tab", ncolumns=2, append=FALSE, sep="t") Another method of exporting matrices provides the MASS package (make sure to install it with the install.packages("MASS") function) through the write.matrix() command: > library(MASS) > write.matrix(y, file="ymatrix.txt", sep=",") For large matrices, the write.matrix() function allows users to specify the size of blocks in which the data are written through the blocksize argument. Probably the most common R data structure that you are going to export to different file formats will be a data frame. The generic write.table() function gives you an option to save your processed data frame objects to standard data formats for example TAB, TXT, or CSV: > write.table(dataset, file="dataset1.txt", append=TRUE, sep=",", na="NA", col.names=TRUE, row.names=FALSE, dec=".") The append and sep arguments should already be clear to you as they were explained earlier. In the na option you may specify an arbitrary string to use for missing values in the data. The logical parameter col.names allows users to append the names of columns to the output file, and the dec parameter sets the string used for decimal points and must be a single character. In the example, we used row.names set to FALSE, as the names of the rows in the data are the same as the values of the subjectID column. However, it is very likely that in other data sets the ID variable may differ from the names (or numbers) of rows, so you may want to control it depending on the characteristics of your data. Two similar functions write.csv() and write.csv2() are just convenience wrappers for saving CSV files, and they only differ from the generic write.table() function by default settings of some of their parameters, for example sep and dec. Feel free to explore these subtle differences at your leisure. To complete this section of the article we need to present how to export your R data frames to third-party formats. Amongst several frequently used methods, at least four of them are worth mentioning here. First, if you wish to write a data frame to a proprietary Microsoft Excel format, such as XLS or XLSX, you should probably use the WriteXLS package (please use install.packages("WriteXLS") if you have not done it yet) and its WriteXLS() function: > library(WriteXLS) > WriteXLS("dataset", "dataset1.xlsx", SheetNames=NULL, row.names=FALSE, col.names=TRUE, AdjWidth=TRUE, envir=parent.frame()) The WriteXLS() command offers users a number of interesting options, for instance you can set the names of the worksheets (SheetNames argument), adjust the widths of columns depending on the number of characters of the longest value (AdjWidth), or even freeze rows and columns just as you do it in Excel (FreezeRow and FreezeCol parameters). Please note that in order for the WriteXLS package to work, you need to have Perl installed on your machine. The package creates Excel files using Perl scripts called WriteXLS.pl for Excel 2003 (XLS) files, and WriteXLSX.pl for Excel 2007 and later version (XLSX) files. If Perl is not present on your system, please make sure to download and install it from https://www.perl.org/get.html. After the Perl installation, you may have to restart your R session and load the WriteXLS package again to apply the changes. For solutions to common Perl issues please visit the following websites: https://www.perl.org/docs.html, http://www.ahinea.com/en/tech/perl-unicode-struggle.html, and http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html or search StackOverflow and similar websites for R and Perl related specific problems. Another very useful way of writing R objects to the XLSX format is provided by the openxlsx package through the write.xlsx() function, which, apart from data frames, also allows lists to be easily written to Excel spreadsheets. Please note that Windows users may need to install the Rtools package in order to use openxlsx functionalities. The write.xlsx() function gives you a large choice of possible options to set, including a custom style to apply to column names (through headerStyle argument), the color of cell borders (borderColour), or even its line style (borderStyle). The following example utilizes only the most common and minimal arguments required to write a list to the XLSX file, but be encouraged to explore other options offered by this very flexible function: > write.xlsx(simple.list, file = "simple_list.xlsx") A third-party package called foreign makes it possible to write data frames to other formats used by well-known statistical tools such as SPSS, Stata, or SAS. When creating files, the write.foreign() function requires users to specify the names of both the data and code files. Data files hold raw data, whereas code files contain scripts with the data structure and metadata (value and variable labels, variable formats, and so on) written in the proprietary syntax. In the following example, the code writes the dataset data frame to the SPSS format: > library(foreign) > write.foreign(dataset, "datafile.txt", "codefile.txt", package="SPSS") Finally, another package called rio contains only three functions, allowing users to quickly import(), export() and convert() data between a large array of file formats, (for example TSV, CSV, RDS, RData, JSON, DTA, SAV, and many more). The package, in fact, is dependent on a number of other R libraries, some of which, for example foreign and openxlsx, have already been presented in this article. The rio package does not introduce any new functionalities apart from the default arguments characteristic for underlying export functions, so you still need to be familiar with the original functions and their parameters if you require more advanced exporting capabilities. But, if you are only looking for a no-fuss general export function, the rio package is definitely a good shortcut to take: > export(dataset, format = "stata") > export(dataset, "dataset1.csv", col.names = TRUE, na = "NA") Summary In this article, we have provided you with quite a bit of theory, and hopefully a lot of practical examples of data structures available to R users. You've created several objects of different types, and you've become familiar with a variety of data and file formats to offer. We then showed you how to save R objects held in your R workspace to external files on your hard drive, or to export them to various standard and proprietary file formats. Resources for Article: Further resources on this subject: Fast Data Manipulation with R [article] The Data Science Venn Diagram [article] Deployment and DevOps [article]
Read more
  • 0
  • 0
  • 11956
article-image-machine-learning-technique-supervised-learning
Packt
09 Nov 2016
7 min read
Save for later

Machine Learning Technique: Supervised Learning

Packt
09 Nov 2016
7 min read
In this article by Andrea Isoni author of the book Machine Learning for the Web, we will the most relevant regression and classification techniques are discussed. All of these algorithms share the same background procedure, and usually the name of the algorithm refers to both a classification and a regression method. The linear regression algorithms, Naive Bayes, decision tree, and support vector machine are going to be discussed in the following sections. To understand how to employ the techniques, a classification and a regression problem will be solved using the mentioned methods. Essentially, a labeled train dataset will be used to train the models, which means to find the values of the parameters, as we discussed in the introduction. As usual, the code is available in the my GitHub folder at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/. (For more resources related to this topic, see here.) We will conclude the article with an extra algorithm that may be used for classification, although it is not specifically designed for this purpose (hidden Markov model). We will now begin to explain the general causes of error in the methods when predicting the true labels associated with a dataset. Model error estimation We said that the trained model is used to predict the labels of new data, and the quality of the prediction depends on the ability of the model to generalize, that is, the correct prediction of cases not present in the trained data. This is a well-known problem in literature and related to two concepts: bias and variance of the outputs. The bias is the error due to a wrong assumption in the algorithm. Given a point x(t) with label yt, the model is biased if it is trained with different training sets, and the predicted label ytpred will always be different from yt. The variance error instead refers to the different wrongly predicted labels of the given point x(t). A classic example to explain the concepts is to consider a circle with the true value at the center (true label), as shown in the following figure. The closer the predicted labels are to the center, the more unbiased the model and the lower the variance (top left in the following figure). The other three cases are also shown here: Variance and bias example. A model with low variance and low bias errors will have the predicted labels that is blue dots (as show in the preceding figure) concentrated on the red center (true label). The high bias error occurs when the predictions are far away from the true label, while high variance appears when the predictions are in a wide range of values. We have already seen that labels can be continuous or discrete, corresponding to regression classification problems respectively. Most of the models are suitable for solving both problems, and we are going to use word regression and classification referring to the same model. More formally, given a set of N data points and corresponding labels, a model with a set of parameters with the true parameter values will have the mean square error (MSE), equal to: We will use the MSE as a measure to evaluate the methods discussed in this article. Now we will start describing the generalized linear methods. Generalized linear models The generalized linear model is a group of models that try to find the M parameters that form a linear relationship between the labels yi and the feature vector x(i) that is as follows: Here, are the errors of the model. The algorithm for finding the parameters tries to minimize the total error of the model defined by the cost function J: The minimization of J is achieved using an iterative algorithm called batch gradient descent: Here, a is called learning rate, and it is a trade-off between convergence speed and convergence precision. An alternative algorithm that is called stochastic gradient descent, that is loop for : The qj is updated for each training example i instead of waiting to sum over the entire training set. The last algorithm converges near the minimum of J, typically faster than batch gradient descent, but the final solution may oscillate around the real values of the parameters. The following paragraphs describe the most common model and the corresponding cost function, J. Linear regression Linear regression is the simplest algorithm and is based on the model: The cost function and update rule are: Ridge regression Ridge regression, also known as Tikhonov regularization, adds a term to the cost function J such that: , where l is the regularization parameter. The additional term has the function needed to prefer a certain set of parameters over all the possible solutions penalizing all the parameters qj different from 0. The final set of qj shrank around 0, lowering the variance of the parameters but introducing a bias error. Indicating with the superscript l the parameters from the linear regression, the ridge regression parameters are related by the following formula: This clearly shows that the larger the l value, the more the ridge parameters are shrunk around 0. Lasso regression Lasso regression is an algorithm similar to ridge regression, the only difference being that the regularization term is the sum of the absolute values of the parameters: Logistic regression Despite the name, this algorithm is used for (binary) classification problems, so we define the labels. The model is given the so-called logistic function expressed by: In this case, the cost function is defined as follows: From this, the update rule is formally the same as linear regression (but the model definition,  , is different): Note that the prediction for a point p,  , is a continuous value between 0 and 1. So usually, to estimate the class label, we have a threshold at =0.5 such that: The logistic regression algorithm is applicable to multiple label problems using the techniques one versus all or one versus one. Using the first method, a problem with K classes is solved by training K logistic regression models, each one assuming the labels of the considered class j as +1 and all the rest as 0. The second approach consists of training a model for each pair of labels (  trained models). Probabilistic interpretation of generalized linear models Now that we have seen the generalized linear model, let’s find the parameters qj that satisfy the relationship: In the case of linear regression, we can assume  as normally distributed with mean 0 and variance s2 such that the probability  is  equivalent to: Therefore, the total likelihood of the system can be expressed as follows: In the case of the logistic regression algorithm, we are assuming that the logistic function itself is the probability: Then the likelihood can be expressed by: In both cases, it can be shown that maximizing the likelihood is equivalent to minimizing the cost function, so the gradient descent will be the same. k-nearest neighbours (KNN) This is a very simple classification (or regression) method in which given a set of feature vectors  with corresponding labels yi, a test point x(t) is assigned to the label value with the majority of the label occurrences in the K nearest neighbors  found, using a distance measure such as the following: Euclidean: Manhattan: Minkowski:  (if q=2, this reduces to the Euclidean distance) In the case of regression, the value yt is calculated by replacing the majority of occurrences by the average of the labels .  The simplest average (or the majority of occurrences) has uniform weights, so each point has the same importance regardless of their actual distance from x(t). However, a weighted average with weights equal to the inverse distance from x(t) may be used. Summary In this article, the major classification and regression algorithms, together with the techniques to implement them, were discussed. You should now be able to understand in which situation each method can be used and how to implement it using Python and its libraries (sklearn and pandas). Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Specialized Machine Learning Topics [article]
Read more
  • 0
  • 0
  • 1739

article-image-getting-started-python-packages
Packt
02 Nov 2016
37 min read
Save for later

Getting Started with Python Packages

Packt
02 Nov 2016
37 min read
In this article by Luca Massaron and Alberto Boschetti the authors of the book Python Data Science Essentials - Second Edition we will cover steps on installing Python, the different installation packages and have a glance at the essential packages will constitute a complete Data Science Toolbox. (For more resources related to this topic, see here.) Whether you are an eager learner of data science or a well-grounded data science practitioner, you can take advantage of this essential introduction to Python for data science. You can use it to the fullest if you already have at least some previous experience in basic coding, in writing general-purpose computer programs in Python, or in some other data-analysis-specific language such as MATLAB or R. Introducing data science and Python Data science is a relatively new knowledge domain, though its core components have been studied and researched for many years by the computer science community. Its components include linear algebra, statistical modelling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval. Data science is a new domain and you have to take into consideration that currently its frontiers are still somewhat blurred and dynamic. Since data science is made of various constituent sets of disciplines, please also keep in mind that there are different profiles of data scientists depending on their competencies and areas of expertise. In such a situation, what can be the best tool of the trade that you can learn and effectively use in your career as a data scientist? We believe that the best tool is Python, and we intend to provide you with all the essential information that you will need for a quick start. In addition, other tools such as R and MATLAB provide data scientists with specialized tools to solve specific problems in statistical analysis and matrix manipulation in data science. However, only Python really completes your data scientist skill set. This multipurpose language is suitable for both development and production alike; it can handle small- to large-scale data problems and it is easy to learn and grasp no matter what your background or experience is. Created in 1991 as a general-purpose, interpreted, and object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory development, and prompt deployment of scientific applications. At present, the core Python characteristics that render it an indispensable data science tool are as follows: It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you may need in the course of a data analysis, and sometimes even more. Python can easily integrate different tools and offers a truly unifying ground for different languages, data strategies, and learning algorithms that can be fitted together easily and which can concretely help data scientists forge powerful solutions. There are packages that allow you to call code in other languages (in Java, C, FORTRAN, R, or Julia), outsourcing some of the computations to them and improving your script performance. It is very versatile. No matter what your programming background or style is (object-oriented, procedural, or even functional), you will enjoy programming with Python. It is cross-platform; your solutions will work perfectly and smoothly on Windows, Linux, and Mac OS systems. You won't have to worry all that much about portability. Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language). Moreover, there are also static compilers such as Cython or just-in-time compilers such as PyPy that can transform Python code into C for higher performance. It can work with large in-memory data because of its minimal memory footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using various iterations and reiterations of data wrangling. It is very simple to learn and use. After you grasp the basics, there's no better way to learn more than by immediately starting with the coding. Moreover, the number of data scientists using Python is continuously growing: new packages and improvements have been released by the community every day, making the Python ecosystem an increasingly prolific and rich language for data science. Installing Python First, let's proceed to introduce all the settings you need in order to create a fully working data science environment to test the examples and experiment with the code that we are going to provide you with. Python is an open source, object-oriented, and cross-platform programming language. Compared to some of its direct competitors (for instance, C++ or Java), Python is very concise.  It allows you to build a working software prototype in a very short time. Yet it has become the most used language in the data scientist's toolbox not just because of that. It is also a general-purpose language, and it is very flexible due to a variety of available packages that solve a wide spectrum of problems and necessities. Python 2 or Python 3? There are two main branches of Python: 2.7.x and 3.x. At the time of writing this article, the Python foundation (www.python.org) is offering downloads for Python version 2.7.11 and 3.5.1. Although the third version is the newest, the older one is still the most used version in the scientific area, since a few packages (check on the website py3readiness.org for a compatibility overview) won't run otherwise yet. In addition, there is no immediate backward compatibility between Python 3 and 2. In fact, if you try to run some code developed for Python 2 with a Python 3 interpreter, it may not work. Major changes have been made to the newest version, and that has affected past compatibility. Some data scientists, having built most of their work on Python 2 and its packages, are reluctant to switch to the new version. We intend to address a larger audience of data scientists, data analysts and developers, who may not have such a strong legacy with Python 2. Thus, we agreed that it would be better to work with Python 3 rather than the older version. We suggest using a version such as Python 3.4 or above. After all, Python 3 is the present and the future of Python. It is the only version that will be further developed and improved by the Python foundation and it will be the default version of the future on many operating systems. Anyway, if you are currently working with version 2 and you prefer to keep on working with it, you can still the examples. In fact, for the most part, our code will simply work on Python 2 after having the code itself preceded by these imports: from __future__ import (absolute_import, division, print_function, unicode_literals) from builtins import * from future import standard_library standard_library.install_aliases() The from __future__ import commands should always occur at the beginning of your scripts or else you may experience Python reporting an error. As described in the Python-future website (python-future.org), these imports will help convert several Python 3-only constructs to a form compatible with both Python 3 and Python 2 (and in any case, most Python 3 code should just simply work on Python 2 even without the aforementioned imports). In order to run the upward commands successfully, if the future package is not already available on your system, you should install it (version >= 0.15.2) using the following command to be executed from a shell: $> pip install –U future If you're interested in understanding the differences between Python 2 and Python 3 further, we recommend reading the wiki page offered by the Python foundation itself: wiki.python.org/moin/Python2orPython3. Step-by-step installation Novice data scientists who have never used Python (who likely don't have the language readily installed on their machines) need to first download the installer from the main website of the project, www.python.org/downloads/, and then install it on their local machine. We will now coversteps which will provide you with full control over what can be installed on your machine. This is very useful when you have to set up single machines to deal with different tasks in data science. Anyway, please be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and it may be well suited for first starting and learning because it saves you time and sometimes even trouble, though it will put a large number of packages (and we won't use most of them) on your computer all at once. This being a multiplatform programming language, you'll find installers for machines that either run on Windows or Unix-like operating systems. Please remember that some of the latest versions of most Linux distributions (such as CentOS, Fedora, Red Hat Enterprise, and Ubuntu) have Python 2 packaged in the repository. In such a case and in the case that you already have a Python version on your computer (since our examples run on Python 3), you first have to check what version you are exactly running. To do such a check, just follow these instructions: Open a python shell, type python in the terminal, or click on any Python icon you find on your system. Then, after having Python started, to test the installation, run the following code in the Python interactive shell or REPL: >>> import sys >>> print (sys.version_info) If you can read that your Python version has the major=2 attribute, it means that you are running a Python 2 instance. Otherwise, if the attribute is valued 3, or if the print statements reports back to you something like v3.x.x (for instance v3.5.1), you are running the right version of Python and you are ready to move forward. To clarify the operations we have just mentioned, when a command is given in the terminal command line, we prefix the command with $>. Otherwise, if it's for the Python REPL, it's preceded by >>>. The installation of packages Python won't come bundled with all you need, unless you take a specific premade distribution. Therefore, to install the packages you need, you can use either pip or easy_install. Both these two tools run in the command line and make the process of installation, upgrade, and removal of Python packages a breeze. To check which tools have been installed on your local machine, run the following command: $> pip To install pip, follow the instructions given at pip.pypa.io/en/latest/installing.html. Alternatively, you can also run this command: $> easy_install If both of these commands end up with an error, you need to install any one of them. We recommend that you use pip because it is thought of as an improvement over easy_install. Moreover, easy_install is going to be dropped in future and pip has important advantages over it. It is preferable to install everything using pip because: It is the preferred package manager for Python 3. Starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers. It provides an uninstall functionality. It rolls back and leaves your system clear if, for whatever reason, the package installation fails. Using easy_install in spite of pip's advantages makes sense if you are working on Windows because pip won't always install pre-compiled binary packages.Sometimes it will try to build the package's extensions directly from C source, thus requiring a properly configured compiler (and that's not an easy task on Windows). This depends on whether the package is running on eggs (and pip cannot directly use their binaries, but it needs to build from their source code) or wheels (in this case, pip can install binaries if available, as explained here: pythonwheels.com/). Instead, easy_install will always install available binaries from eggs and wheels. Therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day (at some price anyway, as we just mentioned in the list). The most recent versions of Python should already have pip installed by default. Therefore, you may have it already installed on your system. If not, the safest way is to download the get-pi.py script from bootstrap.pypa.io/get-pip.py and then run it using the following: $> python get-pip.py The script will also install the setup tool from pypi.python.org/pypi/setuptools, which also contains easy_install. You're now ready to install the packages you need in order to run the examples provided in this article. To install the < package-name > generic package, you just need to run this command: $> pip install < package-name > Alternatively, you can run the following command: $> easy_install < package-name > Note that in some systems, pip might be named as pip3 and easy_install as easy_install-3 to stress the fact that both operate on packages for Python 3. If you're unsure, check the version of Python pip is operating on with: $> pip –V For easy_install, the command is slightly different: $> easy_install --version After this, the <pk> package and all its dependencies will be downloaded and installed. If you're not certain whether a library has been installed or not, just try to import a module inside it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed. This is what happens when the NumPy library has been installed: >>> import numpy This is what happens if it's not installed: >>> import numpy Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named numpy In the latter case, you'll need to first install it through pip or easy_install. Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and the module have the same name, but in many cases, they don't match. For example, the sklearn module is included in the package named Scikit-learn. Finally, to search and browse the Python packages available for Python, look at pypi.python.org. Package upgrades More often than not, you will find yourself in a situation where you have to upgrade a package because either the new version is required by a dependency or it has additional features that you would like to use. First, check the version of the library you have installed by glancing at the __version__ attribute, as shown in the following example, numpy: >>> import numpy >>> numpy.__version__ # 2 underscores before and after '1.9.2' Now, if you want to update it to a newer release, say the 1.11.0 version, you can run the following command from the command line: $> pip install -U numpy==1.11.0 Alternatively, you can use the following command: $> easy_install --upgrade numpy==1.11.0 Finally, if you're interested in upgrading it to the latest available version, simply run this command: $> pip install -U numpy You can alternatively run the following command: $> easy_install --upgrade numpy Scientific distributions As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need (sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier). If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes, they even have additional tools and an IDE. A few of them are very well known among data scientists, and in the following content, you will find some of the key features of each of these packages. We suggest that you promptly download and install a scientific distribution, such as Anaconda (which is the most complete one). Anaconda (continuum.io/downloads) is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, which comprises NumPy, SciPy, pandas, Jupyter, Matplotlib, Scikit-learn, and NLTK. It's a cross-platform distribution (Windows, Linux, and Mac OS X) that can be installed on machines with other existing Python distributions and versions. Its base version is free; instead, add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on the website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics, and scientific computing. Leveraging conda to install packages If you've decided to install an Anaconda distribution, you can take advantage of the conda binary installer we mentioned previously. Anyway, conda is an open source package management system, and consequently it can be installed separately from an Anaconda distribution. You can test immediately whether conda is available on your system. Open a shell and digit: $> conda -V If conda is available, there will appear the version of your conda; otherwise an error will be reported. If conda is not available, you can quickly install it on your system by going to conda.pydata.org/miniconda.html and installing the Miniconda software suitable for your computer. Miniconda is a minimal installation that only includes conda and its dependencies. conda can help you manage two tasks: installing packages and creating virtual environments. In this paragraph, we will explore how conda can help you easily install most of the packages you may need in your data science projects. Before starting, please check to have the latest version of conda at hand: $> conda update conda Now you can install any package you need. To install the <package-name> generic package, you just need to run the following command: $> conda install <package-name> You can also install a particular version of the package just by pointing it out: $> conda install <package-name>=1.11.0 Similarly you can install multiple packages at once by listing all their names: $> conda install <package-name-1> <package-name-2> If you just need to update a package that you previously installed, you can keep on using conda: $> conda update <package-name> You can update all the available packages simply by using the --all argument: $> conda update --all Finally, conda can also uninstall packages for you: $> conda remove <package-name> If you would like to know more about conda, you can read its documentation at conda.pydata.org/docs/index.html. In summary, as a main advantage, it handles binaries even better than easy_install (by always providing a successful installation on Windows without any need to compile the packages from source) but without its problems and limitations. With the use of conda, packages are easy to install (and installation is always successful), update, and even uninstall. On the other hand, conda cannot install directly from a git server (so it cannot access the latest version of many packages under development) and it doesn't cover all the packages available on PyPI as pip itself. Enthought Canopy Enthought Canopy (enthought.com/products/canopy) is a Python distribution by Enthought Inc. It includes more than 200 preinstalled packages, such as NumPy, SciPy, Matplotlib, Jupyter, and pandas. This distribution is targeted at engineers, data scientists, quantitative and data analysts, and enterprises. Its base version is free (which is named Canopy Express), but if you need advanced features, you have to buy a front version. It's a multiplatform distribution and its command-line install tool is canopy_cli. PythonXY PythonXY (python-xy.github.io) is a free, open source Python distribution maintained by the community. It includes a number of packages, which include NumPy, SciPy, NetworkX, Jupyter, and Scikit-learn. It also includes Spyder, an interactive development environment inspired by the MATLAB IDE. The distribution is free. It works only on Microsoft Windows, and its command-line installation tool is pip. WinPython WinPython (winpython.sourceforge.net) is also a free, open-source Python distribution maintained by the community. It is designed for scientists, and includes many packages such as NumPy, SciPy, Matplotlib, and Jupyter. It also includes Spyder as an IDE. It is free and portable. You can put WinPython into any directory, or even into a USB flash drive, and at the same time maintain multiple copies and versions of it on your system. It works only on Microsoft Windows, and its command-line tool is the WinPython Package Manager (WPPM). Explaining virtual environments No matter you have chosen installing a stand-alone Python or instead you used a scientific distribution, you may have noticed that you are actually bound on your system to the Python's version you have installed. The only exception, for Windows users, is to use a WinPython distribution, since it is a portable installation and you can have as many different installations as you need. A simple solution to break free of such a limitation is to use virtualenv that is a tool to create isolated Python environments. That means, by using different Python environments, you can easily achieve these things: Testing any new package installation or doing experimentation on your Python environment without any fear of breaking anything in an irreparable way. In this case, you need a version of Python that acts as a sandbox. Having at hand multiple Python versions (both Python 2 and Python 3), geared with different versions of installed packages. This can help you in dealing with different versions of Python for different purposes (for instance, some of the packages we are going to present on Windows OS only work using Python 3.4, which is not the latest release). Taking a replicable snapshot of your Python environment easily and having your data science prototypes work smoothly on any other computer or in production. In this case, your main concern is the immutability and replicability of your working environment. You can find documentation about virtualenv at virtualenv.readthedocs.io/en/stable, though we are going to provide you with all the directions you need to start using it immediately. In order to take advantage of virtualenv, you have first to install it on your system: $> pip install virtualenv After the installation completes, you can start building your virtual environments. Before proceeding, you have to take a few decisions: If you have more versions of Python installed on your system, you have to decide which version to pick up. Otherwise, virtualenv will take the Python version virtualenv was installed by on your system. In order to set a different Python version you have to digit the argument –p followed by the version of Python you want or inserting the path of the Python executable to be used (for instance, –p python2.7 or just pointing to a Python executable such as -p c:Anaconda2python.exe). With virtualenv, when required to install a certain package, it will install it from scratch, even if it is already available at a system level (on the python directory you created the virtual environment from). This default behavior makes sense because it allows you to create a completely separated empty environment. In order to save disk space and limit the time of installation of all the packages, you may instead decide to take advantage of already available packages on your system by using the argument --system-site-packages. You may want to be able to later move around your virtual environment across Python installations, even among different machines. Therefore you may want to make the functioning of all of the environment's scripts relative to the path it is placed in by using the argument --relocatable. After deciding on the Python version, the linking to existing global packages, and the relocability of the virtual environment, in order to start, you just launch the command from a shell. Declare the name you would like to assign to your new environment: $> virtualenv clone virtualenv will just create a new directory using the name you provided, in the path from which you actually launched the command. To start using it, you just enter the directory and digit activate: $> cd clone $> activate At this point, you can start working on your separated Python environment, installing packages and working with code. If you need to install multiple packages at once, you may need some special function from pip—pip freeze—which will enlist all the packages (and their version) you have installed on your system. You can record the entire list in a text file by this command: $> pip freeze > requirements.txt After saving the list in a text file, just take it into your virtual environment and install all the packages in a breeze with a single command: $> pip install -r requirements.txt Each package will be installed according to the order in the list (packages are listed in a case-insensitive sorted order). If a package requires other packages that are later in the list, that's not a big deal because pip automatically manages such situations. So if your package requires Numpy and Numpy is not yet installed, pip will install it first. When you're finished installing packages and using your environment for scripting and experimenting, in order to return to your system defaults, just issue this command: $> deactivate If you want to remove the virtual environment completely, after deactivating and getting out of the environment's directory, you just have to get rid of the environment's directory itself by a recursive deletion. For instance, on Windows you just do this: $> rd /s /q clone On Linux and Mac, the command will be: $> rm –r –f clone If you are working extensively with virtual environments, you should consider using virtualenvwrapper, which is a set of wrappers for virtualenv in order to help you manage multiple virtual environments easily. It can be found at bitbucket.org/dhellmann/virtualenvwrapper. If you are operating on a Unix system (Linux or OS X), another solution we have to quote is pyenv (which can be found at https://github.com/yyuu/pyenv). It lets you set your main Python version, allow installation of multiple versions, and create virtual environments. Its peculiarity is that it does not depend on Python to be installed and works perfectly at the user level (no need for sudo commands). conda for managing environments If you have installed the Anaconda distribution, or you have tried conda using a Miniconda installation, you can also take advantage of the conda command to run virtual environments as an alternative to virtualenv. Let's see in practice how to use conda for that. We can check what environments we have available like this: >$ conda info -e This command will report to you what environments you can use on your system based on conda. Most likely, your only environment will be just "root", pointing to your Anaconda distribution's folder. As an example, we can create an environment based on Python version 3.4, having all the necessary Anaconda-packaged libraries installed. That makes sense, for instance, for using the package Theano together with Python 3 on Windows (because of an issue we will explain in a few paragraphs). In order to create such an environment, just do: $> conda create -n python34 python=3.4 anaconda The command asks for a particular python version (3.4) and requires the installation of all packages available on the anaconda distribution (the argument anaconda). It names the environment as python34 using the argument –n. The complete installation should take a while, given the large number of packages in the Anaconda installation. After having completed all of the installation, you can activate the environment: $> activate python34 If you need to install additional packages to your environment, when activated, you just do: $> conda install -n python34 <package-name1> <package-name2> That is, you make the list of the required packages follow the name of your environment. Naturally, you can also use pip install, as you would do in a virtualenv environment. You can also use a file instead of listing all the packages by name yourself. You can create a list in an environment using the list argument and piping the output to a file: $> conda list -e > requirements.txt Then, in your target environment, you can install the entire list using: $> conda install --file requirements.txt You can even create an environment, based on a requirements' list: $> conda create -n python34 python=3.4 --file requirements.txt Finally, after having used the environment, to close the session, you simply do this: $> deactivate Contrary to virtualenv, there is a specialized argument in order to completely remove an environment from your system: $> conda remove -n python34 --all A glance at the essential packages We mentioned that the two most relevant characteristics of Python are its ability to integrate with other languages and its mature package system, which is well embodied by PyPI (the Python Package Index: pypi.python.org/pypi), a common repository for the majority of Python open source packages that is constantly maintained and updated. The packages that we are now going to introduce are strongly analytical and they will constitute a complete Data Science Toolbox. All the packages are made up of extensively tested and highly optimized functions for both memory usage and performance, ready to achieve any scripting operation with successful execution. A walkthrough on how to install them is provided next. Partially inspired by similar tools present in R and MATLAB environments, we will together explore how a few selected Python commands can allow you to efficiently handle data and then explore, transform, experiment, and learn from the same without having to write too much code or reinvent the wheel. NumPy NumPy, which is Travis Oliphant's creation, is the true analytical workhorse of the Python language. It provides the user with multidimensional arrays, along with a large set of functions to operate a multiplicity of mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Characterized by optimal memory allocation, arrays are useful not just for storing data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems: Website: www.numpy.org Version at the time of print: 1.11.0 Suggested install command: pip install numpy As a convention largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np: import numpy as np SciPy An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more: Website: www.scipy.org Version at time of print: 0.17.1 Suggested install command: pip install scipy pandas The pandas package deals with everything that NumPy and SciPy cannot do. Thanks to its specific data structures, namely DataFrames and Series, pandas allows you to handle complex tables of data of different types (which is something that NumPy's arrays cannot do) and time series. Thanks to Wes McKinney's creation, you will be able easily and smoothly to load data from a variety of sources. You can then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize your data at will: Website: pandas.pydata.org Version at the time of print: 0.18.1 Suggested install command: pip install pandas Conventionally, pandas is imported as pd: import pandas as pd Scikit-learn Started as part of the SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations on Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at INRA (French Institute for Research in Computer Science and Automation): Website: scikit-learn.org/stable Version at the time of print: 0.17.1 Suggested install command: pip install scikit-learn Note that the imported module is named sklearn. Jupyter A scientific approach requires the fast experimentation of different hypotheses in a reproducible fashion. Initially named IPython and limited to working only with the Python language, Jupyter was created by Fernando Perez in order to address the need for an interactive Python command shell (which is based on shell, web browser, and the application interface), with graphical integration, customizable commands, rich history (in the JSON format), and computational parallelism for an enhanced performance. Jupyter is our favoured choice; it is used to clearly and effectively illustrate operations with scripts and data, and the consequent results: Website: jupyter.org Version at the time of print: 1.0.0 (ipykernel = 4.3.1) Suggested install command: pip install jupyter Matplotlib Originally developed by John Hunter, matplotlib is a library that contains all the building blocks that are required to create quality plots from arrays and to visualize them interactively. You can find all the MATLAB-like plotting frameworks inside the pylab module: Website: matplotlib.org Version at the time of print: 1.5.1 Suggested install command: pip install matplotlib You can simply import what you need for your visualization purposes with the following command: import matplotlib.pyplot as plt Statsmodels Previously part of SciKits, statsmodels was thought to be a complement to SciPy's statistical functions. It features generalized linear models, discrete choice models, time series analysis, and a series of descriptive statistics as well as parametric and nonparametric tests: Website: statsmodels.sourceforge.net Version at the time of print: 0.6.1 Suggested install command: pip install statsmodels Beautiful Soup Beautiful Soup, a creation of Leonard Richardson, is a great tool to scrap out data from HTML and XML files retrieved from the Internet. It works incredibly well, even in the case of tag soups (hence the name), which are collections of malformed, contradictory, and incorrect tags. After choosing your parser (the HTML parser included in Python's standard library works fine), thanks to Beautiful Soup, you can navigate through the objects in the page and extract text, tables, and any other information that you may find useful: Website: www.crummy.com/software/BeautifulSoup Version at the time of print: 4.4.1 Suggested install command: pip install beautifulsoup4 Note that the imported module is named bs4. NetworkX Developed by the Los Alamos National Laboratory, NetworkX is a package specialized in the creation, manipulation, analysis, and graphical representation of real-life network data (it can easily operate with graphs made up of a million nodes and edges). Besides specialized data structures for graphs and fine visualization methods (2D and 3D), it provides the user with many standard graph measures and algorithms, such as the shortest path, centrality, components, communities, clustering, and PageRank. Website: networkx.github.io Version at the time of print: 1.11 Suggested install command: pip install networkx Conventionally, NetworkX is imported as nx: import networkx as nx NLTK The Natural Language Toolkit (NLTK) provides access to corpora and lexical resources and to a complete suite of functions for statistical Natural Language Processing (NLP), ranging from tokenizers to part-of-speech taggers and from tree models to named-entity recognition. Initially, Steven Bird and Edward Loper created the package as an NLP teaching infrastructure for their course at the University of Pennsylvania. Now, it is a fantastic tool that you can use to prototype and build NLP systems: Website: www.nltk.org Version at the time of print: 3.2.1 Suggested install command: pip install nltk Gensim Gensim, programmed by Radim Rehurek, is an open source package that is suitable for the analysis of large textual collections with the help of parallel distributable online algorithms. Among advanced functionalities, it implements Latent Semantic Analysis (LSA), topic modelling by Latent Dirichlet Allocation (LDA), and Google's word2vec, a powerful algorithm that transforms text into vector features that can be used in supervised and unsupervised machine learning. Website: radimrehurek.com/gensim Version at the time of print: 0.12.4 Suggested install command: pip install gensim PyPy PyPy is not a package; it is an alternative implementation of Python 2.7.8 that supports most of the commonly used Python standard packages (unfortunately, NumPy is currently not fully supported). As an advantage, it offers enhanced speed and memory handling. Thus, it is very useful for heavy duty operations on large chunks of data and it should be part of your big data handling strategies: Website: pypy.org/ Version at time of print: 5.1 Download page: pypy.org/download.html XGBoost XGBoost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). Initially created by Tianqi Chen from Washington University, it has been enriched by a Python wrapper by Bing Xu and an R interface by Tong He (you can read the story behind XGBoost directly from its principal creator at homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html). XGBoost is available for Python, R, Java, Scala, Julia, and C++, and it can work on a single machine (leveraging multithreading) in both Hadoop and Spark clusters: Website: xgboost.readthedocs.io/en/latest Version at the time of print: 0.4 Download page: github.com/dmlc/xgboost Detailed instructions for installing XGBoost on your system can be found at this page: github.com/dmlc/xgboost/blob/master/doc/build.md The installation of XGBoost on both Linux and MacOS is quite straightforward, whereas it is a little bit trickier for Windows users. On a Posix system you just have For this reason, we provide specific installation steps to get XGBoost working on Windows: First download and install Git for Windows (git-for-windows.github.io). Then you need a MINGW compiler present on your system. You can download it from www.mingw.org accordingly to the characteristics of your system. From the command line, execute: $> git clone --recursive https://github.com/dmlc/xgboost $> cd xgboost $> git submodule init $> git submodule update Then, always from command line, copy the configuration for 64-byte systems to be the default one: $> copy makemingw64.mk config.mk Alternatively, you just copy the plain 32-byte version: $> copy makemingw.mk config.mk After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling procedure: $> mingw32-make -j4 In MinGW, the make command comes with the name mingw32-make. If you are using a different compiler, the previous command may not work; then you can simply try: $> make -j4 Finally, if the compiler completes its work without errors, you can install the package in your Python by this: $> cd python-package $> python setup.py install After following all the preceding instructions, if you try to import XGBoost in Python and yet it doesn't load and results in an error, it may well be that Python cannot find the MinGW's g++ runtime libraries. You just need to find the location on your computer of MinGW's binaries (in our case, it was in C:mingw-w64mingw64bin; just modify the next code to put yours) and place the following code snippet before importing XGBoost: import os mingw_path = 'C:\mingw-w64\mingw64\bin' os.environ['PATH']=mingw_path + ';' + os.environ['PATH'] import xgboost as xgb Depending on the state of the XGBoost project, similarly to many other projects under continuous development, the preceding installation commands may or may not temporarily work at the time you will try them. Usually waiting for an update of the project or opening an issue with the authors of the package may solve the problem. Theano Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Basically, it provides you with all the building blocks you need to create deep neural networks. Created by academics (an entire development team; you can read their names on their most recent paper at arxiv.org/pdf/1605.02688.pdf), Theano has been used for large scale and intensive computations since 2007: Website: deeplearning.net/software/theano Release at the time of print: 0.8.2 In spite of many installation problems experienced by users in the past (expecially Windows users), the installation of Theano should be straightforward, the package being now available on PyPI: $> pip install Theano If you want the most updated version of the package, you can get it by Github cloning: $> git clone git://github.com/Theano/Theano.git Then you can proceed with direct Python installation: $> cd Theano $> python setup.py install To test your installation, you can run from shell/CMD and verify the reports: $> pip install nose $> pip install nose-parameterized $> nosetests theano If you are working on a Windows OS and the previous instructions don't work, you can try these steps using the conda command provided by the Anaconda distribution: Install TDM GCC x64 (this can be found at tdm-gcc.tdragon.net) Open an Anaconda prompt interface and execute: $> conda update conda $> conda update --all $> conda install mingw libpython $> pip install git+git://github.com/Theano/Theano.git Theano needs libpython, which isn't compatible yet with the version 3.5. So if your Windows installation is not working, this could be the likely cause. Anyway, Theano installs perfectly on Python version 3.4. Our suggestion in this case is to create a virtual Python environment based on version 3.4, install, and use Theano only on that specific version. Directions on how to create virtual environments are provided in the paragraph about virtualenv and conda create. In addition, Theano's website provides some information to Windows users; it could support you when everything else fails: deeplearning.net/software/theano/install_windows.html An important requirement for Theano to scale out on GPUs is to install Nvidia CUDA drivers and SDK for code generation and execution on GPU. If you do not know too much about the CUDA Toolkit, you can actually start from this web page in order to understand more about the technology being used: developer.nvidia.com/cuda-toolkit Therefore, if your computer has an NVidia GPU, you can find all the necessary instructions in order to install CUDA using this tutorial page from NVidia itself: docs.nvidia.com/cuda/cuda-quick-start-guide/index.html Keras Keras is a minimalist and highly modular neural networks library, written in Python and capable of running on top of either Theano or TensorFlow (the source software library for numerical computation released by Google). Keras was created by François Chollet, a machine learning researcher working at Google: Website: keras.io Version at the time of print: 1.0.3 Suggested installation from PyPI: $> pip install keras As an alternative, you can install the latest available version (which is advisable since the package is in continuous development) using the command: $> pip install git+git://github.com/fchollet/keras.git Summary In this article, we performed a lot of installations, from Python packages to examples.They were installed either directly or by using a scientific distribution. We also introduced Jupyter notebooks and demonstrated how you can have access to the data run in the tutorials. Resources for Article: Further resources on this subject: Python for Driving Hardware [Article] Mining Twitter with Python – Influence and Engagement [Article] Python Data Structures [Article]
Read more
  • 0
  • 0
  • 26199
Modal Close icon
Modal Close icon