Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-sql-tuning-enhancements-oracle-12c
Packt
13 Dec 2016
13 min read
Save for later

SQL Tuning Enhancements in Oracle 12c

Packt
13 Dec 2016
13 min read
Background Performance Tuning is one of the most critical area of Oracle databases and having a good knowledge on SQL tuning helps DBAs in tuning production databases on a daily basis. Over the years Oracle optimizer has gone through several enhancements and each release presents a best among all optimizer versions. Oracle 12c is no different. Oracle has improved the optimizer and added new features in this release to make it better than previous release. In this article we are going to see some of the explicit new features of Oracle optimizer which helps us in tuning our queries. Objective In this article, Advait Deo and Indira Karnati, authors of the book OCP Upgrade 1Z0-060 Exam guide discusses new features of Oracle 12c optimizer and how it helps in improving the SQL plan. It also discusses some of the limitations of optimizer in previous release and how Oracle has overcome those limitations in this release. Specifically, we are going to discuss about dynamic plan and how it works (For more resources related to this topic, see here.) SQL Tuning Before we go into the details of each of these new features, let us rewind and check what we used to have in Oracle 11g. Behavior in Oracle 11g R1 Whenever an SQL is executed for the first time, an optimizer will generate an execution plan for the SQL based on the statistics available for the different objects used in the plan. If statistics are not available, or if the optimizer thinks that the existing statistics are of low quality, or if we have complex predicates used in the SQL for which the optimizer cannot estimate the cardinality, the optimizer may choose to use dynamic sampling for those tables. So, based on the statistics values, the optimizer generates the plan and executes the SQL. But, there are two problems with this approach: Statistics generated by dynamic sampling may not be of good quality as they are generated in limited time and are based on a limited sample size. But a trade-off is made to minimize the impact and try to approach a higher level of accuracy. The plan generated using this approach may not be accurate, as the estimated cardinality may differ a lot from the actual cardinality. The next time the query executes, it goes for soft parsing and picks the same plan. Behavior in Oracle 11g R2 To overcome these drawbacks, Oracle enhanced the dynamic sampling feature further in Oracle11g Release 2. In the 11.2 release, Oracle will automatically enable dynamic sample when the query is run if statistics are missing, or if the optimizer thinks that current statistics are not up to the mark. The optimizer also decides the level of the dynamic sample, provided the user does not set the non-default value of the OPTIMIZER_DYNAMIC_SAMPLING parameter (default value is 2). So, if this parameter has a default value in Oracle11g R2, the optimizer will decide when to spawn dynamic sampling in a query and at what level to spawn the dynamic sample. Oracle also introduced a new feature in Oracle11g R2 called cardinality feedback. This was in order to further improve the performance of SQLs, which are executed repeatedly and for which the optimizer does not have the correct cardinality, perhaps because of missing statistics, or complex predicate conditions, or because of some other reason. In such cases, cardinality feedback was very useful. The way cardinality feedback works is, during the first execution, the plan for the SQL is generated using the traditional method without using cardinality feedback. However, during the optimization stage of the first execution, the optimizer notes down all the estimates that are of low quality (due to missing statistics, complex predicates, or some other reason) and monitoring is enabled for the cursor that is created. If this monitoring is enabled during the optimization stage, then, at the end of the first execution, some cardinality estimates in the plan are compared with the actual estimates to understand how significant the variation is. If the estimates vary significantly, then the actual estimates for such predicates are stored along with the cursor, and these estimates are used directly for the next execution instead of being discarded and calculated again. So when the query executes the next time, it will be optimized again (hard parse will happen), but this time it will use the actual statistics or predicates that were saved in the first execution, and the optimizer will come up with better plan. But even with these improvements, there are drawbacks: With cardinality feedback, any missing cardinality or correct estimates are available for the next execution only and not for the first execution. So the first execution always go for regression. The dynamic sample improvements (that is, the optimizer deciding whether dynamic sampling should be used and the level of the dynamic sampling) are only applicable to parallel queries. It is not applicable to queries that aren't running in parallel. Dynamic sampling does not include joins and groups by columns. Oracle 12c has provided new improvements, which eliminates the drawbacks of Oracle11g R2. Adaptive execution plans – dynamic plans The Oracle optimizer chooses the best execution plan for a query based on all the information available to it. Sometimes, the optimizer may not have sufficient statistics or good quality statistics available to it, making it difficult to generate optimal plans. In Oracle 12c, the optimizer has been enhanced to adapt a poorly performing execution plan at run time and prevent a poor plan from being chosen on subsequent executions. An adaptive plan can change the execution plan in the current run when the optimizer estimates prove to be wrong. This is made possible by collecting the statistics at critical places in a plan when the query starts executing. A query is internally split into multiple steps, and the optimizer generates multiple sub-plans for every step. Based on the statistics collected at critical points, the optimizer compares the collected statistics with estimated cardinality. If the optimizer finds a deviation in statistics beyond the set threshold, it picks a different sub-plan for those steps. This improves the ability of the query-processing engine to generate better execution plans. What happens in adaptive plan execution? In Oracle12c, the optimizer generates dynamic plans. A dynamic plan is an execution plan that has many built-in sub-plans. A sub-plan is a portion of plan that the optimizer can switch to as an alternative at run time. When the first execution starts, the optimizer observes statistics at various critical stages in the plan. An optimizer makes a final decision about the sub-plan based on observations made during the execution up to this point. Going deeper into the logic for the dynamic plan, the optimizer actually places the statistics collected at various critical stages in the plan. These critical stages are the places in the plan where the optimizer has to join two tables or where the optimizer has to decide upon the optimal degree of parallelism. During the execution of the plan, the statistics collector buffers a portion of the rows. The portion of the plan preceding the statistics collector can have alternative sub-plans, each of which is valid for the subset of possible values returned by the collector. This means that each of the sub-plans has a different threshold value. Based on the data returned by the statistics collector, a sub-plan is chosen which falls in the required threshold. For example, an optimizer can insert a code to collect statistics before joining two tables, during the query plan building phase. It can have multiple sub-plans based on the type of join it can perform between two tables. If the number of rows returned by the statistics collector on the first table is less than the threshold value, then the optimizer might go with the sub-plan containing the nested loop join. But if the number of rows returned by the statistics collector is above the threshold values, then the optimizer might choose the second sub-plan to go with the hash join. After the optimizer chooses a sub-plan, buffering is disabled and the statistics collector stops collecting rows and passes them through instead. On subsequent executions of the same SQL, the optimizer stops buffering and chooses the same plan instead. With dynamic plans, the optimizer adapts to poor plan choices and correct decisions are made at various steps during runtime. Instead of using predetermined execution plans, adaptive plans enable the optimizer to postpone the final plan decision until statement execution time. Consider the following simple query: SELECT a.sales_rep, b.product, sum(a.amt) FROM sales a, product b WHERE a.product_id = b.product_id GROUP BY a.sales_rep, b.product When the query plan was built initially, the optimizer will put the statistics collector before making the join. So it will scan the first table (SALES) and, based on the number of rows returned, it might make a decision to select the correct type of join. The following figure shows the statistics collector being put in at various stages: Enabling adaptive execution plans To enable adaptive execution plans, you need to fulfill the following conditions: optimizer_features_enable should be set to the minimum of 12.1.0.1 optimizer_adapive_reporting_only should be set to FALSE (default) If you set the OPTIMIZER_ADAPTIVE_REPORTING_ONLY parameter to TRUE, the adaptive execution plan feature runs in the reporting-only mode—it collects the information for adaptive optimization, but doesn't actually use this information to change the execution plans. You can find out if the final plan chosen was the default plan by looking at the column IS_RESOLVED_ADAPTIVE_PLAN in the view V$SQL. Join methods and parallel distribution methods are two areas where adaptive plans have been implemented by Oracle12c. Adaptive execution plans and join methods Here is an example that shows how the adaptive execution plan will look. Instead of simulating a new query in the database and checking if the adaptive plan has worked, I used one of the queries in the database that is already using the adaptive plan. You can get many such queries if you check V$SQL with is_resolved_adaptive_plan = 'Y'. The following queries will list all SQLs that are going for adaptive plans. Select sql_id from v$sql where is_resolved_adaptive_plan = 'Y'; While evaluating the plan, the optimizer uses the cardinality of the join to select the superior join method. The statistics collector starts buffering the rows from the first table, and if the number of rows exceeds the threshold value, the optimizer chooses to go for a hash join. But if the rows are less than the threshold value, the optimizer goes for a nested loop join. The following is the resulting plan: SQL> SELECT * FROM TABLE(DBMS_XPLAN.display_cursor(sql_id=>'dhpn35zupm8ck',cursor_child_no=>0; Plan hash value: 3790265618 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 445 (100)| | | 1 | SORT ORDER BY | | 1 | 73 | 445 (1)| 00:00:01| | 2 | NESTED LOOPS | | 1 | 73 | 444 (0)| 00:00:01| | 3 | NESTED LOOPS | | 151 | 73 | 444 (0)| 00:00:01| |* 4 | TABLE ACCESS BY INDEX ROWID BATCHED| OBJ$ | 151 | 7701 | 293 (0)| 00:00:01| |* 5 | INDEX FULL SCAN | I_OBJ3 | 1 | | 20 (0)| 00:00:01| |* 6 | INDEX UNIQUE SCAN | I_TYPE2 | 1 | | 0 (0)| | |* 7 | TABLE ACCESS BY INDEX ROWID | TYPE$ | 1 | 22 | 1 (0)| 00:00:01| ------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - filter(SYSDATE@!-"O"."CTIME">.0007) 5 - filter("O"."OID$" IS NOT NULL) 6 - access("O"."OID$"="T"."TVOID") 7 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) Note ----- - this is an adaptive plan If we check this plan, we can see the notes section, and it tells us that this is an adaptive plan. It tells us that the optimizer must have started with some default plan based on the statistics in the tables and indexes, and during run time execution it changed the join method for a sub-plan. You can actually check which step optimizer has changed and at what point it has collected the statistics. You can display this using the new format of DBMS_XPLAN.DISPLAY_CURSOR – format => 'adaptive', resulting in the following: DEO>SELECT * FROM TABLE(DBMS_XPLAN.display_cursor(sql_id=>'dhpn35zupm8ck',cursor_child_no=>0,format=>'adaptive')); Plan hash value: 3790265618 ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | 445 (100)| | | 1 | SORT ORDER BY | | 1 | 73 | 445 (1)| 00:00:01 | |- * 2 | HASH JOIN | | 1 | 73 | 444 (0)| 00:00:01 | | 3 | NESTED LOOPS | | 1 | 73 | 444 (0)| 00:00:01 | | 4 | NESTED LOOPS | | 151 | 73 | 444 (0)| 00:00:01 | |- 5 | STATISTICS COLLECTOR | | | | | | | * 6 | TABLE ACCESS BY INDEX ROWID BATCHED| OBJ$ | 151 | 7701 | 293 (0)| 00:00:01 | | * 7 | INDEX FULL SCAN | I_OBJ3 | 1 | | 20 (0)| 00:00:01 | | * 8 | INDEX UNIQUE SCAN | I_TYPE2 | 1 | | 0 (0)| | | * 9 | TABLE ACCESS BY INDEX ROWID | TYPE$ | 1 | 22 | 1 (0)| 00:00:01 | |- * 10 | TABLE ACCESS FULL | TYPE$ | 1 | 22 | 1 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("O"."OID$"="T"."TVOID") 6 - filter(SYSDATE@!-"O"."CTIME">.0007) 7 - filter("O"."OID$" IS NOT NULL) 8 - access("O"."OID$"="T"."TVOID") 9 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) 10 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) Note ----- - this is an adaptive plan (rows marked '-' are inactive) In this output, you can see that it has given three extra steps. Steps 2, 5, and 10 are extra. But these steps were present in the original plan when the query started. Initially, the optimizer generated a plan with a hash join on the outer tables. During runtime, the optimizer started collecting rows returned from OBJ$ table (Step 6), as we can see the STATISTICS COLLECTOR at step 5. Once the rows are buffered, the optimizer came to know that the number of rows returned by the OBJ$ table are less than the threshold and so it can go for a nested loop join instead of a hash join. The rows indicated by - in the beginning belong to the original plan, and they are removed from the final plan. Instead of those records, we have three new steps added—Steps 3, 8, and 9. Step 10 of the full table scan on the TYPE$ table is changed to an index unique scan of I_TYPE2, followed by the table accessed by index rowed at Step 9. Adaptive plans and parallel distribution methods Adaptive plans are also useful in adapting from bad distributing methods when running the SQL in parallel. Parallel execution often requires data redistribution to perform parallel sorts, joins, and aggregates. The database can choose from among multiple data distribution methods to perform these options. The number of rows to be distributed determines the data distribution method, along with the number of parallel server processes. If many parallel server processes distribute only a few rows, the database chooses a broadcast distribution method and sends the entire result set to all the parallel server processes. On the other hand, if a few processes distribute many rows, the database distributes the rows equally among the parallel server processes by choosing a "hash" distribution method. In adaptive plans, the optimizer does not commit to a specific broadcast method. Instead, the optimizer starts with an adaptive parallel data distribution technique called hybrid data distribution. It places a statistics collector to buffer rows returned by the table. Based on the number of rows returned, the optimizer decides the distribution method. If the rows returned by the result are less than the threshold, the data distribution method switches to broadcast distribution. If the rows returned by the table are more than the threshold, the data distribution method switches to hash distribution. Summary In this article we learned the explicit new features of Oracle optimizer which helps us in tuning our queries. Resources for Article: Further resources on this subject: Oracle Essbase System 9 Components [article] Oracle E-Business Suite: Adjusting Items in Inventory and Classifying Items [article] Oracle Business Intelligence : Getting Business Information from Data [article]
Read more
  • 0
  • 0
  • 5035

article-image-tableau-data-extract-best-practices
Packt
12 Dec 2016
11 min read
Save for later

Tableau Data Extract Best Practices

Packt
12 Dec 2016
11 min read
In this article by Jenny Zhang, author of the book Tableau 10.0 Best Practices, you will learn the Best Practices about Tableau Data Extract. We will look into different ways of creating Tableau data extracts and technical details of how a Tableau data extract works. We will learn on how to create extract with large volume of data efficiently, and then upload and manage Tableau data extract in Tableau online. We will also take a look at refresh Tableau data extract, which is useful to keep your data up to date automatically. Finally, we will take a look using Tableau web connector to create data extract. (For more resources related to this topic, see here.) Different ways of creating Tableau data extracts Tableau provides a few ways to create extracts. Direct connect to original data sources Creating an extract by connecting to the original data source (Databases/Salesforce/Google Analytics and so on) will maintain the connection to the original data source. You can right click the extract to edit the extract and refresh the extract from the original data source. Duplicate of an extract If you create a duplicate of the extract by right click the data extract and duplicate, it will create a new .tde file and still maintain the connection to the original data source. If you refresh the duplicated data extract, it will not refresh the original data extract that you created the duplicate from. Connect to a Tableau Extract File If you create a data extract by connecting to a Tableau extract file (.tde), you will not have that connection to the original data source that the extract is created from since you are just connecting to a local .tde file. You cannot edit or refresh the data from the original data source. Duplicate this extract with connection to the local .tde file will NOT create a new .tde file. The duplication will still point to the same local .tde file. You can right click – Extract Data to create an extract out of an extract. But we do not normally do that. Technical details of how a Tableau data extract works Tableau data extract’s design principle A Tableau extract (.tde) file is a compressed snapshot of data extracted from a large variety of original data sources (excel, databases, Salesforce, NoSQL and so on). It is stored on disk and loaded into memory as required to create a Tableau Viz. There are two design principles of the Tableau extract make it ideal for data analytics. The first principle is Tableau extract is a columnar store. The columnar databases store column values rather than row values. The benefit is that the input/output time required to access/aggregate the values in a column is significantly reduced. That is why Tableau extract is great for data analytics. The second principle is how a Tableau extract is structured to make sure it makes best use of your computer’s memory. This will impact how it is loaded into memory and used by Tableau. To better understand this principle, we need to understand how Tableau extract is created and used as the data source to create visualization. When Tableau creates data extract, it defines the structure of the .tde file and creates separate files for each column in the original data source. When Tableau retrieves data from the original data source, it sorts, compresses and adds the values for each column to their own file. After that, individual column files are combined with metadata to form a single file with as many individual memory-mapped files as there are the columns in the original data source. Because a Tableau data extract file is a memory-mapped file, when Tableau requests data from a .tde file, the data is loaded directly into the memory by the operating system. Tableau does not have to open, process or decompress the file. If needed, the operating system continues to move data in and out of RAM to insure that all of the requested data is made available to Tableau. It means that Tableau can query data that is bigger than the RAM on the computer. Benefits of using Tableau data extract Following are the seven main benefits of using Tableau data extract Performance: Using Tableau data extract can increase performance when the underlying data source is slow. It can also speed up CustomSQL. Reduce load: Using Tableau data extract instead of a live connection to databases reduces the load on the database that can result from heavy traffic. Portability: Tableau data extract can be bundled with the visualizations in a packaged workbook for sharing with others. Pre-aggregation: When creating extract, you can choose to aggregate your data for certain dimensions. An aggregated extract has smaller size and contains only aggregated data. Accessing the values of aggregations in a visualization is very fast since all of the work to derive the values has been done. You can choose the level of aggregation. For example, you can choose to aggregate your measures to month, quarter, or year. Materialize calculated fields: When you choose to optimize the extract, all of the calculated fields that have been defined are converted to static values upon the next full refresh. They become additional data fields that can be accessed and aggregated as quickly as any other fields in the extract. The improvement on performance can be significant especially on string calculations since string calculations are much slower compared to numeric or date calculations. Publish to Tableau Public and Tableau Online: Tableau Public only supports Tableau extract files. Though Tableau Online can connect to some cloud based data sources, Tableau data extract is most common used. Support for certain function not available when using live connection: Certain function such as count distinct is only available when using Tableau data extract. How to create extract with large volume of data efficiently Load very large Excel file to Tableau If you have an Excel file with lots of data and lots of formulas, it could take a long time to load into Tableau. The best practice is to save the Excel as a .csv file and remove all the formulas. Aggregate the values to higher dimension If you do not need the values down to the dimension of what it is in the underlying data source, aggregate to a higher dimension will significantly reduce the extract size and improve performance. Use Data Source Filter Add a data source filter by right click the data source and then choose to Edit Data Source Filter to remove the data you do not need before creating the extract. Hide Unused Fields Hide unused fields before creating a data extract can speed up extract creation and also save storage space. Upload and manage Tableau data extract in Tableau online Create Workbook just for extracts One way to create extracts is to create them in different workbooks. The advantage is that you can create extracts on the fly when you need them. But the disadvantage is that once you created many extracts, it is very difficult to manage them. You can hardly remember which dashboard has which extracts. A better solution is to use one workbook just to create data extracts and then upload the extracts to Tableau online. When you need to create visualizations, you can use the extracts in Tableau online. If you want to manage the extracts further, you can use different workbooks for different types of data sources. For example, you can use one workbook for excel files, one workbook for local databases, one workbook for web based data and so on. Upload data extracts to default project The default project in Tableau online is a good place to store your data extracts. The reason is that the default project cannot be deleted. Another benefit is that when you use command line to refresh the data extracts, you do not need to specify project name if they are in the default project. Make sure Tableau online/server has enough space In Tableau Online/Server, it’s important to make sure that the backgrounder has enough disk space to store existing Tableau data extracts as well as refresh them and create new ones. A good rule of thumb is the size of the disk available to the backgrounder should be two to three times the size of the data extracts that are expected to be stored on it. Refresh Tableau data extract Local refresh of the published extract: Download a Local Copy of the Data source from Tableau Online. Go to Data Sources tab Click on the name of the extract you want to download Click download Refresh the Local Copy. Open the extract file in Tableau Desktop Right click on the data source in, and choose Extract- refresh Publish the refreshed Extract to Tableau Online. Right lick the extract and click Publish to server You will be asked if you wish to overwrite a file with the same name and click yes NOTE 1 If you need to make changes to any metadata, please do it before publishing to the server. NOTE 2 If you use the data extract in Tableau Online to create visualizations for multiple workbooks (which I believe you do since that is the benefit of using a shared data source in Tableau Online), please be very careful when making any changes to the calculated fields, groups, or other metadata. If you have other calculations created in the local workbook with the same name as the calculations in the data extract in Tableau Online, the Tableau Online version of the calculation will overwrite what you created in the local workbook. So make sure you have the correct calculations in the data extract that will be published to Tableau Online. Schedule data extract refresh in Tableau Online Only cloud based data sources (eg. Salesforce, Google analytics) can be refreshed using schedule jobs in Tableau online. One option is to use Tableau Desktop command to refresh non-cloud based data source in Tableau Online. Windows scheduler can be used to automate the refresh jobs to update extracts via Tableau Desktop command. Another option is to use the sync application or manually refresh the extracts using Tableau Desktop. NOTE If using command line to refresh the extract, + cannot be used in the data extract name. Tips for Incremental Refreshes Following are the tips for incremental refrences: Incremental extracts retrieve only new records from the underlying data source which reduces the amount of time required to refresh the data extract. If there are no new records to add during an incremental extract, the processes associated with performing an incremental extract still execute. The performance of incremental refresh is decreasing over time. This is because incremental extracts only grow in size, and as a result, the amount of data and areas of memory that must be accessed in order to satisfy requests only grow as well. In addition, larger files are more likely to be fragmented on a disk than smaller ones. When performing an incremental refresh of an extract, records are not replaced. Therefore, using a date field such as “Last Updated” in an incremental refresh could result in duplicate rows in the extract. Incremental refreshes are not possible after an additional file has been appended to a file based data source because the extract has multiple sources at that point. Use Tableau web connector to create data extract What is Tableau web connector? The Tableau Web Data Connector is the API that can be used by people who want to write some code to connect to certain web based data such as a web page. The connectors can be written in java. It seems that these web connectors can only connect to web pages, web services and so on. It can also connect to local files. How to use Tableau web connector? Click on Data | New Data source | Web Data Connector. Is the Tableau web connection live? The data is pulled when the connection is build and Tableau will store the data locally in Tableau extract. You can still refresh the data manually or via schedule jobs. Are there any Tableau web connection available? Here is a list of web connectors around the Tableau community: Alteryx: http://data.theinformationlab.co.uk/alteryx.html Facebook: http://tableaujunkie.com/post/123558558693/facebook-web-data-connector You can check the tableau community for more web connectors Summary In summary, be sure to keep in mind the following best practices for data extracts: Use full fresh when possible. Fully refresh the incrementally refreshed extracts on a regular basis. Publish data extracts to Tableau Online/Server to avoid duplicates. Hide unused fields/ use filter before creating extracts to improve performance and save storage space. Make sure there is enough continuous disk space for the largest extract file. A good way is to use SSD drivers. Resources for Article: Further resources on this subject: Getting Started with Tableau Public [article] Introduction to Practical Business Intelligence [article] Splunk's Input Methods and Data Feeds [article]
Read more
  • 0
  • 0
  • 16226

Packt
09 Dec 2016
4 min read
Save for later

What’s New in SQL Server 2016 Reporting Services

Packt
09 Dec 2016
4 min read
In this article by Robert C. Cain, coauthor of the book SQL Server 2016 Reporting Services Cookbook, we’ll take a brief tour of the new features in SQL Server 2016 Reporting Services. SQL Server 2016 Reporting Services is a true evolution in reporting technology. After making few changes to SSRS over the last several releases, Microsoft unveiled a virtual cornucopia of new features. (For more resources related to this topic, see here.) Report Portal The old Report Manager has received a complete facelift, along with many added new features. Along with it came a rename, it is now known as the Report Portal. The following is a screenshot of the new portal: KPIs KPIs are the first feature you’ll notice. The Report Portal has the ability to display key performance indicators directly, meaning your users can get important metrics at a glance, without the need to open reports. In addition, these KPIs can be linked to other report items such as reports and dashboards, so that a user can simply click on them to find more information. Mobile Reporting Microsoft recognized the users in your organization no longer use just a computer to retrieve their information. Mobile devices, such as phones and tablets, are now commonplace. You could, of course, design individual reports for each platform, but that would cause a lot of repetitive work and limit reuse. To solve this, Microsoft has incorporated a new tool, Mobile Reports. This allows you to create an attractive dashboard that can be displayed in any web browser. In addition, you can easily rearrange the dashboard layout to optimize for both phones and tablets. This means you can create your report once, and use it on multiple platforms. Below are three images of the same mobile report. The first was done via a web browser, the second on a tablet, and the final one on a phone: Paginated reports Traditional SSRS reports have now been renamed Paginated Reports, and are still a critical element in reporting. These provide the detailed information needed for day to day activities in your company. Paginated reports have received several enhancements. First, there are two new chart types, Sunburst and TreeMap. Reports may now be exported to a new format, PowerPoint. Additionally, all reports are now rendered in HTML 5 format. This makes them accessible to any browser, including those running on tablets or other platforms such as Linux or the Mac. PowerBI PowerBI Desktop reports may now be housed within the Report Portal. Currently, opening one will launch the PowerBI desktop application.However, Microsoft has announced in an upcoming update to SSRS 2016 PowerBI reports will be displayed directly within the Report Portal without the need to open the external app. Reporting applications Speaking of Apps, the Report Builder has received a facelift, updating it to a more modern user interface with a color scheme that matches the Report Portal. Report Builder has also been decoupled from the installation of SQL Server. In previous versions Report Builder was part of the SQL Server install, or it was available as a separate download. With SQL Server 2016, both the Report Builder and the Mobile Reporting tool are separate downloads making them easier to stay current as new versions are released. The Report Portal now contains links to download these tools. Excel Excel workbooks, often used as a reporting tool itself, may now be housed within the Report Portal. Opening them will launch Excel, similar to the way in which PowerBI reports currently work. Summary This article summarizes just some of the many new enhancements to SQL Server 2016 Reporting Services. With this release, Microsoft has worked toward meeting the needs of many users in the corporate environment, including the need for mobile reporting, dashboards, and enhanced paginated reports. For more details about these and many more features see the book SQL Server 2016 Reporting Services Cookbook, by Dinesh Priyankara and Robert C. Cain. Resources for Article: Further resources on this subject: Getting Started with Pentaho Data Integration [article] Where Is My Data and How Do I Get to It? [article] Configuring and Managing the Mailbox Server Role [article]
Read more
  • 0
  • 0
  • 2486

article-image-event-detection-news-headlines-hadoop
Packt
08 Dec 2016
13 min read
Save for later

Event detection from the news headlines in Hadoop

Packt
08 Dec 2016
13 min read
In this article by Anurag Shrivastava, author of Hadoop Blueprints, we will be learning how to build a text analytics system which detects the specific events from the random news headlines. Internet has become the main source of news in the world. There are thousands of website which constantly publish and update the news stories around the world. Not every news items is relevant for everyone but some news items are very critical for some people or businesses. For example, if you were major car manufacturer based in Germany having your suppliers located in India then you would be interested in the news from the region which can affect your supply chain. (For more resources related to this topic, see here.) Road accidents in India are a major social and economic problem. Road accidents leave a large number of fatalities behind and result in the loss of capital. In this example, we will build a system which detects if a news item refers to a road accident event. Let us define what we mean by it in the next paragraph. A road accident event may or may not result in fatal injuries. One or more vehicles and pedestrians may be involved in the accidents. A non road accident event news item is everything else which can not be categorized as a road accident event. It could be a road accident trend analysis related to road accidents or something totally unrelated. Technology stack To build this system, we will use the following technologies: Task Technology Data storage HDFS Data processing Hadoop MapReduce Query engine Hive and Hive UDF Data ingestion Curl and HDFS copy Event detection OpenNLP The event detection system is a machine learning based natural language processing system. The natural language processing system brings the intelligence to detect the events in the random headline sentences from the news items. An OpenNLP OpenSourceNaturalLanguageProcessingFramework (OpenNLP) is from apache software foundation. You can download the version 1.6.0 from https://opennlp.apache.org/ to run the examples in this blog. It is capable of detecting the entities, document categories, parts of speech, and so on in the text written by humans. We will use document categorization feature of OpenNLP in our system. Document categorization feature requires you to train the OpenNLP model with the help of sample text. As a result of training, we get a model. This resulting model is used to categorize the new text. Our training data looks as follows: r 1.46 lakh lives lost on Indian roads last year - The Hindu. r Indian road accident data | OpenGovernmentData (OGD) platform... r 400 people die everyday in road accidents in India: Report - India TV. n Top Indian female biker dies in road accident during country-wide tour. n Thirty die in road accidents in north India mountains—World—Dunya... n India's top woman biker Veenu Paliwal dies in road accident: India... r Accidents on India's deadly roads cost the economy over $8 billion... n Thirty die in road accidents in north India mountains (The Express) The first column can take two values: n indicates that the news item is a road accident event r indicates that the news item is not a road accident event or everything else This training set has total 200 lines. Please note that OpenNLP requires at least 15000 lines in the training set to deliver good results. Because we do not have so much training data, we will start with a small set but remain aware about the limitations of our model. You will see that even with a small training dataset, this model works reasonably well. Let us train and build our model: $ opennlp DoccatTrainer -model en-doccat.bin -lang en -data roadaccident.train.prn -encoding UTF-8 Here the file roadaccident.train.prn contains the training data. The output file en-doccat.bin contains the model which we will use in our data pipeline. We have built our model using the command line utility but it is also possible to build the model programmatically. The training data file is a plain text file, which you can expand with a bigger corpus of knowledge to make the model smarter. Next we will build the data pipeline as follows: Fetch RSS feeds This component will fetch RSS news feeds from the popular news web sites. In this case, we will just use one news from Google. We can always add more sites after our first RSS feed has been integrated. The whole RSS feed can be downloaded using the following command: $ curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" The previous command downloads the news headline for India. You can customize the RSS feed by visiting the Google news site is https://news.google.com for your region. Scheduler Our scheduler will fetch the RSS feed once in 6 hours. Let us assume that in 6 hours time interval, we have good likelihood of fetching fresh news items. We will wrap our feed fetching script in a shell file and invoke it using cron. The script is as follows: $ cat feedfetch.sh NAME= "newsfeed-"`date +%Y-%m-%dT%H.%M.%S` curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" > $NAME hadoop fs -put $NAME /xml/rss/newsfeeds Cron job setup line will be as follows: 0 */6 * * * /home/hduser/mycommand Please edit your cron job table using the following command and add the setup line in it: $ cronjob -e Loading data in HDFS To load data in HDFS, we will use HDFS put command which copies the downloaded RSS feed in a directory in HDFS. Let us make this directory in HDFS where our feed fetcher script will store the rss feeds: $ hadoop fs -mkdir /xml/rss/newsfeeds Query using Hive First we will create an external table in Hive for the new RSS feed. Using Xpath based select queries, we will extract the news headlines from the RSS feeds. These headlines will be passed to UDF to detect the categories: CREATE EXTERNAL TABLE IF NOT EXISTS rssnews( document STRING) COMMENT 'RSS Feeds from media' STORED AS TEXTFILE location '/xml/rss/newsfeeds'; The following command parses the XML to retrieve the title or the headlines from XML and explodes them in a single column table: SELECT explode(xpath(name, '//item/title/text()')) FROM xmlnews1; The sample output of the above command on my system is as follows: hive> select explode(xpath(document, '//item/title/text()')) from rssnews; Query ID = hduser_20161010134407_dcbcfd1c-53ac-4c87-976e-275a61ac3e8d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475744961620_0016, Tracking URL = http://localhost:8088/proxy/application_1475744961620_0016/ Kill Command = /home/hduser/hadoop-2.7.1/bin/hadoop job -kill job_1475744961620_0016 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-10-10 14:46:14,022 Stage-1 map = 0%, reduce = 0% 2016-10-10 14:46:20,464 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.69 sec MapReduce Total cumulative CPU time: 4 seconds 690 msec Ended Job = job_1475744961620_0016 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 4.69 sec HDFS Read: 120671 HDFS Write: 1713 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 690 msec OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost CPI(M) worker hacked to death in Kannur - The Hindu Akhilesh Yadav's comment on PM Modi's Lucknow visit shows Samajwadi Party's insecurity: BJP - The Indian Express PMO maintains no data about petitions personally read by PM - Daily News & Analysis AIADMK launches social media campaign to put an end to rumours regarding Amma's health - Times of India Pakistan, India using us to play politics: Former Baloch CM - Times of India Indian soldier, who recited patriotic poem against Pakistan, gets death threat - Zee News This Dussehra effigies of 'terrorism' to go up in flames - Business Standard 'Personal reasons behind Rohith's suicide': Read commission's report - Hindustan Times Time taken: 5.56 seconds, Fetched: 10 row(s) Hive UDF Our Hive User Defined Function (UDF) categorizeDoc takes a news headline and suggests if it is a news about a road accident or the road accident event as we explained earlier. This function is as follows: package com.mycompany.app;import org.apache.hadoop.io.Text;import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;import opennlp.tools.util.InvalidFormatException;import opennlp.tools.doccat.DoccatModel;import opennlp.tools.doccat.DocumentCategorizerME;import java.lang.String;import java.io.FileInputStream;import java.io.InputStream;import java.io.IOException;@Description( name = "getCategory", value = "_FUNC_(string) - gets the catgory of a document ")public final class MyUDF extends UDF { public Text evaluate(Text input) { if (input == null) return null; try { return new Text(categorizeDoc(input.toString())); } catch (Exception ex) { ex.printStackTrace(); return new Text("Sorry Failed: >> " + input.toString()); } } public String categorizeDoc(String doc) throws InvalidFormatException, IOException { InputStream is = new FileInputStream("./en-doccat.bin"); DoccatModel model = new DoccatModel(is); is.close(); DocumentCategorizerME classificationME = new DocumentCategorizerME(model); String documentContent = doc; double[] classDistribution = classificationME.categorize(documentContent); String predictedCategory = classificationME.getBestCategory(classDistribution); return predictedCategory; }} The function categorizeDoc take a single string as input. It loads the model which we created earlier from the file en-doccat.bin from the local directory. Finally it calls the classifier which returns the result to the calling function. The calling function MyUDF extends the hive UDF class. It calls the function categorizeDoc for each string line item input. If the it succeed then the value is returned to the calling program otherwise a message is returned which indicates that the category detection has failed. The pom.xml file to build the above file is as follows: $ cat pom.xml <?xml version="1.0" encoding="UTF-8"?> <project xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mycompany</groupId> <artifactId>app</artifactId> <version>1.0</version> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>2.0.0</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.8</version> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>com.mycompany.app.App</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> </pluginManagement> </build> </project> You can build the jar with all the dependencies in it using the following commands: $ mvn clean compile assembly:single The resulting jar file app-1.0-jar-with-dependencies.jar can be found in the target directory. Let us use this jar file in Hive to categorise the news headlines as follows: Copy jar file to the bin subdirectory in the Hive root: $ cp app-1.0-jar-with-dependencies.jar $HIVE_ROOT/bin Copy the trained model in the bin sub directory in the Hive root: $ cp en-doccat.bin $HIVE_ROOT/bin Run the categorization queries Run Hive: $hive Add jar file in Hive: hive> ADD JAR ./app-1.0-jar-with-dependencies.jar ; Create a temporary categorization function catDoc: hive> CREATE TEMPORARY FUNCTION catDoc as 'com.mycompany.app.MyUDF'; Create a table headlines to hold the headlines extracted from the RSS feed: hive> create table headlines( headline string); Insert the extracted headlines in the table headlines: hive> insert overwrite table headlines select explode(xpath(document, '//item/title/text()')) from rssnews; Let's test our UDF by manually passing a real news headline to it from a newspaper website: hive> hive> select catDoc("8 die as SUV falls into river while crossing bridge in Ghazipur") ; OK N The output is N which means this is indeed a headline about a road accident incident. This is reasonably good, so now let us run this function for the all the headlines: hive> select headline, catDoc(*) from headlines; OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu r Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost r Akhilesh Yadav Backs Rahul Gandhi's 'Dalali' Remark - NDTV r PMO maintains no data about petitions personally read by PM Narendra Modi - Economic Times n Mobile Internet Services Suspended In Protest-Hit Nashik - NDTV n Pakistan, India using us to play politics: Former Baloch CM - Times of India r CBI arrests Central Excise superintendent for taking bribe - Economic Times n Be extra vigilant during festivals: Centre's advisory to states - Times of India r CPI-M worker killed in Kerala - Business Standard n Burqa-clad VHP activist thrashed for sneaking into Muslim women gathering - The Hindu r Time taken: 0.121 seconds, Fetched: 10 row(s) You can see that our headline detection function works and output r or n. In the above example, we see many false positives where a headline has been incorrectly identified as a road accident. A better training for our model can improve the quality of our results. Further reading The book Hadoop Blueprints covers several case studies where we can apply Hadoop, HDFS, data ingestion tools such as Flume and Sqoop, query and visualization tools such as Hive and Zeppelin, machine learning tools such as BigML and Spark to build the solutions. You will discover how to build a fraud detection system using Hadoop or build a Data Lake for example. Summary In this article we have learned to build a text analytics system which detects the specific events from the random news headlines. This also covers how to apply Hadoop, HDFS, and other different tools. Resources for Article: Further resources on this subject: Spark for Beginners [article] Hive Security [article] Customizing heat maps (Intermediate) [article]
Read more
  • 0
  • 0
  • 1609

article-image-build-chatbot
Packt
07 Dec 2016
23 min read
Save for later

Build a Chatbot

Packt
07 Dec 2016
23 min read
In this article written by Alexander T. Combs, author of the book Python Machine Learning Blueprints, we are going to learn how to construct a chatbot from scratch. Along the way, we'll learn more about the history of the field and its future prospects. We'll cover the following topics in this article: The Turing test The history of chatbots The design of chatbots Building a chatbot (For more resources related to this topic, see here.) The Turing test In 1950, Alan Turing posed the question, “Can machines think?” in one of his more famous papers. Being the practical genius that he was, he decided not to tackle this question head on but to pose it in the framework of the “problem of other minds” instead. This problem poses the question, “How do we truly know that other people have minds like our own?”. As we can only observe their behavior and not the inner workings of their mind, we must take it on faith that they are like us. Turing proposed that if a machine behaved as if it was intelligent, then in fact, we should view it as such. This, in a nutshell, is the Turing test. Passing this test means convincing humans that a machine is a fellow human. You may wonder at this point whether a program ever passed the Turing test successfully. The answer is that there really is no official Turing test. The closest thing to an official event is the Loebner Prize. This is a contest held annually that awards modest prizes to chatbots that appear to be the most convincingly human to a panel of judges. If this sounds rigorous and academic, it has been described as anything but. Loebner's sponsorship for the contest is said to have been a product of his total disdain for labor of any kind. Hopefully, by this point, you realized that any claims about “passing the Turing test” are more spectacle than actual science. However, despite this, Turing's original concept is a useful construct. The history of chatbots I still remember the day my father sat me down in front of his IBM XT with its black and green flickering monitor and asked me to talk to “it”. The conversation went something like the following: >Hello? How are you feeling today? >Hi. Who is this? >Hi there… how are you today? >I'm fine. You didn't tell me who you are. >We should be discussing you, not me. >We should? Why? >Let's change focus a bit… Tell me about your family. >You are impossible to have a conversation with. >Can you elaborate on that? >You are killing me. >Perhaps you would like me to be killing you. >Ok, you are freaking me out. >Perhaps you're really talking about yourself? >What?? That makes no sense. >How would an answer to that help you? >END >Very Interesting. >QUIT >Thank you for talking with me. >#$###$#@@…. The “it” that I talked to was a program called ELIZA. Written in the mid-1960s at MIT's Artificial Intelligence Lab by Joseph Weizenbaum, ELIZA was created to mimic the responses of a Rogerian psychotherapist. Though nearly comical when examined in any depth, the program was capable of convincing some users that they were chatting with an actual human. This was a remarkable feat considering it was a scant 200 lines of code that used randomization and regular expressions to parrot back responses. Even today, this simple program remains a staple of popular culture. If you ask Siri who ELIZA is, she will tell you she is a friend and brilliant psychiatrist. If ELIZA was an early example of chatbots, what have we seen after this? In recent years, there has been an explosion of new chatbots; most notable of these is Cleverbot. Cleverbot was released to the world via the web in 1997. Since then, this bot has racked up hundreds of millions of conversions. Unlike early chatbots, Cleverbot (as the name suggests) appears to become more intelligent with each conversion. Though the exact details of the workings of the algorithm are difficult to find, it is said to work by recording all conversations in a database and finding the most appropriate response by identifying the most similar questions and responses in the database. I made up a nonsensical question in the following screenshot, and you can see that it found something similar to the object of my question in terms of a string match. I persisted: Again I got something…similar? You'll also notice that topics can persist across the conversation. In response to my answer, I was asked to go into more detail and justify my answer. This is one of the things that appears to make Cleverbot, well, clever. While chatbots that learn from humans can be quite amusing, they can also have a darker side. Just this past year, Microsoft released a chatbot named Tay on Twitter. People were invited to ask questions of Tay, and Tay would respond in accordance with her “personality”. Microsoft had apparently programmed the bot to appear to be 19-year-old American girl. She was intended to be your virtual “bestie”; the only problem was she started sounding like she would rather hang with the Nazi youth than you. As a result of these unbelievably inflammatory tweets, Microsoft was forced to pull Tay off Twitter and issue an apology: “As many of you know by now, on Wednesday we launched a chatbot called Tay. We are deeply sorry for the unintended offensive and hurtful tweets from Tay, which do not represent who we are or what we stand for, nor how we designed Tay. Tay is now offline and we'll look to bring Tay back only when we are confident we can better anticipate malicious intent that conflicts with our principles and values.” -March 25, 2016 Official Microsoft Blog Clearly, brands that want to release chatbots into the wild in the future should take a lesson from this debacle. There is no doubt that brands are embracing chatbots. Everyone from Facebook to Taco Bell is getting in on the game. Witness the TacoBot: Yes, this is a real thing, and despite the stumbles such as Tay, there is a good chance the future of UI looks a lot like TacoBot. One last example might even help explain why. Quartz recently launched an app that turns news into a conversation. Rather than lay out the day's stories as a flat list, you are engaged in a chat as if you were getting news from a friend. David Gasca, a PM at Twitter, describes his experience using the app in a post on Medium. He describes how the conversational nature invoked feelings that were normally only triggered in human relationships. This is his take on how he felt when he encountered an ad in the app: "Unlike a simple display ad, in a conversational relationship with my app, I feel like I owe something to it: I want to click. At the most subconscious level, I feel the need to reciprocate and not let the app down: The app has given me this content. It's been very nice so far and I enjoyed the GIFs. I should probably click since it's asking nicely.” If this experience is universal—and I expect that it is—this could be the next big thing in advertising, and have no doubt that advertising profits will drive UI design: “The more the bot acts like a human, the more it will be treated like a human.” -Mat Webb, technologist and co-author of Mind Hacks At this point, you are probably dying to know how these things work, so let's get on with it! The design of chatbots The original ELIZA application was two-hundred odd lines of code. The Python NLTK implementation is similarly short. An excerpt can be seen at the following link from NLTK's website (http://www.nltk.org/_modules/nltk/chat/eliza.html). I have also reproduced an except below: # Natural Language Toolkit: Eliza # # Copyright (C) 2001-2016 NLTK Project # Authors: Steven Bird <stevenbird1@gmail.com> # Edward Loper <edloper@gmail.com> # URL: <http://nltk.org/> # For license information, see LICENSE.TXT # Based on an Eliza implementation by Joe Strout <joe@strout.net>, # Jeff Epler <jepler@inetnebr.com> and Jez Higgins <mailto:jez@jezuk.co.uk>. # a translation table used to convert things you say into things the # computer says back, e.g. "I am" --> "you are" from future import print_function # a table of response pairs, where each pair consists of a # regular expression, and a list of possible responses, # with group-macros labelled as %1, %2. pairs = ((r'I need (.*)',("Why do you need %1?", "Would it really help you to get %1?","Are you sure you need %1?")),(r'Why don't you (.*)', ("Do you really think I don't %1?","Perhaps eventually I will %1.","Do you really want me to %1?")), [snip](r'(.*)?',("Why do you ask that?", "Please consider whether you can answer your own question.", "Perhaps the answer lies within yourself?", "Why don't you tell me?")), (r'quit',("Thank you for talking with me.","Good-bye.", "Thank you, that will be $150. Have a good day!")), (r'(.*)',("Please tell me more.","Let's change focus a bit... Tell me about your family.","Can you elaborate on that?","Why do you say that %1?","I see.", "Very interesting.","%1.","I see. And what does that tell you?","How does that make you feel?", "How do you feel when you say that?")) ) eliza_chatbot = Chat(pairs, reflections) def eliza_chat(): print("Therapistn---------") print("Talk to the program by typing in plain English, using normal upper-") print('and lower-case letters and punctuation. Enter "quit" when done.') print('='*72) print("Hello. How are you feeling today?") eliza_chatbot.converse() def demo(): eliza_chat() if name demo() == " main ": As you can see from this code, input text was parsed and then matched against a series of regular expressions. Once the input was matched, a randomized response (that sometimes echoed back a portion of the input) was returned. So, something such as I need a taco would trigger a response of Would it really help you to get a taco? Obviously, the answer is yes, and fortunately, we have advanced to the point that technology can provide one to you (bless you, TacoBot), but this was still in the early days. Shockingly, some people did actually believe ELIZA was a real human. However, what about more advanced bots? How are they constructed? Surprisingly, most of the chatbots that you're likely to encounter don't even use machine learning; they use what's known as retrieval-based models. This means responses are predefined according to the question and the context. The most common architecture for these bots is something called Artificial Intelligence Markup Language (AIML). AIML is an XML-based schema to represent how the bot should interact to the user's input. It's really just a more advanced version of how ELIZA works. Let's take a look at how responses are generated using AIML. First, all inputs are preprocessed to normalize them. This means when you input “Waaazzup???”, it is mapped to “WHAT IS UP”. This preprocessing step funnels down the myriad ways of saying the same thing into one input that can run against a single rule. Punctuation and other extraneous inputs are removed as well at this point. Once this is complete, the input is matched against the appropriate rule. The following is a sample template: <category> <pattern>WHAT IS UP</pattern> <template>The sky, duh. Pfft. Humans...</template> </category> This is the basic setup, but you can also layer in wildcards, randomization, and prioritization schemes. For example, the following pattern uses wildcard matching: <category> <pattern>* FOR ME<pattern> <template>I'm a bot. I don't <star/>. Ever.</template> </category> Here, the *wildcard matches one or more words before FORMEand then repeats these back in the output template. If the user were to type in Danceforme!, the response would be, I'm a bot. I don't dance. Ever. As you can see, these rules don't make for anything that approximates any type of real intelligence, but there are a few tricks that strengthen the illusion. One of the better ones is the ability to generate responses conditioned to a topic. For example, here is a rule that invokes a topic: <category> <pattern>I LIKE TURTLES</pattern> <template>I feel like this whole <set name="topic">turle</set> thing could be a problem. What do you like about them? </template> </category> Once the topic is set, then the rules specific to that context can be matched: <topic name="turtles"> <category> <pattern>* SHELL IS *</pattern> <template>I dislike turtles primarily because of their shells. What other creepy things do you like about turtles? </template> </category> <category> <pattern>* HIDE *</pattern> <template>I wish like a turtle that I could hide from this conversation.</template> </category> </topic> Let's see what this interaction looks like: >I like turtles! >I feel like this whole turtle thing could be a problem. What do you like about them? >I like how they hide in their shell >I wish like a turtle I could hide from this conversation. You can see that the continuity across the conversation adds a measure of realism. You probably think that this can't be state-of-the-art in this age of deep learning, and you're right. While most bots are rule-based, the next generation of chatbots are emerging, and they are based on neural networks. In 2015, Oriol Vinyas and Quoc Le of Google published a paper (http://arxiv.org/pdf/1506.05869v1.pdf), which described the construction of a neural network, based on sequence-to-sequence models. This type of model maps an input sequence, such as “ABC”, to an output sequence, such as “XYZ”. These inputs and outputs can be translations from one language to another for example. However, in the case of their work here, the training data was not language translation, but rather tech support transcripts and movie dialog. While the results from both models are both interesting, it was the interactions that were based on movie model that stole the headlines. The following are sample interactions taken from the paper: None of this was explicitly encoded by humans or present in a training set as asked, and yet, looking at this is, it is frighteningly like speaking with a human. However, let's see more… Note that the model responds with what appears to be knowledge of gender (he, she), of place (England), and career (player). Even questions of meaning, ethics, and morality are fair game: The conversation continues: If this transcript doesn't give you a slight chill of fear for the future, there's a chance you may already be some sort of AI. I wholeheartedly recommend reading the entire paper. It isn't overly technical, and it will definitely give you a glimpse of where this technology is headed. We talked a lot about the history, types, and design of chatbots, but let's now move on to building our own! Building a chatbot Now, having seen what is possible in terms of chatbots, you most likely want to build the best, most state-of-the-art, Google-level bot out there, right? Well, just put that out of your mind right now because we will do just the opposite! We will build the best, most awful bot ever! Let me tell you why. Building a chatbot comparable to what Google built takes some serious hardware and time. You aren't going to whip up a model on your MacBook Pro that takes anything less than a month or two to run with any type of real training set. This means that you will have to rent some time on an AWS box, and not just any box. This box will need to have some heavy-duty specs and preferably be GPU-enabled. You are more than welcome to attempt such a thing. However, if your goal is just to build something very cool and engaging, I have you covered here. I should also warn you in advance, although Cleverbot is no Tay, the conversations can get a bit salty. If you are easily offended, you may want to find a different training set. Ok, let's get started! First, as always, we need training data. Again, as always, this is the most challenging step in the process. Fortunately, I have come across an amazing repository of conversational data. The notsocleverbot.com site has people submit the most absurd conversations they have with Cleverbot. How can you ask for a better training set? Let's take a look at a sample conversation between Cleverbot and a user from the site: So, this is where we'll begin. We'll need to download the transcripts from the site to get started: You'll just need to paste the link into the form on the page. The format will be like the following: http://www.notsocleverbot.com/index.php?page=1. Once this is submitted, the site will process the request and return a page back that looks like the following: From here, if everything looks right, click on the pink Done button near the top right. The site will process the page and then bring you to the following page: Next, click on the Show URL Generator button in the middle: Next, you can set the range of numbers that you'd like to download from. For example, 1-20, by 1 step. Obviously, the more pages you capture, the better this model will be. However, remember that you are taxing the server, so please be considerate. Once this is done, click on Add to list and hit Return in the text box, and you should be able to click on Save. It will begin running, and when it is complete, you will be able to download the data as a CSV file. Next, we'll use our Jupyter notebook to examine and process the data. We'll first import pandasand the Python regular expressions library, re. We will also set the option in pandasto widen our column width so that we can see the data better: import pandas as pd import re pd.set_option('display.max_colwidth',200) Now, we'll load in our data: df = pd.read_csv('/Users/alexcombs/Downloads/nscb.csv') df The preceding code will result in the following output: As we're only interested in the first column, the conversation data, we'll parse this out: convo = df.iloc[:,0] convo The preceding code will result in the following output: You should be able to make out that we have interactions between User and Cleverbot, and that either can initiate the conversation. To get the data in the format that we need, we'll have to parse it into question and response pairs. We aren't necessarily concerned with who says what, but we are concerned with matching up each response to each question. You'll see why in a bit. Let's now perform a bit of regular expression magic on the text: clist = [] def qa_pairs(x): cpairs = re.findall(": (.*?)(?:$|n)", x) clist.extend(list(zip(cpairs, cpairs[1:]))) convo.map(qa_pairs); convo_frame = pd.Series(dict(clist)).to_frame().reset_index() convo_frame.columns = ['q', 'a'] The preceding code results in the following output: Okay, there's a lot of code there. What just happened? We first created a list to hold our question and response tuples. We then passed our conversations through a function to split them into these pairs using regular expressions. Finally, we set it all into a pandas DataFramewith columns labelled qand a. We will now apply a bit of algorithm magic to match up the closest question to the one a user inputs: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity vectorizer = TfidfVectorizer(ngram_range=(1,3)) vec = vectorizer.fit_transform(convo_frame['q']) What we did in the preceding code was to import our TfidfVectorizationlibrary and the cosine similarity library. We then used our training data to create a tf-idf matrix. We can now use this to transform our own new questions and measure the similarity to existing questions in our training set. We covered cosine similarity and tf-idf algorithms in detail, so flip back there if you want to understand how these work under the hood. Let's now get our similarity scores: my_q = vectorizer.transform(['Hi. My name is Alex.']) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) top5 = rs.iloc[0:5] top5 The preceding code results in the following output: What are we looking at here? This is the cosine similarity between the question I asked and the top five closest questions. To the left is the index and on the right is the cosine similarity. Let's take a look at these: convo_frame.iloc[top5.index]['q'] This results in the following output: As you can see, nothing is exactly the same, but there are definitely some similarities. Let's now take a look at the response: rsi = rs.index[0] rsi convo_frame.iloc[rsi]['a'] The preceding code results in the following output: Okay, so our bot seems to have an attitude already. Let's push further. We'll create a handy function so that we can test a number of statements easily: def get_response(q): my_q = vectorizer.transform([q]) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) rsi = rs.index[0] return convo_frame.iloc[rsi]['a'] get_response('Yes, I am clearly more clever than you will ever be!') This results in the following output: We have clearly created a monster, so we'll continue: get_response('You are a stupid machine. Why must I prove anything to you?') This results in the following output: I'm enjoying this. Let's keep rolling with it: get_response('My spirit animal is a menacing cat. What is yours?') To which I responded: get_response('I mean I didn't actually name it.') This results in the following output: Continuing: get_response('Do you have a name suggestion?') This results in the following output: To which I respond: get_response('I think it might be a bit aggressive for a kitten') This results in the following output: I attempt to calm the situation: get_response('No need to involve the police.') This results in the following output: And finally, get_response('And I you, Cleverbot') This results in the following output: Remarkably, this may be one of the best conversations I've had in a while: bot or no bot. Now that we have created this cake-based intelligence, let's set it up so that we can actually chat with it via text message. We'll need a few things to make this work. The first is a twilio account. They will give you a free account that lets you send and receive text messages. Go to http://ww.twilio.com and click to sign up for a free developer API key. You'll set up some login credentials and they will text your phone to confirm your number. Once this is set up, you'll be able to find the details in their Quickstart documentation. Make sure that you select Python from the drop-down menu in the upper left-hand corner. Sending messages from Python code is a breeze, but you will need to request a twilio number. This is the number that you will use to send a receive messages in your code. The receiving bit is a little more complicated because it requires that you to have a webserver running. The documentation is succinct, so you shouldn't have that hard a time getting it set up. You will need to paste a public-facing flask server's URL in under the area where you manage your twilio numbers. Just click on the number and it will bring you to the spot to paste in your URL: Once this is all set up, you will just need to make sure that you have your Flask web server up and running. I have condensed all the code here for you to use on your Flask app: from flask import Flask, request, redirect import twilio.twiml import pandas as pd import re from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity app = Flask( name ) PATH_TO_CSV = 'your/path/here.csv' df = pd.read_csv(PATH_TO_CSV) convo = df.iloc[:,0] clist = [] def qa_pairs(x): cpairs = re.findall(": (.*?)(?:$|n)", x) clist.extend(list(zip(cpairs, cpairs[1:]))) convo.map(qa_pairs); convo_frame = pd.Series(dict(clist)).to_frame().reset_index() convo_frame.columns = ['q', 'a'] vectorizer = TfidfVectorizer(ngram_range=(1,3)) vec = vectorizer.fit_transform(convo_frame['q']) @app.route("/", methods=['GET', 'POST']) def get_response(): input_str = request.values.get('Body') def get_response(q): my_q = vectorizer.transform([input_str]) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) rsi = rs.index[0] return convo_frame.iloc[rsi]['a'] resp = twilio.twiml.Response() if input_str: resp.message(get_response(input_str)) return str(resp) else: resp.message('Something bad happened here.') return str(resp) It looks like there is a lot going on, but essentially we use the same code that we used before, only now we grab the POST data that twilio sends—the text body specifically—rather than the data we hand-entered before into our get_requestfunction. If all goes as planned, you should have your very own weirdo bestie that you can text anytime, and what could be better than that! Summary In this article, we had a full tour of the chatbot landscape. It is clear that we are just on the cusp of an explosion of these sorts of applications. The Conversational UI revolution is just about to begin. Hopefully, this article has inspired you to create your own bot, but if not, at least perhaps you have a much richer understanding of how these applications work and how they will shape our future. I'll let the app say the final words: get_response("Say goodbye, Clevercake") Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Specialized Machine Learning Topics [article]
Read more
  • 0
  • 0
  • 2997

article-image-define-necessary-connections
Packt
02 Dec 2016
5 min read
Save for later

Define the Necessary Connections

Packt
02 Dec 2016
5 min read
In this article by Robert van Mölken and Phil Wilkins, the author of the book Implementing Oracle Integration Cloud Service, where we will see creating connections which is one of the core components of an integration we can easily navigate to the Designer Portal and start creating connections. (For more resources related to this topic, see here.) On the home page, click the Create link of the Connection tile as given in the following screenshot: Because we click on this link the Connections page is loaded, which lists of all created connections, a modal dialogue automatically opens on top of the list. This pop-up shows all the adapter types we can create. For our first integration we define two technology adapter connections, an inbound SOAP connection and an outbound REST connection. Inbound SOAP connection In the pop-up we can scroll down the list and find the SOAP adapter, but the modal dialogue also includes a search field. Just search on SOAP and the list will show the adapters matching the search criteria: Find your adapter by searching on the name or change the appearance from card to list view to show more adapters at ones. Click Select to open the New Connection page. Before we can setup any adapter specific configurations every creation starts with choosing a name and an optional description: Create the connection with the following details: Connection Name FlightAirlinesSOAP_Ch2 Identifier This will be proposed based on the connection name and there is no need to change unless you'd like an alternate name. It is usually the name in all CAPITALS and without spaces and has a max length of 32 characters. Connection Role Trigger The role chosen restricts the connection to be used only in selected role(s). Description This receives in Airline objects as a SOAP service. Click the Create button to accept the details. This will bring us to the specific adapter configuration page where we can add and modify the necessary properties. The one thing all the adapters have in common is the optional Email Address under Connection Administration. This email address is used to send notification to when problems or changes occur in the connection. A SOAP connection consists of three sections; Connection Properties, Security, and an optional Agent Group. On the right side of each section we can find a button to configure its properties.Let's configure each section using the following steps: Click the Configure Connectivity button. Instead of entering in an URL we are uploading the WSDL file. Check the box in the Upload File column. Click the newly shown Upload button. Upload the file ICSBook-Ch2-FlightAirlines-Source WSDL. Click OK to save the properties. Click the Configure Credentials button. In the pop-up that is shown we can configure the security credentials. We have the choice for Basic authentication, Username Password Token, or No Security Policy. Because we use it for our inbound connection we don't have to configure this. Select No Security Policy from the dropdown list. This removes the username and password fields. Click OK to save the properties. We leave the Agent Group section untouched. We can attach an Agent Group if we want to use it as an outbound connection to an on-premises web service. Click Test to check if the connection is working (otherwise it can't be used). For SOAP and REST it simply pings the given domain to check the connectivity, but others for example the Oracle SaaS adapters also authenticate and collect metadata. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Outbound REST connection Now that the inbound connection is created we can create our REST adapter. Click the Create New Connection button to show the Create Connection pop-up again and select the REST adapter. Create the connection with the following details: Connection Name FlightAirlinesREST_Ch2 Identifier This will be proposed based on the connection name Connection Role Invoke Description This returns the Airline objects as a REST/JSON service Email Address Your email address to use to send notifications to Let’s configure the connection properties using the following steps: Click the Configure Connectivity button. Select REST API Base URL for the Connection Type. Enter the URL were your Apiary mock is running on: http://private-xxxx-yourapidomain.apiary-mock.com. Click OK to save the values. Next configure the security credentials using the following steps: Click the Configure Credentials button. Select No Security Policy for the Security Policy. This removes the username and password fields. Click the OK button to save out choice. Click Test at the top to check if the connection is working. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Troubleshooting If the test fails for one of these connections check if the correct WSDL is used or that the connection URL for the REST adapter exists or is reachable. Summary In this article we looked at the processes of creating and testing the necessary connections and the creation of the integration itself. We have seen an inbound SOAP connection and an outbound REST connection. In demonstrating the integration we have also seen how to use Apiary to document and mock our backend REST service. Resources for Article: Further resources on this subject: Getting Started with a Cloud-Only Scenario [article] Extending Oracle VM Management [article] Docker Hosts [article]
Read more
  • 0
  • 0
  • 1417
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-introducing-algorithm-design-paradigms
Packt
18 Nov 2016
10 min read
Save for later

Introducing Algorithm Design Paradigms

Packt
18 Nov 2016
10 min read
In this article by David Julian and Benjamin Baka, author of the book Python Data Structures and Algorithm, we will discern three broad approaches to algorithm design. They are as follows: Divide and conquer Greedy algorithms Dynamic programming   (For more resources related to this topic, see here.) As the name suggests, the divide and conquer paradigm involves breaking a problem into smaller subproblems, and then in some way combining the results to obtain a global solution. This is a very common and natural problem solving technique and is, arguably, the most used approach to algorithm design. Greedy algorithms often involve optimization and combinatorial problems; the classic example is applying it to the traveling salesperson problem, where a greedy approach always chooses the closest destination first. This shortest path strategy involves finding the best solution to a local problem in the hope that this will lead to a global solution. The dynamic programming approach is useful when our subproblems overlap. This is different from divide and conquer. Rather than breaking our problem into independent subproblems, with dynamic programming, intermediate results are cached and can be used in subsequent operations. Like divide and conquer, it uses recursion. However, dynamic programing allows us to compare results at different stages. This can have a performance advantage over divide and conquer for some problems because it is often quicker to retrieve a previously calculated result from memory rather than having to recalculate it. Recursion and backtracking Recursion is particularly useful for divide and conquer problems; however, it can be difficult to understand exactly what is happening, since each recursive call is itself spinning off other recursive calls. At the core of a recursive function are two types of cases. Base cases, which tell the recursion when to terminate and recursive cases that call the function they are in. A simple problem that naturally lends itself to a recursive solution is calculating factorials. The recursive factorial algorithm defines two cases—the base case, when n is zero, and the recursive case, when n is greater than zero. A typical implementation is shown in the following code: def factorial(n): #test for a base case if n==0: return 1 # make a calculation and a recursive call f= n*factorial(n-1) print(f) return(f) factorial(4) This code prints out the digits 1, 2, 4, 24. To calculate 4!, we require four recursive calls plus the initial parent call. On each recursion, a copy of the methods variables is stored in memory. Once the method returns, it is removed from memory. Here is a way to visualize this process: It may not necessarily be clear if recursion or iteration is a better solution to a particular problem, after all, they both repeat a series of operations and both are very well suited to divide and conquer approaches to algorithm design. An iteration churns away until the problem is done. Recursion breaks the problem down into smaller chunks and then combines the results. Iteration is often easier for programmers because the control stays local to a loop, whereas recursion can more closely represent mathematical concepts such as factorials. Recursive calls are stored in memory, whereas iterations are not. This creates a tradeoff between processor cycles and memory usage, so choosing which one to use may depend on whether the task is processor or memory intensive. The following table outlines the key differences between recursion and iteration. Recursion Iteration Terminates when a base case is reached Terminates when a defined condition is met Each recursive call requires space in memory Each iteration is not stored in memory An infinite recursion results in a stack overflow error An infinite iteration will run while the hardware is powered Some problems are naturally better suited to recursive solutions Iterative solutions may not always be obvious Backtracking Backtracking is a form of recursion that is particularly useful for types of problems such as traversing tree structures where we are presented with a number of options at each node, from which we must choose one. Subsequently, we are presented with a different set of options, and depending on the series of choices made, either a goal state or a dead end is reached. If it is the latter, we mast backtrack to a previous node and traverse a different branch. Backtracking is a divide and conquer method for exhaustive search. Importantly, backtracking prunes branches that cannot give a result. An example of back tracking is given by the following. Here, we have used a recursive approach to generating all the possible permutations of a given string, s, of a given length n: def bitStr(n, s): if n == 1: return s return [ digit + bits for digit in bitStr(1,s)for bits in bitStr(n - 1,s)] print (bitStr(3,'abc')) This generates the following output: Note the double list compression and the two recursive calls within this comprehension. This recursively concatenates each element of the initial sequence, returned when n = 1, with each element of the string generated in the previous recursive call. In this sense, it is backtracking to uncover previously ungenerated combinations. The final string that is returned is all n letter combinations of the initial string. Divide and conquer – long multiplication For recursion to be more than just a clever trick, we need to understand how to compare it to other approaches, such as iteration, and to understand when it is use will lead to a faster algorithm. An iterative algorithm that we are all familiar with is the procedure you learned in primary math classes, which was used to multiply two large numbers, that is, long multiplication. If you remember, long multiplication involved iterative multiplying and carry operations followed by a shifting and addition operation. Our aim here is to examine ways to measure how efficient this procedure is and attempt to answer the question, is this the most efficient procedure we can use for multiplying two large numbers together? In the following figure, we can see that multiplying two 4-digit numbers together requires 16 multiplication operations, and we can generalize to say that an n digit number requires, approximately, n2 multiplication operations: This method of analyzing algorithms, in terms of number of computational primitives such as multiplication and addition, is important because it can give a way to understand the relationship between the time it takes to complete a certain computation and the size of the input to that computation. In particular, we want to know what happens when the input, the number of digits, n, is very large. Can we do better? A recursive approach It turns out that in the case of long multiplication, the answer is yes, there are in fact several algorithms for multiplying large numbers that require fewer operations. One of the most well-known alternatives to long multiplication is the Karatsuba algorithm, published in 1962. This takes a fundamentally different approach: rather than iteratively multiplying single digit numbers, it recursively carries out multiplication operation on progressively smaller inputs. Recursive programs call themselves on smaller subset of the input. The first step in building a recursive algorithm is to decompose a large number into several smaller numbers. The most natural way to do this is to simply split the number into halves: the first half comprising the most significant digits and the second half comprising the least significant digits. For example, our four-digit number, 2345, becomes a pair of two digit numbers, 23 and 45. We can write a more general decomposition of any two n-digit numbers x and y using the following, where m is any positive integer less than n. For x-digit number: For y-digit number: So, we can now rewrite our multiplication problem x and y as follows: When we expand and gather like terms we get the following: More conveniently, we can write it like this (equation 1): Here, It should be pointed out that this suggests a recursive approach to multiplying two numbers since this procedure itself involves multiplication. Specifically, the products ac, ad, bc, and bd all involve numbers smaller than the input number, and so it is conceivable that we could apply the same operation as a partial solution to the overall problem. This algorithm, so far consists of four recursive multiplication steps and it is not immediately clear if it will be faster than the classic long multiplication approach. What we have discussed so far in regards to the recursive approach to multiplication was well known to mathematicians since the late 19th century. The Karatsuba algorithm improves on this is by making the following observation. We really only need to know three quantities, z2 = ac, z1=ad +bc, and z0 = bd to solve equation 1. We need to know the values of a, b, c, and d as they contribute to the overall sum and products involved in calculating the quantities z2, z1, and z0. This suggests the possibility that perhaps we can reduce the number of recursive steps. It turns out that this is indeed the situation. Since the products ac and bd are already in their simplest form, it seems unlikely that we can eliminate these calculations. We can, however, make the following observation: When we subtract the quantities ac and bd, which we have calculated in the previous recursive step, we get the quantity we need, namely ad + bc: This shows that we can indeed compute the sum of ad and bc without separately computing each of the individual quantities. In summary, we can improve on equation 1 by reducing from four recursive steps to three. These three steps are as follows: Recursively calculate ac. Recursively calculate bd. Recursively calculate (a +b)(c + d) and subtract ac and bd. The following code shows a Python implementation of the Karatsuba algorithm: from math import log10 def karatsuba(x,y): # The base case for recursion if x < 10 or y < 10: return x*y #sets n, the number of digits in the highest input number n = max(int(log10(x)+1), int(log10(y)+1)) # rounds up n/2 n_2 = int(math.ceil(n / 2.0)) #adds 1 if n is uneven n = n if n % 2 == 0 else n + 1 #splits the input numbers a, b = divmod(x, 10**n_2) c, d = divmod(y, 10**n_2) #applies the three recursive steps ac = karatsuba(a,c) bd = karatsuba(b,d) ad_bc = karatsuba((a+b),(c+d)) - ac - bd #performs the multiplication return (((10**n)*ac) + bd + ((10**n_2)*(ad_bc))) To satisfy ourselves that this does indeed work, we can run the following test function: import random def test(): for i in range(1000): x = random.randint(1,10**5) y = random.randint(1,10**5) expected = x * y result = karatsuba(x, y) if result != expected: return("failed") return('ok') Summary In this article, we looked at a way to recursively multiply large numbers and also a recursive approach for merge sort. We saw how to use backtracking for exhaustive search and generating strings. Resources for Article: Further resources on this subject: Python Data Structures [article] How is Python code organized [article] Algorithm Analysis [article]
Read more
  • 0
  • 0
  • 24738

article-image-suggesters-improving-user-search-experience
Packt
18 Nov 2016
11 min read
Save for later

Suggesters for Improving User Search Experience

Packt
18 Nov 2016
11 min read
In this article by Bharvi Dixit, the author of the book Mastering ElasticSearch 5.0 - Third Edition, we will focus on the topics for improving the user search experience using suggesters, which allows you to correct user query spelling mistakes and build efficient autocomplete mechanisms. First, let's look on the query possibilities and the responses returned by Elasticsearch. We will try to show you the general principles, and then we will get into more details about each of the available suggesters. (For more resources related to this topic, see here.) Using the suggester under search Before Elasticsearch 5.0, there was a possibility to get suggestions for a given text by using a dedicated _suggest REST endpoint. But in Elasticsearch 5.0, this dedicated _suggest endpoint has been deprecated in favor of using suggest API. In this release, the suggest only search requests have been optimized for performance reasons and we can execute the suggetions _search endpoint. Similar to query object, we can use a suggest object and what we need to provide inside suggest object is the text to analyze and the type of used suggester (term or phrase). So if we would like to get suggestions for the words chrimes in wordl (note that we've misspelled the word on purpose), we would run the following query: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } } } }' The dedicated endpoint _suggest has been deprecated in Elasticsearch version 5.0 and might be removed in future releases, so be advised to use suggestion request under _search endpoint. All the examples covered in this article usage the same _search endpoint for suggest request. As you can see, the suggestion request wrapped inside suggest object and is send to Elasticsearch in its own object with the name we chose (in the preceding case, it is first_suggestion). Next, we specify the text for which we want the suggestion to be returned using the text parameter. Finally, we add the suggester object, which is either term or phrase. The suggester object contains its configuration, which for the term suggester used in the preceding command, is the field we want to use for suggestions (the field property). We can also send more than one suggestion at a time by adding multiple suggestion names. For example, if in addition to the preceding suggestion, we would also include a suggestion for the word arest, we would use the following command: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } }, "second_suggestion": { "text": "arest", "term": { "field": "text" } } } }' Understanding the suggester response Let's now look at the example response for the suggestion query we have executed. Although the response will differ for each suggester type, let's look at the response returned by Elasticsearch for the first command we've sent in the preceding code that used the term suggester: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "first_suggestion" : [ { "text" : "chrimes", "offset" : 0, "length" : 7, "options" : [ { "text" : "crimes", "score" : 0.8333333, "freq" : 36 }, { "text" : "choices", "score" : 0.71428573, "freq" : 2 }, { "text" : "chrome", "score" : 0.6666666, "freq" : 2 }, { "text" : "chimps", "score" : 0.6666666, "freq" : 1 }, { "text" : "crimea", "score" : 0.6666666, "freq" : 1 } ] }, { "text" : "in", "offset" : 8, "length" : 2, "options" : [ ] }, { "text" : "wordl", "offset" : 11, "length" : 5, "options" : [ { "text" : "world", "score" : 0.8, "freq" : 436 }, { "text" : "words", "score" : 0.8, "freq" : 6 }, { "text" : "word", "score" : 0.75, "freq" : 9 }, { "text" : "worth", "score" : 0.6, "freq" : 21 }, { "text" : "worst", "score" : 0.6, "freq" : 16 } ] } ] } } As you can see in the preceding response, the term suggester returns a list of possible suggestions for each term that was present in the text parameter of our first_suggestion section. For each term, the term suggester will return an array of possible suggestions with additional information. Looking at the data returned for the wordl term, we can see the original word (the text parameter), its offset in the original text parameter (the offset parameter), and its length (the length parameter). The options array contains suggestions for the given word and will be empty if Elasticsearch doesn't find any suggestions. Each entry in this array is a suggestion and is characterized by the following properties: text: This is the text of the suggestion. score: This is the suggestion score; the higher the score, the better the suggestion will be. freq: This is the frequency of the suggestion. The frequency represents how many times the word appears in documents in the index we are running the suggestion query against. The higher the frequency, the more documents will have the suggested word in its fields and the higher the chance that the suggestion is the one we are looking for. Please remember that the phrase suggester response will differ from the one returned by the terms suggester, The term suggester The term suggester works on the basis of the edit distance, which means that the suggestion with fewer characters that needs to be changed or removed to make the suggestion look like the original word is the best one. For example, let's take the words worl and work. In order to change the worl term to work, we need to change the l letter to k, so it means a distance of one. Of course, the text provided to the suggester is analyzed and then terms are chosen to be suggested. The phrase suggester The term suggester provides a great way to correct user spelling mistakes on a per-term basis. However, if we would like to get back phrases, it is not possible to do that when using this suggester. This is why the phrase suggester was introduced. It is built on top of the term suggester and adds additional phrase calculation logic to it so that whole phrases can be returned instead of individual terms. It uses N-gram based language models to calculate how good the suggestion is and will probably be a better choice to suggest whole phrases instead of the term suggester. The N-gram approach divides terms in the index into grams—word fragments built of one or more letters. For example, if we would like to divide the word mastering into bi-grams (a two letter N-gram), it would look like this: ma as st te er ri in ng. The completion suggester Till now we read about term suggester and phrase suggester which are used for providing suggestions but completion suggester is completely different and it is used for as a prefix-based suggester for allowing us to create the autocomplete (search as you type) functionality in a very performance-effective way because of storing complicated structures in the index instead of calculating them during query time. This suggester is not about correcting user spelling mistakes. In Elasticsearch 5.0, Completion suggester has gone through complete rewrite. Both the syntax and data structures of completion type field have been changed and so is the response structure. There are many new exciting features and speed optimizations have been introduced in the completion suggester. One of these features is making completion suggester near real time which allows deleted suggestions to omit from suggestion results as soon as they are deleted. The logic behind the completion suggester The prefix suggester is based on the data structure called Finite State Transducer (FST) ( For more information refer, http://en.wikipedia.org/wiki/Finite_state_transducer). Although it is highly efficient, it may require significant resources to build on systems with large amounts of data in them: systems that Elasticsearch is perfectly suitable for. If we would like to build such a structure on the nodes after each restart or cluster state change, we may lose performance. Because of this, the Elasticsearch creators decided to use an FST-like structure during index time and store it in the index so that it can be loaded into the memory when needed. Using the completion suggester To use a prefix-based suggester we need to properly index our data with a dedicated field type called completion. It stores the FST-like structure in the index. In order to illustrate how to use this suggester, let's assume that we want to create an autocomplete feature to allow us to show book authors, which we store in an additional index. In addition to author's names, we want to return the identifiers of the books they wrote in order to search for them with an additional query. We start with creating the authors index by running the following command: curl -XPUT "http://localhost:9200/authors" -d' { "mappings": { "author": { "properties": { "name": { "type": "keyword" }, "suggest": { "type": "completion" } } } } }' Our index will contain a single type called author. Each document will have two fields: the name field, which is the name of the author, and the suggest field, which is the field we will use for autocomplete. The suggest field is the one we are interested in; we've defined it using the completion type, which will result in storing the FST-like structure in the index. Implementing your own autocompletion Completion suggester has been designed to be a powerful and easily implemented solution for autocomplete but it supports only prefix query. Most of the time autocomplete need only work as a prefix query for example, If I type elastic then I expect elasticsearch as a suggestion, not nonelastic. There are some use cases, when one wants to implement more general partial word completion. Completion suggester fails to fulfill this requirement. The second limitation of completion suggester is, it does not allow advance queries and filters searched. To get rid of both these limitations we are going to implement a custom autocomplete feature based on N-gram, which works in almost all the scenarios. Creating index Lets create an index location-suggestion with following settings and mappings: curl -XPUT "http://localhost:9200/location-suggestion" -d' { "settings": { "index": { "analysis": { "filter": { "nGram_filter": { "token_chars": [ "letter", "digit", "punctuation", "symbol", "whitespace" ], "min_gram": "2", "type": "nGram", "max_gram": "20" } }, "analyzer": { "nGram_analyzer": { "filter": [ "lowercase", "asciifolding", "nGram_filter" ], "type": "custom", "tokenizer": "whitespace" }, "whitespace_analyzer": { "filter": [ "lowercase", "asciifolding" ], "type": "custom", "tokenizer": "whitespace" } } } } }, "mappings": { "locations": { "properties": { "name": { "type": "text", "analyzer": "nGram_analyzer", "search_analyzer": "whitespace_analyzer" }, "country": { "type": "keyword" } } } } }' Understanding the parameters If you look carefully in preceding curl request for creating the index, it contains both settings and the mappings. We will see them now in detail one by one. Configuring settings Our settings contains two custom analyzers: nGram_analyzer and whitespace_analyzer. We have made custom whitespace_analyzer using whitespace tokenizer just for making due that all the tokens are indexed in lowercase and ascifolded form. Our main interest is nGram_analyzer, which contains a custom filter nGram_filter consisting following parameters: type: Specifies type of token filters which is nGram in our case. token_chars: Specifies what kind of characters are allowed in the generated tokens. Punctuations and special characters are generally removed from the token streams but in our example, we have intended to keep them. We have kept whitespace also so that a text which contains United States and a user searches for u s, United States still appears in the suggestion. min_gram and max_gram: These two attributes set the minimum and maximum length of substrings that will generated and added to the lookup table. For example, according to our settings for the index, the token India will generate following tokens: [ "di", "dia", "ia", "in", "ind", "indi", "india", "nd", "ndi", "ndia" ] Configuring mappings The document type of our index is locations and it has two fields, name and country. The most important thing to see is the way analyzers has been defined for name field which will be used for autosuggestion. For this field we have set index analyzer to our custom nGram_analyzer where the search analyzer is set to whitespace_analyzer. The index_analyzer parameter is no more supported from Elasticsearch version 5.0 onward. Also, if you want to configure search_analyzer property for a field, then you must configure analyzer property too the way we have shown in the preceding example. Summary In this article we focused on improving user search experience. We started with term and phrase suggesters and then covered search as you type that is, autocompletion feature which is implemented using completion suggester. We also saw the limitations of completion suggester in handling advanced queries and partial matching which further solved by implementing our custom completion using N-gram. Resources for Article: Further resources on this subject: Searching Your Data [article] Understanding Mesos Internals [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 2764

article-image-manual-and-automated-testing
Packt
15 Nov 2016
10 min read
Save for later

Manual and Automated Testing

Packt
15 Nov 2016
10 min read
In this article by Claus Führer the author of the book Scientific Computing with Python 3, we focus on two aspects of testing for scientific programming: Manual and Automated testing. Manual testing is what is done by every programmer to quickly check that an implementation is working. Automated testing is the refined, automated variant of that idea. We will introduce some tools available for automatic testing in general, with a view on the particular case of scientific computing. (For more resources related to this topic, see here.) Manual Testing During the development of code you do a lot of small tests in order to test its functionality. This could be called Manual Testing. Typically, you would test that a given function does what it is supposed to do, by manually testing the function in an interactive environment. For instance, suppose that you implement the Bisection algorithm. It is an algorithm that finds a zero (root) of a scalar nonlinear function. To start the algorithm an interval has to be given with the property, that the function takes different signs on the interval boundaries. You would then test an implementation of that algorithm typically by checking: That a solution is found when the function has opposite signs at the interval boundaries that an exception is raised when the function has the same sign at the interval boundaries Manual testing, as necessary as may seem to be, is unsatisfactory. Once you convinced yourself that the code does what it is supposed to do, you formulate a relatively small number of demonstration examples to convince others of the quality of the code. At that stage one often loses interest in the tests made during development and they are forgotten or even deleted. As soon as you change a detail and things no longer work correctly you might regret that your earlier tests are no longer available. Automatic Testing The correct way to develop any piece of code is to use automatic testing. The advantages are The automated repetition of a large number of tests after every code refactoring and before new versions are launched A silent documentation of the use of the code A documentation of the test coverage of your code: Did things work before a change or was a certain aspect never tested? We suggest to develop tests in parallel to the code. Good design of tests is an art of its own and there is rarely an investment which guarantees such a good pay-off in development time savings as the investment in good tests. Now we will go through the implementation of a simple algorithm with the automated testing methods in mind. Testing the bisection algorithm Let us examine automated testing for the bisection algorithm. With this algorithm a zero of a real valued function is found. An implementation of the algorithm can have the following form: def bisect(f,a,b,tol=1.e-8): """ Implementation of the bisection algorithm f real valued function a,b interval boundaries (float) with the property f(a)*f(b)<=0 tol tolerance ( float ) """ if f(a)*f(b)>0: raise ValueError ("Incorrect initial interval [a,b]") for i in range (100): c = (a + b)/2 . if f (a)*f(c) <= 0: b=c else: a=c if abs (a - b)<tol: return (a + b)/2 raise Exception (’ No root found within the given tolerance { }’.format (tol) We assume this to be stored in a file bisection.py. As a first test case we test that the zero of the function F(x) = x is found: def test_identity(): result = bisect(lambda x: x, -1., 1.) #(for lambda) expected = 0. assert allclose(result, expected),’expected zero not found’ text_identity() In this code you meet the Python keyword assert for the first time. It raises an exception AssertionError if its first argument returns the value False. Its optional second argument is a string with additional information. We use the function allclose in order to test for equality for float. Let us comment on some of the features of the test function. We use an assertion to make sure that an exception will be raised if the code does not behave as expected. We have to manually run the test in the line test_identity(). There are many tools to automate this kind of call. Let us now setup a test that checks if bisect raises an exception when the function has the same sign on both ends of the interval. For now, we will suppose that the exception raised is a ValueError exception. Example: Checking the sign for the bisection algorithm. def test_badinput(): try: bisect(lambda x: x,0.5,1) except ValueError: pass else: raise AssertionError() test_badinput() In this case an AssertionError is raised if the exception is not of type ValueError. There are tools to simplify the above construction to check that an exception is raised. Another useful kind of tests is the edge case test. Here we test arguments or user input which is likely to create mathematically undefined situations or states of the program not foreseen by the programmer. For instance, what happens if both bounds are equal? What happens if a>b? We easily setup up such a test by using for instance def test_equal_boundaries(): result = bisect(lambda x: x, 1., 1.) expected = 0. assert allclose(result, expected), ‘test equal interval bounds failed’ def test_reverse_boundaries(): result = bisect(lambda x: x, 1., -1.) expected = 0. assert allclose(result, expected), ‘test reverse interval bounds failed’ test_equal_boundaries() test_reverse_boundaries() Using unittest The standard Python package unittest greatly facilitates automated testing. That package requires that we rewrite our tests a little to be compatible. The first test would have to be rewritten in a class, as follows: from bisection import bisect import unittest class TestIdentity(unittest.TestCase): def test(self): result = bisect(lambda x: x, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) if __name__==‘__main__’: unittest.main() Let us examine the differences to the previous implementation. First, the test is now a method and a part of a class. The class must inherit from unittest,TestCase. The test method’s name must start with test. Note that we may now use one of the assertion tools of the package, namely       . Finally, the tests are run using unittest.main. We recommend to write the tests in a file separate from the code to be tested. That’s why it starts with an import. The test passes and returns Ran 1 test in 0.002s OK If we would have run it with a loose tolerance parameter, e.g., 1.e-3, a failure of the test would have been reported: F ========================================================== FAIL: test (__main__.TestIdentity) ---------------------------------------------------------------------- Traceback (most recent call last): File “<ipython-input-11-e44778304d6f>“, line 5, in test self.assertAlmostEqual(result, expected) AssertionError: 0.00017089843750002018 != 0.0 within 7 places --------------------------------------------------------------------- Ran 1 test in 0.004s FAILED (failures=1) Tests can and should be grouped together as methods of a test class: Example: import unittest from bisection import bisect class TestIdentity(unittest.TestCase): def identity_fcn(self,x): return x def test_functionality(self): result = bisect(self.identity_fcn, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) def test_reverse_boundaries(self): result = bisect(self.identity_fcn, 1., -1.) expected = 0. self.assertAlmostEqual(result, expected) def test_exceeded_tolerance(self): tol=1.e-80 self.assertRaises(Exception, bisect, self.identity_fcn, -1.2, 1.,tol) if __name__==‘__main__’: unittest.main() Here, the last test needs some comments: We used the method unittest.TestCase.assertRaises. It tests whether an exception is correctly raised. Its first parameter is the exception type, for example,ValueError, Exception, and its second argument is a the name of the function, which is expected to raise the exception. The remaining arguments are the arguments for this function. The command unittest.main() creates an instance of the class TestIdentity and executes those methods starting by test. Test setUp and tearDown The class unittest.TestCase provides two special methods, setUp and tearDown, which are run before and after every call to a test method. This is needed when testing generators, which are exhausted after every test. We demonstrate this here by testing a program which checks in which line in a file a given string occurs for the first time: class NotFoundError(Exception): pass def find_string(file, string): for i,lines in enumerate(file.readlines()): if string in lines: return i raise NotFoundError(‘String {} not found in File {}‘. format(string,file.name)) We assume, that this code is saved in a file find_string.py. A test has to prepare a file and open it and remove it after the test: import unittest import os # used for, e.g., deleting files from find_in_file import find_string, NotFoundError class TestFindInFile(unittest.TestCase): def setUp(self): file = open(‘test_file.txt’, ‘w’) file.write(‘aha’) file.close() self.file = open(‘test_file.txt’, ‘r’) def tearDown(self): os.remove(self.file.name) def test_exists(self): line_no=find_string(self.file, ‘aha’) self.assertEqual(line_no, 0) def test_not_exists(self): self.assertRaises(NotFoundError, find_string,self.file, ‘bha’) if __name__==‘__main__’: unittest.main() Before each test setUp is run and afterwards tearDown is executed. Parametrizing Tests One frequently wants to repeat the same test set-up with different data sets. When using the functionalities of unittests this requires to automatically generate test cases with the corresponding methods injected: To this end we first construct a test case with one or several methods that will be used, when we later set up test methods. Let us consider the bisection method again and let us check if the values it returns are really zeros of the given function. We first build the test case and the method which will use for the tests: class Tests(unittest.TestCase): def checkifzero(self,fcn_with_zero,interval): result = bisect(fcn_with_zero,*interval,tol=1.e-8) function_value=fcn_with_zero(result) expected=0. self.assertAlmostEqual(function_value, expected) Then we dynamically create test functions as attributes of this class: test_data=[‘name’:’identity’, ‘function’:lambda x: x, ‘interval’:[-1.2, 1.], ‘name’:’parabola’, ‘function’:lambda x: x**2-1, ’interval’:[0, 10.], ‘name’:’cubic’, ‘function’:lambda x: x**3-2*x** 2,‘interval’:[0.1, 5.],] def make_test_function(dic): return lambda self:self.checkifzero(dic[‘function’],dic [‘interval’]) for data in test_data: setattr(Tests, “test_name”.format(name=data[‘name’]), make_test_function(data)) if __name__==‘__main__’: unittest.main() In this example the data is provided as a list of dictionaries. A function make_test_function dynamically generates a test function which uses a particular data dictionary to perform the test with the previously defined method checkifzero. This test function is made a method of the TestCase class by using the Python command settattr. Summary No program development without testing! In this article we showed the importance of well organized and documented tests. Some professionals even start development by first specifying tests. A useful tool for automatic testing is unittest, which we explained in detail. While testing improves the reliability of a code, profiling is needed to improve the performance. Alternative ways to code may result in large performance differences. We showed how to measure computation time and how to localize bottlenecks in your code. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine Learning with R [article] Storage Scalability [article]
Read more
  • 0
  • 0
  • 2277

article-image-tensorflow-toolbox
Packt
14 Nov 2016
6 min read
Save for later

The TensorFlow Toolbox

Packt
14 Nov 2016
6 min read
In this article by Saif Ahmed, author of the book Machine Learning with TensorFlow, we learned how most machine learning platforms are focused toward scientists and practitioners in academic or industrial settings. Accordingly, while quite powerful, they are often rough around the edges and have few user-experience features. (For more resources related to this topic, see here.) Quite a bit of effort goes into peeking at the model at various stages and viewing and aggregating performance across models and runs. Even viewing the neural network can involve far more effort than expected. While this was acceptable when neural networks were simple and only a few layers deep, today's networks are far deeper. In 2015, Microsoft won the annual ImageNet competition using a deep network with 152 layers. Visualizing such networks can be difficult, and peeking at weights and biases can be overwhelming. Practitioners started using home-built visualizers and bootstrapped tools to analyze their networks and run performance. TensorFlow changed this by releasing TensorBoard directly alongside their overall platform release. TensorBoard runs out of box with no additional installations or setup. Users just need to instrument their code according to what they wish to capture. It features plotting of events, learning rate and loss over time; histograms, for weights and biases; and images. The Graph Explorer allows interactive reviews of the neural network. A quick preview You can follow along with the code here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/cifar10/cifar10_train.py The example uses the CIFAR-10 image set. The CIFAR-10 dataset consists of 60,000 images in ten classes compiled by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The dataset has become one of several standard learning tools and benchmarks for machine learning efforts. Let's start with the Graph Explorer. We can immediately see a convolutional network being used. This is not surprising as we're trying to classify images here. This is just one possible view of the graph. You can try the Graph Explorer as well. It allows deep dives into individual components. Our next stop on the quick preview is the EVENTS tab. This tab shows scalar data over time. The different statistics are grouped into individual tabs on the right-hand side. The following screenshot shows a number of popular scalar statistics, such as loss, learning rate, cross entropy, and sparsity across multiple parts of the network. The HISTOGRAMS tab is a close cousin as it shows tensor data over time. Despite the name, as of TensorFlow v0.7, it does not actually display histograms. Rather, it shows summaries of tensor data using percentiles. The summary view is shown in the following figure. Just like with the EVENTS tab, the data is grouped into tabs on the right-hand side. Different runs can be toggled on and off and runs can be shown overlaid, allowing interesting comparisons. It features three runs, which we can see on the left side, and we'll look at just the softmax function and associated parameters. For now, don't worry too much about what these mean, we're just looking at what we can achieve for our own classifiers. However, the summary view does not do justice to the utility of the HISTOGRAMS tab. Instead, we will zoom into a single graph to observe what is going on. This is shown in the following figure: Notice that each histogram chart shows a time series of nine lines. The top is the maximum, the middle the median, and the bottom the minimum. The three lines directly above and below the median are one and half standard deviation, one standard deviation, and half standard deviation marks. Obviously, this does represent multimodal distributions as it is not a histogram. However, it does provide a quick gist of what would otherwise be a mountain of data to sift through. A couple of things to note are how data can be collected and segregated by runs, how different data streams can be collected, how we can enlarge the views, and how we can zoom into each of the graphs. Enough of graphics, lets jump into code so we can run this for ourselves! Installing TensorBoard TensorFlow comes prepackaged with TensorBoard, so it will already be installed. It runs as a locally served web application accessible via the browser at http://0.0.0.0:6006. Conveniently, there is no server-side code or configurations required. Depending on where your paths are, you may be able to run it directly, as follows: tensorboard --logdir=/tmp/tensorlogs If your paths are not correct, you may need to prefix the application accordingly, as shown in the following command line: tf_install_dir/ tensorflow/tensorboard --logdir=/tmp/tensorlogs On Linux, you can run it in the background and just let it keep running, as follows: nohup tensorboard --logdir=/tmp/tensorlogs & Some thought should be put into the directory structure though. The Runs list on the left side of the dashboard is driven by subdirectories in the logdir location. The following image shows two runs: MNIST_Run1 and MNIST_Run2. Having an organized runs folder will allow plotting successive runs side by side to see differences. When initializing the writer, you will pass in the log_location as the first parameter, as follows: writer = tf.train.SummaryWriter(log_location, sess.graph_def) Consider saving a base location and appending run-specific subdirectories for each run. This will help organize outputs without expending more thought on it. We’ll discuss more about this later. Incorporating hooks into our code The best way to get started with TensorBoard is by taking existing working examples and instrument them with the code required for TensorBoard. We will do this for several common training scripts. Summary In this article, we covered the major areas of TensorBoard—EVENTS, HISTOGRAMS, and viewing GRAPH. We modified popular models to see the exact changes required before TensorBoard could be up and running. This should have demonstrated the fairly minimal effort required to get started with TensorBoard. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Implementing Artificial Neural Networks with TensorFlow [article] Why we need Design Patterns? [article]
Read more
  • 0
  • 0
  • 2584
article-image-data-clustering
Packt
10 Nov 2016
6 min read
Save for later

Data Clustering

Packt
10 Nov 2016
6 min read
In this article by Rodolfo Bonnin, the author of the book Building Machine Learning Projects with TensorFlow, we will start applying data transforming operations. We will begin finding interesting patterns in some given information, discovering groups of data, or clusters, and using clustering techniques. (For more resources related to this topic, see here.) In this process we'll also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library. The topics we will cover in this article are as follows: Getting an idea of how clustering works, and comparing it to other alternative existent classification techniques Using scikit-learn and matplotlib to enrichen the possibilities of dataset choices, and to get professional looking graphical representation of the data Implementing the K-means clustering algorithm Test some variations of the K-means methods to improve the fit and/or the convergence rate Three types of learning from data Based on how we approach the supervision of the samples, we can extract three types of learning: Unsupervised learning: The fully unsupervised approach directly takes a number of undetermined elements and builds a classification of them, looking at different properties that could determine its class Semi-supervised learning: The semi-supervised approach has a number of known classified items and then applies techniques to discover the class of the remaining items Supervised learning: In supervised learning, we start from a population of samples, which have a known type beforehand, and then build a model from it Normally there are three sample populations: one from which the model grows, called training set, one that is used to test the model, called training set, and then there are the samples for which we will be doing classification. Types of data learning based on supervision: unsupervised, semi-supervised, and supervised Unsupervised data clustering One of the simplest operations that can be initially applied to an unknown dataset is to try to understand the possible grouping or common features that the dataset members have. To do so, we could try to find representative points in them that summarize a balance of the parameters of the members of the group. This value could be, for example, the mean or the median of all the cluster members. This also guides to the idea of defining a notion of distance between members: all the members of the groups should be obviously at short distances between them and the representative points, that from the central points of the other groups. In the following image, we can see the results of a typical clustering algorithm and the representation of the cluster centers: Sample clustering algorithm output K-means K-means is a very well-known clustering algorithm that can be easily implemented. It is very straightforward and can guide (depending on the data layout) to a good initial understanding of the provided information. Mechanics of K-means K-means tries to divide a set of samples into K disjoint groups or clusters, using as a main indicator the mean value (be it 1D, 2D, and so on) of the members. This point is normally called centroid, referring to the arithmetic entity with the same name. One important characteristic of K-means is that K should be provided beforehand, and so some previous knowledge of the data is needed to avoid a non-representative result. Algorithm iteration criterion The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster contained samples. This is also known as minimization of inertia. Error minimization criteria for K-means K-means algorithm breakdown The mechanism of the K-means algorithm can be summarized in the following graphic: Simplified flow chart of the K-means process And this is a simplified summary of the algorithm: We start with unclassified samples and take K elements as the starting centroids. There are also possible simplifications of this algorithm that take the first elements in the element list, for the sake of brevity. We then calculate the distances between the samples and the first chosen samples, and so we get the first calculated centroids (or other representative values). You can see in the moving centroids in the illustration toward a more common sense centroid. After the centroids change, their displacement will provoke the individual distances to change, and so the cluster membership can change. So this is the time when we recalculate the centroids and repeat the first steps, in case the stop condition isn't met. The stopping conditions could be of various types: After n iterations (it could be that either we chose a too large number and we'll have unnecessary rounds of computing, or it could converge slowly and we will have a very unconvincing results) if the centroid doesn't have a very stable means. This stop condition could also be used as a last resort if we have a really long iterative process. Referring to the previous mean result, a possibly better criterion for the convergence of the iterations is to take a look at the changes of the centroids, be it in total displacement or total cluster element switches. The last one is employed normally, so we will stop the process once there are no more element-changing clusters: K-means simplified graphic Pros and cons of K-means The advantages of this method are: It scales very well (most of the calculations can be run in parallel) It has been used in a very large range of applications But its simplicity has also a price (no silver bullet rule applies): It requires apriori knowledge (the number of possible clusters should be known beforehand) The outlier values can push the values of the centroids, as they have the same value as any other sample As we assume that the figure is convex and isotropic, it doesn't work very well with non-circle-like delimited clusters Summary In this article, we got a simple overview of some of the most basic models we can implement, but we tried to be as detailed in the explanation as possible. From now on, we are able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcoming of them without having to load models with a greater number of unknown characteristics. You can also refer to the following books on the similar topics: Getting Started with TensorFlow: https://www.packtpub.com/big-data-and-business-intelligence/getting-started-tensorflow R Machine Learning Essentials: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-essentials Building Machine Learning Systems with Python - Second Edition: https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Preprocessing the Data [article]
Read more
  • 0
  • 0
  • 2294

article-image-introduction-practical-business-intelligence
Packt
10 Nov 2016
20 min read
Save for later

Introduction to Practical Business Intelligence

Packt
10 Nov 2016
20 min read
In this article by Ahmed Sherif, author of the book Practical Business Intelligence, is going to explain what is business intelligence? Before answering this question, I want to pose and answer another question. What isn't business intelligence? It is not spreadsheet analysis done with transactional data with thousands of rows. One of the goals of Business Intelligence or BI is to shield the users of the data from the intelligent logic lurking behind the scenes of the application that is delivering the data to them. If the integrity of the data is compromised in any way by an individual not intimately familiar with the data source, then there cannot, by definition, be intelligence in the business decisions made within that same data. The single source of truth is the key for any Business Intelligence operation whether it is a mom and pop soda shop or a Fortune 500 company. Any report, dashboard, or analytical application that is delivering information to a user through a BI tool but the numbers cannot be tied back to the original source will break the trust between the user and the data and will defeat the purpose of Business Intelligence. (For more resources related to this topic, see here.) In my opinion, the most successful tools used for business intelligence directly shield the business user from the query logic used for displaying that same data in some form of visual manner. Business Intelligence has taken many forms in terms of labels over the years. Business Intelligence is the process of delivering actionable business decisions from analytical manipulation and presentation of data within the confines of a business environment. The delivery process mentioned in the definition will focus its attention on. The beauty of BI is that it is not owned by any one particular tool that is proprietary to a specific industry or company. Business Intelligence can be delivered using many different tools, some even that were not originally intended to be used for BI. The tool itself should not be the source where the query logic is applied to generate the business logic of the data. The tool should primarily serve as the delivery mechanism of the query that is generated by the data warehouse that houses both the data, as well as the logic. In this chapter we will cover the following topics: Understanding the Kimball method Understanding business intelligence Data and SQL Working with data and SQL Working with business intelligence tools Downloading and installing MS SQL Server 2014 Downloading and installing AdventureWorks Understanding the Kimball method As we discuss the data warehouse where our data is being housed, we will be remised not to bring up Ralph Kimball, one of the original architects of the data warehouse.  Kimball's methodology incorporated dimensional modeling, which has become the standard for modeling a data warehouse for Business Intelligence purposes. Dimensional modeling incorporates joining tables that have detail data and tables that have lookup data. A detail table is known as a fact table in dimensional modeling. An example of a fact table would be a table holding thousands of rows of transactional sales from a retail store.  The table will house several ID's affiliated with the product, the sales person, the purchase date, and the purchaser just to name a few. Additionally, the fact table will store numeric data for each individual transaction, such as sales quantity for sales amount to name a few examples. These numeric values will be referred to as measures. While there is usually one fact table, there will also be several lookup or dimensional tables that will have one table for each ID that is used in a fact table. So, for example,  there would be one dimensional table for the product name affiliated with a product ID. There would be one dimensional table for the month, week, day, and year of the id affiliated with the date. These dimensional tables are also referred to as Lookup Tables, because they kind of look up what the name of a dimension ID is affiliated with. Usually, you would find as many dimensional tables as there are ID's in the fact table. The dimensional tables would all be joined to the one fact table creating something of a 'star' look. Hence, the name for this type of table join is known as a star schema which is represented diagrammatically in the following figure. It is customary that the fact table will be the largest table in a data warehouse while the lookup tables will all be quite small in rows, some as small as one row. The tables are joined by ID's, also known as surrogate keys. Surrogate keys allow for the most efficient join between a fact table and a dimensional table as they are usually a data type of integer. As more and more detail is added to a dimensional table, that new dimension is just given the next number in line, usually starting with 1. Query performance between tables joins suffers when we introduce non-numeric characters into the join or worse, symbols (although most databases will not allow that). Understanding business intelligence architecture I will continuously hammer home the point that the various tools utilized to deliver the visual and graphical BI components should not house any internal logic to filter data out of the tool nor should it be the source of any built in calculations. The tools themselves should not house this logic as they will be utilized by many different users. If each user who develops a BI app off of the tool incorporates different internal filters without the tool, the single source of truth tying back to the data warehouse will become multiple sources of truths.  Any logic applied to the data to filter out a specific dimension or to calculate a specific measure should be applied in the data warehouse and then pulled into the tool. For example, if the requirement for a BI dashboard was to show current year and prior year sales for US regions only, the filter for region code would be ideally applied in the data warehouse as opposed to inside of the tool. The following is a query written in SQL joining two tables from the AdventureWorks database that highlights the difference between dimenions and measures.  The 'region' column is a dimension column and the 'SalesYTD' and 'SalesPY' are measure columns. In this example, the 'TerritoryID' is serving as the key join between 'SalesTerritory' and 'SalesPerson'. Since the measures are coming from the 'SalesPerson' table, that table will serve as the fact table and 'SalesPerson.TerritoryID' will serve as the fact ID. Since the Region column is dimensional and coming from the 'SalesTerritory' table, that table will serve as the dimensional or lookup table and 'SalesTerritory.TerritoryID' will serve as the dimension ID. In a finely-tuned data warehouse both the fact ID and dimension ID would be indexed to allow for efficient query performance. This performance is obtained by sorting the ID's numerically so that a row from one table that is being joined to another table does not have to be searched through the entire table but only a subset of that table. When the table is only a few hundred rows, it may not seem necessary to index columns, but when the table grows to a few hundred million rows, it may become necessary. Select region.Name as Region ,round(sum(sales.SalesYTD),2) as SalesYTD ,round(sum(sales.SalesLastYear),2) as SalesPY FROM [AdventureWorks2014].[Sales].[SalesTerritory] region left outer join [AdventureWorks2014].[Sales].[SalesPerson] sales on sales.TerritoryID = region.TerritoryID where region.CountryRegionCode = 'US' Group by region.Name order by region.Name asc There are several reasons why applying the logic at the database level is considered a best practice. Most of the time, these requests for filtering data or manipulating calculations are done at the BI tool level because it is easier for the developer than to go to the source. However, if these filters are being performed due to data quality issues then applying logic at the reporting level is only masking an underlying data issue that needs to be addressed across the entire data warehouse. You would be doing yourself a disservice in the long run as you will be establishing a precedence that the data quality would be handled by the report developer as opposed to the database administrator. You are just adding additional work onto your plate. Ideal BI tools will quickly connect to the data source and then allow for slicing and dicing of your dimensions and measures in a manner that will quickly inform the business of useful and practical information. Ultimately, the choice of a BI tool by an individual or an organization will come down to the ease of use of the tool as well as the flexibility to showcase the data through various components such as graphs, charts, widgets, and infographics. Management If you are a Business Intelligence manager looking to establish a department with a variety of tools to help flesh out your requirements, could serve as a good source for interview questions to weed out unqualified candidates. A manager could use to distinguish some of the nuances between these different skillsets and prioritize hiring based on immediate needs. Data Scientist The term Data Scientist has been misused in the BI industry, in my humble opinion. It has been lumped in with Data Analyst as well as BI Developer. Unfortunately, these three positions have separate skillsets and you will do yourself a disservice by assuming one person can do multiple positions successfully. A Data Scientist will be able to apply statistical algorithms behind the data that is being extracted from the BI tools and make predictions about what will happen in the future with that same data set. Due to this skillset, a Data Scientist may find the chapters focusing on R and Python to be of particular importance because of their abilities to leverage predictive capabilities within their BI delivery mechanisms. Data Analyst A Data Analyst is probably the second most misused position behind a Data Scientist. Typically, a Data Analyst should be analyzing the data that is coming out of the BI tools that are connected to the data warehouse. Most Data Analysts are comfortable working with Microsoft Excel. Often times they are asked to take on additional roles in developing dashboards that require additional programming skills.  This is where they would find some comfort using a tool like Power BI, Tableau, or QlikView. These tools would allow for a Data Analyst to quickly develop a storyboard or visualization that would allow for quick analysis with minimal programming skills. Visualization Developer A 'dataviz' developer is someone who can create complex visualizations out of data and showcase interesting interactions between different measures inside of a dataset that cannot necessarily be seen with a traditional chart or graph. More often than not these developers possess some programming background such as JavaScript, HTML, or CSS. These developers are also used to developing applications directly for the web and therefore would find D3.js a comfortable environment to program in. Working with Data and SQL The examples and exercises that will come from the AdventureWorks database.  The AdventureWorks database has a comprehensive list of tables that mimics an actual bicycle retailor. The examples will draw on different tables from the database to highlight BI reporting from the various segments appropriate for the AdventureWorks Company. These segments include Human Resources, Manufacturing, Sales, Purchasing, and Contact Management. A different segment of the data will be highlighted in each chapter utilizing a specific set of tools. A cursory understanding of SQL (structured query language) will be helpful to get a grasp of how data is being aggregated with dimensions and measures. Additionally, an understanding of the SQL statements used will help with the validation process to ensure a single source of truth between the source data and the output inside of the BI tool of choice. For more information about learning SQL, visit the following website: www.sqlfordummies.com Working with business intelligence tools Over the course of the last 20 years, there have been a growing number of software products released that were geared towards Business Intelligence. In addition, there have been a number of software products and programming languages that were not initially built for BI but later on became a staple for the industry. The tools used were chosen based on the fact that they were either built off of open source technology or they were products from companies that provided free versions of their software for development purposes. Many companies from the big enterprise firms have their own BI tools and they are quite popular. However, unless you have a license with them, it is unlikely that you will be able to use their tool without having to shell out a small fortune. Power BI and Excel Power BI is one of the more relatively newer BI tools from Microsoft.  It is known as a self-service solution and integrates seamlessly with other data sources such as Microsoft Excel and Microsoft SQL Server.  Our primary purpose in using Power BI will be to generate interactive dashboards, reports, and datasets for users. In addition to using Power BI we will also focus on utilizing Microsoft Excel to assist with some data analysis and validation of results that are being pulled from our data warehouse.  Pivot tables are very popular within MS Excel and will be used to validate aggregation done inside of the data warehouse. D3.js D3.js, also known as data-driven documents, is a JavaScript library known for delivery beautiful visualizations by manipulating documents based on data. Since D3 is rooted in JavaScript, all visualizations make a seamless transition to the web. D3 allows for major customization to any part of visualization and because of this flexibility, it will require a steeper learning curve that probably any other software program. D3 can consume data easily as a .json or a .csv file.  Additionally, the data can also be imbedded directly within the JavaScript code that renders the visualization on the web. R R is a free and open source statistical programming language that produces beautiful graphics. The R language has been widely used among the statistical community and more recently in the data science and machine learning community as well. Due to this fact, it has picked up steam in recent years as a platform for displaying and delivering effective and practical BI. In addition to visualizing BI, R has the ability to also visualize predictive analysis with algorithms and forecasts. While R is a bit raw in its interface, there have been some IDE's (Integrated Development Environment) that have been developed to ease the user experience. RStudio will be used to deliver the visualisations developed within R. Python Python is considered the most traditional programming language of all the different languages. It is a widely used general purpose programming language with several modules that are very powerful in analysing and visualizing data. Similar to R, Python is a bit raw in its own form for delivering beautiful graphics as a BI tool; however, with the incorporation of an IDE the user interface becomes much more of a pleasurable development experience. PyCharm will be the IDE used to develop BI with Python. PyCharm is free to use and allows creation of the iPython notebook which delivers seamless integration between Python and the powerful modules that will assist with BI. As a note, all code in Python will be developed using the Python 3 syntax. QlikView QlikView is a software company specializing in delivering business intelligence solutions using their desktop tool. QlikView is one of the leaders in delivering quick visualizations based on data and queries through their desktop application. They advertise themselves to be self-service BI for business users. While they do offer solutions that target more enterprise organizations, they also offer a free version of their tool for personal use. Tableau is probably the closest competitor in terms of delivering similar BI solutions. Tableau Tableau is a software company specializing in delivering business intelligence solutions using their desktop tool. If this sounds familiar to QlikView, it's probably because it's true. Both are leaders in the field of establishing a delivery mechanism with easy installation, setup, and connectivity to the available data. Tableau has a free version of their desktop tool. Again, Tableau excels at delivering both beautiful visualizations quickly as well as self-service data discovery to more advanced business users. Microsoft SQL Server Microsoft SQL will serve as the data warehouse for the examples that we will with the BI Tools. Microsoft SQL Server is relatively simple to install and set up as well it is free to download. Additionally, there are example databases that configure seamlessly with it, such as the AdventureWorks database. Downloading and Installing MS SQL Server 2014 First things first. We will need to get started with getting our database and data warehouse up and running so that we can begin to develop our BI environment. We will visit the Microsoft website below to start the download selection process. https://www.microsoft.com/en-us/download/details.aspx?id=42299 Select the specified language that is applicable to you and also select the MS SQL Server Express version with Advanced features that is 64-bit edition as shown in the following screenshot. Ideally you'll want to be working in a 64-bit edition when dealing with servers. After selecting the file, the download process should begin. Depending on your connection speed it could take some time as the file is slightly larger than 1 GB. The next step in the process is selecting a new stand-alone instance of SQL Server 2014 unless you already have a version and wish to upgrade instead as shown in the following screenshot.. After accepting the license terms, continue through the steps in the Global Rules as well as the Product Updates to get to the setup installation files. For the feature selection tab, make sure the following features are selected for your installation as shown in the following screenshot. Our preference is to label a named instance of this database to something related to the work we are doing.  Since this will be used for Business Intelligence, I went ahead and name this instance 'SQLBI' as shown in the following screenshot: The default Server Configuration settings are sufficient for now, there is no need to change anything under that section as shown in the following screenshot. Unless you are required to do so within your company or organization, for personal use it is sufficient to just go with Windows Authentication mode for sign-on as shown in the following screenshot. We will not need to do any configuring of reporting services, so it is sufficient for our purposes to just with installing Reporting Services Native mode without any need for configuration at this time. At this point the installation will proceed and may take anywhere between 20-30 minutes depending on the cpu resources. If you continue to have issues with your installation, you can visit the following website from Microsoft for additional help. http://social.technet.microsoft.com/wiki/contents/articles/23878.installing-sql-server-2014-step-by-step-tutorial.aspx Ultimately, if everything with the installation is successful, you'll want to see all portions of the installation have a green check mark next to their name and be labeled 'Successful' as shown in the following screenshot. Downloading and Installing AdventureWorks We are almost finished with getting our business intelligence data warehouse complete. We are now at the stage where we will extract and load data into our data warehouse. The last part is to download and install the AdventureWorks database from Microsoft. The zipped file for AdventureWorks 2014 is located in the following website from Microsoft: https://msftdbprodsamples.codeplex.com/downloads/get/880661 Once the file is downloaded and unzipped, you will find a file named the following: AdventureWorks2014.bak Copy that file and paste it in the following folder where it will be incorporated with your Microsoft SQL Server 2014 Express Edition. C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackup Also note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. Once that has been copied over, we can fire up Management Studio for SQL Server 2014 and start up a blank new query by going to File New Query with Current Connection Once you have a blank query set up, copy and paste the following code in the and execute it: use [master] Restore database AdventureWorks2014 from disk = 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackupAdventureWorks2014.bak' with move 'AdventureWorks2014_data' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.mdf', Move 'AdventureWorks2014_log' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.ldf' , replace Once again, please note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. At this point in time within the database you should have received a message saying that Microsoft SQL Server has processed 24248 pages for database 'AdventureWorks2014'. Once you have refreshed your database tab on the upper left hand corner of SQL Server, the AdventureWorks database will become visible as well as all of the appropriate tables as shown in the following screenshot: One final step that we will need to verify just to make sure that your login account has all of the appropriate server settings. When you right-click on the SQL Server name on the upper left hand portion of Management Studio, select the properties.  Select Permissions inside Properties. Find your username and check all of the rights under the Grant column as shown in the following screenshot: Finally, we need to also ensure that the folder that houses Microsoft SQL Server 2014 also has the appropriate rights enabled for your current user.  That specific folder is located under C:Program FilesMicrosoft SQL Server. For purposes of our exercises, we will assign all rights for the SQL Server user to the following folder as shown in the following screenshot: We are now ready to begin connecting our BI tools to our data! Summary The emphasis will be placed on implementing Business Intelligence best practices within the various tools that will be used based on the different levels of data that is provided within the AdventureWorks database. In the next chapter we will cover extracting additional data from the web that will be joined to the AdventureWorks database. This process is known as web scraping and can be performed with great success using tools such as Python and R. In addition to collecting the data, we will focus on transforming the collected data for optimal query performance. Resources for Article: Further resources on this subject: LabVIEW Basics [article] Thinking Probabilistically [article] Clustering Methods [article]
Read more
  • 0
  • 1
  • 4035

article-image-introduction-r-programming-language-and-statistical-environment
Packt
09 Nov 2016
34 min read
Save for later

Introduction to R Programming Language and Statistical Environment

Packt
09 Nov 2016
34 min read
In this article by Simon Walkowiak author of the book Big Data Analytics with R, we will have the opportunity to learn some most important R functions from base R installation and well-known third party packages used for data crunching, transformation, and analysis. More specifically in this article you will learn to: Understand the landscape of available R data structures Be guided through a number of R operations allowing you to import data from standard and proprietary data formats Carry out essential data cleaning and processing activities such as subsetting, aggregating, creating contingency tables, and so on Inspect the data by implementing a selection of Exploratory Data Analysis techniques such as descriptive statistics Apply basic statistical methods to estimate correlation parameters between two (Pearson's r) or more variables (multiple regressions) or find the differences between means for two (t-tests) or more groups Analysis of variance (ANOVA) Be introduced to more advanced data modeling tasks like logistic and Poisson regressions (For more resources related to this topic, see here.) Learning R This book assumes that you have been previously exposed to R programming language, and this article would serve more as a revision, and an overview, of the most essential operations, rather than a very thorough handbook on R. The goal of this work is to present you with specific R applications related to Big Data and the way you can combine R with your existing Big Data analytics workflows instead of teaching you basics of data processing in R. There is a substantial number of great introductory and beginner-level books on R available at IT specialized bookstores or online, directly from Packt Publishing, and other respected publishers, as well as on the Amazon store. Some of the recommendations include the following: R in Action: Data Analysis and Graphics with R by Robert Kabacoff (2015), 2nd edition, Manning Publications R Cookbook by Paul Teetor (2011), O'Reilly Discovering Statistics Using R by Andy Field, Jeremy Miles, and Zoe Field (2012), SAGE Publications R for Data Science by Dan Toomey (2014), Packt Publishing An alternative route to the acquisition of good practical R skills is through a large number of online resources, or more traditional tutor-led in-class training courses. The first option offers you an almost limitless choice of websites, blogs, and online guides. A good starting point is the main and previously mentioned Comprehensive R Archive Network (CRAN) page (https://cran.r-project.org/), which, apart from the R core software, contains several well-maintained manuals and Task Views—community run indexes of R packages dealing with specific statistical or data management issues. R-bloggers on the other hand (http://www.r-bloggers.com/) deliver regular news on R in the form of R-related blog posts or tutorials prepared by R enthusiasts and data scientists. Other interesting online sources, which you will probably find yourself using quite often, are as follows: http://www.inside-r.org/—news and information from and by R community http://www.rdocumentation.org/—a useful search engine of R packages and functions http://blog.rstudio.org/—a blog run and edited by RStudio engineers http://www.statmethods.net/—a very informative tutorial-laden website based on the popular R book R in Action by Rob Kabacoff However, it is very likely that after some initial reading, and several months of playing with R, your most frequent destinations to seek further R-related information and obtain help on more complex use cases for specific functions will become StackOverflow(http://stackoverflow.com/) and, even better, StackExchange (http://stackexchange.com/). StackExchange is in fact a network of support and question-and-answer community-run websites, which address many problems related to statistical, mathematical, biological, and other methods or concepts, whereas StackOverflow, which is currently one of the sub-sites under the StackExchange label, focuses more on applied programming issues and provides users with coding hints and solutions in most (if not all) programming languages known to developers. Both tend to be very popular amongst R users, and as of late December 2015, there were almost 120,000 R-tagged questions asked on StackOverflow. The http://stackoverflow.com/tags/r/info page also contains numerous links and further references to free interactive R learning resources, online books and manuals and many other. Another good idea is to start your R adventure from user-friendly online training courses available through online-learning providers like Coursera (https://www.coursera.org), DataCamp (https://www.datacamp.com), edX (https://www.edx.org), or CodeSchool (https://www.codeschool.com). Of course, owing to the nature of such courses, a successful acquisition of R skills is somewhat subjective, however, in recent years, they have grown in popularity enormously, and they have also gained rather positive reviews from employers and recruiters alike. Online courses may then be very suitable, especially for those who, for various reasons, cannot attend a traditional university degree with R components, or just prefer to learn R at their own leisure or around their working hours. Before we move on to the practical part, whichever strategy you are going to use to learn R, please do not be discouraged by the first difficulties. R, like any other programming language, or should I say, like any other language (including foreign languages), needs time, patience, long hours of practice, and a large number of varied exercises to let you explore many different dimensions and complexities of its syntax and rich libraries of functions. If you are still struggling with your R skills, however, I am sure the next section will get them off the ground. Revisiting R basics In the following section we will present a short revision of the most useful and frequently applied R functions and statements. We will start from a quick R and RStudio installation guide and then proceed to creating R data structures, data manipulation, and transformation techniques, and basic methods used in the Exploratory Data Analysis (EDA). Although the R codes listed in this book have been tested extensively, as always in such cases, please make sure that your equipment is not faulty and that you will be running all the following scripts at your own risk. Getting R and RStudio ready Depending on your operating system (whether Mac OS X, Windows, or Linux) you can download and install specific base R files directly from https://cran.r-project.org/. If you prefer to use RStudio IDE you still need to install R core available from CRAN website first and then download and run installers of the most recent version of RStudio IDE specific for your platform from https://www.rstudio.com/products/rstudio/download/. Personally I prefer to use RStudio, owing to its practical add-ons such as code highlighting and more user-friendly GUI, however, there is no particular reason why you can't use just the simple R core installation if you want to. Having said that, in this book we will be using RStudio in most of the examples. All code snippets have been executed and run on a MacBook Pro laptop with Mac OS X (Yosemite) operating system, 2.3 GHz Intel Core i5 processor, 1TB solid-state hard drive and 16GB of RAM memory, but you should also be fine with a much weaker configuration. In this article we won't be using any large data, and even in the remaining parts of this book the data sets used are limited to approximately 100MB to 130MB in size each. You are also provided with links and references to full Big Data whenever possible. If you would like to follow the practical parts of this book you are advised to download and unzip the R code and data for each article from the web page created for this book by Packt Publishing. If you use this book in PDF format it is not advisable to copy the code and paste it into the R console. When printed, some characters (like quotation marks " ") may be encoded differently than in R and the execution of such commands may result in errors being returned by the R console. Once you have downloaded both R core and RStudio installation files, follow the on-screen instructions for each installer. When you have finished installing them, open your RStudio software. Upon initialization of the RStudio you should see its GUI with a number of windows distributed on the screen. The largest one is the console in which you input and execute the code, line by line. You can also invoke the editor panel (it is recommended) by clicking on the white empty file icon in the top left corner of the RStudio software or alternatively by navigating to File | New File | R Script. If you have downloaded the R code from the book page of the Packt Publishing website, you may also just click on the Open an existing file (Ctrl + O) (a yellow open folder icon) and locate the downloaded R code on your computer's hard drive (or navigate to File | Open File…). Now your RStudio session is open and we can adjust some most essential settings. First, you need to set your working directory to the location on your hard drive where your data files are. If you know the specific location you can just type the setwd() command with a full and exact path to the location of your data as follows: > setwd("/Users/simonwalkowiak/Desktop/data") Of course your actual path will differ from mine, shown in the preceding code, however please mind that if you copy the path from the Windows Explorer address bar you will need to change the backslashes to forward slashes / (or to double backslashes \). Also, the path needs to be kept within the quotation marks "…". Alternatively you can set your working directory by navigating to Session | Set Working Directory | Choose Directory… to manually select the folder in which you store the data for this session. Apart from the ones we have already described, there are other ways to set your working directory correctly. In fact most of the operations, and even more complex data analysis and processing activities, can be achieved in R in numerous ways. For obvious reasons, we won't be presenting all of them, but we will just focus on the frequently used methods and some tips and hints applicable to special or difficult scenarios. You can check whether your working directory has been set correctly by invoking the following line: > getwd() [1] "/Users/simonwalkowiak/Desktop/data" From what you can see, the getwd() function returned the correct destination for my previously defined working directory. Setting the URLs to R repositories It is always good practice to check whether your R repositories are set correctly. R repositories are servers located at various institutes and organizations around the world, which store recent updates and new versions of third-party R packages. It is recommended that you set the URL of your default repository to the CRAN server and choose a mirror that is located relatively close to you. To set the repositories you may use the following code: > setRepositories(addURLs = c(CRAN = "https://cran.r-project.org/")) You can check your current, or default, repository URLs by invoking the following function: > getOption("repos") The output will confirm your URL selection:               CRAN "https://cran.r-project.org/" You will be able to choose specific mirrors when you install a new package for the first time during the session, or you may navigate to Tools | Global Options… | Packages. In the Package management section of the window you can alter the default CRAN mirror location—click on Change… button to adjust. Once your repository URLs and working directory are set, you can go on to create data structures that are typical for R programming language. R data structures The concept of data structures in various programming languages is extremely important and cannot be overlooked. Similarly in R, available data structures allow you to hold any type of data and use them for further processing and analysis. The kind of data structure which you use, puts certain constraints on how you can access and process data stored in this structure, and what manipulation techniques you can use. This section will briefly guide you through a number of basic data structures available in R language. Vectors Whenever I teach statistical computing courses, I always start by introducing R learners to vectors as the first data structure they should get familiar with. Vectors are one-dimensional structures that can hold any type of data that is numeric, character, or logical. In simple terms, a vector is a sequence of some sort of values (for example numeric, character, logical, and many more) of specified length. The most important thing that you need to remember is that an atomic vector may contain only one type of data. Let's then create a vector with 10 random deviates from a standard normal distribution, and store all its elements in an object which we will call vector1. In your RStudio console (or its editor) type the following: > vector1 <- rnorm(10) Let's now see the contents of our newly created vector1: > vector1 [1] -0.37758383 -2.30857701 2.97803059 -0.03848892 1.38250714 [6] 0.13337065 -0.51647388 -0.81756661 0.75457226 -0.01954176 As we drew random values, your vector most likely contains different elements to the ones shown in the preceding example. Let's then make sure that my new vector (vector2) is the same as yours. In order to do this we need to set a seed from which we will be drawing the values: > set.seed(123) > vector2 <- rnorm(10, mean=3, sd=2) > vector2 [1] 1.8790487 2.5396450 6.1174166 3.1410168 3.2585755 6.4301300 [7] 3.9218324 0.4698775 1.6262943 2.1086761 In the preceding code we've set the seed to an arbitrary number (123) in order to allow you to replicate the values of elements stored in vector2 and we've also used some optional parameters of the rnorm() function, which enabled us to specify two characteristics of our data, that is the arithmetic mean (set to 3) and standard deviation (set to 2). If you wish to inspect all available arguments of the rnorm() function, its default settings, and examples of how to use it in practice, type ?rnorm to view help and information on that specific function. However, probably the most common way in which you will be creating a vector of data is by using the c() function (c stands for concatenate) and then explicitly passing the values of each element of the vector: > vector3 <- c(6, 8, 7, 1, 2, 3, 9, 6, 7, 6) > vector3 [1] 6 8 7 1 2 3 9 6 7 6 In the preceding example we've created vector3 with 10 numeric elements. You can use the length() function of any data structure to inspect the number of elements: > length(vector3) [1] 10 The class() and mode() functions allow you to determine how to handle the elements of vector3 and how the data are stored in vector3 respectively. > class(vector3) [1] "numeric" > mode(vector3) [1] "numeric" The subtle difference between both functions becomes clearer if we create a vector that holds levels of categorical variable (known as a factor in R) with character values: > vector4 <- c("poor", "good", "good", "average", "average", "good", "poor", "good", "average", "good") > vector4 [1] "poor" "good" "good" "average" "average" "good" "poor" [8] "good" "average" "good" > class(vector4) [1] "character" > mode(vector4) [1] "character" > levels(vector4) NULL In the preceding example, both the class() and mode() outputs of our character vector are the same, as we still haven't set it to be treated as a categorical variable, and we haven't defined its levels (the contents of the levels() function is empty—NULL). In the following code we will explicitly set the vector to be recognized as categorical with three levels: > vector4 <- factor(vector4, levels = c("poor", "average", "good")) > vector4 [1] poor good good average average good poor good [8] average good Levels: poor average good The sequence of levels doesn't imply that our vector is ordered. We can order the levels of factors in R using the ordered() command. For example, you may want to arrange the levels of vector4 in reverse order, starting from "good": > vector4.ord <- ordered(vector4, levels = c("good", "average", "poor")) > vector4.ord [1] poor good good average average good poor good [8] average good Levels: good < average < poor You can see from the output that R has now properly recognized the order of our levels, which we had defined. We can now apply class() and mode() functions on the vector4.ord object: > class(vector4.ord) [1] "ordered" "factor" > mode(vector4.ord) [1] "numeric" You may very likely be wondering why the mode() function returned "numeric" type instead of "character". The answer is simple. By setting the levels of our factor, R has assigned values 1, 2, and 3 to "good", "average" and "poor" respectively, exactly in the same order as we had defined them in the ordered() function. You can check this using levels() and str() functions: > levels(vector4.ord) [1] "good" "average" "poor" > str(vector4.ord) Ord.factor w/ 3 levels "good"<"average"<..: 3 1 1 2 2 1 3 1 2 1 Just to finalize the subject of vectors, let's create a logical vector, which contains only TRUE and FALSE values: > vector5 <- c(TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE) > vector5 [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE Similarly, for all other vectors already presented, feel free to check their structure, class, mode, and length using appropriate functions shown in this section. What outputs did those commands return? Scalars The reason why I always start from vectors is that scalars just seem trivial when they follow vectors. To simplify things even more, think of scalars as one-element vectors which are traditionally used to hold some constant values for example: > a1 <- 5 > a1 [1] 5 Of course you may use scalars in computations and also assign any one-element outputs of mathematical or statistical operations to another, arbitrary named scalar for example: > a2 <- 4 > a3 <- a1 + a2 > a3 [1] 9 In order to complete this short subsection on scalars, create two separate scalars which will hold a character and a logical value. Matrices A matrix is a two-dimensional R data structure in which each of its elements must be of the same type; that is numeric, character, or logical. As matrices consist of rows and columns, their shape resembles tables. In fact, when creating a matrix, you can specify how you want to distribute values across its rows and columns for example: > y <- matrix(1:20, nrow=5, ncol=4) > y [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 In the preceding example we have allocated a sequence of 20 values (from 1 to 20) into five rows and four columns, and by default they have been distributed by column. We may now create another matrix in which we will distribute the values by rows and give names to rows and columns using the dimnames argument (dimnames stands for names of dimensions) in the matrix() function: > rows <- c("R1", "R2", "R3", "R4", "R5") > columns <- c("C1", "C2", "C3", "C4") > z <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE, dimnames=list(rows, columns)) > z C1 C2 C3 C4 R1 1 2 3 4 R2 5 6 7 8 R3 9 10 11 12 R4 13 14 15 16 R5 17 18 19 20 As we are talking about matrices it's hard not to mention anything about how to extract specific elements stored in a matrix. This skill will actually turn out to be very useful when we get to subsetting real data sets. Looking at the matrix y, for which we didn't define any names of its rows and columns, notice how R denotes them. The row numbers come in the format [r, ], where r is a consecutive number of a row, whereas the column are identified by [ ,c], where c is a consecutive number of a column. If you then wished to extract a value stored in the fourth row of the second column of our matrix y, you could use the following code to do so: > y[4,2] [1] 9 In case you wanted to extract the whole column number three from our matrix y, you could type the following: > y[,3] [1] 11 12 13 14 15 As you can see, we don't even need to allow an empty space before the comma in order for this short script to work. Let's now imagine you would like to extract three values stored in the second, third and fifth rows of the first column in our vector z with named rows and columns. In this case, you may still want to use the previously shown notation, you do not need to refer explicitly to the names of dimensions of our matrix z. Additionally, notice that for several values to extract we have to specify their row locations as a vector—hence we will put their row coordinates inside the c() function which we had previously used to create vectors: > z[c(2, 3, 5), 1] R2 R3 R5 5 9 17 Similar rules of extracting data will apply to other data structures in R such as arrays, lists, and data frames, which we are going to present next. Arrays Arrays are very similar to matrices with only one exception: they contain more dimensions. However, just like matrices or vectors, they may only hold one type of data. In R language, arrays are created using the array() function: > array1 <- array(1:20, dim=c(2,2,5)) > array1 , , 1 [,1] [,2] [1,] 1 3 [2,] 2 4 , , 2 [,1] [,2] [1,] 5 7 [2,] 6 8 , , 3 [,1] [,2] [1,] 9 11 [2,] 10 12 , , 4 [,1] [,2] [1,] 13 15 [2,] 14 16 , , 5 [,1] [,2] [1,] 17 19 [2,] 18 20 The dim argument, which was used within the array() function, specifies how many dimensions you want to distribute your data across. As we had 20 values (from 1 to 20) we had to make sure that our array can hold all 20 elements, therefore we decided to assign them into two rows, two columns, and five dimensions (2 x 2 x 5 = 20). You can check dimensionality of your multi-dimensional R objects with dim() command: > dim(array1) [1] 2 2 5 As with matrices, you can use standard rules for extracting specific elements from your arrays. The only difference is that now you have additional dimensions to take care of. Let's assume you would want to extract a specific value located in the second row of the first column in the fourth dimension of our array1: > array1[2, 1, 4] [1] 14 Also, if you need to find a location of a specific value, for example 11, within the array, you can simply type the following line: > which(array1==11, arr.ind=TRUE) dim1 dim2 dim3 [1,] 1 2 3 Here, the which() function returns indices of the array (arr.ind=TRUE), where the sought value equals 11 (hence ==). As we had only one instance of value 11 in our array, there is only one row specifying its location in the output. If we had more instances of 11, additional rows would be returned indicating indices for each element equal to 11. Data frames The following two short subsections concern two of probably the most widely used R data structures. Data frames are very similar to matrices, but they may contain different types of data. Here you might have suddenly thought of a typical rectangular data set with rows and columns or observations and variables. In fact you are correct. Most of the data sets are indeed imported into R as data frames. You can also create a simple data frame manually with the data.frame() function, but as each column in the data frame may be of a different type, we must first create vectors which will hold data for specific columns: > subjectID <- c(1:10) > age <- c(37,23,42,25,22,25,48,19,22,38) > gender <- c("male", "male", "male", "male", "male", "female", "female", "female", "female", "female") > lifesat <- c(9,7,8,10,4,10,8,7,8,9) > health <- c("good", "average", "average", "good", "poor", "average", "good", "poor", "average", "good") > paid <- c(T, F, F, T, T, T, F, F, F, T) > dataset <- data.frame(subjectID, age, gender, lifesat, health, paid) > dataset subjectID age gender lifesat health paid 1 1 37 male 9 good TRUE 2 2 23 male 7 average FALSE 3 3 42 male 8 average FALSE 4 4 25 male 10 good TRUE 5 5 22 male 4 poor TRUE 6 6 25 female 10 average TRUE 7 7 48 female 8 good FALSE 8 8 19 female 7 poor FALSE 9 9 22 female 8 average FALSE 10 10 38 female 9 good TRUE The preceding example presents a simple data frame which contains some dummy imaginary data, possibly a sample from a basic psychological experiment, which measured subjects' life satisfaction (lifesat) and their health status (health) and also collected other socio-demographic information such as age and gender, and whether the participant was a paid subject or a volunteer. As we deal with various types of data, the elements for each column had to be amalgamated into a single structure of a data frame using the data.frame() command, and specifying the names of objects (vectors) in which we stored all values. You can inspect the structure of this data frame with the previously mentioned str() function: > str(dataset) 'data.frame': 10 obs. of 6 variables: $ subjectID: int 1 2 3 4 5 6 7 8 9 10 $ age : num 37 23 42 25 22 25 48 19 22 38 $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 1 1 1 1 $ lifesat : num 9 7 8 10 4 10 8 7 8 9 $ health : Factor w/ 3 levels "average","good",..: 2 1 1 2 3 1 2 3 1 2 $ paid : logi TRUE FALSE FALSE TRUE TRUE TRUE ... The output of str() gives you some basic insights into the shape and format of your data in the dataset object, for example, number of observations and variables, names of variables, types of data they hold, and examples of values for each variable. While discussing data frames, it may also be useful to introduce you to another way of creating subsets. As presented earlier, you may apply standard extraction rules to subset data of your interest. For example, suppose you want to print only those columns which contain age, gender, and life satisfaction information from our dataset data frame. You may use the following two alternatives (the output not shown to save space, but feel free to run it): > dataset[,2:4] #or > dataset[, c("age", "gender", "lifesat")] Both lines of code will produce exactly the same results. The subset() function however gives you additional capabilities of defining conditional statements which will filter the data, based on the output of logical operators. You can replicate the preceding output using subset() in the following way: > subset(dataset[c("age", "gender", "lifesat")]) Assume now that you want to create a subset with all subjects who are over 30 years old, and with a score of greater than or equal to eight on the life satisfaction scale (lifesat). The subset() function comes very handy: > subset(dataset, age > 30 & lifesat >= 8) subjectID age gender lifesat health paid 1 1 37 male 9 good TRUE 3 3 42 male 8 average FALSE 7 7 48 female 8 good FALSE 10 10 38 female 9 good TRUE Or you want to produce an output with two socio-demographic variables of age and gender, of only these subjects who were paid to participate in this experiment: > subset(dataset, paid==TRUE, select=c("age", "gender")) age gender 1 37 male 4 25 male 5 22 male 6 25 female 10 38 female We will perform much more thorough and complex data transformations on real data frames in the second part of this article. Lists A list in R is a data structure, which is a collection of other objects. For example, in the list you can store vectors, scalars, matrices, arrays, data frames, and even other lists. In fact, lists in R are vectors, but they differ from atomic vectors, which we introduced earlier in this section as lists that can hold many different types of data. In the following example, we will construct a simple list (using list() function) which will include a variety of other data structures: > simple.vector1 <- c(1, 29, 21, 3, 4, 55) > simple.matrix <- matrix(1:24, nrow=4, ncol=6, byrow=TRUE) > simple.scalar1 <- 5 > simple.scalar2 <- "The List" > simple.vector2 <- c("easy", "moderate", "difficult") > simple.list <- list(name=simple.scalar2, matrix=simple.matrix, vector=simple.vector1, scalar=simple.scalar1, difficulty=simple.vector2) >simple.list $name [1] "The List" $matrix [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 2 3 4 5 6 [2,] 7 8 9 10 11 12 [3,] 13 14 15 16 17 18 [4,] 19 20 21 22 23 24 $vector [1] 1 29 21 3 4 55 $scalar [1] 5 $difficulty [1] "easy" "moderate" "difficult" > str(simple.list) List of 5 $ name : chr "The List" $ matrix : int [1:4, 1:6] 1 7 13 19 2 8 14 20 3 9 ... $ vector : num [1:6] 1 29 21 3 4 55 $ scalar : num 5 $ difficulty: chr [1:3] "easy" "moderate" "difficult" Looking at the preceding output, you can see that we have assigned names to each component in our list and the str() function prints them as if they were variables of a standard rectangular data set. In order to extract specific elements from a list, you first need to use a double square bracket notation [[x]] to identify a component x within the list. For example, assuming you want to print an element stored in its first row and the third column of the second component you may use the following line in R: > simple.list[[2]][1,3] [1] 3 Owing to their flexibility, lists are commonly used as preferred data structures in the outputs of statistical functions. It is then important for you to know how you can deal with lists and what sort of methods you can apply to extract and process data stored in them. Once you are familiar with the basic features of data structures available in R, you may wish to visit Hadley Wickham's online book at http://adv-r.had.co.nz/ in which he explains various more advanced concepts related to each native data structure in R language, and different techniques of subsetting data, depending on the way they are stored. Exporting R data objects In the previous section we created numerous objects, which you can inspect in the Environment tab window in RStudio. Alternatively, you may use the ls() function to list all objects stored in your global environment: > ls() If you've followed the article along, and run the script for this book line-by-line, the output of the ls() function should hopefully return 27 objects: [1] "a1" "a2" "a3" [4] "age" "array1" "columns" [7] "dataset" "gender" "health" [10] "lifesat" "paid" "rows" [13] "simple.list" "simple.matrix" "simple.scalar1" [16] "simple.scalar2" "simple.vector1" "simple.vector2" [19] "subjectID" "vector1" "vector2" [22] "vector3" "vector4" "vector4.ord" [25] "vector5" "y" "z" In this section we will present various methods of saving the created objects to your local drive and exporting their contents to a number of the most commonly used file formats. Sometimes, for various reasons, it may happen that you need to leave your project and exit RStudio or shut your PC down. If you do not save your created objects, you will lose all of them, the moment you close RStudio. Remember that R stores created data objects in the RAM of your machine, and whenever these objects are not in use any longer, R frees them from the memory, which simply means that they get deleted. Of course this might turn out to be quite costly, especially if you had not saved your original R script, which would have enabled you to replicate all the steps of your data processing activities when you start a new session in R. In order to prevent the objects from being deleted, you can save all or selected ones as .RData files on your hard drive. In the first case, you may use the save.image() function which saves your whole current workspace with all objects to your current working directory: > save.image(file = "workspace.RData") If you are dealing with large objects, first make sure you have enough storage space available on your drive (this is normally not a problem any longer), or alternatively you can reduce the size of the saved objects using one of the compression methods available. For example, the above workspace.RData file was 3,751 bytes in size without compression, but when xz compression was applied the size of the resulting file decreased to 3,568 bytes. > save.image(file = "workspace2.RData", compress = "xz") Of course, the difference in sizes in the presented example is minuscule, as we are dealing with very small objects, however it gets much more significant for bigger data structures. The trade-off of applying one of the compression methods is the time it takes for R to save and load .RData files. If you prefer to save only chosen objects (for example dataset data frame and simple.list list) you can achieve this with the save() function: > save(dataset, simple.list, file = "two_objects.RData") You may now test whether the above solutions worked by cleaning your global environment of all objects, and then loading one of the created files, for example: > rm(list=ls()) > load("workspace2.RData") As an additional exercise, feel free to explore other functions which allow you to write text representations of R objects, for example dump() or dput(). More specifically, run the following commands and compare the returned outputs: > dump(ls(), file = "dump.R", append = FALSE) > dput(dataset, file = "dput.txt") The save.image() and save() functions only create images of your workspace or selected objects on the hard drive. It is an entirely different story if you want to export some of the objects to data files of specified formats, for example, comma-separated, tab-delimited, or proprietary formats like Microsoft Excel, SPSS, or Stata. The easiest way to export R objects to generic file formats like CSV, TXT, or TAB is through the cat() function, but it only works on atomic vectors: > cat(age, file="age.txt", sep=",", fill=TRUE, labels=NULL, append=TRUE) > cat(age, file="age.csv", sep=",", fill=TRUE, labels=NULL, append=TRUE) The preceding code creates two files, one as a text file and another one as a comma-separated format, both of which contain values from the age vector that we had previously created for the dataset data frame. The sep argument is a character vector of strings to append after each element, the fill option is a logical argument which controls whether the output is automatically broken into lines (if set to TRUE), the labels parameter allows you to add a character vector of labels for each printed line of data in the file, and the append logical argument enables you to append the output of the call to the already existing file with the same name. In order to export vectors and matrices to TXT, CSV, or TAB formats you can use the write() function, which writes out a matrix or a vector in a specified number of columns for example: > write(age, file="agedata.csv", ncolumns=2, append=TRUE, sep=",") > write(y, file="matrix_y.tab", ncolumns=2, append=FALSE, sep="t") Another method of exporting matrices provides the MASS package (make sure to install it with the install.packages("MASS") function) through the write.matrix() command: > library(MASS) > write.matrix(y, file="ymatrix.txt", sep=",") For large matrices, the write.matrix() function allows users to specify the size of blocks in which the data are written through the blocksize argument. Probably the most common R data structure that you are going to export to different file formats will be a data frame. The generic write.table() function gives you an option to save your processed data frame objects to standard data formats for example TAB, TXT, or CSV: > write.table(dataset, file="dataset1.txt", append=TRUE, sep=",", na="NA", col.names=TRUE, row.names=FALSE, dec=".") The append and sep arguments should already be clear to you as they were explained earlier. In the na option you may specify an arbitrary string to use for missing values in the data. The logical parameter col.names allows users to append the names of columns to the output file, and the dec parameter sets the string used for decimal points and must be a single character. In the example, we used row.names set to FALSE, as the names of the rows in the data are the same as the values of the subjectID column. However, it is very likely that in other data sets the ID variable may differ from the names (or numbers) of rows, so you may want to control it depending on the characteristics of your data. Two similar functions write.csv() and write.csv2() are just convenience wrappers for saving CSV files, and they only differ from the generic write.table() function by default settings of some of their parameters, for example sep and dec. Feel free to explore these subtle differences at your leisure. To complete this section of the article we need to present how to export your R data frames to third-party formats. Amongst several frequently used methods, at least four of them are worth mentioning here. First, if you wish to write a data frame to a proprietary Microsoft Excel format, such as XLS or XLSX, you should probably use the WriteXLS package (please use install.packages("WriteXLS") if you have not done it yet) and its WriteXLS() function: > library(WriteXLS) > WriteXLS("dataset", "dataset1.xlsx", SheetNames=NULL, row.names=FALSE, col.names=TRUE, AdjWidth=TRUE, envir=parent.frame()) The WriteXLS() command offers users a number of interesting options, for instance you can set the names of the worksheets (SheetNames argument), adjust the widths of columns depending on the number of characters of the longest value (AdjWidth), or even freeze rows and columns just as you do it in Excel (FreezeRow and FreezeCol parameters). Please note that in order for the WriteXLS package to work, you need to have Perl installed on your machine. The package creates Excel files using Perl scripts called WriteXLS.pl for Excel 2003 (XLS) files, and WriteXLSX.pl for Excel 2007 and later version (XLSX) files. If Perl is not present on your system, please make sure to download and install it from https://www.perl.org/get.html. After the Perl installation, you may have to restart your R session and load the WriteXLS package again to apply the changes. For solutions to common Perl issues please visit the following websites: https://www.perl.org/docs.html, http://www.ahinea.com/en/tech/perl-unicode-struggle.html, and http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html or search StackOverflow and similar websites for R and Perl related specific problems. Another very useful way of writing R objects to the XLSX format is provided by the openxlsx package through the write.xlsx() function, which, apart from data frames, also allows lists to be easily written to Excel spreadsheets. Please note that Windows users may need to install the Rtools package in order to use openxlsx functionalities. The write.xlsx() function gives you a large choice of possible options to set, including a custom style to apply to column names (through headerStyle argument), the color of cell borders (borderColour), or even its line style (borderStyle). The following example utilizes only the most common and minimal arguments required to write a list to the XLSX file, but be encouraged to explore other options offered by this very flexible function: > write.xlsx(simple.list, file = "simple_list.xlsx") A third-party package called foreign makes it possible to write data frames to other formats used by well-known statistical tools such as SPSS, Stata, or SAS. When creating files, the write.foreign() function requires users to specify the names of both the data and code files. Data files hold raw data, whereas code files contain scripts with the data structure and metadata (value and variable labels, variable formats, and so on) written in the proprietary syntax. In the following example, the code writes the dataset data frame to the SPSS format: > library(foreign) > write.foreign(dataset, "datafile.txt", "codefile.txt", package="SPSS") Finally, another package called rio contains only three functions, allowing users to quickly import(), export() and convert() data between a large array of file formats, (for example TSV, CSV, RDS, RData, JSON, DTA, SAV, and many more). The package, in fact, is dependent on a number of other R libraries, some of which, for example foreign and openxlsx, have already been presented in this article. The rio package does not introduce any new functionalities apart from the default arguments characteristic for underlying export functions, so you still need to be familiar with the original functions and their parameters if you require more advanced exporting capabilities. But, if you are only looking for a no-fuss general export function, the rio package is definitely a good shortcut to take: > export(dataset, format = "stata") > export(dataset, "dataset1.csv", col.names = TRUE, na = "NA") Summary In this article, we have provided you with quite a bit of theory, and hopefully a lot of practical examples of data structures available to R users. You've created several objects of different types, and you've become familiar with a variety of data and file formats to offer. We then showed you how to save R objects held in your R workspace to external files on your hard drive, or to export them to various standard and proprietary file formats. Resources for Article: Further resources on this subject: Fast Data Manipulation with R [article] The Data Science Venn Diagram [article] Deployment and DevOps [article]
Read more
  • 0
  • 0
  • 11956
article-image-machine-learning-technique-supervised-learning
Packt
09 Nov 2016
7 min read
Save for later

Machine Learning Technique: Supervised Learning

Packt
09 Nov 2016
7 min read
In this article by Andrea Isoni author of the book Machine Learning for the Web, we will the most relevant regression and classification techniques are discussed. All of these algorithms share the same background procedure, and usually the name of the algorithm refers to both a classification and a regression method. The linear regression algorithms, Naive Bayes, decision tree, and support vector machine are going to be discussed in the following sections. To understand how to employ the techniques, a classification and a regression problem will be solved using the mentioned methods. Essentially, a labeled train dataset will be used to train the models, which means to find the values of the parameters, as we discussed in the introduction. As usual, the code is available in the my GitHub folder at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/. (For more resources related to this topic, see here.) We will conclude the article with an extra algorithm that may be used for classification, although it is not specifically designed for this purpose (hidden Markov model). We will now begin to explain the general causes of error in the methods when predicting the true labels associated with a dataset. Model error estimation We said that the trained model is used to predict the labels of new data, and the quality of the prediction depends on the ability of the model to generalize, that is, the correct prediction of cases not present in the trained data. This is a well-known problem in literature and related to two concepts: bias and variance of the outputs. The bias is the error due to a wrong assumption in the algorithm. Given a point x(t) with label yt, the model is biased if it is trained with different training sets, and the predicted label ytpred will always be different from yt. The variance error instead refers to the different wrongly predicted labels of the given point x(t). A classic example to explain the concepts is to consider a circle with the true value at the center (true label), as shown in the following figure. The closer the predicted labels are to the center, the more unbiased the model and the lower the variance (top left in the following figure). The other three cases are also shown here: Variance and bias example. A model with low variance and low bias errors will have the predicted labels that is blue dots (as show in the preceding figure) concentrated on the red center (true label). The high bias error occurs when the predictions are far away from the true label, while high variance appears when the predictions are in a wide range of values. We have already seen that labels can be continuous or discrete, corresponding to regression classification problems respectively. Most of the models are suitable for solving both problems, and we are going to use word regression and classification referring to the same model. More formally, given a set of N data points and corresponding labels, a model with a set of parameters with the true parameter values will have the mean square error (MSE), equal to: We will use the MSE as a measure to evaluate the methods discussed in this article. Now we will start describing the generalized linear methods. Generalized linear models The generalized linear model is a group of models that try to find the M parameters that form a linear relationship between the labels yi and the feature vector x(i) that is as follows: Here, are the errors of the model. The algorithm for finding the parameters tries to minimize the total error of the model defined by the cost function J: The minimization of J is achieved using an iterative algorithm called batch gradient descent: Here, a is called learning rate, and it is a trade-off between convergence speed and convergence precision. An alternative algorithm that is called stochastic gradient descent, that is loop for : The qj is updated for each training example i instead of waiting to sum over the entire training set. The last algorithm converges near the minimum of J, typically faster than batch gradient descent, but the final solution may oscillate around the real values of the parameters. The following paragraphs describe the most common model and the corresponding cost function, J. Linear regression Linear regression is the simplest algorithm and is based on the model: The cost function and update rule are: Ridge regression Ridge regression, also known as Tikhonov regularization, adds a term to the cost function J such that: , where l is the regularization parameter. The additional term has the function needed to prefer a certain set of parameters over all the possible solutions penalizing all the parameters qj different from 0. The final set of qj shrank around 0, lowering the variance of the parameters but introducing a bias error. Indicating with the superscript l the parameters from the linear regression, the ridge regression parameters are related by the following formula: This clearly shows that the larger the l value, the more the ridge parameters are shrunk around 0. Lasso regression Lasso regression is an algorithm similar to ridge regression, the only difference being that the regularization term is the sum of the absolute values of the parameters: Logistic regression Despite the name, this algorithm is used for (binary) classification problems, so we define the labels. The model is given the so-called logistic function expressed by: In this case, the cost function is defined as follows: From this, the update rule is formally the same as linear regression (but the model definition,  , is different): Note that the prediction for a point p,  , is a continuous value between 0 and 1. So usually, to estimate the class label, we have a threshold at =0.5 such that: The logistic regression algorithm is applicable to multiple label problems using the techniques one versus all or one versus one. Using the first method, a problem with K classes is solved by training K logistic regression models, each one assuming the labels of the considered class j as +1 and all the rest as 0. The second approach consists of training a model for each pair of labels (  trained models). Probabilistic interpretation of generalized linear models Now that we have seen the generalized linear model, let’s find the parameters qj that satisfy the relationship: In the case of linear regression, we can assume  as normally distributed with mean 0 and variance s2 such that the probability  is  equivalent to: Therefore, the total likelihood of the system can be expressed as follows: In the case of the logistic regression algorithm, we are assuming that the logistic function itself is the probability: Then the likelihood can be expressed by: In both cases, it can be shown that maximizing the likelihood is equivalent to minimizing the cost function, so the gradient descent will be the same. k-nearest neighbours (KNN) This is a very simple classification (or regression) method in which given a set of feature vectors  with corresponding labels yi, a test point x(t) is assigned to the label value with the majority of the label occurrences in the K nearest neighbors  found, using a distance measure such as the following: Euclidean: Manhattan: Minkowski:  (if q=2, this reduces to the Euclidean distance) In the case of regression, the value yt is calculated by replacing the majority of occurrences by the average of the labels .  The simplest average (or the majority of occurrences) has uniform weights, so each point has the same importance regardless of their actual distance from x(t). However, a weighted average with weights equal to the inverse distance from x(t) may be used. Summary In this article, the major classification and regression algorithms, together with the techniques to implement them, were discussed. You should now be able to understand in which situation each method can be used and how to implement it using Python and its libraries (sklearn and pandas). Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Specialized Machine Learning Topics [article]
Read more
  • 0
  • 0
  • 1739

article-image-labview-basics
Packt
02 Nov 2016
8 min read
Save for later

LabVIEW Basics

Packt
02 Nov 2016
8 min read
In this article by Behzad Ehsani, author of the book Data Acquisition using LabVIEW, after a brief introduction and a short note on installation, we will go over the most widely used pallets and objects Icon tool bar from a standard installation of LabVIEW and a brief explanation of what each object does. (For more resources related to this topic, see here.) Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environment by its complete graphical approach to programming. As an example, while representation of a while loop in a text based language such as the C language consists of several predefined, extremely compact and sometimes extremely cryptic lines of text, a while a loop in LabVIEW, is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning cure for the beginner. LabVIEW is based on what is called G language, but there are still other languages, especially C under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempt to start projects in LabVIEW only because at the first glace, the graphical nature of interface and the concept of drag an drop used in LabVIEW, appears to do away with required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that in many higher level development and testing environment, specially when using complicated test equipment and complex mathematical calculations or even creating embedded software LabVIEW's approach will be much more time efficient and bug free environment which otherwise would require several lines of code in traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses. LabVIEW does not completely replace the need for traditional text based languages and depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern day program installation; that is, Insert the DVD 1 and follow on-screen guided installation steps. LabVIEW comes in one DVD for Mac and Linux version but in four or more DVDs for the Windows edition (depending on additional software, different licensing and additional libraries and packages purchased.) In this article we will use LabVIEW 2013 Professional Development version for Windows. Given the target audience of this article, we assume the user is well capable of installation of the program. Installation is also well documented by National Instruments and the mandatory one year support purchase with each copy of LabVIEW is a valuable source of live and email help. Also, NI web site (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups and local group events and meeting of fellow LabVIEW developers, etc. One worthy note for those who are new to installation of LabVIEW is that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediate need!) These additional software are fully functional (in demo mode for 7 days) which may be extended for about a month with online registration. This is a very good opportunity to have hands on experience with even more of power and functionality that LabVIEW is capable to offer. The additional information gained by installing other available software on the DVDs may help in further development of a given project. Just imagine if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition probably is going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need to web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all software on all DVDs are selected: When installing a fresh version of LabVIEW, if you do decide to observe the advice above, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI.... and Measurement Studio... for Visual Studio LabWindows according to National Instruments .., is an ANSI C integrated development environment. Also note that by default NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipments. Also, note that device drivers (on Windows installations) come on a separate DVD which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, National Instruments has programmers that would do just that. But this would come at a cost to the user. VI Package manager, now installed as a part of standard installation is also a must these days. National Instruments distributes third party software and drivers and public domain packages via VI Package manager. Appropriate software and drivers for these microcontrollers are installed via VI Package manager. You can install many public domain packages that further installs many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by National Instruments. Finally, note that the more modules, packages and software are selected to be installed the longer it will take to complete the installation. This may sound like making an obvious point but surprisingly enough installation of all software on the three DVDs (for Windows) take up over five hours! On a standard laptop or pc we used. Obviously a more powerful PC (such as one with solid sate hard drive) my not take such log time: LabVIEW Basics Once the LabVIEW applications is launched, by default two blank windows open simultaneously; a Front Panel and a Block Diagram window and a VI is created: VIs or Virtual Instruments are heart and soul of LabVIEW. They are what separate LabVIEW from all other text based development environments. In LabVIEW everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs These graphical representation of a thing, be it a simple while loop, a complex mathematical concept such as polynomial interpolation or simply a Boolean constant are all graphically represented. To use an object right-click inside the block diagram or front panel window, a pallet list appears. Follow the arrow and pick an object from the list of object from subsequent pallet an place it on the appropriate window. The selected object now can be dragged and place on different location on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of cores there are many exceptions to this rule. For example a while loop can only be selected in Block Diagram and by itself, a while loop does not have a graphical representation on the front panel window. Needless to say, LabVIEW also has keyboard combination that expedite selecting and placing any given toolkit objects onto the appropriate window. Each object has one (or several) wire connections going into as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to input and output of one or more object. Later, we will use an example to illustrate how a basic LabVIEW VI is created and executed. Highlights LabVIEW is a complete object-oriented development and test environment based on G language. As such it is a very powerful and complex environment. In article one we went through introduction to LabVIEW and its main functionality of each of its icon by way of an actual user interactive example. Accompanied by appropriate hardware (both NI as well as many industry standard test, measurement and development hardware products) LabVIEW is capable to cover from developing embedded systems to fuzzy logic and almost everything in between! Summary In this article we cover the basics of LabVIEW, from installation to in depth explanation of each and every element in the toolbar. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Data mining [article] PostgreSQL in Action [article]
Read more
  • 0
  • 0
  • 3704
Modal Close icon
Modal Close icon