Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-perform-predictive-forecasting-sap-analytics-cloud
Kunal Chaudhari
17 Feb 2018
7 min read
Save for later

How to perform predictive forecasting in SAP Analytics Cloud

Kunal Chaudhari
17 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud. This book involves features of the SAP Analytics Cloud which will help you collaborate, predict and solve business intelligence problems with cloud computing.[/box] In this article we will learn how to use predictive forecasting with the help of a trend time series chart to see revenue trends in a range of a year. Time series forecasting is only supported for planning models in SAP Analytics Cloud. So, you need planning rights and a planning license to run a predictive time-series forecast. However, you can add predictive forecast by creating a trend time series chart based on an analytical model to estimate future values. In this article, you will use a trend time series chart to view net revenue trends throughout the range of a year. A predictive time-series forecast runs an algorithm on historical data to predict future values for specific measures. For this type of chart, you can forecast a maximum of three different measures, and you have to specify the time for the prediction and the past time periods to use as historical data. Add a blank chart from the Insert toolbar. Set Data Source to the BestRun_Demo model. Select the Time Series chart from the Trend category. In the Measures section, click on the Add Measure link, and select Net Revenue. Finally, click on the Add Dimension link in the Time section, and select Date as the chart’s dimension: The output of your selections is depicted in the first view in the following screenshot. Every chart you create on your story page has its own unique elements that let you navigate and drill into details. The trend time series chart also allows you to zoom in to different time periods and scroll across the entire timeline. For example, the first figure in the following illustration provides a one-year view (A) of net revenue trends, that is from January to December 2015. Click on the six months link (B) to see the corresponding output, as illustrated in the second view. Drag the rectangle box (C) to the left or right to scroll across the entire timeline: Adding a forecast Click on the last data point representing December 2015, and select Add Forecast from the More Actions menu (D) to add a forecast: You see the Predictive Forecast panel on the right side, which displays the maximum number of forecast periods. Using the slider (E) in this section, you can reduce the number of forecast periods. By default, you see the maximum number (in the current scenario, it is seven) in the slider, which is determined by the amount of historical data you have. In the Forecast On section, you see the measure (F) you selected for the chart. If required, you can forecast a maximum of three different measures in this type of chart that you can add in the Builder panel. For the time being, click on OK to accept the default values for the forecast, as illustrated in the following screenshot: The forecast will be added to the chart. It is indicated by a highlighted area (G) and a dotted line (H). Click on the 1 year link (I) to see an output similar to the one illustrated in the following screenshot under the Modifying forecast section. As you can see, there are several data points that represent forecast. The top and bottom of the highlighted area indicate the upper and lower bounds of the prediction range, and the data points fall in the middle (on the dotted line) of the forecast range for each time period. Select a data point to see the Upper Confidence Bound (J) and Lower Confidence Bound (K) values. Modifying forecast You can modify a forecast using the link provided in the Forecast section at the bottom of the Builder panel. Select the chart, and scroll to the bottom of the Builder panel. Click on the Edit icon (L) to see the Predictive Forecast panel again. Review your settings, and make the required changes in this panel. For example, drag the slider toward the left to set the Forecast Periods value to 3 (M). Click on OK to save your settings. The chart should now display the forecast for three months--January, February, and March 2016 (N): Adding a time calculation If you want to display values such as year-over-year sales trends or year-to-date totals in your chart, then you can utilize the time calculation feature of SAP Analytics Cloud. The time calculation feature provides you with several calculation options. In order to use this feature, your chart must contain a time dimension with the appropriate level of granularity. For example, if you want to see quarter-over-quarter results, the time dimension must include quarterly or even monthly results. The space constraint prevents us from going through all these options. However, we will utilize the year-over-year option to compare yearly results in this article to get an idea about this feature. Execute the following instructions to first create a bar chart that shows the sold quantities of the four product categories. Then, add a time calculation to the chart to reveal the year-over-year changes in quantity sold for each category. As usual, add a blank chart to the page using the chart option on the Insert toolbar. Select the Best Run model as Data Source for the chart. Select the Bar/Column chart from the Comparison category. In the Measures section, click on the Add Measure link, and select Quantity Sold. Click on the Add Dimension link in the Dimensions section, and select Product as the chart’s dimension, as shown here: The chart appears on the page. At this stage, if you click on the More icon representing Quantity sold, you will see that the Add Time Calculation option (A) is grayed out. This is because time calculations require a time dimension to the chart, which we will add next. Click on the Add Dimension link in the Dimensions section, and select Date to add this time dimension to the chart. The chart transforms, as illustrated in the following screenshot: To display the results in the chart at the year level, you need to apply a filter as follows: Click on the filter icon in the Date dimension, and select Filter by Member. In the Set Members for Date dialog box, expand the all node, and select 2014, 2015, and 2016, individually. Once again, the chart changes to reflect the application of filter, as illustrated in the following screenshot: Now that a time dimension has been added to the chart, we can add a time calculation to it as follows: Click on the More icon in the Quantity sold measure. Select Add Time Calculation from the menu. Choose Year Over Year. New bars (A) and a corresponding legend (B) will be added to the chart, which help you compare yearly results, as shown in the following screenshot: To summarize, we provided hands-on exposure on predictive forecasting in SAP Analytics Cloud, where you learned about how to use a trend time series chart to view net revenue trends throughout the range of a year. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud, to get an understanding of SAP Analytics Cloud platform and how to create better BI solutions.  
Read more
  • 0
  • 0
  • 17375

article-image-implement-memory-oltp-sql-server-linux
Fatema Patrawala
17 Feb 2018
11 min read
Save for later

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala
17 Feb 2018
11 min read
[box type="note" align="" class="" width=""]Below given article is an excerpt from the book SQL Server on Linux, authored by Jasmin Azemović. This book is a handy guide to setting up and implementing your SQL Server solution on the open source Linux platform.[/box] Today we will learn about the basics of In-Memory OLTP and how to implement it on SQL Server on Linux through the following topics: Elements of performance What is In-Memory OLTP Implementation Elements of performance How do you know if you have a performance issue in your database environment? Well, let's put it in these terms. You notice it (the good), users start calling technical support and complaining about how everything is slow (the bad) or you don't know about your performance issues (the ugly). Try to never get in to the last category. The good Achieving best performance is an iterative process where you need to define a set of tasks that you will execute on a regular basics and monitor their results. Here is a list that will give you an idea and guide you through this process: Establish the baseline Define the problem Fix one thing at a time Test and re-establish the baseline Repeat everything Establishing the baseline is the critical part. In most case scenarios, it is not possible without real stress testing. Example: How many users' systems can you handle on the current configuration? The next step is to measure the processing time. Do your queries or stored procedures require milliseconds, seconds, or minutes to execute? Now you need to monitor your database server using a set of tools and correct methodologies. During that process, you notice that some queries show elements of performance degradation. This is the point that defines the problem. Let's say that frequent UPDATE and DELETE operations are resulting in index fragmentation. The next step is to fix this issue with REORGANIZE or REBUILD index operations. Test your solution in the control environment and then in the production. Results can be better, same, or worse. It depends and there is no magic answer here. Maybe now something else is creating the problem: disk, memory, CPU, network, and so on. In this step, you should re-establish the old or a new baseline. Measuring performance process is something that never ends. You should keep monitoring the system and stay alert. The bad If you are in this category, then you probably have an issue with establishing the baseline and alerting the system. So, users are becoming your alerts and that is a bad thing. The rest of the steps are the same except re-establishing the baseline. But this can be your wake-up call to move yourself in the good category. The ugly This means that you don't know or you don't want to know about performance issues. The best case scenario is a headline on some news portal, but that is the ugly thing. Every decent DBA should try to be light years away from this category. What do you need to start working with performance measuring, monitoring, and fixing? Here are some tips that can help you: Know the data and the app Know your server and its capacity Use dynamic management views—DMVs: Sys.dm_os_wait_stats Sys.dm_exec_query_stats sys.dm_db_index_operational_stats Look for top queries by reads, writes, CPU, execution count Put everything in to LibreOffice Calc or another spreadsheet application and do some basic comparative math Fortunately, there is something in the field that can make your life really easy. It can boost your environment to the scale of warp speed (I am a Star Trek fan). What is In-Memory OLTP? SQL Server In-Memory feature is unique in the database world. The reason is very simple; because it is built-in to the databases' engine itself. It is not a separate database solution and there are some major benefits of this. One of these benefits is that in most cases you don't have to rewrite entire SQL Server applications to see performance benefits. On average, you will see 10x more speed while you are testing the new In-Memory capabilities. Sometimes you will even see up to 50x improvement, but it all depends on the amount of business logic that is done in the database via stored procedures. The greater the logic in the database, the greater the performance increase. The more the business logic sits in the app, the less opportunity there is for performance increase. This is one of the reasons for always separating database world from the rest of the application layer. It has built-in compatibility with other non-memory tables. This way you can optimize thememory you have for the most heavily used tables and leave others on the disk. This also means you won't have to go out and buy expensive new hardware to make large InMemory databases work; you can optimize In-Memory to fit your existing hardware. In-Memory was started in SQL Server 2014. One of the first companies that has started to use this feature during the development of the 2014 version was Bwin. This is an online gaming company. With In-Memory OLTP they improved their transaction speed by 16x, without investing in new expensive hardware. The same company has achieved 1.2 Million requests/second on SQL Server 2016 with a single machine using In-Memory OLTP: https://blogs.msdn.microsoft.com/sqlcat/2016/10/26/how-bwin-is-using-sql-server-2016-in-memory-oltp-to-achieve-unprecedented-performance-and-scale/ Not every application will benefit from In-Memory OLTP. If an application is not suffering from performance problems related to concurrency, IO pressure, or blocking, it's probably not a good candidate. If the application has long-running transactions that consume large amounts of buffer space, such as ETL processing, it's probably not a good candidate either. The best applications for consideration would be those that run high volumes of small fast transactions, with repeatable query plans such as order processing, reservation systems, stock trading, and ticket processing. The biggest benefits will be seen on systems that suffer performance penalties from tables that are having concurrency issues related to a large number of users and locking/blocking. Applications that heavily use the tempdb for temporary tables could benefit from In-Memory OLTP by creating the table as memory optimized, and performing the expensive sorts, and groups, and selective queries on the tables that are memory optimized. In-Memory OLTP quick start An important thing to remember is that the databases that will contain memory-optimized tables must have a MEMORY_OPTIMIZED_DATA filegroup. This filegroup is used for storing the checkpoint needed by SQL Server to recover the memory-optimized tables. Here is a simple DDL SQL statement to create a database that is prepared for In-Memory tables: 1> USE master 2> GO 1> CREATE DATABASE InMemorySandbox 2> ON 3> PRIMARY (NAME = InMemorySandbox_data, 4> FILENAME = 5> '/var/opt/mssql/data/InMemorySandbox_data_data.mdf', 6> size=500MB), 7> FILEGROUP InMemorySandbox_fg 8> CONTAINS MEMORY_OPTIMIZED_DATA 9> (NAME = InMemorySandbox_dir, 10> FILENAME = 11> '/var/opt/mssql/data/InMemorySandbox_dir') 12> LOG ON (name = InMemorySandbox_log, 13> Filename= 14>'/var/opt/mssql/data/InMemorySandbox_data_data.ldf', 15> size=500MB) 16 GO   The next step is to alter the existing database and configure it to access memory-optimized tables. This part is helpful when you need to test and/or migrate current business solutions: --First, we need to check compatibility level of database. -- Minimum is 130 1> USE AdventureWorks 2> GO 3> SELECT T.compatibility_level 4> FROM sys.databases as T 5> WHERE T.name = Db_Name(); 6> GO compatibility_level ------------------- 120 (1 row(s) affected) --Change the compatibility level 1> ALTER DATABASE CURRENT 2> SET COMPATIBILITY_LEVEL = 130; 3> GO --Modify the transaction isolation level 1> ALTER DATABASE CURRENT SET 2> MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON 3> GO --Finlay create memory optimized filegroup 1> ALTER DATABASE AdventureWorks 2> ADD FILEGROUP AdventureWorks_fg CONTAINS 3> MEMORY_OPTIMIZED_DATA 4> GO 1> ALTER DATABASE AdventureWorks ADD FILE 2> (NAME='AdventureWorks_mem', 3> FILENAME='/var/opt/mssql/data/AdventureWorks_mem') 4> TO FILEGROUP AdventureWorks_fg 5> GO   How to create memory-optimized table? The syntax for creating memory-optimized tables is almost the same as the syntax for creating classic disk-based tables. You will need to specify that the table is a memory-optimized table, which is done using the MEMORY_OPTIMIZED = ON clause. A memory-optimized table can be created with two DURABILITY values: SCHEMA_AND_DATA (default) SCHEMA_ONLY If you defined a memory-optimized table with DURABILITY=SCHEMA_ONLY, it means that changes to the table's data are not logged and the data is not persisted on disk. However, the schema is persisted as part of the database metadata. A side effect is that an empty table will be available after the database is recovered during a restart of SQL Server on Linux service.   The following table is a summary of key differences between those two DURABILITY Options. When you create a memory-optimized table, the database engine will generate DML routines just for accessing that table, and load them as DLLs files. SQL Server itself does not perform data manipulation, instead it calls the appropriate DLL: Now let's add some memory-optimized tables to our sample database:     1> USE InMemorySandbox 2> GO -- Create a durable memory-optimized table 1> CREATE TABLE Basket( 2> BasketID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED, 4> UserID INT NOT NULL INDEX ix_UserID 5> NONCLUSTERED HASH WITH (BUCKET_COUNT=1000000), 6> CreatedDate DATETIME2 NOT NULL,   7> TotalPrice MONEY) WITH (MEMORY_OPTIMIZED=ON) 8> GO -- Create a non-durable table. 1> CREATE TABLE UserLogs ( 2> SessionID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT=400000), 4> UserID int NOT NULL, 5> CreatedDate DATETIME2 NOT NULL, 6> BasketID INT, 7> INDEX ix_UserID 8> NONCLUSTERED HASH (UserID) WITH (BUCKET_COUNT=400000)) 9> WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY) 10> GO -- Add some sample records 1> INSERT INTO UserLogs VALUES 2> (432, SYSDATETIME(), 1), 3> (231, SYSDATETIME(), 7), 4> (256, SYSDATETIME(), 7), 5> (134, SYSDATETIME(), NULL), 6> (858, SYSDATETIME(), 2), 7> (965, SYSDATETIME(), NULL) 8> GO 1> INSERT INTO Basket VALUES 2> (231, SYSDATETIME(), 536), 3> (256, SYSDATETIME(), 6547), 4> (432, SYSDATETIME(), 23.6), 5> (134, SYSDATETIME(), NULL) 6> GO -- Checking the content of the tables 1> SELECT SessionID, UserID, BasketID 2> FROM UserLogs 3> GO 1> SELECT BasketID, UserID 2> FROM Basket 3> GO   What is natively compiled stored procedure? This is another great feature that comes comes within In-Memory package. In a nutshell, it is a classic SQL stored procedure, but it is compiled into machine code for blazing fast performance. They are stored as native DLLs, enabling faster data access and more efficient query execution than traditional T-SQL. Now you will create a natively compiled stored procedure to insert 1,000,000 rows into Basket: 1> USE InMemorySandbox 2> GO 1> CREATE PROCEDURE dbo.usp_BasketInsert @InsertCount int 2> WITH NATIVE_COMPILATION, SCHEMABINDING AS 3> BEGIN ATOMIC 4> WITH 5> (TRANSACTION ISOLATION LEVEL = SNAPSHOT, 6> LANGUAGE = N'us_english') 7> DECLARE @i int = 0 8> WHILE @i < @InsertCount 9> BEGIN 10> INSERT INTO dbo.Basket VALUES (1, SYSDATETIME() , NULL) 11> SET @i += 1 12> END 13> END 14> GO --Add 1000000 records 1> EXEC dbo.usp_BasketInsert 1000000 2> GO   The insert part should be blazing fast. Again, it depends on your environment (CPU, RAM, disk, and virtualization). My insert was done in less than three seconds, on an average machine. But significant improvement should be visible now. Execute the following SELECT statement and count the number of records:   1> SELECT COUNT(*) 2> FROM dbo.Basket 3> GO ----------- 1000004 (1 row(s) affected)   In my case, counting of one million records was less than one second. It is really hard to achieve this performance on any kind of disk. Let's try another query. We want to know how much time it will take to find the top 10 records where the insert time was longer than 10 microseconds:   1> SELECT TOP 10 BasketID, CreatedDate 2> FROM dbo.Basket 3> WHERE DATEDIFF 4> (MICROSECOND,'2017-05-30 15:17:20.9308732', CreatedDate) 5> >10 6> GO   Again, query execution time was less than a second. Even if you remove TOP and try to get all the records it will take less than a second (in my case scenario). Advantages of InMemory tables are more than obvious.   We learnt about the basic concepts of In-Memory OLTP and how to implement it on new and existing database. We also got to know that a memory-optimized table can be created with two DURABILITY values and finally, we created an In-Memory table. If you found this article useful, check out the book SQL Server on Linux, which covers advanced SQL Server topics, demonstrating the process of setting up SQL Server database solution in the Linux environment.        
Read more
  • 0
  • 0
  • 26486

article-image-use-labview-data-acquisition
Fatema Patrawala
17 Feb 2018
14 min read
Save for later

How to use LabVIEW for data acquisition

Fatema Patrawala
17 Feb 2018
14 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Data Acquisition Using LabVIEW written by Behzad Ehsani. In this book you will learn to transform physical phenomena into computer-acceptable data using an object-oriented language.[/box] Today we will discuss basics of LabVIEW, focus on its installation with an example of a LabVIEW program which is generally known as Virtual Instrument (VI). Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environments by its completely graphical approach to programming. As an example, while representation of a while loop in a text-based language such as C consists of several predefined, extremely compact, and sometimes extremely cryptic lines of text, a while loop in LabVIEW is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning curve for the beginner. LabVIEW is based on what is called the G language, but there are still other languages, especially C, under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempted to start projects in LabVIEW only because, at first glance, the graphical nature of the interface and the concept of drag and drop used in LabVIEW appears to do away with the required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that, in many higher-level development and testing environments, especially when using complicated test equipment and complex mathematical calculations or even creating embedded software, LabVIEW's approach will be a much more time-efficient and bug-free environment which otherwise would require several lines of code in a traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses.   LabVIEW does not completely replace the need for traditional text based languages and, depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern-day program installation; that is, insert the DVD 1 and follow the onscreen guided installation steps. LabVIEW comes in one DVD for the Mac and Linux versions but in four or more DVDs for the Windows edition (depending on additional software, different licensing, and additional libraries and packages purchased). In this book, we will use the LabVIEW 2013 Professional Development version for Windows. Given the target audience of this book, we assume the user is fully capable of installing the program. Installation is also well documented by National Instruments (NI) and the mandatory 1-year support purchase with each copy of LabVIEW is a valuable source of live and e-mail help. Also, the NI website (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups, local group events and meetings of fellow LabVIEW developers, and so on. It's worth noting for those who are new to the installation of LabVIEW that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediately needed!). This additional software is fully functional in demo mode for 7 days, which may be extended for about a month with online registration. This is a very good opportunity to have hands-on experience with even more of the power and functionality that LabVIEW is capable of offering. The additional information gained by installing the other software available on the DVDs may help in further development of a given project. Just imagine, if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition is probably going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need for web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all the software on all the DVDs is selected: When installing a fresh version of LabVIEW, if you do decide to observe the given advice, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI... and Measurement Studio... for Visual Studio. LabWindows, according to NI, is an ANSI C integrated development environment. Also note that, by default, NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipment. Also, note that device drivers (on Windows installations) come on a separate DVD, which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well-established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, NI has programmers that would do just that. But this would come at a cost to the user. VI Package Manager, now installed as a part of standard installation, is also a must these days. NI distributes third-party software and drivers and public domain packages via VI Package Manager. We are going to use examples using Arduino (http://www.arduino.cc) microcontrollers in later chapters of this book. Appropriate software and drivers for these microcontrollers are installed via VI Package Manager. You can install many public domain packages that further install many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by NI. Finally, note that the more modules, packages, and software that are selected to be installed, the longer it will take to complete the installation. This may sound like making an obvious point but, surprisingly enough, installation of all software on the three DVDs (for Windows) takes up over 5 hours! On a standard laptop or PC we used. Obviously, a more powerful PC (such as one with a solid state hard drive) may not take such long time. Basic LabVIEW VI Once the LabVIEW application is launched, by default two blank windows open simultaneously–a Front Panel and a Block Diagram window–and a VI is created: VIs are the heart and soul of LabVIEW. They are what separate LabVIEW from all other text-based development environments. In LabVIEW, everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs. These graphical representations of a thing, be it a simple while loop, a complex mathematical concept such as Polynomial Interpolation, or simply a Boolean constant, are all graphically represented. To use an object, right-click inside the Block Diagram or Front Panel window, a pallet list appears. Follow the arrow and pick an object from the list of objects from subsequent pallet and place it on the appropriate window. The selected object can now be dragged and placed on different locations on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of course, there are many exceptions to this rule. For example, a while loop can only be selected in Block Diagram and, by itself, a while loop does not have a graphical representation on the Front Panel window. Needless to say, LabVIEW also has keyboard combinations that expedite selecting and placing any given toolkit objects onto the appropriate window: Each object has one (or several) wire connections going into it as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to the input and output of one or more objects. Example 1 – counter with a gauge This is a fairly simple program with simple user interaction. Once the program has been launched, it uses a while loop to wait for the user input. This is a typical behavior of almost any user-friendly program. For example, if the user launches Microsoft Office, the program launches and waits for the user to pick a menu item, click on a button, or perform any other action that the program may provide. Similarly, this program starts execution but waits in a loop for the user to choose a command. In this case only a simple Start or Stop is available. If the Start button is clicked, the program uses a for loop function to simply count from 0 to 10 in intervals of 200 milliseconds. After each count is completed, the gauge on the Front Panel, the GUI part of the program, is updated to show the current count. The counter is then set to the zero location of the gauge and the program awaits subsequent user input. If the Start button is clicked again, this action is repeated, and, obviously, if the Stop button is clicked, the program exits. Although very simple, in this example, you can find many of the concepts that are often used in a much more elaborate program. Let's walk through the code and point out some of these concepts. The following steps not only walk the reader through the example code but are also a brief tutorial on how to use LabVIEW, how to utilize each working window, and how to wire objects. Launch LabVIEW and from the File menu, choose New VI and follow the steps:    Right-click on the Block Diagram window.    From Programming Functions, choose Structures and select While Loop.    Click (and hold) and drag the cursor to create a (resizable) rectangle. On the bottom-left corner, right-click on the wire to the stop loop and choose Create a control. Note that a Stop button appears on both the Block Diagram and Front panel windows. Inside the while loop box, right-click on the Block Diagram window and from Programming Function, choose Structures and select Case Structures. Click and (and hold) and drag the cursor to create a (resizable) rectangle. On the Front Panel window, next to the Stop button created, right-click and from Modern Controls, choose Boolean and select an OK button. Double-click on the text label of the OK button and replace the OK button text with Start. Note that an OK button is also created on the Block Diagram window and the text label on that button also changed when you changed the text label on the Front Panel window. On the Front Panel window, drag-and-drop the newly created Start button next to the tiny green question mark on the left-hand side of the Case Structure box, outside of the case structure but inside the while loop. Wire the Start button to the Case Structure. Inside the Case Structure box, right-click on the Block Diagram window and from Programming Function, choose Structures and select For Loop. Click and (and hold) and drag the cursor to create a (resizable) rectangle. Inside the Case Structure box, right-click on N on the top-left side of the Case Structure and choose Create Constant. An integer blue box with a value of 0 will be connected to the For Loop. This is the number of irritations the for loop is going to have. Change 0 to 11. Inside the For Loop box, right click on the Block Diagram widow and from Programming Function, choose Timing and select Wait(ms). Right-click on the Wait function created in step 10 and connect a integer value of 200 similar to step 9. On the Front Panel window, right-click and from Modern functions, choose Gauge. Note that a Gauge function will appear on the Block Diagram window too. If the function is not inside the For Loop, drag and drop it inside the For Loop. Inside the For loop, on the Block Diagram widow, connect the iteration count i to the Gauge. On the Block Diagram, right-click on the Gauge, and under the Create submenu, choose Local variable. If it is not already inside the while loop, drag and drop it inside the while loop but outside of the case structure. Right-click on the local variable created in step 15 and connect a Zero to the input of the local variable. Click on the Clean Up icon on the main menu bar on the Block Diagram window and drag and move items on the Front Panel window so that both windows look similar to the following screenshots: Creating a project is a must When LabVIEW is launched, a default screen such as in the following screenshot appears on the screen: The most common way of using LabVIEW, at least in the beginning of a small project or test program, is to create a new VI. A common rule of programming is that each function, or in this case VI, should not be larger than a page. Keep in mind that, by nature, LabVIEW will have two windows to begin with and, being a graphical programming environment only, each VI may require more screen space than the similar text based development environment. To start off development and in order to set up all devices and connections required for tasks such as data acquisition, a developer may get the job done by simply creating one, and, more likely several VIs. Speaking from experience among engineers and other developers (in other words, in situations where R&D looms more heavily on the project than collecting raw data), quick VIs are more efficient initially, but almost all projects that start in this fashion end up growing very quickly and other people and other departments will need be involved and/or be fed the gathered data. In most cases, within a short time from the beginning of the project, technicians from the same department or related teams may be need to be trained to use the software in development. This is why it is best to develop the habit of creating a new project from the very beginning. Note the center button on the left-hand window in the preceding screenshot. Creating a new project (as opposed to creating VIs and sub-VIs) has many advantages and it is a must if the program created will have to run as an executable on computers that do not have LabVIEW installed on them. Later versions of LabVIEW have streamlined the creation of a project and have added many templates and starting points to them. Although, for the sake of simplicity, we created our first example with the creation of a simple VI, one could almost as easily create a project and choose from many starting points, templates, and other concepts (such as architecture) in LabVIEW. The most useful starting point for a complete and user-friendly application for data acquisition would be a state machine. Throughout the book, we will create simple VIs as a quick and simple way to illustrate a point but, by the end of the book, we will collect all of the VIs, icons, drivers, and sub-VIs in one complete state machine, all collected in one complete project. From the project created, we will create a standalone application that will not need the LabVIEW environment to execute, which could run on any computer that has LabVIEW runtime engine installed on it. To summarize, we went through the basics of LabVIEW and the main functionality of each of its icons by way of an actual user-interactive example. LabVIEW is capable of developing embedded systems, fuzzy logic, and almost everything in between! If you are interested to know more about LabVIEW, check out this book Data Acquisition Using LabVIEW.    
Read more
  • 0
  • 0
  • 7491

article-image-manipulating-text-data-using-python-regular-expressions-regex
Sugandha Lahoti
16 Feb 2018
8 min read
Save for later

Manipulating text data using Python Regular Expressions (regex)

Sugandha Lahoti
16 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Allan Visochek titled Practical Data Wrangling. This book covers practical data wrangling techniques in Python and R to turn your noisy data into relevant, insight-ready information.[/box] In today’s tutorial, we will learn how to manipulate text data using regular expressions in Python. What is a Regular expression A regular expression, or regex for short, is simply a sequence of characters that specifies a certain search pattern. Regular expressions have been around for quite a while and are a field of computer science in and of themselves. In Python, regular expression operations are handled using Python's built in re module. In this section, I will walk through the basics of creating regular expressions and using them to  You can implement a regular expression with the following steps: Specify a pattern string. Compile the pattern string to a regular expression object. Use the regular expression object to search a string for the pattern. Optional: Extract the matched pattern from the string. Writing and using a regular expression The first step to creating a regular expression in Python is to import the re module: import re Python regular expressions are expressed using pattern strings, which are strings that specify the desired search pattern. In its simplest form, a pattern string can consist only of letters, numbers, and spaces. The following pattern string expresses a search query for an exact sequence of characters. You can think of each character as an individual pattern. In later examples, I will discuss more sophisticated patterns: import re pattern_string = "this is the pattern" The next step is to process the pattern string into an object that Python can use in order to search for the pattern. This is done using the compile() method of the re module. The compile() method takes the pattern string as an argument and returns a regex object: import re pattern_string = "this is the pattern" regex = re.compile(pattern_string) Once you have a regex object, you can use it to search within a search string for the pattern specified in the pattern string. A search string is just the name for the string in which you are looking for a pattern. To search for the pattern, you can use the search() method of the regex object as follows: import re pattern_string = "this is the pattern" regex = re.compile(pattern_string) match = regex.search("this is the pattern") If the pattern specified in the pattern string is in the search string, the search() method will return a match object. Otherwise, it returns the None data type, which is an empty value. Since Python interprets True and False values rather loosely, the result of the search function can be used like a Boolean value in an if statement, which can be rather convenient: .... match = regex.search("this is the pattern") if match: print("this was a match!") The search string this is the pattern should produce a match, because it matches exactly the pattern specified in the pattern string. The search function will produce a match if the pattern is found at any point in the search string as the following demonstrates: .... match = regex.search("this is the pattern") if match: print("this was a match!") if regex.search("*** this is the pattern ***"): print("this was not a match!") if not regex.search("this is not the pattern"): print("this was not a match!") Special Characters Regular expressions depend on the use of certain special characters in order to express patterns. Due to this, the following characters should not be used directly unless they are used for their intended purpose: . ^ $ * + ? {} () [] | If you do need to use any of the previously mentioned characters in a pattern string to search for that character, you can write the character preceded by a backslash character. This is called escaping characters. Here's an example: pattern string = "c*b" ## matches "c*b" If you need to search for the backslash character itself, you use two backslash characters, as follows: pattern string = "cb" ## matches "cb" Matching whitespace Using s at any point in the pattern string matches a whitespace character. This is more general then the space character, as it applies to tabs and newline characters: .... a_space_b = re.compile("asb") if a_space_b.search("a b"): print("'a b' is a match!") if a_space_b.search("1234 a b 1234"): print("'1234 a b 1234' is a match") if a_space_b.search("ab"): print("'1234 a b 1234' is a match") Matching the start of string If the ^ character is used at the beginning of the pattern string, the regular expression will only produce a match if the pattern is found at the beginning of the search string: .... a_at_start = re.compile("^a") if a_at_start.search("a"): print("'a' is a match") if a_at_start.search("a 1234"): print("'a 1234' is a match") if a_at_start.search("1234 a"): print("'1234 a' is a match") Matching the end of a string Similarly, if the $ symbol is used at the end of the pattern string, the regular expression will only produce a match if the pattern appears at the end of the search string: .... a_at_end = re.compile("a$") if a_at_end.search("a"): print("'a' is a match") if a_at_end.search("a 1234"): print("'a 1234' is a match") if a_at_end.search("1234 a"): print("'1234 a' is a match") Matching a range of characters It is possible to match a range of characters instead of just one. This can add some flexibility to the pattern: [A-Z] matches all capital letters [a-z] matches all lowercase letters [0-9] matches all digits .... lower_case_letter = re.compile("[a-z]") if lower_case_letter.search("a"): print("'a' is a match") if lower_case_letter.search("B"): print("'B' is a match") if lower_case_letter.search("123 A B 2"): print("'123 A B 2' is a match") digit = re.compile("[0-9]") if digit.search("1"): print("'a' is a match") if digit.search("342"): print("'a' is a match") if digit.search("asdf abcd"): print("'a' is a match") Matching any one of several patterns If there is a fixed number of patterns that would constitute a match, they can be combined using the following syntax: (<pattern1>|<pattern2>|<pattern3>) The following a_or_b regular expression will match any string where there is either an a character or a b character: .... a_or_b = re.compile("(a|b)") if a_or_b.search("a"): print("'a' is a match") if a_or_b.search("b"): print("'b' is a match") if a_or_b.search("c"): print("'c' is a match") Matching a sequence instead of just one character If the + character comes after another character or pattern, the regular expression will match an arbitrarily long sequence of that pattern. This is quite useful, because it makes it easy to express something like a word or number that can be of arbitrary length. Putting patterns together More sophisticated patterns can be produced by combining pattern strings one after the other. In the following example, I've created a regular expression that searches for a number strictly followed by a word. The pattern string that generates the regular expression is composed of the following: A pattern string that matches a sequence of digits: [0-9]+ A pattern string that matches a whitespace character: s  A pattern string that matches a sequence of letters: [a-z]+ A pattern string that matches either the end of the string or a whitespace character: (s|$) .... number_then_word = re.compile("[0-9]+s[a-z]+(s|$)") The regex split() function Regex objects in Python also have a split() method. The split method splits the search string into an array of substrings. The splits occur at each location along the string where the pattern is identified. The result is an array of strings that occur between instances of the pattern. If the pattern occurs at the beginning or end of the search string, an empty string is included at the beginning or end of the resulting array, respectively: .... print(a_or_b.split("123a456b789")) print(a_or_b.split("a1b")) If you are interested, the Python documentation has a more complete coverage of regular expressions. It can be found at https://docs.python.org/3.6/library/re.html. We saw various ways of using regular expressions in Python. To know more about data wrangling techniques using simple and real-world data-sets you may check out this book Practical Data Wrangling.  
Read more
  • 0
  • 0
  • 62366

article-image-build-enable-jenkins-mesos-plugin
Vijin Boricha
16 Feb 2018
4 min read
Save for later

How to build and enable the Jenkins Mesos plugin

Vijin Boricha
16 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by David Blomquist and Tomasz Janiszewski titled Apache Mesos Cookbook. From this book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In today’s tutorial, we will learn about building and enabling the Jenkins Mesos plugin. Building the Jenkins Mesos plugin By default, Jenkins uses statically created agents and runs jobs on them. We can extend this behavior with a plugin that will make Jenkins use Mesos as a resource manager. Jenkins will register as a Mesos framework and accept offers when it needs to run a job. How to do it The Jenkins Mesos plugin installation is a little bit harder than Marathon. There are no official binary packages for it, so it must be installed from sources:  First of all, we need to download the source code: curl -L https://github.com/jenkinsci/mesos-plugin/archive/mesos-0.14.0.tar. gz | tar -zx cd jenkinsci-mesos-plugin-*  The plugin is written in Java and to build it we need Maven (mvn): sudo apt install maven  Finally, build the package: mvn package If everything goes smoothly, you should see information, that all tests passed and the plugin package will be placed in target/mesos.hpi. Jenkins is written in Java and presents an API for creating plugins. Plugins do not have to be written in Java, but must be compatible with those interfaces so most plugins are written in Java. The natural choice for building a Java application is Maven, although Gradle is getting more and more popular. The Jenkins Mesos plugin uses the Mesos native library to communicate with Mesos. This communication is now deprecated, so the plugin does not support all Mesos features that are available with the Mesos HTTP API. Enabling the Jenkins Mesos plugin Here  you will learn how to enable the Mesos Jenkins plugin and configure a job to be run on Mesos. How to do it... The first step is to install the Mesos Jenkins plugin. To do so, navigate to the Plugin Manager by clicking Manage Jenkins | Manage Plugins, and select the Advanced tab. You should see the following screen: Click Choose  file and select the previously built plugin to upload it. Once the plugin is installed, you have to configure it. To do so, go to the configuration (Manage Jenkins | Configure  System).  At the bottom of the page, the cloud section  should appear. Fill in all the fields with  the desired configuration values: This was the last step of the plugin installation. If you now disable Advanced On- demand framework registration, you should see the Jenkins Scheduler registered in the Mesos frameworks. Remember to configure Slave  username to the existing system user on Mesos agents. It will be used to run your jobs. By default, it will be jenkins. You can create it on slaves with the following command: adduser jenkins Be careful when providing an IP or hostnames for Mesos and Jenkins. It must match the IP used later by the scheduler for communication. By default, the Mesos native library binds to the interface that the hostname resolves to. This could lead to problems in communication, especially when receiving messages from Mesos. If you see your Jenkins is connected, but jobs are stuck and agents do not start, check if Jenkins is registered with the proper IP. You can set the IP used by Jenkins by adding the following line in /etc/default/jenkins (in this example, we assume Jenkins should bind on 10.10.10.10): LIBPROCESS_IP=10.10.10.10 We learnt about building and enabling Jenkins Mesos plugin. You can know more about how to configure and maintain Apache Mesos from Apache Mesos Cookbook.    
Read more
  • 0
  • 0
  • 11204

article-image-install-elasticsearch-ubuntu-windows
Fatema Patrawala
16 Feb 2018
3 min read
Save for later

How to install Elasticsearch in Ubuntu and Windows

Fatema Patrawala
16 Feb 2018
3 min read
[box type="note" align="" class="" width=""]This article is an extract from the book, Mastering Elastic Stack  co-authored by Ravi Kumar Gupta and Yuvraj Gupta.This book will brush you up with basic knowledge on implementing the Elastic Stack and then dives deep into complex and advanced implementations. [/box] In today’s tutorial we aim to learn Elasticsearch v5.1.1 installation for Ubuntu and Windows. Installation of Elasticsearch on Ubuntu 14.04 In order to install Elasticsearch on Ubuntu, refer to the following steps: Download Elasticsearch 5.1.1 as a debian package using terminal: wget https://artifacts.elastic.co /downloads/elasticsearch/elasticsearch-5.1.1.deb 2. Install the debian package using following command: sudo dpkg -i elasticsearch-5.1.1.deb Elasticsearch will be installed in /usr/share/elasticsearch directory. The configuration files will be present at /etc/elasticsearch. The init script will be present at /etc/init.d/elasticsearch. The log files will be present within /var/log/elasticsearch directory. 3. Configure Elasticsearch to run automatically on bootup . If you are using SysV init distribution, then run the following command: sudo update-rc.d elasticsearch defaults 95 10 The preceding command will print on screen: Adding system startup for, /etc/init.d/elasticsearch Check status of Elasticsearch using following command: sudo service elasticsearch status Run Elasticsearch as a service using following command: sudo service elasticsearch start Elasticsearch may not start if you have any plugin installed which is not supported in ES-5.0.x version onwards. As plugins have been deprecated, it is required to uninstall any plugin if exists in prior version of ES. Remove a plugin after going to ES Home using following command: bin/elasticsearch-plugin remove head Usage of Elasticsearch command: sudo service elasticsearch {start|stop|restart|force- reload|status} If you are using systemd distribution, then run following command: sudo /bin/systemctl daemon-reload sudo /bin/systemctl enable elasticsearch.service To verify elasticsearch installation open open http://localhost:9200 in browser or run the following command from command line: curl -X GET http://localhost:9200 Installation of Elasticsearch on Windows In order to install Elasticsearch on Windows, refer to the following steps: Download Elasticsearch 5.1.1 version from its site using the following link: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.1.1.zip Upon opening the link, click on it and it will download the ZIP package. 2. Extract the downloaded ZIP package by unzipping it using WinRAR, 7-Zip, and other such extracting softwares (if you don't have one of these then download it). This will extract the files and folders in the directory. 3. Then click on the extracted folder and navigate the folder to reach inside the bin folder. 4. Click on the elasticsearch.bat file to run Elasticsearch. If this window is closed Elasticsearch will stop running, as the node will shut down. 5. To verify Elasticsearch installation, open http://localhost:9200 in the browser: Installation of Elasticsearch as a service After installing Elasticsearch as previously mentioned, open Command Prompt after navigating to the bin folder and use the following command: elasticsearch-service.bat install Usage: elasticsearch-service.bat install | remove | start | stop | manager To summarize, we learnt installation of Elasticsearch on Ubuntu and Windows. If you are keen to know more about how to work with the Elastic Stack in a production environment, you can grab our comprehensive guide Mastering Elastic Stack.  
Read more
  • 0
  • 0
  • 56496
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks
Savia Lobo
16 Feb 2018
6 min read
Save for later

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo
16 Feb 2018
6 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from  future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.  
Read more
  • 0
  • 0
  • 33693

article-image-4-ways-implement-feature-selection-python-machine-learning
Sugandha Lahoti
16 Feb 2018
13 min read
Save for later

4 ways to implement feature selection in Python for machine learning

Sugandha Lahoti
16 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box] In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library: Univariate selection Recursive Feature Elimination (RFE) Principle Component Analysis (PCA) Choosing important features (feature importance) We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community. Univariate selection Statistical tests can be used to select those features that have the strongest relationships with the output variable. The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features. The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset: #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:]) You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age. Scores for each feature: [111.52   1411.887 17.605 53.108  2175.565   127.669 5.393 181.304] Selected Features: [[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]] Recursive Feature Elimination RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_) After execution, we will get: Num Features: 3 Selected Features: [ True False False False False   True  True False] Feature Ranking: [1 2 3 5 6 1 1 4] You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array. Principle Component Analysis PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the following example, we use PCA and select three principal components: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) You can see that the transformed dataset (three principal components) bears little resemblance to the source data: Explained Variance: [ 0.88854663   0.06159078  0.02579012] [[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02 9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03] [ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]] Choosing important features (feature importance) Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy. A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory. This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy). We will start with importing all of the libraries: #Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn         from sklearn.feature_selection         import SelectFromModel          np.random.seed(1) Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model: #Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy: #Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier: #Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6)) Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see: Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data. In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn: #Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees            = 250 max_feat     = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc)) The accuracy of our model is: Accuracy of model before feature selection is 98.82 As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes. So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature: #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_): print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366) As you can see here, each feature has a different importance based on its contribution to the final prediction. We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training: #Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain) Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset: #Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20) Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset. In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set: #Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97 Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly. This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table: Evaluation criteria Before feature selection After feature selection Number of features 94 20 Size of dataset 26.32 MB 5.60 MB Training time 2.91 seconds 1.71 seconds Accuracy 98.82 percent 99.97 percent The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before. To summarize the article, we explored 4 ways of feature selection in machine learning. If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.
Read more
  • 0
  • 4
  • 99736

article-image-6-popular-regression-techniques-must-know
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

6 Popular Regression Techniques you must know

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box] In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular regression techniques and approaches You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.
Read more
  • 0
  • 0
  • 32430

article-image-how-to-use-r-to-boost-your-data-model
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

How to use R to boost your Data Model

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Statistics for Data Science, written by James D. Miller. This book is a comprehensive primer on the basic concepts of statistics and their application in different data science tasks.[/box] In this article, we explain the implementation of boosting - a popular technique used to improve the performance of a data model - using the popular R programming language. We will take a high-level look at a thought-provoking prediction problem drawn from Mastering Predictive Analytics with R, Second Edition, by James D. Miller and Rui Miguel Forte. Here, an original example of patterns made by radiation on a telescope camera are analyzed in an attempt to predict whether a certain pattern came from gamma rays leaking into the atmosphere or from regular background radiation. Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The dataset used is the MAGIC Gamma Telescope Data Set, hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope This data consists of 19,020 observations, holding the following list of attributes: Prepping the data First, various steps need to be performed on our example data. The data is first loaded into an R data frame object named magic, recoding the CLASS output variable to use classes 1 and -1 for gamma rays and background radiation respectively: > magic <- read.csv("magic04.data", header = FALSE) > names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1",  "FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS") > magic$CLASS <- as.factor(ifelse(magic$CLASS =='g', 1, -1)) Next, the data is split into two files: a training data and a test data frame using an 80-20 split: > library(caret) > set.seed(33711209) > magic_sampling_vector <- createDataPartition(magic$CLASS,  p = 0.80, list = FALSE) > magic_train <- magic[magic_sampling_vector, 1:10] > magic_train_output <- magic[magic_sampling_vector, 11] > magic_test <- magic[-magic_sampling_vector, 1:10] > magic_test_output <- magic[-magic_sampling_vector, 11] The model used for boosting is a simple multilayer perceptron with a single hidden layer leveraging R's nnet package. Neural networks, often produce higher accuracy when inputs are normalized, so, in this example, before training any models, this preprocessing is performed: > magic_pp <- preProcess(magic_train, method = c("center",  "scale")) > magic_train_pp <- predict(magic_pp, magic_train) > magic_train_df_pp <- cbind(magic_train_pp,  CLASS = magic_train_output) > magic_test_pp <- predict(magic_pp, magic_test) Training Boosting is designed to work best with weak learners, so a very small number of hidden neurons in the model's hidden layer are used. Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, a baseline performance is established by training a single neural network (and measuring its performance). This is to accomplish the following: > library(nnet) > n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1) > n_test_predictions <- predict(n_model, magic_test_pp,  type = "class") > (n_test_accuracy <- mean(n_test_predictions ==  magic_test_output))  [1] 0.7948988 This establishes that we have a baseline accuracy of around 79.5 percent. Not too bad, but can boost to improve upon this score? To that end, the function AdaBoostNN(), which is shown as follows, is used. This function will take input from a data frame, the name of the output variable, the number of single hidden layer neural network models to be built, and finally, the number of hidden units these neural networks will have. The function will then implement the AdaBoost algorithm and return a list of models with their corresponding weights. Here is the function: AdaBoostNN <- function(training_data, output_column, M, hidden_units) { require("nnet") models <- list() alphas <- list() n <- nrow(training_data) model_formula <- as.formula(paste(output_column, '~ .', sep = '')) w <- rep((1/n), n) for (m in 1:M) { model <- nnet(model_formula, data = training_data, size = hidden_units, weights = w) models[[m]] <- model predictions <- as.numeric(predict(model, training_data[, -which(names(training_data) == output_column)], type = "class")) errors <- predictions != training_data[, output_column] error_rate <- sum(w * as.numeric(errors)) / sum(w) alpha <- 0.5 * log((1 - error_rate) / error_rate) alphas[[m]] <- alpha temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) } else { x * exp(-alpha)}, w, errors) w <- temp_w / sum(temp_w) } return(list(models = models, alphas = unlist(alphas))) } The preceding function uses the following logic: First, initialize empty lists of models and model weights (alphas). Compute the number of observations in the training data, storing this in the variable n. The name of the output column provided is then used to create a formula that describes the neural network that will be built. In the dataset used, this formula will be CLASS ~ ., meaning that the neural network will compute CLASS as a function of all the other columns as input features. Next, initialize the weights vector and define a loop that will run for M iterations in order to build M models. In every iteration, the first step is to use the current setting of the weights vector to train a neural network using as many hidden units as specified in the input, hidden_units. Then, compute a vector of predictions that the model generates on the training data using the predict() function. By comparing these predictions to the output column of the training data, calculate the errors that the current model makes on the training data. This then allows the computation of the error rate. This error rate is set as the weight of the current model and, finally, the observation weights to be used in the next iteration of the loop are updated according to whether each observation was correctly classified. The weight vector is then normalized and we are ready to begin the next iteration! After completing M iterations, output a list of models and their corresponding model weights. Ready for boosting There is now a function able to train our ensemble classifier using AdaBoost, but we also need a function to make the actual predictions. This function will take in the output list produced by our training function, AdaBoostNN(), along with a test dataset. This function is AdaBoostNN.predict() and it is shown as follows: AdaBoostNN.predict <- function(ada_model, test_data) { models <- ada_model$models alphas <- ada_model$alphas prediction_matrix <- sapply(models, function (x) as.numeric(predict(x, test_data, type = "class"))) weighted_predictions <- t(apply(prediction_matrix, 1, function(x) mapply(function(y, z) y * z, x, alphas))) final_predictions <- apply(weighted_predictions, 1, function(x) sign(sum(x))) return(final_predictions) } This function first extracts the models and the model weights (from the list produced by the previous function). A matrix of predictions is created, where each column corresponds to the vector of predictions made by a particular model. Thus, there will be as many columns in this matrix as the models that we used for boosting. We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1. .Lastly, the matrix of weighted observations is reduced into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by the function. As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy: > ada_model <- AdaBoostNN(magic_train_df_pp, 'CLASS', 10, 1) > predictions <- AdaBoostNN.predict(ada_model, magic_test_pp,   'CLASS') > mean(predictions == magic_test_output) [1] 0.804365 We see in this example, boosting ten models shows a marginal improvement in accuracy, but perhaps training more models might make more of a difference. What we learned From the preceding example, you may conclude that, for the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that, at this point, our accuracy is being limited by the type of model trained. If you found this article useful, make sure to check out the book Statistics for Data Science for interesting statistical techniques and their implementation in R.  
Read more
  • 0
  • 0
  • 15020
article-image-choose-r-data-mining-project
Fatema Patrawala
15 Feb 2018
9 min read
Save for later

Why choose R for your data mining project

Fatema Patrawala
15 Feb 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt taken from the book R Data Mining, written by Andrea Cirillo. If you are a budding data scientist or a data analyst with basic knowledge of R, and you want to get into the intricacies of data mining in a practical manner, be sure to check out this book.[/box] In today’s post, we will analyze R's strengths, and understand why it is a savvy idea to learn this programming language for data mining. R's strengths You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular? If looking at the root causes of R's popularity, we definitely have to mention these three: Open source inside Plugin ready Data visualization friendly Open source inside One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well. These attributes fit well for almost every target user of a statistical analysis language: Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses Plugin ready You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game. There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to file system manipulation, statistical analysis, and data Visualization. While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages. This is basically how the package development and sharing flow works: The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper. The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R related documents and packages. Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor): install.packages("ggplot2") library(ggplot2) As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is. More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community. Data visualization friendly As a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data. Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields. R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks: This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it. Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers. To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field. Engaging with the community to learn R Now that we are aware of R’s popularity we need to engage with the community to take advantage of it. We will look at alternative and non-exclusive ways of engaging with the community: Employing community-driven learning material Asking for help from the community Staying ahead of language developments Employing community-driven learning material: There are two main kinds of R learning materials developed by the community: Papers, manuals, and books Online interactive courses Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books. Let me point out to you the more useful ones: Advanced R R for Data Science Introduction to Statistical Learning OpenIntro Statistics The R Journal Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff. Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question. Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there): Stack Overflow R-help mailing list R packages documentation I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered. If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers. Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language. To summarize, we looked why to choose R as your programming language for data mining and how we can engage with the R community. If you think this post is useful, you may further check out this book R Data Mining, to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more.    
Read more
  • 0
  • 0
  • 6057

article-image-make-efficient-data-driven-decisions
Aaron Lazar
15 Feb 2018
7 min read
Save for later

How to make efficient data-driven decisions

Aaron Lazar
15 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. The book will help you build, tune, and deploy predictive data models with TensorFlow.[/box] Today we’ll learn to take decisions driven by data with the help of few examples. The growing demand for data is a key challenge. Decision support teams such as institutional research and business intelligence often cannot take the right decisions on how to expand their business and research outcomes from a huge collection of data. Although data plays an important role in driving the decision, however, in reality, taking the right decision at right time is the goal. In other words, the goal is the decision support, not the data support. This can be achieved through an advanced use of data management and analytics. Data value chain for making decisions The following diagram in figure 1 (source: H. Gilbert Miller and Peter Mork, From Data to Decisions: A Value Chain for Big Data, Proc. Of IT Professional, Volume: 15, Issue: 1, Jan.-Feb. 2013, DOI: 10.1109/MITP.2013.11) shows the data chain towards taking actual decisions–that is, the goal. The value chains start through the data discovery stage consisting of several steps such as data collection and annotating data preparation, and then organizing them in a logical order having the desired flow. Then comes the data integration for establishing a common data representation of the data. Since the target is to take the right decision, for future reference having the appropriate provenance of the data–that is, where it comes from, is important: Well, now your data is somehow integrated into a presentable format, it's time for the data exploration stage, which consists of several steps such as analyzing the integrated data and visualization before taking the actions to take on the basis of the interpreted results. However, is this enough before taking the right decision? Probably not! The reason is that it lacks enough analytics, which eventually helps to take the decision with an actionable insight. Predictive analytics comes in here to fill the gap between. Now let's see an example of how in the following section. From disaster to decision – Titanic survival example Here is the challenge, Titanic–Machine Learning from Disaster from Kaggle (https://www.kaggle.com/c/titanic): "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy" But going into this deeper, we need to know about the data of passengers travelling in the Titanic during the disaster so that we can develop a predictive model that can be used for survival analysis. The dataset can be downloaded from the preceding URL. Table 1 here shows the metadata about the Titanic survival dataset: A snapshot of the dataset can be seen as follows: The ultimate target of using this dataset is to predict what kind of people survived the Titanic disaster. However, a bit of exploratory analysis of the dataset is a mandate. At first, we need to import necessary packages and libraries: import pandas as pd import matplotlib.pyplot as plt import numpy as np Now read the dataset and create a panda's DataFrame: df = pd.read_csv('/home/asif/titanic_data.csv') Before drawing the distribution of the dataset, let's specify the parameters for the graph: fig = plt.figure(figsize=(18,6), dpi=1600) alpha=alpha_scatterplot = 0.2 alpha_bar_chart = 0.55 fig = plt.figure() ax = fig.add_subplot(111) Draw a bar diagram for showing who survived versus who did not: ax1 = plt.subplot2grid((2,3),(0,0)) ax1.set_xlim(-1, 2) df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart) plt.title("Survival distribution: 1 = survived") Plot a graph showing survival by Age: plt.subplot2grid((2,3),(0,1)) plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot) plt.ylabel("Age") plt.grid(b=True, which='major', axis='y') plt.title("Survival by Age: 1 = survived") Plot a graph showing distribution of the passengers classes: ax3 = plt.subplot2grid((2,3),(0,2)) df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart) ax3.set_ylim(-1, len(df.Pclass.value_counts())) plt.title("Class dist. of the passengers") Plot a kernel density estimate of the subset of the 1st class passengers' age: plt.subplot2grid((2,3),(1,0), colspan=2) df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde') df.Age[df.Pclass == 3].plot(kind='kde') plt.xlabel("Age") plt.title("Age dist. within class") plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best') Plot a graph showing passengers per boarding location: ax5 = plt.subplot2grid((2,3),(1,2)) df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart) ax5.set_xlim(-1, len(df.Embarked.value_counts())) plt.title("Passengers per boarding location") Finally, we show all the subplots together: plt.show() >>> The figure shows the survival distribution, survival by age, age distribution, and the passengers per boarding location: However, to execute the preceding code, you need to install several packages such as matplotlib, pandas, and scipy. They are listed as follows: Installing pandas: Pandas is a Python package for data manipulation. It can be installed as follows: $ sudo pip3 install pandas #For Python 2.7, use the following: $ sudo pip install pandas Installing matplotlib: In the preceding code, matplotlib is a plotting library for mathematical objects. It can be installed as follows: $ sudo apt-get install python-matplotlib # for Python 2.7 $ sudo apt-get install python3-matplotlib # for Python 3.x Installing scipy: Scipy is a Python package for scientific computing. Installing blas and lapack and gfortran are a prerequisite for this one. Now just execute the following command on your terminal: $ sudo apt-get install libblas-dev liblapack-dev $ sudo apt-get install gfortran $ sudo pip3 install scipy # for Python 3.x $ sudo pip install scipy # for Python 2.7 For Mac, use the following command to install the above modules: $ sudo easy_install pip $ sudo pip install matplotlib $ sudo pip install libblas-dev liblapack-dev $ sudo pip install gfortran $ sudo pip install scipy For windows, I am assuming that Python 2.7 is already installed at C:Python27. Then open the command prompt and type the following command: C:Usersadmin-karim>cd C:/Python27 C:Python27> python -m pip install <package_name> # provide package name accordingly. For Python3, issue the following commands: C:Usersadmin-karim>cd C:Usersadmin-karimAppDataLocalPrograms PythonPython35Scripts C:Usersadmin-karimAppDataLocalProgramsPythonPython35 Scripts>python3 -m pip install <package_name> Well, we have seen the data. Now it's your turn to do some analytics on top of the data. Say predicting what kinds of people survived from that disaster. Don't you agree that we have enough information about the passengers, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, say being a woman, being in 1st class, and being a child were all factors that could boost passenger chances of survival during this disaster. In a brute-force approach–for example, using if/else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, does writing such a program in Python make much sense? Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine tuning for each variable and samples (that is, passenger). This is where predictive analytics with machine learning algorithms and emerging tools comes in so that you could build a program that learns from sample data to predict whether a given passenger would survive. If you found this post useful and would like to explore more, head over to grab the book, Predictive Analytics with TensorFlow written by Md. Rezaul Karim.    
Read more
  • 0
  • 0
  • 2129

article-image-build-generative-chatbot-using-recurrent-neural-networks-lstm-rnns
Savia Lobo
15 Feb 2018
8 min read
Save for later

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

Savia Lobo
15 Feb 2018
8 min read
In today’s tutorial we will learn to build generative chatbot using recurrent neural networks. The RNN used here is Long Short Term Memory(LSTM). Generative chatbots are very difficult to build and operate. Even today, most workable chatbots are retrieving in nature; they retrieve the best response for the given question based on semantic similarity, intent, and so on. For further reading, refer to the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et. al. (https://arxiv.org/pdf/1406.1078.pdf). [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled Natural Language Processing with Python Cookbook. In this book you will come across various recipes covering natural language understanding, Natural Language Processing, and syntactic analysis.[/box] Getting ready... The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python: >>> import os """ First change the following directory link to where all input files do exist """ >>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL") >>> import numpy as np >>> import pandas as pd # File reading >>> with open('bot.txt', 'r') as content_file: ... botdata = content_file.read() >>> Questions = [] >>> Answers = [] AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively: >>> for line in botdata.split("</pattern>"): ... if "<pattern>" in line: ... Quesn = line[line.find("<pattern>")+len("<pattern>"):] ... Questions.append(Quesn.lower()) >>> for line in botdata.split("</template>"): ... if "<template>" in line: ... Ans = line[line.find("<template>")+len("<template>"):] ... Ans = Ans.lower() ... Answers.append(Ans.lower()) >>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"]) >>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"] >>> print(QnAdata.head()) The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can't read English and everything is in numbers for the model. How to do it... After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results: Preprocessing: Convert the question-and-answer pairs into vectorized format, which will be utilized in model training. Model building and validation: Develop deep learning models and validate the data. Prediction of answers from trained model: The trained model will be used to predict answers for given questions. How it works... The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings: # Creating Vocabulary >>> import nltk >>> import collections >>> counter = collections.Counter() >>> for i in range(len(QnAdata)): ... for word in nltk.word_tokenize(QnAdata.iloc[i][2]): ... counter[word]+=1 >>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} >>> idx2word = {v:k for k,v in word2idx.items()} >>> idx2word[0] = "PAD" >>> vocab_size = len(word2idx)+1 >>> print (vocab_size) Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data: >>> def encode(sentence, maxlen,vocab_size): ... indices = np.zeros((maxlen, vocab_size)) ... for i, w in enumerate(nltk.word_tokenize(sentence)): ... if i == maxlen: break ... indices[i, word2idx[w]] = 1 ... return indices >>> def decode(indices, calc_argmax=True): ... if calc_argmax: ... indices = np.argmax(indices, axis=-1) ... return ' '.join(idx2word[x] for x in indices) The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it's length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models: >>> question_maxlen = 10 >>> answer_maxlen = 20 >>> def create_questions(question_maxlen,vocab_size): ... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size)) ... for q in range(len(Questions)): ... question = encode(Questions[q],question_maxlen,vocab_size) ... question_idx[q] = question ... return question_idx >>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size) >>> def create_answers(answer_maxlen,vocab_size): ... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size)) ... for q in range(len(Answers)): ... answer = encode(Answers[q],answer_maxlen,vocab_size) ... answer_idx[q] = answer ... return answer_idx >>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size) >>> from keras.layers import Input,Dense,Dropout,Activation >>> from keras.models import Model >>> from keras.layers.recurrent import LSTM >>> from keras.layers.wrappers import Bidirectional >>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension's vocabulary size: >>> n_hidden = 128 >>> question_layer = Input(shape=(question_maxlen,vocab_size)) >>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer) >>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn) >>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode) >>> regularized_layer = ActivityRegularization(l2=1)(dense_layer) >>> softmax_layer = Activation('softmax')(regularized_layer) >>> model = Model([question_layer],[softmax_layer]) >>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) >>> print (model.summary()) The following model summary describes the change in flow of model size across the model. The input layer matches the question's dimension and the output matches the answer's dimension: # Model Training >>> quesns_train_2 = quesns_train.astype('float32') >>> answs_train_2 = answs_train.astype('float32') >>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05) The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less: # Model prediction >>> ans_pred = model.predict(quesns_train_2[0:3]) >>> print (decode(ans_pred[0])) >>> print (decode(ans_pred[1])) The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models: Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try: Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably. Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.  
Read more
  • 0
  • 4
  • 58183
article-image-deploy-rethinkdb-using-docker
Vijin Boricha
14 Feb 2018
7 min read
Save for later

How to deploy RethinkDB using Docker

Vijin Boricha
14 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.  
Read more
  • 0
  • 0
  • 57987

article-image-how-to-perform-data-exploration-with-rethinkdb
Vijin Boricha
14 Feb 2018
5 min read
Save for later

How to perform Data exploration with RethinkDB

Vijin Boricha
14 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will let you master the capabilities of RethinkDB and implement them to develop efficient real-time web applications.[/box] In this article, we will learn to do data exploration in RethinkDB with the help of few use case. Executing data exploration use cases We have imported our database i.e. our mock data into our RethinkDB instance. Now it's time to run a use case query and make use of it. But before we do so, we need to figure out one data alteration. We have made a mistake while generating mock data (on purpose actually) we have a $ sign before ctc. Hence, it becomes tough to perform salary-level queries. Before we move ahead, we need to figure out this problem, and basically get rid of the $ sign and update the ctc value to an integer instead of a string. In order to do this, we need to perform the following operation: Traverse through each document in the database Split the ctc string into two parts, containing $ and the other value Update the ctc value in the document with a new data type and value Since we require the chaining of queries, I have written a small snippet in Node.js to achieve the previous scenario as follows: var rethinkdb = require('rethinkdb'); var connection = null; rethinkdb.connect({host : 'localhost', port : 28015},function(err,conn) { if(err) { throw new Error('Connection error'); } connection = conn; rethinkdb.db("company").table("employees") .run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.each(function(err,data) { data.ctc = parseInt(data.ctc.split("$")[1]); rethinkdb.db("company").table("employees") .get(data.id) .update({ctc : data.ctc}) .run(connection,function(err,response) { if(err) { throw new Error(err); } console.log(response); }); }); }); }); As you can see in the preceding code, we first fetch all the documents and traverse them using cursor, one document at a time. We use the split() method as a $ separator and convert the outcome, which is salary, into an integer using the parseInt() method. We update each document at a time using the id value of the document: After selecting all the documents again, we can see an updated ctc value as an integer, as shown in the following figure: This is one of the practical examples where we perform some data manipulation before moving ahead with complex queries. Similarly, you can look for errors such as blank spaces in a specific field or duplicate elements in your record. Finding duplicate elements We can use distinct() to find out whether there is any duplicate element present in the table. Say you have 1,000 rows and there are 10 duplicates. In order to determine that, we just need to find out the unique rows (of course excluding the ID key, as that's unique by nature). Here is the query for the same: r.db("company").table('employees').without('id').distinct().count() As shown in the following screenshot, this query returns the count of unique rows, which should be 1,000 if there are no duplicates: This implies that our records contain no duplicate documents. Finding the list of countries We can write a query to find all the countries we have in our record and also use distinct again by just selecting the country field. Here is the query: r.db("company").table('employees')("country").distinct() As shown in this image, we have 124 countries in our records: Finding the top 10 employees with the highest salary In this use case, we need to evaluate all the records and find the top 10 employees with the highest to lowest pay. Here is the query for the same: r.db("company").table("employees").orderBy(r.desc("ctc")).limit(10) Here we are using orderBy, which by default orders the record in ascending order. To get the highest pay at the first document, we need to use descending ordering; we did it using the desc() ReQL command. As shown in the following image, the query returns 10 rows: You can modify the same query by just by limiting the number of users to one to get thehighest-paid employee. Displaying employee records with a specific name and location To extract such records from our table, we need to again perform a filter on the "first_name" and "country" fields. Here is the query to return those records: r.db("company").table('employees').filter({"first_name" : "John","country" : "Sweden"}) We are just performing a basic filter and comparing both fields. ReQL queries are really easy for solving such queries due to their chaining feature. After executing the preceding query, we show the following output: To summarize, we looked over a few use cases where we had to perform alteration and filtering of records in order to meet exploration task, like stripping the $ sign from ctc, or converting base 256 ip addresses into base 10 values and then performing a query on them. We also covered a general use case in order to get a practical feel of ReQL. If you are interested to learn about RethinkDB Query Language,  Extending RethinkDB, and more you may check out this book Mastering RethinkDB.  
Read more
  • 0
  • 0
  • 2354
Modal Close icon
Modal Close icon