Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-sql-tuning-enhancements-oracle-12c

13 Dec 2016

13 min read

SQL Tuning Enhancements in Oracle 12c

13 Dec 2016

0
0
5035

article-image-tableau-data-extract-best-practices

Packt

12 Dec 2016

11 min read

Tableau Data Extract Best Practices

Packt

12 Dec 2016

11 min read

0
0
16226

Packt

09 Dec 2016

4 min read

What’s New in SQL Server 2016 Reporting Services

Packt

09 Dec 2016

4 min read

In this article by Robert C. Cain, coauthor of the book SQL Server 2016 Reporting Services Cookbook, we’ll take a brief tour of the new features in SQL Server 2016 Reporting Services. SQL Server 2016 Reporting Services is a true evolution in reporting technology. After making few changes to SSRS over the last several releases, Microsoft unveiled a virtual cornucopia of new features. (For more resources related to this topic, see here.) Report Portal The old Report Manager has received a complete facelift, along with many added new features. Along with it came a rename, it is now known as the Report Portal. The following is a screenshot of the new portal: KPIs KPIs are the first feature you’ll notice. The Report Portal has the ability to display key performance indicators directly, meaning your users can get important metrics at a glance, without the need to open reports. In addition, these KPIs can be linked to other report items such as reports and dashboards, so that a user can simply click on them to find more information. Mobile Reporting Microsoft recognized the users in your organization no longer use just a computer to retrieve their information. Mobile devices, such as phones and tablets, are now commonplace. You could, of course, design individual reports for each platform, but that would cause a lot of repetitive work and limit reuse. To solve this, Microsoft has incorporated a new tool, Mobile Reports. This allows you to create an attractive dashboard that can be displayed in any web browser. In addition, you can easily rearrange the dashboard layout to optimize for both phones and tablets. This means you can create your report once, and use it on multiple platforms. Below are three images of the same mobile report. The first was done via a web browser, the second on a tablet, and the final one on a phone: Paginated reports Traditional SSRS reports have now been renamed Paginated Reports, and are still a critical element in reporting. These provide the detailed information needed for day to day activities in your company. Paginated reports have received several enhancements. First, there are two new chart types, Sunburst and TreeMap. Reports may now be exported to a new format, PowerPoint. Additionally, all reports are now rendered in HTML 5 format. This makes them accessible to any browser, including those running on tablets or other platforms such as Linux or the Mac. PowerBI PowerBI Desktop reports may now be housed within the Report Portal. Currently, opening one will launch the PowerBI desktop application.However, Microsoft has announced in an upcoming update to SSRS 2016 PowerBI reports will be displayed directly within the Report Portal without the need to open the external app. Reporting applications Speaking of Apps, the Report Builder has received a facelift, updating it to a more modern user interface with a color scheme that matches the Report Portal. Report Builder has also been decoupled from the installation of SQL Server. In previous versions Report Builder was part of the SQL Server install, or it was available as a separate download. With SQL Server 2016, both the Report Builder and the Mobile Reporting tool are separate downloads making them easier to stay current as new versions are released. The Report Portal now contains links to download these tools. Excel Excel workbooks, often used as a reporting tool itself, may now be housed within the Report Portal. Opening them will launch Excel, similar to the way in which PowerBI reports currently work. Summary This article summarizes just some of the many new enhancements to SQL Server 2016 Reporting Services. With this release, Microsoft has worked toward meeting the needs of many users in the corporate environment, including the need for mobile reporting, dashboards, and enhanced paginated reports. For more details about these and many more features see the book SQL Server 2016 Reporting Services Cookbook, by Dinesh Priyankara and Robert C. Cain. Resources for Article: Further resources on this subject: Getting Started with Pentaho Data Integration [article] Where Is My Data and How Do I Get to It? [article] Configuring and Managing the Mailbox Server Role [article]

0
0
2486

article-image-event-detection-news-headlines-hadoop

Packt

08 Dec 2016

13 min read

Event detection from the news headlines in Hadoop

Packt

08 Dec 2016

13 min read

In this article by Anurag Shrivastava, author of Hadoop Blueprints, we will be learning how to build a text analytics system which detects the specific events from the random news headlines. Internet has become the main source of news in the world. There are thousands of website which constantly publish and update the news stories around the world. Not every news items is relevant for everyone but some news items are very critical for some people or businesses. For example, if you were major car manufacturer based in Germany having your suppliers located in India then you would be interested in the news from the region which can affect your supply chain. (For more resources related to this topic, see here.) Road accidents in India are a major social and economic problem. Road accidents leave a large number of fatalities behind and result in the loss of capital. In this example, we will build a system which detects if a news item refers to a road accident event. Let us define what we mean by it in the next paragraph. A road accident event may or may not result in fatal injuries. One or more vehicles and pedestrians may be involved in the accidents. A non road accident event news item is everything else which can not be categorized as a road accident event. It could be a road accident trend analysis related to road accidents or something totally unrelated. Technology stack To build this system, we will use the following technologies: Task Technology Data storage HDFS Data processing Hadoop MapReduce Query engine Hive and Hive UDF Data ingestion Curl and HDFS copy Event detection OpenNLP The event detection system is a machine learning based natural language processing system. The natural language processing system brings the intelligence to detect the events in the random headline sentences from the news items. An OpenNLP OpenSourceNaturalLanguageProcessingFramework (OpenNLP) is from apache software foundation. You can download the version 1.6.0 from https://opennlp.apache.org/ to run the examples in this blog. It is capable of detecting the entities, document categories, parts of speech, and so on in the text written by humans. We will use document categorization feature of OpenNLP in our system. Document categorization feature requires you to train the OpenNLP model with the help of sample text. As a result of training, we get a model. This resulting model is used to categorize the new text. Our training data looks as follows: r 1.46 lakh lives lost on Indian roads last year - The Hindu. r Indian road accident data | OpenGovernmentData (OGD) platform... r 400 people die everyday in road accidents in India: Report - India TV. n Top Indian female biker dies in road accident during country-wide tour. n Thirty die in road accidents in north India mountains—World—Dunya... n India's top woman biker Veenu Paliwal dies in road accident: India... r Accidents on India's deadly roads cost the economy over $8 billion... n Thirty die in road accidents in north India mountains (The Express) The first column can take two values: n indicates that the news item is a road accident event r indicates that the news item is not a road accident event or everything else This training set has total 200 lines. Please note that OpenNLP requires at least 15000 lines in the training set to deliver good results. Because we do not have so much training data, we will start with a small set but remain aware about the limitations of our model. You will see that even with a small training dataset, this model works reasonably well. Let us train and build our model: $ opennlp DoccatTrainer -model en-doccat.bin -lang en -data roadaccident.train.prn -encoding UTF-8 Here the file roadaccident.train.prn contains the training data. The output file en-doccat.bin contains the model which we will use in our data pipeline. We have built our model using the command line utility but it is also possible to build the model programmatically. The training data file is a plain text file, which you can expand with a bigger corpus of knowledge to make the model smarter. Next we will build the data pipeline as follows: Fetch RSS feeds This component will fetch RSS news feeds from the popular news web sites. In this case, we will just use one news from Google. We can always add more sites after our first RSS feed has been integrated. The whole RSS feed can be downloaded using the following command: $ curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" The previous command downloads the news headline for India. You can customize the RSS feed by visiting the Google news site is https://news.google.com for your region. Scheduler Our scheduler will fetch the RSS feed once in 6 hours. Let us assume that in 6 hours time interval, we have good likelihood of fetching fresh news items. We will wrap our feed fetching script in a shell file and invoke it using cron. The script is as follows: $ cat feedfetch.sh NAME= "newsfeed-"`date +%Y-%m-%dT%H.%M.%S` curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" > $NAME hadoop fs -put $NAME /xml/rss/newsfeeds Cron job setup line will be as follows: 0 */6 * * * /home/hduser/mycommand Please edit your cron job table using the following command and add the setup line in it: $ cronjob -e Loading data in HDFS To load data in HDFS, we will use HDFS put command which copies the downloaded RSS feed in a directory in HDFS. Let us make this directory in HDFS where our feed fetcher script will store the rss feeds: $ hadoop fs -mkdir /xml/rss/newsfeeds Query using Hive First we will create an external table in Hive for the new RSS feed. Using Xpath based select queries, we will extract the news headlines from the RSS feeds. These headlines will be passed to UDF to detect the categories: CREATE EXTERNAL TABLE IF NOT EXISTS rssnews( document STRING) COMMENT 'RSS Feeds from media' STORED AS TEXTFILE location '/xml/rss/newsfeeds'; The following command parses the XML to retrieve the title or the headlines from XML and explodes them in a single column table: SELECT explode(xpath(name, '//item/title/text()')) FROM xmlnews1; The sample output of the above command on my system is as follows: hive> select explode(xpath(document, '//item/title/text()')) from rssnews; Query ID = hduser_20161010134407_dcbcfd1c-53ac-4c87-976e-275a61ac3e8d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475744961620_0016, Tracking URL = http://localhost:8088/proxy/application_1475744961620_0016/ Kill Command = /home/hduser/hadoop-2.7.1/bin/hadoop job -kill job_1475744961620_0016 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-10-10 14:46:14,022 Stage-1 map = 0%, reduce = 0% 2016-10-10 14:46:20,464 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.69 sec MapReduce Total cumulative CPU time: 4 seconds 690 msec Ended Job = job_1475744961620_0016 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 4.69 sec HDFS Read: 120671 HDFS Write: 1713 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 690 msec OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost CPI(M) worker hacked to death in Kannur - The Hindu Akhilesh Yadav's comment on PM Modi's Lucknow visit shows Samajwadi Party's insecurity: BJP - The Indian Express PMO maintains no data about petitions personally read by PM - Daily News & Analysis AIADMK launches social media campaign to put an end to rumours regarding Amma's health - Times of India Pakistan, India using us to play politics: Former Baloch CM - Times of India Indian soldier, who recited patriotic poem against Pakistan, gets death threat - Zee News This Dussehra effigies of 'terrorism' to go up in flames - Business Standard 'Personal reasons behind Rohith's suicide': Read commission's report - Hindustan Times Time taken: 5.56 seconds, Fetched: 10 row(s) Hive UDF Our Hive User Defined Function (UDF) categorizeDoc takes a news headline and suggests if it is a news about a road accident or the road accident event as we explained earlier. This function is as follows: package com.mycompany.app;import org.apache.hadoop.io.Text;import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;import opennlp.tools.util.InvalidFormatException;import opennlp.tools.doccat.DoccatModel;import opennlp.tools.doccat.DocumentCategorizerME;import java.lang.String;import java.io.FileInputStream;import java.io.InputStream;import java.io.IOException;@Description( name = "getCategory", value = "_FUNC_(string) - gets the catgory of a document ")public final class MyUDF extends UDF { public Text evaluate(Text input) { if (input == null) return null; try { return new Text(categorizeDoc(input.toString())); } catch (Exception ex) { ex.printStackTrace(); return new Text("Sorry Failed: >> " + input.toString()); } } public String categorizeDoc(String doc) throws InvalidFormatException, IOException { InputStream is = new FileInputStream("./en-doccat.bin"); DoccatModel model = new DoccatModel(is); is.close(); DocumentCategorizerME classificationME = new DocumentCategorizerME(model); String documentContent = doc; double[] classDistribution = classificationME.categorize(documentContent); String predictedCategory = classificationME.getBestCategory(classDistribution); return predictedCategory; }} The function categorizeDoc take a single string as input. It loads the model which we created earlier from the file en-doccat.bin from the local directory. Finally it calls the classifier which returns the result to the calling function. The calling function MyUDF extends the hive UDF class. It calls the function categorizeDoc for each string line item input. If the it succeed then the value is returned to the calling program otherwise a message is returned which indicates that the category detection has failed. The pom.xml file to build the above file is as follows: $ cat pom.xml <?xml version="1.0" encoding="UTF-8"?> <project xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mycompany</groupId> <artifactId>app</artifactId> <version>1.0</version> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>2.0.0</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.8</version> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>com.mycompany.app.App</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> </pluginManagement> </build> </project> You can build the jar with all the dependencies in it using the following commands: $ mvn clean compile assembly:single The resulting jar file app-1.0-jar-with-dependencies.jar can be found in the target directory. Let us use this jar file in Hive to categorise the news headlines as follows: Copy jar file to the bin subdirectory in the Hive root: $ cp app-1.0-jar-with-dependencies.jar $HIVE_ROOT/bin Copy the trained model in the bin sub directory in the Hive root: $ cp en-doccat.bin $HIVE_ROOT/bin Run the categorization queries Run Hive: $hive Add jar file in Hive: hive> ADD JAR ./app-1.0-jar-with-dependencies.jar ; Create a temporary categorization function catDoc: hive> CREATE TEMPORARY FUNCTION catDoc as 'com.mycompany.app.MyUDF'; Create a table headlines to hold the headlines extracted from the RSS feed: hive> create table headlines( headline string); Insert the extracted headlines in the table headlines: hive> insert overwrite table headlines select explode(xpath(document, '//item/title/text()')) from rssnews; Let's test our UDF by manually passing a real news headline to it from a newspaper website: hive> hive> select catDoc("8 die as SUV falls into river while crossing bridge in Ghazipur") ; OK N The output is N which means this is indeed a headline about a road accident incident. This is reasonably good, so now let us run this function for the all the headlines: hive> select headline, catDoc(*) from headlines; OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu r Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost r Akhilesh Yadav Backs Rahul Gandhi's 'Dalali' Remark - NDTV r PMO maintains no data about petitions personally read by PM Narendra Modi - Economic Times n Mobile Internet Services Suspended In Protest-Hit Nashik - NDTV n Pakistan, India using us to play politics: Former Baloch CM - Times of India r CBI arrests Central Excise superintendent for taking bribe - Economic Times n Be extra vigilant during festivals: Centre's advisory to states - Times of India r CPI-M worker killed in Kerala - Business Standard n Burqa-clad VHP activist thrashed for sneaking into Muslim women gathering - The Hindu r Time taken: 0.121 seconds, Fetched: 10 row(s) You can see that our headline detection function works and output r or n. In the above example, we see many false positives where a headline has been incorrectly identified as a road accident. A better training for our model can improve the quality of our results. Further reading The book Hadoop Blueprints covers several case studies where we can apply Hadoop, HDFS, data ingestion tools such as Flume and Sqoop, query and visualization tools such as Hive and Zeppelin, machine learning tools such as BigML and Spark to build the solutions. You will discover how to build a fraud detection system using Hadoop or build a Data Lake for example. Summary In this article we have learned to build a text analytics system which detects the specific events from the random news headlines. This also covers how to apply Hadoop, HDFS, and other different tools. Resources for Article: Further resources on this subject: Spark for Beginners [article] Hive Security [article] Customizing heat maps (Intermediate) [article]

0
0
1609

Packt

07 Dec 2016

23 min read

Build a Chatbot

Packt

07 Dec 2016

23 min read

0
0
2997

article-image-define-necessary-connections

Packt

02 Dec 2016

5 min read

Define the Necessary Connections

Packt

02 Dec 2016

5 min read

In this article by Robert van Mölken and Phil Wilkins, the author of the book Implementing Oracle Integration Cloud Service, where we will see creating connections which is one of the core components of an integration we can easily navigate to the Designer Portal and start creating connections. (For more resources related to this topic, see here.) On the home page, click the Create link of the Connection tile as given in the following screenshot: Because we click on this link the Connections page is loaded, which lists of all created connections, a modal dialogue automatically opens on top of the list. This pop-up shows all the adapter types we can create. For our first integration we define two technology adapter connections, an inbound SOAP connection and an outbound REST connection. Inbound SOAP connection In the pop-up we can scroll down the list and find the SOAP adapter, but the modal dialogue also includes a search field. Just search on SOAP and the list will show the adapters matching the search criteria: Find your adapter by searching on the name or change the appearance from card to list view to show more adapters at ones. Click Select to open the New Connection page. Before we can setup any adapter specific configurations every creation starts with choosing a name and an optional description: Create the connection with the following details: Connection Name FlightAirlinesSOAP_Ch2 Identifier This will be proposed based on the connection name and there is no need to change unless you'd like an alternate name. It is usually the name in all CAPITALS and without spaces and has a max length of 32 characters. Connection Role Trigger The role chosen restricts the connection to be used only in selected role(s). Description This receives in Airline objects as a SOAP service. Click the Create button to accept the details. This will bring us to the specific adapter configuration page where we can add and modify the necessary properties. The one thing all the adapters have in common is the optional Email Address under Connection Administration. This email address is used to send notification to when problems or changes occur in the connection. A SOAP connection consists of three sections; Connection Properties, Security, and an optional Agent Group. On the right side of each section we can find a button to configure its properties.Let's configure each section using the following steps: Click the Configure Connectivity button. Instead of entering in an URL we are uploading the WSDL file. Check the box in the Upload File column. Click the newly shown Upload button. Upload the file ICSBook-Ch2-FlightAirlines-Source WSDL. Click OK to save the properties. Click the Configure Credentials button. In the pop-up that is shown we can configure the security credentials. We have the choice for Basic authentication, Username Password Token, or No Security Policy. Because we use it for our inbound connection we don't have to configure this. Select No Security Policy from the dropdown list. This removes the username and password fields. Click OK to save the properties. We leave the Agent Group section untouched. We can attach an Agent Group if we want to use it as an outbound connection to an on-premises web service. Click Test to check if the connection is working (otherwise it can't be used). For SOAP and REST it simply pings the given domain to check the connectivity, but others for example the Oracle SaaS adapters also authenticate and collect metadata. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Outbound REST connection Now that the inbound connection is created we can create our REST adapter. Click the Create New Connection button to show the Create Connection pop-up again and select the REST adapter. Create the connection with the following details: Connection Name FlightAirlinesREST_Ch2 Identifier This will be proposed based on the connection name Connection Role Invoke Description This returns the Airline objects as a REST/JSON service Email Address Your email address to use to send notifications to Let’s configure the connection properties using the following steps: Click the Configure Connectivity button. Select REST API Base URL for the Connection Type. Enter the URL were your Apiary mock is running on: http://private-xxxx-yourapidomain.apiary-mock.com. Click OK to save the values. Next configure the security credentials using the following steps: Click the Configure Credentials button. Select No Security Policy for the Security Policy. This removes the username and password fields. Click the OK button to save out choice. Click Test at the top to check if the connection is working. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Troubleshooting If the test fails for one of these connections check if the correct WSDL is used or that the connection URL for the REST adapter exists or is reachable. Summary In this article we looked at the processes of creating and testing the necessary connections and the creation of the integration itself. We have seen an inbound SOAP connection and an outbound REST connection. In demonstrating the integration we have also seen how to use Apiary to document and mock our backend REST service. Resources for Article: Further resources on this subject: Getting Started with a Cloud-Only Scenario [article] Extending Oracle VM Management [article] Docker Hosts [article]

0
0
1417

article-image-introducing-algorithm-design-paradigms

Packt

18 Nov 2016

10 min read

Introducing Algorithm Design Paradigms

Packt

18 Nov 2016

10 min read

In this article by David Julian and Benjamin Baka, author of the book Python Data Structures and Algorithm, we will discern three broad approaches to algorithm design. They are as follows: Divide and conquer Greedy algorithms Dynamic programming (For more resources related to this topic, see here.) As the name suggests, the divide and conquer paradigm involves breaking a problem into smaller subproblems, and then in some way combining the results to obtain a global solution. This is a very common and natural problem solving technique and is, arguably, the most used approach to algorithm design. Greedy algorithms often involve optimization and combinatorial problems; the classic example is applying it to the traveling salesperson problem, where a greedy approach always chooses the closest destination first. This shortest path strategy involves finding the best solution to a local problem in the hope that this will lead to a global solution. The dynamic programming approach is useful when our subproblems overlap. This is different from divide and conquer. Rather than breaking our problem into independent subproblems, with dynamic programming, intermediate results are cached and can be used in subsequent operations. Like divide and conquer, it uses recursion. However, dynamic programing allows us to compare results at different stages. This can have a performance advantage over divide and conquer for some problems because it is often quicker to retrieve a previously calculated result from memory rather than having to recalculate it. Recursion and backtracking Recursion is particularly useful for divide and conquer problems; however, it can be difficult to understand exactly what is happening, since each recursive call is itself spinning off other recursive calls. At the core of a recursive function are two types of cases. Base cases, which tell the recursion when to terminate and recursive cases that call the function they are in. A simple problem that naturally lends itself to a recursive solution is calculating factorials. The recursive factorial algorithm defines two cases—the base case, when n is zero, and the recursive case, when n is greater than zero. A typical implementation is shown in the following code: def factorial(n): #test for a base case if n==0: return 1 # make a calculation and a recursive call f= n*factorial(n-1) print(f) return(f) factorial(4) This code prints out the digits 1, 2, 4, 24. To calculate 4!, we require four recursive calls plus the initial parent call. On each recursion, a copy of the methods variables is stored in memory. Once the method returns, it is removed from memory. Here is a way to visualize this process: It may not necessarily be clear if recursion or iteration is a better solution to a particular problem, after all, they both repeat a series of operations and both are very well suited to divide and conquer approaches to algorithm design. An iteration churns away until the problem is done. Recursion breaks the problem down into smaller chunks and then combines the results. Iteration is often easier for programmers because the control stays local to a loop, whereas recursion can more closely represent mathematical concepts such as factorials. Recursive calls are stored in memory, whereas iterations are not. This creates a tradeoff between processor cycles and memory usage, so choosing which one to use may depend on whether the task is processor or memory intensive. The following table outlines the key differences between recursion and iteration. Recursion Iteration Terminates when a base case is reached Terminates when a defined condition is met Each recursive call requires space in memory Each iteration is not stored in memory An infinite recursion results in a stack overflow error An infinite iteration will run while the hardware is powered Some problems are naturally better suited to recursive solutions Iterative solutions may not always be obvious Backtracking Backtracking is a form of recursion that is particularly useful for types of problems such as traversing tree structures where we are presented with a number of options at each node, from which we must choose one. Subsequently, we are presented with a different set of options, and depending on the series of choices made, either a goal state or a dead end is reached. If it is the latter, we mast backtrack to a previous node and traverse a different branch. Backtracking is a divide and conquer method for exhaustive search. Importantly, backtracking prunes branches that cannot give a result. An example of back tracking is given by the following. Here, we have used a recursive approach to generating all the possible permutations of a given string, s, of a given length n: def bitStr(n, s): if n == 1: return s return [ digit + bits for digit in bitStr(1,s)for bits in bitStr(n - 1,s)] print (bitStr(3,'abc')) This generates the following output: Note the double list compression and the two recursive calls within this comprehension. This recursively concatenates each element of the initial sequence, returned when n = 1, with each element of the string generated in the previous recursive call. In this sense, it is backtracking to uncover previously ungenerated combinations. The final string that is returned is all n letter combinations of the initial string. Divide and conquer – long multiplication For recursion to be more than just a clever trick, we need to understand how to compare it to other approaches, such as iteration, and to understand when it is use will lead to a faster algorithm. An iterative algorithm that we are all familiar with is the procedure you learned in primary math classes, which was used to multiply two large numbers, that is, long multiplication. If you remember, long multiplication involved iterative multiplying and carry operations followed by a shifting and addition operation. Our aim here is to examine ways to measure how efficient this procedure is and attempt to answer the question, is this the most efficient procedure we can use for multiplying two large numbers together? In the following figure, we can see that multiplying two 4-digit numbers together requires 16 multiplication operations, and we can generalize to say that an n digit number requires, approximately, n2 multiplication operations: This method of analyzing algorithms, in terms of number of computational primitives such as multiplication and addition, is important because it can give a way to understand the relationship between the time it takes to complete a certain computation and the size of the input to that computation. In particular, we want to know what happens when the input, the number of digits, n, is very large. Can we do better? A recursive approach It turns out that in the case of long multiplication, the answer is yes, there are in fact several algorithms for multiplying large numbers that require fewer operations. One of the most well-known alternatives to long multiplication is the Karatsuba algorithm, published in 1962. This takes a fundamentally different approach: rather than iteratively multiplying single digit numbers, it recursively carries out multiplication operation on progressively smaller inputs. Recursive programs call themselves on smaller subset of the input. The first step in building a recursive algorithm is to decompose a large number into several smaller numbers. The most natural way to do this is to simply split the number into halves: the first half comprising the most significant digits and the second half comprising the least significant digits. For example, our four-digit number, 2345, becomes a pair of two digit numbers, 23 and 45. We can write a more general decomposition of any two n-digit numbers x and y using the following, where m is any positive integer less than n. For x-digit number: For y-digit number: So, we can now rewrite our multiplication problem x and y as follows: When we expand and gather like terms we get the following: More conveniently, we can write it like this (equation 1): Here, It should be pointed out that this suggests a recursive approach to multiplying two numbers since this procedure itself involves multiplication. Specifically, the products ac, ad, bc, and bd all involve numbers smaller than the input number, and so it is conceivable that we could apply the same operation as a partial solution to the overall problem. This algorithm, so far consists of four recursive multiplication steps and it is not immediately clear if it will be faster than the classic long multiplication approach. What we have discussed so far in regards to the recursive approach to multiplication was well known to mathematicians since the late 19th century. The Karatsuba algorithm improves on this is by making the following observation. We really only need to know three quantities, z2 = ac, z1=ad +bc, and z0 = bd to solve equation 1. We need to know the values of a, b, c, and d as they contribute to the overall sum and products involved in calculating the quantities z2, z1, and z0. This suggests the possibility that perhaps we can reduce the number of recursive steps. It turns out that this is indeed the situation. Since the products ac and bd are already in their simplest form, it seems unlikely that we can eliminate these calculations. We can, however, make the following observation: When we subtract the quantities ac and bd, which we have calculated in the previous recursive step, we get the quantity we need, namely ad + bc: This shows that we can indeed compute the sum of ad and bc without separately computing each of the individual quantities. In summary, we can improve on equation 1 by reducing from four recursive steps to three. These three steps are as follows: Recursively calculate ac. Recursively calculate bd. Recursively calculate (a +b)(c + d) and subtract ac and bd. The following code shows a Python implementation of the Karatsuba algorithm: from math import log10 def karatsuba(x,y): # The base case for recursion if x < 10 or y < 10: return x*y #sets n, the number of digits in the highest input number n = max(int(log10(x)+1), int(log10(y)+1)) # rounds up n/2 n_2 = int(math.ceil(n / 2.0)) #adds 1 if n is uneven n = n if n % 2 == 0 else n + 1 #splits the input numbers a, b = divmod(x, 10**n_2) c, d = divmod(y, 10**n_2) #applies the three recursive steps ac = karatsuba(a,c) bd = karatsuba(b,d) ad_bc = karatsuba((a+b),(c+d)) - ac - bd #performs the multiplication return (((10**n)*ac) + bd + ((10**n_2)*(ad_bc))) To satisfy ourselves that this does indeed work, we can run the following test function: import random def test(): for i in range(1000): x = random.randint(1,10**5) y = random.randint(1,10**5) expected = x * y result = karatsuba(x, y) if result != expected: return("failed") return('ok') Summary In this article, we looked at a way to recursively multiply large numbers and also a recursive approach for merge sort. We saw how to use backtracking for exhaustive search and generating strings. Resources for Article: Further resources on this subject: Python Data Structures [article] How is Python code organized [article] Algorithm Analysis [article]

0
0
24738

article-image-suggesters-improving-user-search-experience

Packt

18 Nov 2016

11 min read

Suggesters for Improving User Search Experience

Packt

18 Nov 2016

11 min read

In this article by Bharvi Dixit, the author of the book Mastering ElasticSearch 5.0 - Third Edition, we will focus on the topics for improving the user search experience using suggesters, which allows you to correct user query spelling mistakes and build efficient autocomplete mechanisms. First, let's look on the query possibilities and the responses returned by Elasticsearch. We will try to show you the general principles, and then we will get into more details about each of the available suggesters. (For more resources related to this topic, see here.) Using the suggester under search Before Elasticsearch 5.0, there was a possibility to get suggestions for a given text by using a dedicated _suggest REST endpoint. But in Elasticsearch 5.0, this dedicated _suggest endpoint has been deprecated in favor of using suggest API. In this release, the suggest only search requests have been optimized for performance reasons and we can execute the suggetions _search endpoint. Similar to query object, we can use a suggest object and what we need to provide inside suggest object is the text to analyze and the type of used suggester (term or phrase). So if we would like to get suggestions for the words chrimes in wordl (note that we've misspelled the word on purpose), we would run the following query: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } } } }' The dedicated endpoint _suggest has been deprecated in Elasticsearch version 5.0 and might be removed in future releases, so be advised to use suggestion request under _search endpoint. All the examples covered in this article usage the same _search endpoint for suggest request. As you can see, the suggestion request wrapped inside suggest object and is send to Elasticsearch in its own object with the name we chose (in the preceding case, it is first_suggestion). Next, we specify the text for which we want the suggestion to be returned using the text parameter. Finally, we add the suggester object, which is either term or phrase. The suggester object contains its configuration, which for the term suggester used in the preceding command, is the field we want to use for suggestions (the field property). We can also send more than one suggestion at a time by adding multiple suggestion names. For example, if in addition to the preceding suggestion, we would also include a suggestion for the word arest, we would use the following command: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } }, "second_suggestion": { "text": "arest", "term": { "field": "text" } } } }' Understanding the suggester response Let's now look at the example response for the suggestion query we have executed. Although the response will differ for each suggester type, let's look at the response returned by Elasticsearch for the first command we've sent in the preceding code that used the term suggester: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "first_suggestion" : [ { "text" : "chrimes", "offset" : 0, "length" : 7, "options" : [ { "text" : "crimes", "score" : 0.8333333, "freq" : 36 }, { "text" : "choices", "score" : 0.71428573, "freq" : 2 }, { "text" : "chrome", "score" : 0.6666666, "freq" : 2 }, { "text" : "chimps", "score" : 0.6666666, "freq" : 1 }, { "text" : "crimea", "score" : 0.6666666, "freq" : 1 } ] }, { "text" : "in", "offset" : 8, "length" : 2, "options" : [ ] }, { "text" : "wordl", "offset" : 11, "length" : 5, "options" : [ { "text" : "world", "score" : 0.8, "freq" : 436 }, { "text" : "words", "score" : 0.8, "freq" : 6 }, { "text" : "word", "score" : 0.75, "freq" : 9 }, { "text" : "worth", "score" : 0.6, "freq" : 21 }, { "text" : "worst", "score" : 0.6, "freq" : 16 } ] } ] } } As you can see in the preceding response, the term suggester returns a list of possible suggestions for each term that was present in the text parameter of our first_suggestion section. For each term, the term suggester will return an array of possible suggestions with additional information. Looking at the data returned for the wordl term, we can see the original word (the text parameter), its offset in the original text parameter (the offset parameter), and its length (the length parameter). The options array contains suggestions for the given word and will be empty if Elasticsearch doesn't find any suggestions. Each entry in this array is a suggestion and is characterized by the following properties: text: This is the text of the suggestion. score: This is the suggestion score; the higher the score, the better the suggestion will be. freq: This is the frequency of the suggestion. The frequency represents how many times the word appears in documents in the index we are running the suggestion query against. The higher the frequency, the more documents will have the suggested word in its fields and the higher the chance that the suggestion is the one we are looking for. Please remember that the phrase suggester response will differ from the one returned by the terms suggester, The term suggester The term suggester works on the basis of the edit distance, which means that the suggestion with fewer characters that needs to be changed or removed to make the suggestion look like the original word is the best one. For example, let's take the words worl and work. In order to change the worl term to work, we need to change the l letter to k, so it means a distance of one. Of course, the text provided to the suggester is analyzed and then terms are chosen to be suggested. The phrase suggester The term suggester provides a great way to correct user spelling mistakes on a per-term basis. However, if we would like to get back phrases, it is not possible to do that when using this suggester. This is why the phrase suggester was introduced. It is built on top of the term suggester and adds additional phrase calculation logic to it so that whole phrases can be returned instead of individual terms. It uses N-gram based language models to calculate how good the suggestion is and will probably be a better choice to suggest whole phrases instead of the term suggester. The N-gram approach divides terms in the index into grams—word fragments built of one or more letters. For example, if we would like to divide the word mastering into bi-grams (a two letter N-gram), it would look like this: ma as st te er ri in ng. The completion suggester Till now we read about term suggester and phrase suggester which are used for providing suggestions but completion suggester is completely different and it is used for as a prefix-based suggester for allowing us to create the autocomplete (search as you type) functionality in a very performance-effective way because of storing complicated structures in the index instead of calculating them during query time. This suggester is not about correcting user spelling mistakes. In Elasticsearch 5.0, Completion suggester has gone through complete rewrite. Both the syntax and data structures of completion type field have been changed and so is the response structure. There are many new exciting features and speed optimizations have been introduced in the completion suggester. One of these features is making completion suggester near real time which allows deleted suggestions to omit from suggestion results as soon as they are deleted. The logic behind the completion suggester The prefix suggester is based on the data structure called Finite State Transducer (FST) ( For more information refer, http://en.wikipedia.org/wiki/Finite_state_transducer). Although it is highly efficient, it may require significant resources to build on systems with large amounts of data in them: systems that Elasticsearch is perfectly suitable for. If we would like to build such a structure on the nodes after each restart or cluster state change, we may lose performance. Because of this, the Elasticsearch creators decided to use an FST-like structure during index time and store it in the index so that it can be loaded into the memory when needed. Using the completion suggester To use a prefix-based suggester we need to properly index our data with a dedicated field type called completion. It stores the FST-like structure in the index. In order to illustrate how to use this suggester, let's assume that we want to create an autocomplete feature to allow us to show book authors, which we store in an additional index. In addition to author's names, we want to return the identifiers of the books they wrote in order to search for them with an additional query. We start with creating the authors index by running the following command: curl -XPUT "http://localhost:9200/authors" -d' { "mappings": { "author": { "properties": { "name": { "type": "keyword" }, "suggest": { "type": "completion" } } } } }' Our index will contain a single type called author. Each document will have two fields: the name field, which is the name of the author, and the suggest field, which is the field we will use for autocomplete. The suggest field is the one we are interested in; we've defined it using the completion type, which will result in storing the FST-like structure in the index. Implementing your own autocompletion Completion suggester has been designed to be a powerful and easily implemented solution for autocomplete but it supports only prefix query. Most of the time autocomplete need only work as a prefix query for example, If I type elastic then I expect elasticsearch as a suggestion, not nonelastic. There are some use cases, when one wants to implement more general partial word completion. Completion suggester fails to fulfill this requirement. The second limitation of completion suggester is, it does not allow advance queries and filters searched. To get rid of both these limitations we are going to implement a custom autocomplete feature based on N-gram, which works in almost all the scenarios. Creating index Lets create an index location-suggestion with following settings and mappings: curl -XPUT "http://localhost:9200/location-suggestion" -d' { "settings": { "index": { "analysis": { "filter": { "nGram_filter": { "token_chars": [ "letter", "digit", "punctuation", "symbol", "whitespace" ], "min_gram": "2", "type": "nGram", "max_gram": "20" } }, "analyzer": { "nGram_analyzer": { "filter": [ "lowercase", "asciifolding", "nGram_filter" ], "type": "custom", "tokenizer": "whitespace" }, "whitespace_analyzer": { "filter": [ "lowercase", "asciifolding" ], "type": "custom", "tokenizer": "whitespace" } } } } }, "mappings": { "locations": { "properties": { "name": { "type": "text", "analyzer": "nGram_analyzer", "search_analyzer": "whitespace_analyzer" }, "country": { "type": "keyword" } } } } }' Understanding the parameters If you look carefully in preceding curl request for creating the index, it contains both settings and the mappings. We will see them now in detail one by one. Configuring settings Our settings contains two custom analyzers: nGram_analyzer and whitespace_analyzer. We have made custom whitespace_analyzer using whitespace tokenizer just for making due that all the tokens are indexed in lowercase and ascifolded form. Our main interest is nGram_analyzer, which contains a custom filter nGram_filter consisting following parameters: type: Specifies type of token filters which is nGram in our case. token_chars: Specifies what kind of characters are allowed in the generated tokens. Punctuations and special characters are generally removed from the token streams but in our example, we have intended to keep them. We have kept whitespace also so that a text which contains United States and a user searches for u s, United States still appears in the suggestion. min_gram and max_gram: These two attributes set the minimum and maximum length of substrings that will generated and added to the lookup table. For example, according to our settings for the index, the token India will generate following tokens: [ "di", "dia", "ia", "in", "ind", "indi", "india", "nd", "ndi", "ndia" ] Configuring mappings The document type of our index is locations and it has two fields, name and country. The most important thing to see is the way analyzers has been defined for name field which will be used for autosuggestion. For this field we have set index analyzer to our custom nGram_analyzer where the search analyzer is set to whitespace_analyzer. The index_analyzer parameter is no more supported from Elasticsearch version 5.0 onward. Also, if you want to configure search_analyzer property for a field, then you must configure analyzer property too the way we have shown in the preceding example. Summary In this article we focused on improving user search experience. We started with term and phrase suggesters and then covered search as you type that is, autocompletion feature which is implemented using completion suggester. We also saw the limitations of completion suggester in handling advanced queries and partial matching which further solved by implementing our custom completion using N-gram. Resources for Article: Further resources on this subject: Searching Your Data [article] Understanding Mesos Internals [article] Big Data Analysis (R and Hadoop) [article]

0
0
2764

article-image-manual-and-automated-testing

Packt

15 Nov 2016

10 min read

Manual and Automated Testing

Packt

15 Nov 2016

10 min read

In this article by Claus Führer the author of the book Scientific Computing with Python 3, we focus on two aspects of testing for scientific programming: Manual and Automated testing. Manual testing is what is done by every programmer to quickly check that an implementation is working. Automated testing is the refined, automated variant of that idea. We will introduce some tools available for automatic testing in general, with a view on the particular case of scientific computing. (For more resources related to this topic, see here.) Manual Testing During the development of code you do a lot of small tests in order to test its functionality. This could be called Manual Testing. Typically, you would test that a given function does what it is supposed to do, by manually testing the function in an interactive environment. For instance, suppose that you implement the Bisection algorithm. It is an algorithm that finds a zero (root) of a scalar nonlinear function. To start the algorithm an interval has to be given with the property, that the function takes different signs on the interval boundaries. You would then test an implementation of that algorithm typically by checking: That a solution is found when the function has opposite signs at the interval boundaries that an exception is raised when the function has the same sign at the interval boundaries Manual testing, as necessary as may seem to be, is unsatisfactory. Once you convinced yourself that the code does what it is supposed to do, you formulate a relatively small number of demonstration examples to convince others of the quality of the code. At that stage one often loses interest in the tests made during development and they are forgotten or even deleted. As soon as you change a detail and things no longer work correctly you might regret that your earlier tests are no longer available. Automatic Testing The correct way to develop any piece of code is to use automatic testing. The advantages are The automated repetition of a large number of tests after every code refactoring and before new versions are launched A silent documentation of the use of the code A documentation of the test coverage of your code: Did things work before a change or was a certain aspect never tested? We suggest to develop tests in parallel to the code. Good design of tests is an art of its own and there is rarely an investment which guarantees such a good pay-off in development time savings as the investment in good tests. Now we will go through the implementation of a simple algorithm with the automated testing methods in mind. Testing the bisection algorithm Let us examine automated testing for the bisection algorithm. With this algorithm a zero of a real valued function is found. An implementation of the algorithm can have the following form: def bisect(f,a,b,tol=1.e-8): """ Implementation of the bisection algorithm f real valued function a,b interval boundaries (float) with the property f(a)*f(b)<=0 tol tolerance ( float ) """ if f(a)*f(b)>0: raise ValueError ("Incorrect initial interval [a,b]") for i in range (100): c = (a + b)/2 . if f (a)*f(c) <= 0: b=c else: a=c if abs (a - b)<tol: return (a + b)/2 raise Exception (’ No root found within the given tolerance { }’.format (tol) We assume this to be stored in a file bisection.py. As a first test case we test that the zero of the function F(x) = x is found: def test_identity(): result = bisect(lambda x: x, -1., 1.) #(for lambda) expected = 0. assert allclose(result, expected),’expected zero not found’ text_identity() In this code you meet the Python keyword assert for the first time. It raises an exception AssertionError if its first argument returns the value False. Its optional second argument is a string with additional information. We use the function allclose in order to test for equality for float. Let us comment on some of the features of the test function. We use an assertion to make sure that an exception will be raised if the code does not behave as expected. We have to manually run the test in the line test_identity(). There are many tools to automate this kind of call. Let us now setup a test that checks if bisect raises an exception when the function has the same sign on both ends of the interval. For now, we will suppose that the exception raised is a ValueError exception. Example: Checking the sign for the bisection algorithm. def test_badinput(): try: bisect(lambda x: x,0.5,1) except ValueError: pass else: raise AssertionError() test_badinput() In this case an AssertionError is raised if the exception is not of type ValueError. There are tools to simplify the above construction to check that an exception is raised. Another useful kind of tests is the edge case test. Here we test arguments or user input which is likely to create mathematically undefined situations or states of the program not foreseen by the programmer. For instance, what happens if both bounds are equal? What happens if a>b? We easily setup up such a test by using for instance def test_equal_boundaries(): result = bisect(lambda x: x, 1., 1.) expected = 0. assert allclose(result, expected), ‘test equal interval bounds failed’ def test_reverse_boundaries(): result = bisect(lambda x: x, 1., -1.) expected = 0. assert allclose(result, expected), ‘test reverse interval bounds failed’ test_equal_boundaries() test_reverse_boundaries() Using unittest The standard Python package unittest greatly facilitates automated testing. That package requires that we rewrite our tests a little to be compatible. The first test would have to be rewritten in a class, as follows: from bisection import bisect import unittest class TestIdentity(unittest.TestCase): def test(self): result = bisect(lambda x: x, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) if __name__==‘__main__’: unittest.main() Let us examine the differences to the previous implementation. First, the test is now a method and a part of a class. The class must inherit from unittest,TestCase. The test method’s name must start with test. Note that we may now use one of the assertion tools of the package, namely . Finally, the tests are run using unittest.main. We recommend to write the tests in a file separate from the code to be tested. That’s why it starts with an import. The test passes and returns Ran 1 test in 0.002s OK If we would have run it with a loose tolerance parameter, e.g., 1.e-3, a failure of the test would have been reported: F ========================================================== FAIL: test (__main__.TestIdentity) ---------------------------------------------------------------------- Traceback (most recent call last): File “<ipython-input-11-e44778304d6f>“, line 5, in test self.assertAlmostEqual(result, expected) AssertionError: 0.00017089843750002018 != 0.0 within 7 places --------------------------------------------------------------------- Ran 1 test in 0.004s FAILED (failures=1) Tests can and should be grouped together as methods of a test class: Example: import unittest from bisection import bisect class TestIdentity(unittest.TestCase): def identity_fcn(self,x): return x def test_functionality(self): result = bisect(self.identity_fcn, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) def test_reverse_boundaries(self): result = bisect(self.identity_fcn, 1., -1.) expected = 0. self.assertAlmostEqual(result, expected) def test_exceeded_tolerance(self): tol=1.e-80 self.assertRaises(Exception, bisect, self.identity_fcn, -1.2, 1.,tol) if __name__==‘__main__’: unittest.main() Here, the last test needs some comments: We used the method unittest.TestCase.assertRaises. It tests whether an exception is correctly raised. Its first parameter is the exception type, for example,ValueError, Exception, and its second argument is a the name of the function, which is expected to raise the exception. The remaining arguments are the arguments for this function. The command unittest.main() creates an instance of the class TestIdentity and executes those methods starting by test. Test setUp and tearDown The class unittest.TestCase provides two special methods, setUp and tearDown, which are run before and after every call to a test method. This is needed when testing generators, which are exhausted after every test. We demonstrate this here by testing a program which checks in which line in a file a given string occurs for the first time: class NotFoundError(Exception): pass def find_string(file, string): for i,lines in enumerate(file.readlines()): if string in lines: return i raise NotFoundError(‘String {} not found in File {}‘. format(string,file.name)) We assume, that this code is saved in a file find_string.py. A test has to prepare a file and open it and remove it after the test: import unittest import os # used for, e.g., deleting files from find_in_file import find_string, NotFoundError class TestFindInFile(unittest.TestCase): def setUp(self): file = open(‘test_file.txt’, ‘w’) file.write(‘aha’) file.close() self.file = open(‘test_file.txt’, ‘r’) def tearDown(self): os.remove(self.file.name) def test_exists(self): line_no=find_string(self.file, ‘aha’) self.assertEqual(line_no, 0) def test_not_exists(self): self.assertRaises(NotFoundError, find_string,self.file, ‘bha’) if __name__==‘__main__’: unittest.main() Before each test setUp is run and afterwards tearDown is executed. Parametrizing Tests One frequently wants to repeat the same test set-up with different data sets. When using the functionalities of unittests this requires to automatically generate test cases with the corresponding methods injected: To this end we first construct a test case with one or several methods that will be used, when we later set up test methods. Let us consider the bisection method again and let us check if the values it returns are really zeros of the given function. We first build the test case and the method which will use for the tests: class Tests(unittest.TestCase): def checkifzero(self,fcn_with_zero,interval): result = bisect(fcn_with_zero,*interval,tol=1.e-8) function_value=fcn_with_zero(result) expected=0. self.assertAlmostEqual(function_value, expected) Then we dynamically create test functions as attributes of this class: test_data=[‘name’:’identity’, ‘function’:lambda x: x, ‘interval’:[-1.2, 1.], ‘name’:’parabola’, ‘function’:lambda x: x**2-1, ’interval’:[0, 10.], ‘name’:’cubic’, ‘function’:lambda x: x**3-2*x** 2,‘interval’:[0.1, 5.],] def make_test_function(dic): return lambda self:self.checkifzero(dic[‘function’],dic [‘interval’]) for data in test_data: setattr(Tests, “test_name”.format(name=data[‘name’]), make_test_function(data)) if __name__==‘__main__’: unittest.main() In this example the data is provided as a list of dictionaries. A function make_test_function dynamically generates a test function which uses a particular data dictionary to perform the test with the previously defined method checkifzero. This test function is made a method of the TestCase class by using the Python command settattr. Summary No program development without testing! In this article we showed the importance of well organized and documented tests. Some professionals even start development by first specifying tests. A useful tool for automatic testing is unittest, which we explained in detail. While testing improves the reliability of a code, profiling is needed to improve the performance. Alternative ways to code may result in large performance differences. We showed how to measure computation time and how to localize bottlenecks in your code. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine Learning with R [article] Storage Scalability [article]

0
0
2277

Packt

14 Nov 2016

6 min read

The TensorFlow Toolbox

Packt

14 Nov 2016

6 min read

In this article by Saif Ahmed, author of the book Machine Learning with TensorFlow, we learned how most machine learning platforms are focused toward scientists and practitioners in academic or industrial settings. Accordingly, while quite powerful, they are often rough around the edges and have few user-experience features. (For more resources related to this topic, see here.) Quite a bit of effort goes into peeking at the model at various stages and viewing and aggregating performance across models and runs. Even viewing the neural network can involve far more effort than expected. While this was acceptable when neural networks were simple and only a few layers deep, today's networks are far deeper. In 2015, Microsoft won the annual ImageNet competition using a deep network with 152 layers. Visualizing such networks can be difficult, and peeking at weights and biases can be overwhelming. Practitioners started using home-built visualizers and bootstrapped tools to analyze their networks and run performance. TensorFlow changed this by releasing TensorBoard directly alongside their overall platform release. TensorBoard runs out of box with no additional installations or setup. Users just need to instrument their code according to what they wish to capture. It features plotting of events, learning rate and loss over time; histograms, for weights and biases; and images. The Graph Explorer allows interactive reviews of the neural network. A quick preview You can follow along with the code here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/cifar10/cifar10_train.py The example uses the CIFAR-10 image set. The CIFAR-10 dataset consists of 60,000 images in ten classes compiled by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The dataset has become one of several standard learning tools and benchmarks for machine learning efforts. Let's start with the Graph Explorer. We can immediately see a convolutional network being used. This is not surprising as we're trying to classify images here. This is just one possible view of the graph. You can try the Graph Explorer as well. It allows deep dives into individual components. Our next stop on the quick preview is the EVENTS tab. This tab shows scalar data over time. The different statistics are grouped into individual tabs on the right-hand side. The following screenshot shows a number of popular scalar statistics, such as loss, learning rate, cross entropy, and sparsity across multiple parts of the network. The HISTOGRAMS tab is a close cousin as it shows tensor data over time. Despite the name, as of TensorFlow v0.7, it does not actually display histograms. Rather, it shows summaries of tensor data using percentiles. The summary view is shown in the following figure. Just like with the EVENTS tab, the data is grouped into tabs on the right-hand side. Different runs can be toggled on and off and runs can be shown overlaid, allowing interesting comparisons. It features three runs, which we can see on the left side, and we'll look at just the softmax function and associated parameters. For now, don't worry too much about what these mean, we're just looking at what we can achieve for our own classifiers. However, the summary view does not do justice to the utility of the HISTOGRAMS tab. Instead, we will zoom into a single graph to observe what is going on. This is shown in the following figure: Notice that each histogram chart shows a time series of nine lines. The top is the maximum, the middle the median, and the bottom the minimum. The three lines directly above and below the median are one and half standard deviation, one standard deviation, and half standard deviation marks. Obviously, this does represent multimodal distributions as it is not a histogram. However, it does provide a quick gist of what would otherwise be a mountain of data to sift through. A couple of things to note are how data can be collected and segregated by runs, how different data streams can be collected, how we can enlarge the views, and how we can zoom into each of the graphs. Enough of graphics, lets jump into code so we can run this for ourselves! Installing TensorBoard TensorFlow comes prepackaged with TensorBoard, so it will already be installed. It runs as a locally served web application accessible via the browser at http://0.0.0.0:6006. Conveniently, there is no server-side code or configurations required. Depending on where your paths are, you may be able to run it directly, as follows: tensorboard --logdir=/tmp/tensorlogs If your paths are not correct, you may need to prefix the application accordingly, as shown in the following command line: tf_install_dir/ tensorflow/tensorboard --logdir=/tmp/tensorlogs On Linux, you can run it in the background and just let it keep running, as follows: nohup tensorboard --logdir=/tmp/tensorlogs & Some thought should be put into the directory structure though. The Runs list on the left side of the dashboard is driven by subdirectories in the logdir location. The following image shows two runs: MNIST_Run1 and MNIST_Run2. Having an organized runs folder will allow plotting successive runs side by side to see differences. When initializing the writer, you will pass in the log_location as the first parameter, as follows: writer = tf.train.SummaryWriter(log_location, sess.graph_def) Consider saving a base location and appending run-specific subdirectories for each run. This will help organize outputs without expending more thought on it. We’ll discuss more about this later. Incorporating hooks into our code The best way to get started with TensorBoard is by taking existing working examples and instrument them with the code required for TensorBoard. We will do this for several common training scripts. Summary In this article, we covered the major areas of TensorBoard—EVENTS, HISTOGRAMS, and viewing GRAPH. We modified popular models to see the exact changes required before TensorBoard could be up and running. This should have demonstrated the fairly minimal effort required to get started with TensorBoard. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Implementing Artificial Neural Networks with TensorFlow [article] Why we need Design Patterns? [article]

0
0
2584

Packt

10 Nov 2016

6 min read

Data Clustering

Packt

10 Nov 2016

6 min read

In this article by Rodolfo Bonnin, the author of the book Building Machine Learning Projects with TensorFlow, we will start applying data transforming operations. We will begin finding interesting patterns in some given information, discovering groups of data, or clusters, and using clustering techniques. (For more resources related to this topic, see here.) In this process we'll also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library. The topics we will cover in this article are as follows: Getting an idea of how clustering works, and comparing it to other alternative existent classification techniques Using scikit-learn and matplotlib to enrichen the possibilities of dataset choices, and to get professional looking graphical representation of the data Implementing the K-means clustering algorithm Test some variations of the K-means methods to improve the fit and/or the convergence rate Three types of learning from data Based on how we approach the supervision of the samples, we can extract three types of learning: Unsupervised learning: The fully unsupervised approach directly takes a number of undetermined elements and builds a classification of them, looking at different properties that could determine its class Semi-supervised learning: The semi-supervised approach has a number of known classified items and then applies techniques to discover the class of the remaining items Supervised learning: In supervised learning, we start from a population of samples, which have a known type beforehand, and then build a model from it Normally there are three sample populations: one from which the model grows, called training set, one that is used to test the model, called training set, and then there are the samples for which we will be doing classification. Types of data learning based on supervision: unsupervised, semi-supervised, and supervised Unsupervised data clustering One of the simplest operations that can be initially applied to an unknown dataset is to try to understand the possible grouping or common features that the dataset members have. To do so, we could try to find representative points in them that summarize a balance of the parameters of the members of the group. This value could be, for example, the mean or the median of all the cluster members. This also guides to the idea of defining a notion of distance between members: all the members of the groups should be obviously at short distances between them and the representative points, that from the central points of the other groups. In the following image, we can see the results of a typical clustering algorithm and the representation of the cluster centers: Sample clustering algorithm output K-means K-means is a very well-known clustering algorithm that can be easily implemented. It is very straightforward and can guide (depending on the data layout) to a good initial understanding of the provided information. Mechanics of K-means K-means tries to divide a set of samples into K disjoint groups or clusters, using as a main indicator the mean value (be it 1D, 2D, and so on) of the members. This point is normally called centroid, referring to the arithmetic entity with the same name. One important characteristic of K-means is that K should be provided beforehand, and so some previous knowledge of the data is needed to avoid a non-representative result. Algorithm iteration criterion The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster contained samples. This is also known as minimization of inertia. Error minimization criteria for K-means K-means algorithm breakdown The mechanism of the K-means algorithm can be summarized in the following graphic: Simplified flow chart of the K-means process And this is a simplified summary of the algorithm: We start with unclassified samples and take K elements as the starting centroids. There are also possible simplifications of this algorithm that take the first elements in the element list, for the sake of brevity. We then calculate the distances between the samples and the first chosen samples, and so we get the first calculated centroids (or other representative values). You can see in the moving centroids in the illustration toward a more common sense centroid. After the centroids change, their displacement will provoke the individual distances to change, and so the cluster membership can change. So this is the time when we recalculate the centroids and repeat the first steps, in case the stop condition isn't met. The stopping conditions could be of various types: After n iterations (it could be that either we chose a too large number and we'll have unnecessary rounds of computing, or it could converge slowly and we will have a very unconvincing results) if the centroid doesn't have a very stable means. This stop condition could also be used as a last resort if we have a really long iterative process. Referring to the previous mean result, a possibly better criterion for the convergence of the iterations is to take a look at the changes of the centroids, be it in total displacement or total cluster element switches. The last one is employed normally, so we will stop the process once there are no more element-changing clusters: K-means simplified graphic Pros and cons of K-means The advantages of this method are: It scales very well (most of the calculations can be run in parallel) It has been used in a very large range of applications But its simplicity has also a price (no silver bullet rule applies): It requires apriori knowledge (the number of possible clusters should be known beforehand) The outlier values can push the values of the centroids, as they have the same value as any other sample As we assume that the figure is convex and isotropic, it doesn't work very well with non-circle-like delimited clusters Summary In this article, we got a simple overview of some of the most basic models we can implement, but we tried to be as detailed in the explanation as possible. From now on, we are able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcoming of them without having to load models with a greater number of unknown characteristics. You can also refer to the following books on the similar topics: Getting Started with TensorFlow: https://www.packtpub.com/big-data-and-business-intelligence/getting-started-tensorflow R Machine Learning Essentials: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-essentials Building Machine Learning Systems with Python - Second Edition: https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Preprocessing the Data [article]

0
0
2294

article-image-introduction-practical-business-intelligence

Packt

10 Nov 2016

20 min read

Introduction to Practical Business Intelligence

Packt

10 Nov 2016

20 min read

0
1
4035

article-image-introduction-r-programming-language-and-statistical-environment

Packt

09 Nov 2016

34 min read

Introduction to R Programming Language and Statistical Environment

Packt

09 Nov 2016

34 min read

In this article by Simon Walkowiak author of the book Big Data Analytics with R, we will have the opportunity to learn some most important R functions from base R installation and well-known third party packages used for data crunching, transformation, and analysis. More specifically in this article you will learn to: Understand the landscape of available R data structures Be guided through a number of R operations allowing you to import data from standard and proprietary data formats Carry out essential data cleaning and processing activities such as subsetting, aggregating, creating contingency tables, and so on Inspect the data by implementing a selection of Exploratory Data Analysis techniques such as descriptive statistics Apply basic statistical methods to estimate correlation parameters between two (Pearson's r) or more variables (multiple regressions) or find the differences between means for two (t-tests) or more groups Analysis of variance (ANOVA) Be introduced to more advanced data modeling tasks like logistic and Poisson regressions (For more resources related to this topic, see here.) Learning R This book assumes that you have been previously exposed to R programming language, and this article would serve more as a revision, and an overview, of the most essential operations, rather than a very thorough handbook on R. The goal of this work is to present you with specific R applications related to Big Data and the way you can combine R with your existing Big Data analytics workflows instead of teaching you basics of data processing in R. There is a substantial number of great introductory and beginner-level books on R available at IT specialized bookstores or online, directly from Packt Publishing, and other respected publishers, as well as on the Amazon store. Some of the recommendations include the following: R in Action: Data Analysis and Graphics with R by Robert Kabacoff (2015), 2nd edition, Manning Publications R Cookbook by Paul Teetor (2011), O'Reilly Discovering Statistics Using R by Andy Field, Jeremy Miles, and Zoe Field (2012), SAGE Publications R for Data Science by Dan Toomey (2014), Packt Publishing An alternative route to the acquisition of good practical R skills is through a large number of online resources, or more traditional tutor-led in-class training courses. The first option offers you an almost limitless choice of websites, blogs, and online guides. A good starting point is the main and previously mentioned Comprehensive R Archive Network (CRAN) page (https://cran.r-project.org/), which, apart from the R core software, contains several well-maintained manuals and Task Views—community run indexes of R packages dealing with specific statistical or data management issues. R-bloggers on the other hand (http://www.r-bloggers.com/) deliver regular news on R in the form of R-related blog posts or tutorials prepared by R enthusiasts and data scientists. Other interesting online sources, which you will probably find yourself using quite often, are as follows: http://www.inside-r.org/—news and information from and by R community http://www.rdocumentation.org/—a useful search engine of R packages and functions http://blog.rstudio.org/—a blog run and edited by RStudio engineers http://www.statmethods.net/—a very informative tutorial-laden website based on the popular R book R in Action by Rob Kabacoff However, it is very likely that after some initial reading, and several months of playing with R, your most frequent destinations to seek further R-related information and obtain help on more complex use cases for specific functions will become StackOverflow(http://stackoverflow.com/) and, even better, StackExchange (http://stackexchange.com/). StackExchange is in fact a network of support and question-and-answer community-run websites, which address many problems related to statistical, mathematical, biological, and other methods or concepts, whereas StackOverflow, which is currently one of the sub-sites under the StackExchange label, focuses more on applied programming issues and provides users with coding hints and solutions in most (if not all) programming languages known to developers. Both tend to be very popular amongst R users, and as of late December 2015, there were almost 120,000 R-tagged questions asked on StackOverflow. The http://stackoverflow.com/tags/r/info page also contains numerous links and further references to free interactive R learning resources, online books and manuals and many other. Another good idea is to start your R adventure from user-friendly online training courses available through online-learning providers like Coursera (https://www.coursera.org), DataCamp (https://www.datacamp.com), edX (https://www.edx.org), or CodeSchool (https://www.codeschool.com). Of course, owing to the nature of such courses, a successful acquisition of R skills is somewhat subjective, however, in recent years, they have grown in popularity enormously, and they have also gained rather positive reviews from employers and recruiters alike. Online courses may then be very suitable, especially for those who, for various reasons, cannot attend a traditional university degree with R components, or just prefer to learn R at their own leisure or around their working hours. Before we move on to the practical part, whichever strategy you are going to use to learn R, please do not be discouraged by the first difficulties. R, like any other programming language, or should I say, like any other language (including foreign languages), needs time, patience, long hours of practice, and a large number of varied exercises to let you explore many different dimensions and complexities of its syntax and rich libraries of functions. If you are still struggling with your R skills, however, I am sure the next section will get them off the ground. Revisiting R basics In the following section we will present a short revision of the most useful and frequently applied R functions and statements. We will start from a quick R and RStudio installation guide and then proceed to creating R data structures, data manipulation, and transformation techniques, and basic methods used in the Exploratory Data Analysis (EDA). Although the R codes listed in this book have been tested extensively, as always in such cases, please make sure that your equipment is not faulty and that you will be running all the following scripts at your own risk. Getting R and RStudio ready Depending on your operating system (whether Mac OS X, Windows, or Linux) you can download and install specific base R files directly from https://cran.r-project.org/. If you prefer to use RStudio IDE you still need to install R core available from CRAN website first and then download and run installers of the most recent version of RStudio IDE specific for your platform from https://www.rstudio.com/products/rstudio/download/. Personally I prefer to use RStudio, owing to its practical add-ons such as code highlighting and more user-friendly GUI, however, there is no particular reason why you can't use just the simple R core installation if you want to. Having said that, in this book we will be using RStudio in most of the examples. All code snippets have been executed and run on a MacBook Pro laptop with Mac OS X (Yosemite) operating system, 2.3 GHz Intel Core i5 processor, 1TB solid-state hard drive and 16GB of RAM memory, but you should also be fine with a much weaker configuration. In this article we won't be using any large data, and even in the remaining parts of this book the data sets used are limited to approximately 100MB to 130MB in size each. You are also provided with links and references to full Big Data whenever possible. If you would like to follow the practical parts of this book you are advised to download and unzip the R code and data for each article from the web page created for this book by Packt Publishing. If you use this book in PDF format it is not advisable to copy the code and paste it into the R console. When printed, some characters (like quotation marks " ") may be encoded differently than in R and the execution of such commands may result in errors being returned by the R console. Once you have downloaded both R core and RStudio installation files, follow the on-screen instructions for each installer. When you have finished installing them, open your RStudio software. Upon initialization of the RStudio you should see its GUI with a number of windows distributed on the screen. The largest one is the console in which you input and execute the code, line by line. You can also invoke the editor panel (it is recommended) by clicking on the white empty file icon in the top left corner of the RStudio software or alternatively by navigating to File | New File | R Script. If you have downloaded the R code from the book page of the Packt Publishing website, you may also just click on the Open an existing file (Ctrl + O) (a yellow open folder icon) and locate the downloaded R code on your computer's hard drive (or navigate to File | Open File…). Now your RStudio session is open and we can adjust some most essential settings. First, you need to set your working directory to the location on your hard drive where your data files are. If you know the specific location you can just type the setwd() command with a full and exact path to the location of your data as follows: > setwd("/Users/simonwalkowiak/Desktop/data") Of course your actual path will differ from mine, shown in the preceding code, however please mind that if you copy the path from the Windows Explorer address bar you will need to change the backslashes to forward slashes / (or to double backslashes \). Also, the path needs to be kept within the quotation marks "…". Alternatively you can set your working directory by navigating to Session | Set Working Directory | Choose Directory… to manually select the folder in which you store the data for this session. Apart from the ones we have already described, there are other ways to set your working directory correctly. In fact most of the operations, and even more complex data analysis and processing activities, can be achieved in R in numerous ways. For obvious reasons, we won't be presenting all of them, but we will just focus on the frequently used methods and some tips and hints applicable to special or difficult scenarios. You can check whether your working directory has been set correctly by invoking the following line: > getwd() [1] "/Users/simonwalkowiak/Desktop/data" From what you can see, the getwd() function returned the correct destination for my previously defined working directory. Setting the URLs to R repositories It is always good practice to check whether your R repositories are set correctly. R repositories are servers located at various institutes and organizations around the world, which store recent updates and new versions of third-party R packages. It is recommended that you set the URL of your default repository to the CRAN server and choose a mirror that is located relatively close to you. To set the repositories you may use the following code: > setRepositories(addURLs = c(CRAN = "https://cran.r-project.org/")) You can check your current, or default, repository URLs by invoking the following function: > getOption("repos") The output will confirm your URL selection: CRAN "https://cran.r-project.org/" You will be able to choose specific mirrors when you install a new package for the first time during the session, or you may navigate to Tools | Global Options… | Packages. In the Package management section of the window you can alter the default CRAN mirror location—click on Change… button to adjust. Once your repository URLs and working directory are set, you can go on to create data structures that are typical for R programming language. R data structures The concept of data structures in various programming languages is extremely important and cannot be overlooked. Similarly in R, available data structures allow you to hold any type of data and use them for further processing and analysis. The kind of data structure which you use, puts certain constraints on how you can access and process data stored in this structure, and what manipulation techniques you can use. This section will briefly guide you through a number of basic data structures available in R language. Vectors Whenever I teach statistical computing courses, I always start by introducing R learners to vectors as the first data structure they should get familiar with. Vectors are one-dimensional structures that can hold any type of data that is numeric, character, or logical. In simple terms, a vector is a sequence of some sort of values (for example numeric, character, logical, and many more) of specified length. The most important thing that you need to remember is that an atomic vector may contain only one type of data. Let's then create a vector with 10 random deviates from a standard normal distribution, and store all its elements in an object which we will call vector1. In your RStudio console (or its editor) type the following: > vector1 <- rnorm(10) Let's now see the contents of our newly created vector1: > vector1 [1] -0.37758383 -2.30857701 2.97803059 -0.03848892 1.38250714 [6] 0.13337065 -0.51647388 -0.81756661 0.75457226 -0.01954176 As we drew random values, your vector most likely contains different elements to the ones shown in the preceding example. Let's then make sure that my new vector (vector2) is the same as yours. In order to do this we need to set a seed from which we will be drawing the values: > set.seed(123) > vector2 <- rnorm(10, mean=3, sd=2) > vector2 [1] 1.8790487 2.5396450 6.1174166 3.1410168 3.2585755 6.4301300 [7] 3.9218324 0.4698775 1.6262943 2.1086761 In the preceding code we've set the seed to an arbitrary number (123) in order to allow you to replicate the values of elements stored in vector2 and we've also used some optional parameters of the rnorm() function, which enabled us to specify two characteristics of our data, that is the arithmetic mean (set to 3) and standard deviation (set to 2). If you wish to inspect all available arguments of the rnorm() function, its default settings, and examples of how to use it in practice, type ?rnorm to view help and information on that specific function. However, probably the most common way in which you will be creating a vector of data is by using the c() function (c stands for concatenate) and then explicitly passing the values of each element of the vector: > vector3 <- c(6, 8, 7, 1, 2, 3, 9, 6, 7, 6) > vector3 [1] 6 8 7 1 2 3 9 6 7 6 In the preceding example we've created vector3 with 10 numeric elements. You can use the length() function of any data structure to inspect the number of elements: > length(vector3) [1] 10 The class() and mode() functions allow you to determine how to handle the elements of vector3 and how the data are stored in vector3 respectively. > class(vector3) [1] "numeric" > mode(vector3) [1] "numeric" The subtle difference between both functions becomes clearer if we create a vector that holds levels of categorical variable (known as a factor in R) with character values: > vector4 <- c("poor", "good", "good", "average", "average", "good", "poor", "good", "average", "good") > vector4 [1] "poor" "good" "good" "average" "average" "good" "poor" [8] "good" "average" "good" > class(vector4) [1] "character" > mode(vector4) [1] "character" > levels(vector4) NULL In the preceding example, both the class() and mode() outputs of our character vector are the same, as we still haven't set it to be treated as a categorical variable, and we haven't defined its levels (the contents of the levels() function is empty—NULL). In the following code we will explicitly set the vector to be recognized as categorical with three levels: > vector4 <- factor(vector4, levels = c("poor", "average", "good")) > vector4 [1] poor good good average average good poor good [8] average good Levels: poor average good The sequence of levels doesn't imply that our vector is ordered. We can order the levels of factors in R using the ordered() command. For example, you may want to arrange the levels of vector4 in reverse order, starting from "good": > vector4.ord <- ordered(vector4, levels = c("good", "average", "poor")) > vector4.ord [1] poor good good average average good poor good [8] average good Levels: good < average < poor You can see from the output that R has now properly recognized the order of our levels, which we had defined. We can now apply class() and mode() functions on the vector4.ord object: > class(vector4.ord) [1] "ordered" "factor" > mode(vector4.ord) [1] "numeric" You may very likely be wondering why the mode() function returned "numeric" type instead of "character". The answer is simple. By setting the levels of our factor, R has assigned values 1, 2, and 3 to "good", "average" and "poor" respectively, exactly in the same order as we had defined them in the ordered() function. You can check this using levels() and str() functions: > levels(vector4.ord) [1] "good" "average" "poor" > str(vector4.ord) Ord.factor w/ 3 levels "good"<"average"<..: 3 1 1 2 2 1 3 1 2 1 Just to finalize the subject of vectors, let's create a logical vector, which contains only TRUE and FALSE values: > vector5 <- c(TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE) > vector5 [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE Similarly, for all other vectors already presented, feel free to check their structure, class, mode, and length using appropriate functions shown in this section. What outputs did those commands return? Scalars The reason why I always start from vectors is that scalars just seem trivial when they follow vectors. To simplify things even more, think of scalars as one-element vectors which are traditionally used to hold some constant values for example: > a1 <- 5 > a1 [1] 5 Of course you may use scalars in computations and also assign any one-element outputs of mathematical or statistical operations to another, arbitrary named scalar for example: > a2 <- 4 > a3 <- a1 + a2 > a3 [1] 9 In order to complete this short subsection on scalars, create two separate scalars which will hold a character and a logical value. Matrices A matrix is a two-dimensional R data structure in which each of its elements must be of the same type; that is numeric, character, or logical. As matrices consist of rows and columns, their shape resembles tables. In fact, when creating a matrix, you can specify how you want to distribute values across its rows and columns for example: > y <- matrix(1:20, nrow=5, ncol=4) > y [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 In the preceding example we have allocated a sequence of 20 values (from 1 to 20) into five rows and four columns, and by default they have been distributed by column. We may now create another matrix in which we will distribute the values by rows and give names to rows and columns using the dimnames argument (dimnames stands for names of dimensions) in the matrix() function: > rows <- c("R1", "R2", "R3", "R4", "R5") > columns <- c("C1", "C2", "C3", "C4") > z <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE, dimnames=list(rows, columns)) > z C1 C2 C3 C4 R1 1 2 3 4 R2 5 6 7 8 R3 9 10 11 12 R4 13 14 15 16 R5 17 18 19 20 As we are talking about matrices it's hard not to mention anything about how to extract specific elements stored in a matrix. This skill will actually turn out to be very useful when we get to subsetting real data sets. Looking at the matrix y, for which we didn't define any names of its rows and columns, notice how R denotes them. The row numbers come in the format [r, ], where r is a consecutive number of a row, whereas the column are identified by [ ,c], where c is a consecutive number of a column. If you then wished to extract a value stored in the fourth row of the second column of our matrix y, you could use the following code to do so: > y[4,2] [1] 9 In case you wanted to extract the whole column number three from our matrix y, you could type the following: > y[,3] [1] 11 12 13 14 15 As you can see, we don't even need to allow an empty space before the comma in order for this short script to work. Let's now imagine you would like to extract three values stored in the second, third and fifth rows of the first column in our vector z with named rows and columns. In this case, you may still want to use the previously shown notation, you do not need to refer explicitly to the names of dimensions of our matrix z. Additionally, notice that for several values to extract we have to specify their row locations as a vector—hence we will put their row coordinates inside the c() function which we had previously used to create vectors: > z[c(2, 3, 5), 1] R2 R3 R5 5 9 17 Similar rules of extracting data will apply to other data structures in R such as arrays, lists, and data frames, which we are going to present next. Arrays Arrays are very similar to matrices with only one exception: they contain more dimensions. However, just like matrices or vectors, they may only hold one type of data. In R language, arrays are created using the array() function: > array1 <- array(1:20, dim=c(2,2,5)) > array1 , , 1 [,1] [,2] [1,] 1 3 [2,] 2 4 , , 2 [,1] [,2] [1,] 5 7 [2,] 6 8 , , 3 [,1] [,2] [1,] 9 11 [2,] 10 12 , , 4 [,1] [,2] [1,] 13 15 [2,] 14 16 , , 5 [,1] [,2] [1,] 17 19 [2,] 18 20 The dim argument, which was used within the array() function, specifies how many dimensions you want to distribute your data across. As we had 20 values (from 1 to 20) we had to make sure that our array can hold all 20 elements, therefore we decided to assign them into two rows, two columns, and five dimensions (2 x 2 x 5 = 20). You can check dimensionality of your multi-dimensional R objects with dim() command: > dim(array1) [1] 2 2 5 As with matrices, you can use standard rules for extracting specific elements from your arrays. The only difference is that now you have additional dimensions to take care of. Let's assume you would want to extract a specific value located in the second row of the first column in the fourth dimension of our array1: > array1[2, 1, 4] [1] 14 Also, if you need to find a location of a specific value, for example 11, within the array, you can simply type the following line: > which(array1==11, arr.ind=TRUE) dim1 dim2 dim3 [1,] 1 2 3 Here, the which() function returns indices of the array (arr.ind=TRUE), where the sought value equals 11 (hence ==). As we had only one instance of value 11 in our array, there is only one row specifying its location in the output. If we had more instances of 11, additional rows would be returned indicating indices for each element equal to 11. Data frames The following two short subsections concern two of probably the most widely used R data structures. Data frames are very similar to matrices, but they may contain different types of data. Here you might have suddenly thought of a typical rectangular data set with rows and columns or observations and variables. In fact you are correct. Most of the data sets are indeed imported into R as data frames. You can also create a simple data frame manually with the data.frame() function, but as each column in the data frame may be of a different type, we must first create vectors which will hold data for specific columns: > subjectID <- c(1:10) > age <- c(37,23,42,25,22,25,48,19,22,38) > gender <- c("male", "male", "male", "male", "male", "female", "female", "female", "female", "female") > lifesat <- c(9,7,8,10,4,10,8,7,8,9) > health <- c("good", "average", "average", "good", "poor", "average", "good", "poor", "average", "good") > paid <- c(T, F, F, T, T, T, F, F, F, T) > dataset <- data.frame(subjectID, age, gender, lifesat, health, paid) > dataset subjectID age gender lifesat health paid 1 1 37 male 9 good TRUE 2 2 23 male 7 average FALSE 3 3 42 male 8 average FALSE 4 4 25 male 10 good TRUE 5 5 22 male 4 poor TRUE 6 6 25 female 10 average TRUE 7 7 48 female 8 good FALSE 8 8 19 female 7 poor FALSE 9 9 22 female 8 average FALSE 10 10 38 female 9 good TRUE The preceding example presents a simple data frame which contains some dummy imaginary data, possibly a sample from a basic psychological experiment, which measured subjects' life satisfaction (lifesat) and their health status (health) and also collected other socio-demographic information such as age and gender, and whether the participant was a paid subject or a volunteer. As we deal with various types of data, the elements for each column had to be amalgamated into a single structure of a data frame using the data.frame() command, and specifying the names of objects (vectors) in which we stored all values. You can inspect the structure of this data frame with the previously mentioned str() function: > str(dataset) 'data.frame': 10 obs. of 6 variables: $ subjectID: int 1 2 3 4 5 6 7 8 9 10 $ age : num 37 23 42 25 22 25 48 19 22 38 $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 1 1 1 1 $ lifesat : num 9 7 8 10 4 10 8 7 8 9 $ health : Factor w/ 3 levels "average","good",..: 2 1 1 2 3 1 2 3 1 2 $ paid : logi TRUE FALSE FALSE TRUE TRUE TRUE ... The output of str() gives you some basic insights into the shape and format of your data in the dataset object, for example, number of observations and variables, names of variables, types of data they hold, and examples of values for each variable. While discussing data frames, it may also be useful to introduce you to another way of creating subsets. As presented earlier, you may apply standard extraction rules to subset data of your interest. For example, suppose you want to print only those columns which contain age, gender, and life satisfaction information from our dataset data frame. You may use the following two alternatives (the output not shown to save space, but feel free to run it): > dataset[,2:4] #or > dataset[, c("age", "gender", "lifesat")] Both lines of code will produce exactly the same results. The subset() function however gives you additional capabilities of defining conditional statements which will filter the data, based on the output of logical operators. You can replicate the preceding output using subset() in the following way: > subset(dataset[c("age", "gender", "lifesat")]) Assume now that you want to create a subset with all subjects who are over 30 years old, and with a score of greater than or equal to eight on the life satisfaction scale (lifesat). The subset() function comes very handy: > subset(dataset, age > 30 & lifesat >= 8) subjectID age gender lifesat health paid 1 1 37 male 9 good TRUE 3 3 42 male 8 average FALSE 7 7 48 female 8 good FALSE 10 10 38 female 9 good TRUE Or you want to produce an output with two socio-demographic variables of age and gender, of only these subjects who were paid to participate in this experiment: > subset(dataset, paid==TRUE, select=c("age", "gender")) age gender 1 37 male 4 25 male 5 22 male 6 25 female 10 38 female We will perform much more thorough and complex data transformations on real data frames in the second part of this article. Lists A list in R is a data structure, which is a collection of other objects. For example, in the list you can store vectors, scalars, matrices, arrays, data frames, and even other lists. In fact, lists in R are vectors, but they differ from atomic vectors, which we introduced earlier in this section as lists that can hold many different types of data. In the following example, we will construct a simple list (using list() function) which will include a variety of other data structures: > simple.vector1 <- c(1, 29, 21, 3, 4, 55) > simple.matrix <- matrix(1:24, nrow=4, ncol=6, byrow=TRUE) > simple.scalar1 <- 5 > simple.scalar2 <- "The List" > simple.vector2 <- c("easy", "moderate", "difficult") > simple.list <- list(name=simple.scalar2, matrix=simple.matrix, vector=simple.vector1, scalar=simple.scalar1, difficulty=simple.vector2) >simple.list $name [1] "The List" $matrix [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 2 3 4 5 6 [2,] 7 8 9 10 11 12 [3,] 13 14 15 16 17 18 [4,] 19 20 21 22 23 24 $vector [1] 1 29 21 3 4 55 $scalar [1] 5 $difficulty [1] "easy" "moderate" "difficult" > str(simple.list) List of 5 $ name : chr "The List" $ matrix : int [1:4, 1:6] 1 7 13 19 2 8 14 20 3 9 ... $ vector : num [1:6] 1 29 21 3 4 55 $ scalar : num 5 $ difficulty: chr [1:3] "easy" "moderate" "difficult" Looking at the preceding output, you can see that we have assigned names to each component in our list and the str() function prints them as if they were variables of a standard rectangular data set. In order to extract specific elements from a list, you first need to use a double square bracket notation [[x]] to identify a component x within the list. For example, assuming you want to print an element stored in its first row and the third column of the second component you may use the following line in R: > simple.list[[2]][1,3] [1] 3 Owing to their flexibility, lists are commonly used as preferred data structures in the outputs of statistical functions. It is then important for you to know how you can deal with lists and what sort of methods you can apply to extract and process data stored in them. Once you are familiar with the basic features of data structures available in R, you may wish to visit Hadley Wickham's online book at http://adv-r.had.co.nz/ in which he explains various more advanced concepts related to each native data structure in R language, and different techniques of subsetting data, depending on the way they are stored. Exporting R data objects In the previous section we created numerous objects, which you can inspect in the Environment tab window in RStudio. Alternatively, you may use the ls() function to list all objects stored in your global environment: > ls() If you've followed the article along, and run the script for this book line-by-line, the output of the ls() function should hopefully return 27 objects: [1] "a1" "a2" "a3" [4] "age" "array1" "columns" [7] "dataset" "gender" "health" [10] "lifesat" "paid" "rows" [13] "simple.list" "simple.matrix" "simple.scalar1" [16] "simple.scalar2" "simple.vector1" "simple.vector2" [19] "subjectID" "vector1" "vector2" [22] "vector3" "vector4" "vector4.ord" [25] "vector5" "y" "z" In this section we will present various methods of saving the created objects to your local drive and exporting their contents to a number of the most commonly used file formats. Sometimes, for various reasons, it may happen that you need to leave your project and exit RStudio or shut your PC down. If you do not save your created objects, you will lose all of them, the moment you close RStudio. Remember that R stores created data objects in the RAM of your machine, and whenever these objects are not in use any longer, R frees them from the memory, which simply means that they get deleted. Of course this might turn out to be quite costly, especially if you had not saved your original R script, which would have enabled you to replicate all the steps of your data processing activities when you start a new session in R. In order to prevent the objects from being deleted, you can save all or selected ones as .RData files on your hard drive. In the first case, you may use the save.image() function which saves your whole current workspace with all objects to your current working directory: > save.image(file = "workspace.RData") If you are dealing with large objects, first make sure you have enough storage space available on your drive (this is normally not a problem any longer), or alternatively you can reduce the size of the saved objects using one of the compression methods available. For example, the above workspace.RData file was 3,751 bytes in size without compression, but when xz compression was applied the size of the resulting file decreased to 3,568 bytes. > save.image(file = "workspace2.RData", compress = "xz") Of course, the difference in sizes in the presented example is minuscule, as we are dealing with very small objects, however it gets much more significant for bigger data structures. The trade-off of applying one of the compression methods is the time it takes for R to save and load .RData files. If you prefer to save only chosen objects (for example dataset data frame and simple.list list) you can achieve this with the save() function: > save(dataset, simple.list, file = "two_objects.RData") You may now test whether the above solutions worked by cleaning your global environment of all objects, and then loading one of the created files, for example: > rm(list=ls()) > load("workspace2.RData") As an additional exercise, feel free to explore other functions which allow you to write text representations of R objects, for example dump() or dput(). More specifically, run the following commands and compare the returned outputs: > dump(ls(), file = "dump.R", append = FALSE) > dput(dataset, file = "dput.txt") The save.image() and save() functions only create images of your workspace or selected objects on the hard drive. It is an entirely different story if you want to export some of the objects to data files of specified formats, for example, comma-separated, tab-delimited, or proprietary formats like Microsoft Excel, SPSS, or Stata. The easiest way to export R objects to generic file formats like CSV, TXT, or TAB is through the cat() function, but it only works on atomic vectors: > cat(age, file="age.txt", sep=",", fill=TRUE, labels=NULL, append=TRUE) > cat(age, file="age.csv", sep=",", fill=TRUE, labels=NULL, append=TRUE) The preceding code creates two files, one as a text file and another one as a comma-separated format, both of which contain values from the age vector that we had previously created for the dataset data frame. The sep argument is a character vector of strings to append after each element, the fill option is a logical argument which controls whether the output is automatically broken into lines (if set to TRUE), the labels parameter allows you to add a character vector of labels for each printed line of data in the file, and the append logical argument enables you to append the output of the call to the already existing file with the same name. In order to export vectors and matrices to TXT, CSV, or TAB formats you can use the write() function, which writes out a matrix or a vector in a specified number of columns for example: > write(age, file="agedata.csv", ncolumns=2, append=TRUE, sep=",") > write(y, file="matrix_y.tab", ncolumns=2, append=FALSE, sep="t") Another method of exporting matrices provides the MASS package (make sure to install it with the install.packages("MASS") function) through the write.matrix() command: > library(MASS) > write.matrix(y, file="ymatrix.txt", sep=",") For large matrices, the write.matrix() function allows users to specify the size of blocks in which the data are written through the blocksize argument. Probably the most common R data structure that you are going to export to different file formats will be a data frame. The generic write.table() function gives you an option to save your processed data frame objects to standard data formats for example TAB, TXT, or CSV: > write.table(dataset, file="dataset1.txt", append=TRUE, sep=",", na="NA", col.names=TRUE, row.names=FALSE, dec=".") The append and sep arguments should already be clear to you as they were explained earlier. In the na option you may specify an arbitrary string to use for missing values in the data. The logical parameter col.names allows users to append the names of columns to the output file, and the dec parameter sets the string used for decimal points and must be a single character. In the example, we used row.names set to FALSE, as the names of the rows in the data are the same as the values of the subjectID column. However, it is very likely that in other data sets the ID variable may differ from the names (or numbers) of rows, so you may want to control it depending on the characteristics of your data. Two similar functions write.csv() and write.csv2() are just convenience wrappers for saving CSV files, and they only differ from the generic write.table() function by default settings of some of their parameters, for example sep and dec. Feel free to explore these subtle differences at your leisure. To complete this section of the article we need to present how to export your R data frames to third-party formats. Amongst several frequently used methods, at least four of them are worth mentioning here. First, if you wish to write a data frame to a proprietary Microsoft Excel format, such as XLS or XLSX, you should probably use the WriteXLS package (please use install.packages("WriteXLS") if you have not done it yet) and its WriteXLS() function: > library(WriteXLS) > WriteXLS("dataset", "dataset1.xlsx", SheetNames=NULL, row.names=FALSE, col.names=TRUE, AdjWidth=TRUE, envir=parent.frame()) The WriteXLS() command offers users a number of interesting options, for instance you can set the names of the worksheets (SheetNames argument), adjust the widths of columns depending on the number of characters of the longest value (AdjWidth), or even freeze rows and columns just as you do it in Excel (FreezeRow and FreezeCol parameters). Please note that in order for the WriteXLS package to work, you need to have Perl installed on your machine. The package creates Excel files using Perl scripts called WriteXLS.pl for Excel 2003 (XLS) files, and WriteXLSX.pl for Excel 2007 and later version (XLSX) files. If Perl is not present on your system, please make sure to download and install it from https://www.perl.org/get.html. After the Perl installation, you may have to restart your R session and load the WriteXLS package again to apply the changes. For solutions to common Perl issues please visit the following websites: https://www.perl.org/docs.html, http://www.ahinea.com/en/tech/perl-unicode-struggle.html, and http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html or search StackOverflow and similar websites for R and Perl related specific problems. Another very useful way of writing R objects to the XLSX format is provided by the openxlsx package through the write.xlsx() function, which, apart from data frames, also allows lists to be easily written to Excel spreadsheets. Please note that Windows users may need to install the Rtools package in order to use openxlsx functionalities. The write.xlsx() function gives you a large choice of possible options to set, including a custom style to apply to column names (through headerStyle argument), the color of cell borders (borderColour), or even its line style (borderStyle). The following example utilizes only the most common and minimal arguments required to write a list to the XLSX file, but be encouraged to explore other options offered by this very flexible function: > write.xlsx(simple.list, file = "simple_list.xlsx") A third-party package called foreign makes it possible to write data frames to other formats used by well-known statistical tools such as SPSS, Stata, or SAS. When creating files, the write.foreign() function requires users to specify the names of both the data and code files. Data files hold raw data, whereas code files contain scripts with the data structure and metadata (value and variable labels, variable formats, and so on) written in the proprietary syntax. In the following example, the code writes the dataset data frame to the SPSS format: > library(foreign) > write.foreign(dataset, "datafile.txt", "codefile.txt", package="SPSS") Finally, another package called rio contains only three functions, allowing users to quickly import(), export() and convert() data between a large array of file formats, (for example TSV, CSV, RDS, RData, JSON, DTA, SAV, and many more). The package, in fact, is dependent on a number of other R libraries, some of which, for example foreign and openxlsx, have already been presented in this article. The rio package does not introduce any new functionalities apart from the default arguments characteristic for underlying export functions, so you still need to be familiar with the original functions and their parameters if you require more advanced exporting capabilities. But, if you are only looking for a no-fuss general export function, the rio package is definitely a good shortcut to take: > export(dataset, format = "stata") > export(dataset, "dataset1.csv", col.names = TRUE, na = "NA") Summary In this article, we have provided you with quite a bit of theory, and hopefully a lot of practical examples of data structures available to R users. You've created several objects of different types, and you've become familiar with a variety of data and file formats to offer. We then showed you how to save R objects held in your R workspace to external files on your hard drive, or to export them to various standard and proprietary file formats. Resources for Article: Further resources on this subject: Fast Data Manipulation with R [article] The Data Science Venn Diagram [article] Deployment and DevOps [article]

0
0
11956

article-image-machine-learning-technique-supervised-learning

Packt

09 Nov 2016

7 min read

Machine Learning Technique: Supervised Learning

Packt

09 Nov 2016

7 min read

In this article by Andrea Isoni author of the book Machine Learning for the Web, we will the most relevant regression and classification techniques are discussed. All of these algorithms share the same background procedure, and usually the name of the algorithm refers to both a classification and a regression method. The linear regression algorithms, Naive Bayes, decision tree, and support vector machine are going to be discussed in the following sections. To understand how to employ the techniques, a classification and a regression problem will be solved using the mentioned methods. Essentially, a labeled train dataset will be used to train the models, which means to find the values of the parameters, as we discussed in the introduction. As usual, the code is available in the my GitHub folder at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/. (For more resources related to this topic, see here.) We will conclude the article with an extra algorithm that may be used for classification, although it is not specifically designed for this purpose (hidden Markov model). We will now begin to explain the general causes of error in the methods when predicting the true labels associated with a dataset. Model error estimation We said that the trained model is used to predict the labels of new data, and the quality of the prediction depends on the ability of the model to generalize, that is, the correct prediction of cases not present in the trained data. This is a well-known problem in literature and related to two concepts: bias and variance of the outputs. The bias is the error due to a wrong assumption in the algorithm. Given a point x(t) with label yt, the model is biased if it is trained with different training sets, and the predicted label ytpred will always be different from yt. The variance error instead refers to the different wrongly predicted labels of the given point x(t). A classic example to explain the concepts is to consider a circle with the true value at the center (true label), as shown in the following figure. The closer the predicted labels are to the center, the more unbiased the model and the lower the variance (top left in the following figure). The other three cases are also shown here: Variance and bias example. A model with low variance and low bias errors will have the predicted labels that is blue dots (as show in the preceding figure) concentrated on the red center (true label). The high bias error occurs when the predictions are far away from the true label, while high variance appears when the predictions are in a wide range of values. We have already seen that labels can be continuous or discrete, corresponding to regression classification problems respectively. Most of the models are suitable for solving both problems, and we are going to use word regression and classification referring to the same model. More formally, given a set of N data points and corresponding labels, a model with a set of parameters with the true parameter values will have the mean square error (MSE), equal to: We will use the MSE as a measure to evaluate the methods discussed in this article. Now we will start describing the generalized linear methods. Generalized linear models The generalized linear model is a group of models that try to find the M parameters that form a linear relationship between the labels yi and the feature vector x(i) that is as follows: Here, are the errors of the model. The algorithm for finding the parameters tries to minimize the total error of the model defined by the cost function J: The minimization of J is achieved using an iterative algorithm called batch gradient descent: Here, a is called learning rate, and it is a trade-off between convergence speed and convergence precision. An alternative algorithm that is called stochastic gradient descent, that is loop for : The qj is updated for each training example i instead of waiting to sum over the entire training set. The last algorithm converges near the minimum of J, typically faster than batch gradient descent, but the final solution may oscillate around the real values of the parameters. The following paragraphs describe the most common model and the corresponding cost function, J. Linear regression Linear regression is the simplest algorithm and is based on the model: The cost function and update rule are: Ridge regression Ridge regression, also known as Tikhonov regularization, adds a term to the cost function J such that: , where l is the regularization parameter. The additional term has the function needed to prefer a certain set of parameters over all the possible solutions penalizing all the parameters qj different from 0. The final set of qj shrank around 0, lowering the variance of the parameters but introducing a bias error. Indicating with the superscript l the parameters from the linear regression, the ridge regression parameters are related by the following formula: This clearly shows that the larger the l value, the more the ridge parameters are shrunk around 0. Lasso regression Lasso regression is an algorithm similar to ridge regression, the only difference being that the regularization term is the sum of the absolute values of the parameters: Logistic regression Despite the name, this algorithm is used for (binary) classification problems, so we define the labels. The model is given the so-called logistic function expressed by: In this case, the cost function is defined as follows: From this, the update rule is formally the same as linear regression (but the model definition, , is different): Note that the prediction for a point p, , is a continuous value between 0 and 1. So usually, to estimate the class label, we have a threshold at =0.5 such that: The logistic regression algorithm is applicable to multiple label problems using the techniques one versus all or one versus one. Using the first method, a problem with K classes is solved by training K logistic regression models, each one assuming the labels of the considered class j as +1 and all the rest as 0. The second approach consists of training a model for each pair of labels ( trained models). Probabilistic interpretation of generalized linear models Now that we have seen the generalized linear model, let’s find the parameters qj that satisfy the relationship: In the case of linear regression, we can assume as normally distributed with mean 0 and variance s2 such that the probability is equivalent to: Therefore, the total likelihood of the system can be expressed as follows: In the case of the logistic regression algorithm, we are assuming that the logistic function itself is the probability: Then the likelihood can be expressed by: In both cases, it can be shown that maximizing the likelihood is equivalent to minimizing the cost function, so the gradient descent will be the same. k-nearest neighbours (KNN) This is a very simple classification (or regression) method in which given a set of feature vectors with corresponding labels yi, a test point x(t) is assigned to the label value with the majority of the label occurrences in the K nearest neighbors found, using a distance measure such as the following: Euclidean: Manhattan: Minkowski: (if q=2, this reduces to the Euclidean distance) In the case of regression, the value yt is calculated by replacing the majority of occurrences by the average of the labels . The simplest average (or the majority of occurrences) has uniform weights, so each point has the same importance regardless of their actual distance from x(t). However, a weighted average with weights equal to the inverse distance from x(t) may be used. Summary In this article, the major classification and regression algorithms, together with the techniques to implement them, were discussed. You should now be able to understand in which situation each method can be used and how to implement it using Python and its libraries (sklearn and pandas). Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Specialized Machine Learning Topics [article]

0
0
1739

Packt

02 Nov 2016

8 min read

LabVIEW Basics

Packt

02 Nov 2016

8 min read

In this article by Behzad Ehsani, author of the book Data Acquisition using LabVIEW, after a brief introduction and a short note on installation, we will go over the most widely used pallets and objects Icon tool bar from a standard installation of LabVIEW and a brief explanation of what each object does. (For more resources related to this topic, see here.) Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environment by its complete graphical approach to programming. As an example, while representation of a while loop in a text based language such as the C language consists of several predefined, extremely compact and sometimes extremely cryptic lines of text, a while a loop in LabVIEW, is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning cure for the beginner. LabVIEW is based on what is called G language, but there are still other languages, especially C under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempt to start projects in LabVIEW only because at the first glace, the graphical nature of interface and the concept of drag an drop used in LabVIEW, appears to do away with required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that in many higher level development and testing environment, specially when using complicated test equipment and complex mathematical calculations or even creating embedded software LabVIEW's approach will be much more time efficient and bug free environment which otherwise would require several lines of code in traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses. LabVIEW does not completely replace the need for traditional text based languages and depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern day program installation; that is, Insert the DVD 1 and follow on-screen guided installation steps. LabVIEW comes in one DVD for Mac and Linux version but in four or more DVDs for the Windows edition (depending on additional software, different licensing and additional libraries and packages purchased.) In this article we will use LabVIEW 2013 Professional Development version for Windows. Given the target audience of this article, we assume the user is well capable of installation of the program. Installation is also well documented by National Instruments and the mandatory one year support purchase with each copy of LabVIEW is a valuable source of live and email help. Also, NI web site (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups and local group events and meeting of fellow LabVIEW developers, etc. One worthy note for those who are new to installation of LabVIEW is that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediate need!) These additional software are fully functional (in demo mode for 7 days) which may be extended for about a month with online registration. This is a very good opportunity to have hands on experience with even more of power and functionality that LabVIEW is capable to offer. The additional information gained by installing other available software on the DVDs may help in further development of a given project. Just imagine if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition probably is going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need to web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all software on all DVDs are selected: When installing a fresh version of LabVIEW, if you do decide to observe the advice above, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI.... and Measurement Studio... for Visual Studio LabWindows according to National Instruments .., is an ANSI C integrated development environment. Also note that by default NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipments. Also, note that device drivers (on Windows installations) come on a separate DVD which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, National Instruments has programmers that would do just that. But this would come at a cost to the user. VI Package manager, now installed as a part of standard installation is also a must these days. National Instruments distributes third party software and drivers and public domain packages via VI Package manager. Appropriate software and drivers for these microcontrollers are installed via VI Package manager. You can install many public domain packages that further installs many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by National Instruments. Finally, note that the more modules, packages and software are selected to be installed the longer it will take to complete the installation. This may sound like making an obvious point but surprisingly enough installation of all software on the three DVDs (for Windows) take up over five hours! On a standard laptop or pc we used. Obviously a more powerful PC (such as one with solid sate hard drive) my not take such log time: LabVIEW Basics Once the LabVIEW applications is launched, by default two blank windows open simultaneously; a Front Panel and a Block Diagram window and a VI is created: VIs or Virtual Instruments are heart and soul of LabVIEW. They are what separate LabVIEW from all other text based development environments. In LabVIEW everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs These graphical representation of a thing, be it a simple while loop, a complex mathematical concept such as polynomial interpolation or simply a Boolean constant are all graphically represented. To use an object right-click inside the block diagram or front panel window, a pallet list appears. Follow the arrow and pick an object from the list of object from subsequent pallet an place it on the appropriate window. The selected object now can be dragged and place on different location on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of cores there are many exceptions to this rule. For example a while loop can only be selected in Block Diagram and by itself, a while loop does not have a graphical representation on the front panel window. Needless to say, LabVIEW also has keyboard combination that expedite selecting and placing any given toolkit objects onto the appropriate window. Each object has one (or several) wire connections going into as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to input and output of one or more object. Later, we will use an example to illustrate how a basic LabVIEW VI is created and executed. Highlights LabVIEW is a complete object-oriented development and test environment based on G language. As such it is a very powerful and complex environment. In article one we went through introduction to LabVIEW and its main functionality of each of its icon by way of an actual user interactive example. Accompanied by appropriate hardware (both NI as well as many industry standard test, measurement and development hardware products) LabVIEW is capable to cover from developing embedded systems to fuzzy logic and almost everything in between! Summary In this article we cover the basics of LabVIEW, from installation to in depth explanation of each and every element in the toolbar. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Data mining [article] PostgreSQL in Action [article]

0
0
3704

How-To Tutorials - Data

SQL Tuning Enhancements in Oracle 12c

Tableau Data Extract Best Practices

What’s New in SQL Server 2016 Reporting Services

Event detection from the news headlines in Hadoop

Build a Chatbot

Define the Necessary Connections

Introducing Algorithm Design Paradigms

Suggesters for Improving User Search Experience

Manual and Automated Testing

The TensorFlow Toolbox

Trending Topics

Data Clustering

Introduction to Practical Business Intelligence

Introduction to R Programming Language and Statistical Environment

Machine Learning Technique: Supervised Learning

LabVIEW Basics

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access