The Hunt for Data

Exclusive offer: get 50% off this eBook here
Haskell Data Analysis Cookbook

Haskell Data Analysis Cookbook — Save 50%

Explore intuitive data analysis techniques and powerful machine learning methods using over 130 practical recipes

$32.99    $16.50
by Nishant Shukla | June 2014 | Cookbooks Open Source

In this article by Nishant Shukla, author of the book Haskell Data Analysis Cookbook, we will learn how to use local data of different file formats and also learn how to download data from the Internet using our Haskell code.

(For more resources related to this topic, see here.)

Examining a JSON file with the aeson package

JavaScript Object Notation (JSON) is a way to represent key-value pairs in plain text. The format is described extensively in RFC 4627 (http://www.ietf.org/rfc/rfc4627).

In this recipe, we will parse a JSON description about a person. We often encounter JSON in APIs from web applications.

Getting ready

Install the aeson library from hackage using Cabal.

Prepare an input.json file representing data about a mathematician, such as the one in the following code snippet:

$ cat input.json {"name":"Gauss", "nationality":"German", "born":1777, "died":1855}

We will be parsing this JSON and representing it as a usable data type in Haskell.

How to do it...

  1. Use the OverloadedStrings language extension to represent strings as ByteString, as shown in the following line of code:

    {-# LANGUAGE OverloadedStrings #-}

  2. Import aeson as well as some helper functions as follows:

    import Data.Aeson import Control.Applicative import qualified Data.ByteString.Lazy as B

  3. Create the data type corresponding to the JSON structure, as shown in the following code:

    data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int }

  4. Provide an instance for the parseJSON function, as shown in the following code snippet:

    instance FromJSON Mathematician where parseJSON (Object v) = Mathematician <$> (v .: "name") <*> (v .: "nationality") <*> (v .: "born") <*> (v .:? "died")

  5. Define and implement main as follows:

    main :: IO () main = do

  6. Read the input and decode the JSON, as shown in the following code snippet:

    input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m

  7. Now we will do something interesting with the data as follows:

    greet m = (show.name) m ++ " was born in the year " ++ (show.born) m

  8. We can run the code to see the following output:

    $ runhaskell Main.hs "Gauss" was born in the year 1777

How it works...

Aeson takes care of the complications in representing JSON. It creates native usable data out of a structured text. In this recipe, we use the .: and .:? functions provided by the Data.Aeson module.

As the Aeson package uses ByteStrings instead of Strings, it is very helpful to tell the compiler that characters between quotation marks should be treated as the proper data type. This is done in the first line of the code which invokes the OverloadedStrings language extension.

We use the decode function provided by Aeson to transform a string into a data type. It has the type FromJSON a => B.ByteString -> Maybe a. Our Mathematician data type must implement an instance of the FromJSON typeclass to properly use this function. Fortunately, the only required function for implementing FromJSON is parseJSON. The syntax used in this recipe for implementing parseJSON is a little strange, but this is because we're leveraging applicative functions and lenses, which are more advanced Haskell topics.

The .: function has two arguments, Object and Text, and returns a Parser a data type. As per the documentation, it retrieves the value associated with the given key of an object. This function is used if the key and the value exist in the JSON document. The :? function also retrieves the associated value from the given key of an object, but the existence of the key and value are not mandatory. So, we use .:? for optional key value pairs in a JSON document.

There's more…

If the implementation of the FromJSON typeclass is too involved, we can easily let GHC automatically fill it out using the DeriveGeneric language extension. The following is a simpler rewrite of the code:

{-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE DeriveGeneric #-} import Data.Aeson import qualified Data.ByteString.Lazy as B import GHC.Generics data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } deriving Generic instance FromJSON Mathematician main = do input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m greet m = (show.name) m ++" was born in the year "++ (show.born) m

Although Aeson is powerful and generalizable, it may be an overkill for some simple JSON interactions. Alternatively, if we wish to use a very minimal JSON parser and printer, we can use Yocto, which can be downloaded from http://hackage.haskell.org/package/yocto.

Reading an XML file using the HXT package

Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/).

In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates.

Getting ready

We will first set up an XML file called input.xml with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows:

$ cat input.xml <thread> <email> <to>Databender</to> <from>Princess</from> <date>Thu Dec 18 15:03:23 EST 2014</date> <subject>Joke</subject> <body>Why did you divide sin by tan?</body> </email> <email> <to>Princess</to> <from>Databender</from> <date>Fri Dec 19 3:12:00 EST 2014</date> <subject>RE: Joke</subject> <body>Just cos.</body> </email> </thread>

Using Cabal, install the HXT library which we use for manipulating XML documents:

$ cabal install hxt

How to do it...

  1. We only need one import, which will be for parsing XML, using the following line of code:

    import Text.XML.HXT.Core

  2. Define and implement main and specify the XML location. For this recipe, the file is retrieved from input.xml. Refer to the following code:

    main :: IO () main = do input <- readFile "input.xml"

  3. Apply the readString function to the input and extract all the date documents. We filter items with a specific name using the hasName :: String -> a XmlTree XmlTree function. Also, we extract the text using the getText :: a XmlTree String function, as shown in the following code snippet:

    dates <- runX $ readString [withValidate no] input //> hasName "date" //> getText

  4. We can now use the list of extracted dates as follows:

    print dates

  5. By running the code, we print the following output:

    $ runhaskell Main.hs ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"]

How it works...

The library function, runX, takes in an Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX function in this recipe takes in IOSArrow XmlTree String and returns an IO action of the String type. We generate this IOSArrow object using the readString function, which performs a series of operations to the XML data.

For a deep insight into the XML document, //> should be used whereas /> only looks at the current level. We use the //> function to look up the date attributes and display all the associated text.

As defined in the documentation, the hasName function tests whether a node has a specific name, and the getText function selects the text of a text node. Some other functions include the following:

  • isText: This is used to test for text nodes

  • isAttr: This is used to test for an attribute tree

  • hasAttr: This is used to test whether an element node has an attribute node with a specific name

  • getElemName: This is used to select the name of an element node

All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html.

Capturing table rows from an HTML page

Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process.

In this recipe, we will find a table on a web page and gather all rows to be used in the program.

Getting ready

We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure:

The HTML behind this table is as follows:

$ cat input.html <!DOCTYPE html> <html> <body> <h1>Course Listing</h1> <table> <tr> <th>Course</th> <th>Time</th> <th>Capacity</th> </tr> <tr> <td>CS 1501</td> <td>17:00</td> <td>60</td> </tr> <tr> <td>MATH 7600</td> <td>14:00</td> <td>25</td> </tr> <tr> <td>PHIL 1000</td> <td>9:30</td> <td>120</td> </tr> </table> </body> </html>

If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines:

$ cabal install hxt $ cabal install split

How to do it...

  1. We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet:

    import Text.XML.HXT.Core import Data.List.Split (chunksOf)

  2. Define and implement main to read the input.html file.

    main :: IO () main = do input <- readFile "input.html"

  3. Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code:

    texts <- runX $ readString [withParseHTML yes, withWarnings no] input //> hasName "td" //> getText

  4. The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code:

    let rows = chunksOf 3 texts print $ findBiggest rows

  5. By folding through the data, identify the course with the largest capacity using the following code snippet:

    findBiggest :: [[String]] -> [String] findBiggest [] = [] findBiggest items = foldl1 (\a x -> if capacity x > capacity a then x else a) items capacity [a,b,c] = toInt c capacity _ = -1 toInt :: String -> Int toInt = read

  6. Running the code will display the class with the largest capacity as follows:

    $ runhaskell Main.hs {"PHIL 1000", "9:30", "120"}

How it works...

This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].

Haskell Data Analysis Cookbook Explore intuitive data analysis techniques and powerful machine learning methods using over 130 practical recipes
Published: June 2014
eBook Price: $32.99
Book Price: $54.99
See more
Select your format and quantity:

Understanding how to perform HTTP GET requests

One of the most resourceful places to find good data is online. GET requests are common methods of communicating with an HTTP web server. In this recipe, we will grab all the links from a Wikipedia article and print them to the terminal. To easily grab all the links, we will use a helpful library called HandsomeSoup, which lets us easily manipulate and traverse a webpage through CSS selectors.

Getting ready

We will be collecting all links from a Wikipedia web page. Make sure to have an Internet connection before running this recipe.

Install the HandsomeSoup CSS selector package, and also install the HXT library if it is not already installed. To do this, use the following commands:

$ cabal install HandsomeSoup $ cabal install hxt

How to do it...

  1. This recipe requires hxt for parsing HTML and requires HandsomeSoup for the easy-to-use CSS selectors, as shown in the following code snippet:

    import Text.XML.HXT.Core import Text.HandsomeSoup

  2. Define and implement main as follows:

    main :: IO () main = do

  3. Pass in the URL as a string to HandsomeSoup's fromUrl function:

    let doc = fromUrl "http://en.wikipedia.org/wiki/Narwhal"

  4. Select all links within the bodyContent field of the Wikipedia page as follows:

    links <- runX $ doc >>> css "#bodyContent a" ! "href" print links

How it works…

The HandsomeSoup package allows easy CSS selectors. In this recipe, we run the #bodyContent a selector on a Wikipedia article web page. This finds all link tags that are descendants of an element with the bodyContent ID.

See also…

Another common way to obtain data online is through POST requests. To find out more, refer to the Learning how to perform HTTP POST requests recipe.

Learning how to perform HTTP POST requests

A POST request is another very common HTTP server request used by many APIs. We will be mining the University of Virginia directory search. When sending a POST request for a search query, the Lightweight Directory Access Protocol (LDAP) server replies with a web page of search results.

Getting ready

For this recipe, access to the Internet is necessary.

Install the HandsomeSoup CSS selector package, and also install the HXT library if it is not already installed:

$ cabal install HandsomeSoup $ cabal install hxt

How to do it...

  1. Import the following libraries:

    import Network.HTTP import Network.URI (parseURI) import Text.XML.HXT.Core import Text.HandsomeSoup import Data.Maybe (fromJust)

  2. Define the POST request specified by the directory search website. Depending on the server, the following POST request details would be different. Refer to the following code snippet:

    myRequestURL = "http://www.virginia.edu/cgi-local/ldapweb" myRequest :: String -> Request_String myRequest query = Request { rqURI = fromJust $ parseURI myRequestURL , rqMethod = POST , rqHeaders = [ mkHeader HdrContentType "text/html" , mkHeader HdrContentLength $ show $ length body ] , rqBody = body } where body = "whitepages=" ++ query

  3. Define and implement main to run the POST request on a query as follows:

    main :: IO () main = do response <- simpleHTTP $ myRequest "poon"

  4. Gather the HTML and parse it:

    html <- getResponseBody response let doc = readString [withParseHTML yes, withWarnings no] html

  5. Find the table rows and print it out using the following:

    rows <- runX $ doc >>> css "td" //> getText print rows

Running the code will display all search results relating to "poon", such as "Poonam" or "Witherspoon".

How it works...

A POST request needs the specified URI, headers, and body. By filling out a Request data type, it can be used to establish a server request.

See also

Refer to the Understanding how to perform HTTP GET requests recipe for details on how to perform a GET request instead.

Summary

In this article, we learned how to use local data of different file formats and also learned how to download data from the Internet using our Haskell code.

Resources for Article:


Further resources on this subject:


Haskell Data Analysis Cookbook Explore intuitive data analysis techniques and powerful machine learning methods using over 130 practical recipes
Published: June 2014
eBook Price: $32.99
Book Price: $54.99
See more
Select your format and quantity:

About the Author :


Nishant Shukla

Nishant Shukla is a computer scientist with a passion for mathematics. Throughout the years, he has worked for a handful of start-ups and large corporations including WillowTree Apps, Microsoft, Facebook, and Foursquare.

Stepping into the world of Haskell was his excuse for better understanding Category Theory at first, but eventually, he found himself immersed in the language. His semester-long introductory Haskell course in the engineering school at the University of Virginia (http://shuklan.com/haskell) has been accessed by individuals from over 154 countries around the world, gathering over 45,000 unique visitors.

Besides Haskell, he is a proponent of decentralized Internet and open source software. His academic research in the fields of Machine Learning, Neural Networks, and Computer Vision aim to supply a fundamental contribution to the world of computing.

Books From Packt


Pentaho for Big Data Analytics
Pentaho for Big Data Analytics

Haskell Financial Data Modeling and Predictive Analytics
Haskell Financial Data Modeling and Predictive Analytics

Game Data Analysis – Tools and Methods
Game Data Analysis – Tools and Methods

Practical Data Analysis
Practical Data Analysis

Practical Data Analysis and Reporting with BIRT
Practical Data Analysis and Reporting with BIRT

Clojure Data Analysis Cookbook
Clojure Data Analysis Cookbook

Getting Started with Greenplum for Big Data Analytics
Getting Started with Greenplum for Big Data Analytics

Mastering Clojure Data Analysis
Mastering Clojure Data Analysis


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software