Home Data Haskell Data Analysis Cookbook

Haskell Data Analysis Cookbook

By Nishant Shukla
books-svg-icon Book
Subscription FREE
eBook + Subscription €14.99
eBook €36.99
Print + eBook €45.99
READ FOR FREE Free Trial for 7 days. €14.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
READ FOR FREE Free Trial for 7 days. €14.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
Subscription FREE
eBook + Subscription €14.99
eBook €36.99
Print + eBook €45.99
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
  1. Free Chapter
    The Hunt for Data
About this book
Publication date:
June 2014
Publisher
Packt
Pages
334
ISBN
9781783286331

 

Chapter 1. The Hunt for Data

In this chapter, we will cover the following recipes:

  • Harnessing data from various sources

  • Accumulating text data from a file path

  • Catching I/O code faults

  • Keeping and representing data from a CSV file

  • Examining a JSON file with the aeson package

  • Reading an XML file using the HXT package

  • Capturing table rows from an HTML page

  • Understanding how to perform HTTP GET requests

  • Learning how to perform HTTP POST requests

  • Traversing online directories for data

  • Using MongoDB queries in Haskell

  • Reading from a remote MongoDB server

  • Exploring data from a SQLite database

 

Introduction


Data is everywhere, logging is cheap, and analysis is inevitable. One of the most fundamental concepts of this chapter is based on gathering useful data. After building a large collection of usable text, which we call the corpus, we must learn to represent this content in code. The primary focus will be first on obtaining data and later on enumerating ways of representing it.

Gathering data is arguably as important as analyzing it to extrapolate results and form valid generalizable claims. It is a scientific pursuit; therefore, great care must and will be taken to ensure unbiased and representative sampling. We recommend following along closely in this chapter because the remainder of the book depends on having a source of data to work with. Without data, there isn't much to analyze, so we should carefully observe the techniques laid out to build our own formidable corpus.

The first recipe enumerates various sources to start gathering data online. The next few recipes deal with using local data of different file formats. We then learn how to download data from the Internet using our Haskell code. Finally, we finish this chapter with a couple of recipes on using databases in Haskell.

 

Harnessing data from various sources


Information can be described as structured, unstructured, or sometimes a mix of the two—semi-structured.

In a very general sense, structured data is anything that can be parsed by an algorithm. Common examples include JSON, CSV, and XML. If given structured data, we can design a piece of code to dissect the underlying format and easily produce useful results. As mining structured data is a deterministic process, it allows us to automate the parsing. This in effect lets us gather more input to feed our data analysis algorithms.

Unstructured data is everything else. It is data not defined in a specified manner. Written languages such as English are often regarded as unstructured because of the difficulty in parsing a data model out of a natural sentence.

In our search for good data, we will often find a mix of structured and unstructured text. This is called semi-structured text.

This recipe will primarily focus on obtaining structured and semi-structured data from the following sources.

Tip

Unlike most recipes in this book, this recipe does not contain any code. The best way to read this book is by skipping around to the recipes that interest you.

How to do it...

We will browse through the links provided in the following sections to build up a list of sources to harness interesting data in usable formats. However, this list is not at all exhaustive.

Some of these sources have an Application Programming Interface (API) that allows more sophisticated access to interesting data. An API specifies the interactions and defines how data is communicated.

News

The New York Times has one of the most polished API documentation to access anything from real-estate data to article search results. This documentation can be found at http://developer.nytimes.com.

The Guardian also supports a massive datastore with over a million articles at http://www.theguardian.com/data.

USA TODAY provides some interesting resources on books, movies, and music reviews. The technical documentation can be found at http://developer.usatoday.com.

The BBC features some interesting API endpoints including information on BBC programs, and music located at http://www.bbc.co.uk/developer/technology/apis.html.

Private

Facebook, Twitter, Instagram, Foursquare, Tumblr, SoundCloud, Meetup, and many other social networking sites support APIs to access some degree of social information.

For specific APIs such as weather or sports, Mashape is a centralized search engine to narrow down the search to some lesser-known sources. Mashape is located at https://www.mashape.com/

Most data sources can be visualized using the Google Public Data search located at http://www.google.com/publicdata.

For a list of all countries with names in various data formats, refer to the repository located at https://github.com/umpirsky/country-list.

Academic

Some data sources are hosted openly by universities around the world for research purposes.

To analyze health care data, the University of Washington has published Institute for Health Metrics and Evaluation (IHME) to collect rigorous and comparable measurement of the world's most important health problems. Navigate to http://www.healthdata.org for more information.

The MNIST database of handwritten digits from NYU, Google Labs, and Microsoft Research is a training set of normalized and centered samples for handwritten digits. Download the data from http://yann.lecun.com/exdb/mnist.

Nonprofits

Human Development Reports publishes annual updates ranging from international data about adult literacy to the number of people owning personal computers. It describes itself as having a variety of public international sources and represents the most current statistics available for those indicators. More information is available at http://hdr.undp.org/en/statistics.

The World Bank is the source for poverty and world development data. It regards itself as a free source that enables open access to data about development in countries around the globe. Find more information at http://data.worldbank.org/.

The World Health Organization provides data and analyses for monitoring the global health situation. See more information at http://www.who.int/research/en.

UNICEF also releases interesting statistics, as the quote from their website suggests:

"The UNICEF database contains statistical tables for child mortality, diseases, water sanitation, and more vitals. UNICEF claims to play a central role in monitoring the situation of children and women—assisting countries in collecting and analyzing data, helping them develop methodologies and indicators, maintaining global databases, disseminating and publishing data. Find the resources at http://www.unicef.org/statistics."

The United Nations hosts interesting publicly available political statistics at http://www.un.org/en/databases.

The United States government

If we crave the urge to discover patterns in the United States (U.S.) government like Nicholas Cage did in the feature film National Treasure (2004), then http://www.data.gov/ is our go-to source. It's the U.S. government's active effort to provide useful data. It is described as a place to increase "public access to high-value, machine-readable datasets generated by the executive branch of the Federal Government". Find more information at http://www.data.gov.

The United States Census Bureau releases population counts, housing statistics, area measurements, and more. These can be found at http://www.census.gov.

                       
About the Author
  • Nishant Shukla

    Nishant Shukla is a computer scientist with a passion for mathematics. Throughout the years, he has worked for a handful of start-ups and large corporations including WillowTree Apps, Microsoft, Facebook, and Foursquare. Stepping into the world of Haskell was his excuse for better understanding Category Theory at first, but eventually, he found himself immersed in the language. His semester-long introductory Haskell course in the engineering school at the University of Virginia (http://shuklan.com/haskell) has been accessed by individuals from over 154 countries around the world, gathering over 45,000 unique visitors. Besides Haskell, he is a proponent of decentralized Internet and open source software. His academic research in the fields of Machine Learning, Neural Networks, and Computer Vision aim to supply a fundamental contribution to the world of computing.

    Browse publications by this author
Latest Reviews (1 reviews total)
...................................
Recommended For You
Haskell Data Analysis Cookbook
Unlock this book and the full library FREE for 7 days
Start now