Learning Haskell Data Analysis

By James Church
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Tools of the Trade

About this book

Haskell is trending in the field of data science by providing a powerful platform for robust data science practices. This book provides you with the skills to handle large amounts of data, even if that data is in a less than perfect state. Each chapter in the book helps to build a small library of code that will be used to solve a problem for that chapter. The book starts with creating databases out of existing datasets, cleaning that data, and interacting with databases within Haskell in order to produce charts for publications. It then moves towards more theoretical concepts that are fundamental to introductory data analysis, but in a context of a real-world problem with real-world data. As you progress in the book, you will be relying on code from previous chapters in order to help create new solutions quickly. By the end of the book, you will be able to manipulate, find, and analyze large and small sets of data using your own Haskell libraries.

Publication date:
May 2015
Publisher
Packt
Pages
198
ISBN
9781784394707

 

Chapter 1. Tools of the Trade

Data analysis is the craft of sifting through data for the purpose of learning or decision making. To ease the difficulties of sifting through data, we rely on databases and our knowledge of programming. For nut-and-bolts coding, this text uses Haskell. For storage, plotting, and computations on large datasets, we will use SQLite3, gnuplot, and LAPACK respectively. These four pieces of software are a powerful combination that allow us to solve some difficult problems. In this chapter, we will discuss these tools of the trade and recommend a few more.

In this chapter, we will cover the following:

  • Why we should consider Haskell for our next data analysis project

  • Installing and configuring Haskell, the GHCi (short for Glasgow Haskell Compiler interactive) environment, and cabal

  • The software packages needed in addition to Haskell: SQLite3, gnuplot, and LAPACK

  • The nearly essential software packages that you should consider: Git and Tmux

  • Our first program: computing the median of a list of values

  • An introduction to the command-line environment

 

Welcome to Haskell and data analysis!


This book is about solving problems related to data. In each chapter, we will present a problem or a question that needs answering. The only way to get this answer is through an understanding of the data. Data analysis is not only a practice that helps us glean insight from information, but also an academic pursuit that combines knowledge of the basics of computer programming, statistics, machine learning, and linear algebra. The theory behind data analysis comes from statistics.

The concepts of summary statistics, sampling, and empirical testing are gifts from the statistical community. Computer science is a craft that helps us convert statistical procedures into formal algorithms, which are interpreted by a computer. Rarely will our questions about data be an end in themselves. Once the data has been analyzed, the analysis should serve as a plan to better decision-making powers. The field of machine learning is an attempt to create algorithms that are capable of making their own decisions based on the results of the analysis of a dataset. Finally, we will sometimes need to use linear algebra for complicated datasets. Linear algebra is the study of vector spaces and matrices, which can be understood by the data analyst as a multidimensional dataset with rows and columns. However, the most important skill of data analysts is their ability to communicate their findings with the help of a combination of written descriptions and graphs. Data science is a challenging field that requires a blend of computer science, mathematics, and statistics disciplines.

In the first chapter, the real-world problem is with regard to getting our environment ready. Many languages are suitable for data analysis, but this book tackles data problems using Haskell and assumes that you have a background in the Haskell language from Chapter 2, Getting Our Feet Wet onwards. If not, we encourage you to pick up a book on Haskell development. You can refer to Learn You a Haskell for Great Good: A Beginner's Guide, Miran Lipovaca, No Starch Press, and Real World Haskell, Bryan O'Sullivan, John Goerzen, Donald Bruce Stewart, O'Reilly Media, which are excellent texts if you want to learn programming in Haskell. Learn You a Haskell for Great Good: A Beginner's Guide can be read online at http://learnyouahaskell.com/. Real World Haskell can also be read online at http://book.realworldhaskell.org/. The former book is an introduction to the language, while the latter is a text on professional Haskell programming. Once you wade through these books (as well as Learning Haskell Data Analysis), we encourage you to read the book Haskell Data Analysis Cookbook, Nishant Shukla, Packt Publishing. This cookbook will provide snippets of code to work with a wide variety of data formats, databases, visualization tools, data structures, and clustering algorithms. We also recommend Notes on Functional Programming with Haskell by Dr. Conrad Cunningham.

Besides Haskell, we will discuss open source data formats, databases, and graphing software in the following manner:

  • We will limit ourselves to working with two data serialization file formats: JSON and CSV. CSV is perhaps the most common data serialization format for uncompressed data with the weakness of not being an explicit standard. In a later chapter, we will examine data from the Twitter web service, which exports data in the JSON format. By limiting ourselves to two data formats, we will focus our efforts on problem solving instead of prolonged discussions of data formats.

  • We will use SQLite3 for our database backend application. SQLite3 is a lightweight database software that can store large amounts of data. Using a wrapper module, we can pull data directly from a SQLite3 database into the Haskell command line for analysis.

  • We will use the EasyPlot Haskell wrapper module for gnuplot, which is a popular open source tool that is used to create publication-ready graphics. The EasyPlot wrapper provides access to a subset of features in gnuplot, but we shall see that this subset is more than sufficient for the creation of compelling graphs.

 

Why Haskell?


Since this is an introductory text to data analysis, we will focus on commonly used practices within the context of the Haskell language. Haskell is a general-purpose, purely functional programming language with strong static typing. Among the many features of Haskell are lazy evaluation, type inference, Lisp-like support for lists and tuples, and the Hackage repository. Here are the reasons why we like Haskell:

  • Haskell has features that are similar to Lisp. These features are used to process lists of data (minus the syntax of Lisp). Higher-order functions such as map, foldr, foldl, and filter provide us with a standard interface to apply functions to lists. In Haskell, all the functions can be used as a parameter to the other functions, allowing the programmer to seamlessly manipulate data on the fly through anonymous functions known as lambda expressions. The map function is frequently used in this book due to the ease it provides in converting a list of elements of a data type to another type.

  • Haskell is a purely functional programming language, and stateless in nature. This means that the only information known to a function is the information that is either passed into that function or returned from other function calls. The so-called variables are named after their mathematical properties and not the conventional computer programming sense of the word. Variables are not allowed to change. Instead, they are the bindings to expressions. Because of these limitations, functions are easier to test for their correctness than stateful languages.

  • Haskell can handle large datasets that have a size that your system memory limitations will allow, which should be sufficient to handle most medium-sized data problems. Big data can be defined as any dataset that is so big that it needs to be broken up into pieces and aggregated in a secondary step or sampled prior to the analysis. A step-down of big data is medium data, which can be defined as a dataset that can be processed in its entirety without you having to break it into parts. There is no set number with regard to when a dataset grows in size from medium to big since ever-increasing hardware capabilities continuously redefine how much a computer can do. An informal definition of small data is a dataset that can be easily grasped in its entirety by a human, which can be considered to be a few numbers at best. All of the problems considered in this book were tested on a computer with a RAM of 2 GB. The smallest dataset examined in this chapter is 16 values and the largest dataset is about 7 MB in size. Each of the problems presented in this text should scale in size to the definition of medium data.

  • Haskell enforces lazy evaluation. Lazy evaluation allows Haskell to delay the execution of a procedure until it is actually needed, for example, throughout this book, we will be setting up calculations over the course of several steps. In most strict languages, once these calculations are encountered by the language, they are immediately executed and the results are stored in memory. In lazy languages, commands are compiled and the system stores the instructions. If a calculation step is never used, it never gets evaluated, thus saving execution time. Once the calculation is required (for example, when we need to see the results displayed on the screen), only then will a lazy language evaluate the steps of our algorithm.

  • Haskell supports type inference. Type inference allows Haskell to be strictly typed without having to declare the need for types as the code is being written, for example, consider the following myFunc function annotation:

    myFunc :: a -> a -> Integer

    This function requires two parameters, and it returns an Integer. The type is left ambiguous, and it will be inferred when the function is used. Because both the types are a, Haskell will use static type checking to ensure that the data type of the first parameter matches the data type of the second parameter. If we wish for the possibility of the first parameter to have a type that is different from the second, we can create a second inferred type named b. (Specific types begin with an uppercase letter and generic types must begin with a lowercase letter.)

  • Using the cabal tool, a Haskell programmer has several thousands of libraries that can be quickly downloaded to a system. These libraries provide analysts with most of the common data analysis procedures. While many libraries exist within the cabal repository, sometimes we may opt not to use them in favor of an explicit description of the math and code behind a particular algorithm.

 

Getting ready


You will need to install the Haskell platform, which is available on all three major operating systems: Windows, Mac, and Linux. I primarily work with Debian Linux. Linux has the benefit of being equipped with a versatile command line, which can facilitate almost everything that is essential to the data analysis process. From the command line, we can download software, install Haskell libraries, download datasets, write files, and view raw datasets. An essential activity that the command line cannot do for us is the rendering of graphics that can be provided with sufficient detail to inspect rendered charts of our analyses.

Installing the Haskell platform on Linux

On Ubuntu- and Debian-based systems, you can install the Haskell platform using apt-get, as follows:

$ sudo apt-get install haskell-platform 

This single command will install everything that is needed to get started, including the compiler (ghc), interactive command line (ghci), and the library install tool (cabal). Take a moment to test the following commands:

$ ghc --version 
The Glorious Glasgow Haskell Compilation System, version 7.4.1 
$ ghci --version 
The Glorious Glasgow Haskell Compilation System, version 7.4.1 

If you get back the version numbers for the Haskell compiler and the Haskell interactive prompt, you should be all set. However, we do need to perform some housekeeping with regards to cabal. We will use cabal throughout this book, and it will require an update immediately. We updated the cabal tool through cabal itself.

First, we will update the Haskell package list from Hackage using the update directive by using the following command:

$ cabal update

Next, we will download cabal using the cabal-install command. This command will not overwrite the existing cabal program. Instead, it will download an updated cabal to your home folder, which can be found at ~/.cabal/bin/cabal.

$ cabal install cabal-install

Your system has two versions of cabal on it. We created an alias command to make sure that we only use the updated version of cabal. This is a temporary alias command. You should add the following line to one of your configuration files in your home directory. (We added ours to ~/.bash_aliases and reloaded aliases with source ~/.bash_aliases.)

$ alias cabal='~/.cabal/bin/cabal'

If all goes according to plan, you will have an updated version of cabal on your system. Here is the version of cabal used at the time of writing this book:

$ cabal --version
cabal-install version 1.22.0.0
using version 1.22.0.0 of the Cabal library

If you use cabal long enough, you may run into problems. Rather than going into a prolonged discussion on how to manage Haskell packages, it is easier to start over with a clean slate. Your packages are downloaded to a folder under ~/.cabal, and they are registered with the Haskell environment under the ~/.ghc/ directory. If you find that a package has not been installed due to a conflicted dependency, you can spend an evening reading the package documentation to figure out which packages need to be removed or installed. Alternatively, you can use the following command and wipe the slate clean:

$ rm -rf ~/.ghc

The preceding command wipes out all your installed Haskell packages. We can promise that you will not have conflicting packages if you have no packages. We call this the Break Glass In Case of Emergency solution. This is obviously is not the best solution, but it is a solution that gets your necessary packages installed. You have more important things to do than wrestle with cabal. While it may take about an hour or so to download and install packages with this approach, this approach is less stressful than the process of going through package version numbers.

The software used in addition to Haskell

There are three open source software packages used in this book that work alongside the Haskell platform. If you are using Debian or Ubuntu, you will be able to download each of these packages using the apt-get command-line tool. The instructions on how to download and install these packages will be introduced when the software is needed. If you are using Windows or Mac, you will have to consult the documentation for these software packages for an installation on your system.

SQLite3

SQLite3 (for more information refer to: https://sqlite.org/) is a standalone Structured Query Language (SQL) database engine. We use SQLite3 to filter and organize large amounts of data. It requires no configuration, does not use a background server process, and each database is self-contained in a single file ending with the .sql extension. The software is portable, has many features from the features found in sever-based SQL database engines, and can support large databases. We will introduce SQLite3 in Chapter 2, Getting Our Feet Wet and use it extensively in the rest of the book.

Gnuplot

Gnuplot (for more information refer to: http://www.gnuplot.info/) is a command-line tool that can be used to create charts and graphs for academic publications. It supports many features related to 2D and 3D plotting as well as a number of output and interactive formats. We will use gnuplot in conjunction with the EasyPlot Haskell wrapper module. EasyPlot gives us access to a subset of the features of gnuplot (which means that even though our charts are being piped through gnuplot, we will not be able to utilize the full power of gnuplot from within this library). Every chart presented in this book was created using EasyPlot and gnuplot. We will introduce EasyPlot and gnuplot in Chapter 4, Plotting.

LAPACK

LAPACK (short for Linear Algebra PACKage) (for more information refer to: http://www.netlib.org/lapack/) has been constantly developed since the early 1990s. To this day, this library is written in FORTRAN. Since it is so vital to science, it is funded through the United States National Science Foundation (NSF). This library supports routines related to systems of equations such as matrix multiplication, matrix inversion, and eigenvalue decomposition. We will use the hmatrix wrapper for LAPACK in Chapter 8, Building a Recommendation Engine to write our own Principal Component Analysis (PCA) function to create a recommendation engine. We will also use LAPACK to avoid the messiness that comes when trying to write an eigensolver ourselves.

 

Nearly essential tools of the trade


This section is about the tools used in the preparation of this book. They aren't essential to Haskell or data analysis, but they deserve a mention.

Version control software – Git

If you have ever been in a situation where you needed to update an old file while keeping that old file, you may have been tempted to name the files MyFileVersion1 and MyFileVersion2. In this instance, you used manual version control. Instead, you should use version control software.

Git is a distributed version control software that allows teams of programmers to work on a single project, track their changes, branch a project, merge project branches, and roll back mistakes if necessary. Git will scale from a team of 1 to hundreds of members.

If you already have a favorite software package for version control, we encourage you to use it while working through the examples in this book. If not, we will quickly demonstrate how to use Git.

First, you need to install Git by using the following code:

$ sudo apt-get install git

Git requires you to set up a repository in your working directory. Navigate to your folder for your Haskell project and create a repository:

$ git init

Once your repository is created, add the files that you've created in this chapter to the repository. Create a file called LearningDataAnalysis01.hs. At this point, the file should be blank. Let's add the blank file to our repository:

$ git add LearningDataAnalysis01.hs

Now, we'll commit the change:

$ git commit -m 'Add chapter 1 file'

Take a moment to revisit the LearningDataAnalysis01.hs file and make a change to damage the file. We can do this via the following command line:

$ echo "It was a mistake to add this line." >> LearningDataAnalysis01.hs

An addition to this line represents work that you contributed to a file but later realized was a mistake. This program will no longer compile with these changes. You may wish that you could remember the contents of the original file. You are in luck. Everything that you have committed to the version control is stored in the repository. Rename your damaged file to LearningDataAnalysis01Damaged.hs. We will fix our file back to the last commit:

$ git checkout -- LearningDataAnalysis01.hs

The LearningDataAnalysis01.hs blank file will be added back to your folder. When you inspect your file, you will see that the changes are gone and the file is restored. Hurray!

If you have a project consisting of at least one file, you should use version control. Here is the general workflow for branchless version control:

  1. Think.

  2. Write some code.

  3. Test that code.

  4. Commit that code.

  5. Go to step 1.

It doesn't take long to see the benefits of version control. Mistakes happen and version control is there to save you. This version control workflow will be sufficient for small projects. Though we will not remind you that you should use version control, you should make a practice of committing your code after each chapter (which is done probably more frequently than this).

Tmux

Tmux is an application that is used to run multiple terminals within a single terminal. A collection of terminals can be detached and reattached to other terminal connections, programs can be kept running in the background to monitor the progress, and the user can be allowed to jump back and forth between terminals, for example, while writing this book, we typically kept tmux running with the following terminals:

  • A terminal for the interactive Haskell command line

  • A terminal running our favorite text editor while working on the code for a chapter

  • A terminal running a text editor with mental notes to ourselves and snippets of code

  • A terminal running a text editor containing the text of the chapter we were currently writing

  • A terminal running the terminal web browser elinks in order to read the Haskell documentation

The prized feature (in our opinion) of tmux is its ability to detach from a terminal (even the one that has lost connection) and reattach itself to the currently connected terminal. Our work environment is a remote virtual private server running Debian Linux. With tmux, we can log in to our server from any computer with an Internet connection and an ssh client, reattach the current tmux session, and return to the testing and writing of the code.

We will begin by installing tmux:

$ sudo apt-get install tmux

Now, let's start tmux:

$ tmux

You will see the screen refresh with a new terminal. You are now inside a pseudoterminal. While in this terminal, start the interactive Haskell compiler (ghci). At the prompt, perform a calculation. Let's add 2 and 2 by using the prefix manner rather than the typical infix manner (all operators in Haskell are functions that allow for infix evaluation. Here, we call addition as a function):

$ ghci
GHCi, version 7.4.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
> (+) 2 2
4

The interactive Haskell compiler runs continuously. On your keyboard, type Ctrl + B, followed by C (for create). This command creates a new terminal. You can cycle forward through the chain of open terminals by using the Ctrl + B command, followed by N (for next). You now have two terminals running on the same connection.

Imagine that this is being viewed on a remove server. On your keyboard, type Ctrl + B followed by D. The screen will return just prior to you calling tmux. The [detached] word will now be seen on the screen. You no longer will be able to see the interactive Haskell compiler, but it will still run in the background of your computer. You can reattach the session to this terminal window by using the following command:

$ tmux attach -d

Your windows will be restored with all of your applications running and the content on the screen the same as it was when you left it. Cycle through the terminals until you find the Haskell interactive command line (Ctrl + B followed by P, cycles to the previous terminal). The application never stopped running. Once you are finished with your multiplexed session, close the command line in the manner that you normally would (either by using Ctrl + D, or by typing exit). Every terminal that is closed will return you to another open terminal. The tmux service will stop once the last terminal opened within the tmux command is closed.

 

Our first Haskell program


Though this is a book about data analysis using Haskell, it isn't meant to teach you the syntax or features of the Haskell language. What we would like to do for the remainder of the chapter is to get you (the reader) familiar with some of the repeatedly used language features if you aren't familiar with the language. Consider this a crash course in the Haskell language.

The median of a dataset is the value that is present in the middle of the dataset when the dataset is sorted. If there are an even number of elements, the median is the average of the two values closest to the middle of the sorted dataset. Based on this, we can plan an algorithm to compute a median. First, we will sort the numbers. Second, we will determine whether there are an even number of elements or an odd number. Finally, we will return the appropriate middle value.

Create a new folder on your computer where your Haskell files will be stored. You should put all your files in a directory called ~/projects/LearningHaskellDataAnalysis. Inside this directory, using an editor of your choice, create a file called LearningDataAnalysis01.hs (hopefully, you created this file earlier in our demonstration of Git). We will create a module file to store our algorithm to compute the median of a dataset. It will begin with the following lines:

module LearningDataAnalysis01 where
import Data.List

The first line tells Haskell that this is a module file that contains functions for general usage. The second line tells Haskell that we need the Data.List library, which is a part of the Haskell platform. This library contains several versatile functions that are required to use lists, and we will take full advantage of this library.

We will begin by crafting the header of our function:

median :: [Double] -> Double

The preceding statement states that we have a function named median that requires a parameter consisting of a list of floating-point values. It will return a single floating-point value. Now, consider the following code snippet of the median function:

median :: [Double] -> Double 
median [] = 0 
median xs = if oddInLength then 
              middleValue 
            else 
              (middleValue + beforeMiddleValue) / 2 
  where
    sortedList = sort xs 
    oddInLength = 1 == mod (genericLength xs) 2 
    middle = floor $ (genericLength xs) / 2 
    middleValue = genericIndex sortedList middle 
    beforeMiddleValue = genericIndex sortedList (middle-1)

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Haskell interprets our arguments with the help of pattern matching. The first two calls under the function header are called patterns. Haskell will compare our data to each of these patterns. The first pattern is [] (which is an empty list). If our input list to this function is empty, we will return the 0 value. The second pattern is xs, which matches a nonempty list. Now, we will evaluate to check whether this list has an odd number of elements or an even number and return the correct value.

The bulk of the work of this function happens under the where clause. It is a mistake to think of these statements as a sequential program. These expressions are executed as they are needed to complete the task of the main function. Under the where clause, we have five expressions that perform an operation and store a result. We will go over each of them. Consider the first clause:

sortedList = sort xs

The first clause sorts our list of values and returns the result to sortedList. Here, we utilize the sort function, which is found in Data.List. Consider the second clause:

oddInLength = 1 == mod (genericLength xs) 2

The second clause determines whether a list has an odd number of elements. We will do this by computing the modulus (using mod) of the length of the list (using genericLength) and the number 2. We will compare this result to the number 1, which must be either true or false. Consider the third clause:

middle = floor $ (genericLength xs) / 2

The third clause takes the length of the list, divides it by 2, and then computes the mathematical floor of the result. See the $ operator? This operator tells Haskell that everything on the rest of the line should be treated as a single expression. We could have written this line as middle = floor ((genericLength xs) / 2) and it would be valid. This saves us from having to look at an extra set of parentheses. We can take this a step further and use middle = floor $ genericLength xs / 2 with no parentheses. Readability takes priority over character counting. Consider the fourth and fifth clause:

middleValue = genericIndex sortedList middle 
beforeMiddleValue = genericIndex sortedList (middle-1) 

The fourth and fifth clauses use genericIndex to pull a specific value from the sortedList variable (the first one pulls the value from the computed middle and the second pulls it from the element that is immediately before the middle). The fifth clause has a potential problem—on a list with one element, the middle element is 0 and the element before the middle element is -1.

If you recall our discussion earlier on lazy evaluation, none of these statements are called unless needed. Back in the main portion of the function, you can see the description of median. The same can be seen in the following:

median xs = if oddInLength then 
              middleValue 
            else
              (middleValue + beforeMiddleValue) / 2

Consider an example of a list with one element. The first expression that is encountered in our list of where clauses will be oddInLength (since this is evaluated in the if statement). The oddInLength expression should evaluate to true. Thus, we execute the true branch of the conditional expression. The middleValue expression requires you to call genericIndex function on sortedList (which executes the two remaining where clauses). In this example, beforeMiddleValue will not be executed.

We will build a wrapper program that utilizes our median function call. Create a second file called median.hs, which will serve as our wrapper to the module:

import System.Environment (getArgs) 
import LearningDataAnalysis01 
main :: IO () 
main = do 
  values <- getArgs 
  print . median $ map read values 

You can see that the last line of the file states—take all the values from the command line, read them as Double values, pass them to the median, and print the result. You might ask yourself how Haskell knows how to read these values as Double values and not anything else. This is where the magic of type inference happens. Because our median function specified that it requires a list of Double values, this information is passed on to read to make sure that the information is interpreted as a Double type. The map function makes sure that read is applied to every element in values. Finally, the print function prints the result.

Let's compile your new program from the command line. We are ready to test:

$ ghc median.hs -o median 

Great. We will have a new executable in our directory called median. From the command line, test a few values:

$ ./median 
0.0 
$ ./median 1 
1.0 
$ ./median 1 2 
1.5 
$ ./median 2 1 
1.5 
$ ./median 1 2 3 
2.0 
$ ./median 2 3 1 
2.0 
$ ./median 3 4 5 1 2 
3.0 

From this small sample of input, we believe that our function is working correctly. We can use this function on later datasets to find the median of samples.

 

Interactive Haskell


This section will be used to familiarize you with the Haskell interactive command line. Before we introduce you to the interactive command line, we will introduce the optional configuration file that you can create in ~/.ghci in your home folder. We have configured ours with the following code:

:set prompt "> "

The preceding code tells the interactive command line to display a single > as the prompt. You can start the interactive command line using the ghci command. Here is what you will see when the command line is started:

$ ghci
GHCi, version 7.4.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
>

You can execute simple equations using either the familiar infix notation, or the functional notation:

> 2 + 2
4
> (+) 2 2
4
> 2 + 4 * 5
22
> (+) 2 $ (*) 4 5
22

Note that we need to use $ here in order to tell Haskell that the (*) 4 5 multiplication portion is an argument to the (+) 2 addition portion.

An introductory problem

This introductory problem will serve as a way of explaining the features of the Haskell language that are used repeatedly in this book. The problem is that we wish to know the location of the vowel characters of a given word, for example, in the word apple, there are two vowels (the first and fifth letters). Given a string apple, we should return a list of [1, 5]. We will go through the thought process of solving this problem and turn our solution into a function. You can use the elemIndices function that can be found in the Data.List module, but we will chose not to do so for teaching purposes.

First, we will declare a variable to store our word. In this example, we will use the word apple:

> let word = "apple"

We will assign a number to each letter in our word using zip and an infinite list of numbers. The zip function will perform a pair-wise merge of two lists to create a list of tuples. A tuple is a type of list structure that can store types in a heterogeneous manner. In the following code, we will combine the integer and character types:

> zip [1..] word
[(1,'a'),(2,'p'),(3,'p'),(4,'l'),(5,'e')]

The expression [1..] is an infinite list of numbers. If you type this in the interactive command line, numbers will appear until you decide to stop it. By using it in conjunction with zip, we only take what we need. There are five letters in apple. So, we only take five elements from our infinite list. This is an example of lazy evaluation at work.

Next, we will filter our list to remove anything that is not a vowel character. We will do this with the help of the filter function, which requires us to pass a lambda function with the rule that defines what is allowed in the list of values:

> filter (\(_, letter) -> elem letter "aeiouAEIOU") $ zip [1..] word
[(1,'a'),(5,'e')]

Let's take a closer look at the lambda expression that is within the parentheses that begin with (\ and end with ). Using the :t option, we can inspect how Haskell interprets this function:

> :t \(_, letter) -> elem letter "aeiouAEIOU"
\(_, letter) -> elem letter "aeiouAEIOU" :: (t, Char) -> Bool

The function requires a pair of values. The first value in the pair is identified with _, which indicates that this is a wild card type. You can see that Haskell identifies it with the t generic type. The second value in the pair is identified by letter, which represents a character in our string. We never defined that letter was a Char type, but Haskell was able to use type inference to realize that we were using the value in a list to search for the value among a list of characters and thus, this must be a character. This lambda expression calls the elem function, which is a part of the Data.List module. The elem function returns a Bool type. So, the return type of Bool is also inferred. The elem function returns true if a value exists in a list. Otherwise, it returns false.

We need to remove the letters from our list of values and return a list of only the numbers:

> map fst . filter (\(_, letter) -> elem letter "aeiouAEIOU") $ zip [1..] word
[1,5]

The map function, like the filter, requires a function and a list. Here, the function is fst and the list is provided by the value returned by the call to the filter. Typically, tuples consist of two values (but this is not always the case). The fst and snd functions will extract the first and second values of a tuple, as follows:

> :t fst
fst :: (a, b) -> a
> :t snd
snd :: (a, b) -> b
> fst (1, 'a')
1
> snd (1, 'a')
'a'

We will add our newly crafted expression to the LearningDataAnalysis01 module. Now, open the file and add the new function towards the end of this file using the following code:

-- Finds the indices of every vowel in a word.
vowelIndices :: String -> [Integer]
vowelIndices word = 
  map fst $ filter (\(_, letter) -> elem letter "aeiouAEIOU") $ zip [1..] word

Then, return to the Haskell command line and load the module using :l:

> :l LearningDataAnalysis01
[1 of 1] Compiling LearningDataAnalysis01 ( LearningDataAnalysis01.hs, interpreted )
Ok, modules loaded: LearningDataAnalysis01.

In the next few chapters, we will clip the output of the load command. Your functions are now loaded and ready for use on the command line:

> vowelIndices "apple"
[1,5]
> vowelIndices "orange"
[1,3,6]
> vowelIndices "grapes"
[3,5]
> vowelIndices "supercalifragilisticexpialidocious"
[2,4,7,9,12,14,16,19,21,24,25,27,29,31,32,33]
> vowelIndices "why"
[]

You can also use the median function that we used earlier. In the following code, we will pass every integer returned by vowelIndices through fromIntegral to convert it to a Double type:

> median . map fromIntegral $ vowelIndices "supercalifragilisticexpialidocious"
20.0

If you make changes to your module, you can quickly reload the module in the interactive command line by using :r. This advice comes with a warning—every time you load or reload a library in Haskell, the entire environment (and all your delicately typed expressions) will be reset. You will lose everything on doing this. This is typically countered by having a separate text editor open where you can type out all your Haskell commands and paste them in the GHCi interpreter.

 

Summary


This chapter looked at Haskell from the perspective of a data analyst. We looked at Haskell's feature set (functional, type-inferred, and lazy). We saw how each of these features benefit a data analyst. We also spent some time getting acquainted with our environment, which includes the setting up of Haskell, cabal, Git, and tmux. We ended the chapter with a simple program that computes the median of a list of values and creates a function to find the vowels in a string.

About the Author

  • James Church

    James Church lives in Clarksville, Tennessee, United States, where he enjoys teaching, programming, and playing board games with his wife, Michelle. He is an assistant professor of computer science at Austin Peay State University. He has consulted for various companies and a chemical laboratory for the purpose of performing data analysis work. James is the author of Learning Haskell Data Analysis.

    Browse publications by this author
Book Title
Unlock this full book FREE 10 day trial
Start Free Trial