You're reading from Getting Started with Haskell Data Analysis

Product typeBook

Published inOct 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781789802863

Edition1st Edition

Languages

Haskell

Concepts

Data Analysis

Author (1)

James Church

The CSV library – working with CSV files

In this section, we're going to cover the basics of the CSV library and how to work with CSV files. To do this, we will be taking a closer look at the structure of a CSV file; how to install the Text.CSV Haskell library; and how to retrieve data from a CSV file from within Haskell.

Now to begin, we need a CSV file. So, I'm going to tab over to my Haskell environment, which is just a Debian Linux virtual machine running on my computer, and I'm going to go to the website at retrosheet.org. This is a website for baseball statistics, and we are going to use them to demonstrate the CSV library. Find the link for Data Downloads and click Game Logs, as follows:

Now, scroll down just a little bit and you should see game logs for every single season, going all the way back to 1871. For now, I would like to stick with the most recent complete season, which is 2015:

So, go ahead and click the 2015 link. We will have the option to download a ZIP file, so go ahead and click OK. Now, I'm going to tab over to my Terminal:

Let's go into the Downloads folder, and if we hit ls, we see that there's our ZIP file. Let's unzip that file and see what we have. Let's open up GL2015.TXT. This is a CSV file, and will display something like the following:

A CSV file is a file of comma-separated values. So, you'll see that we have a file divided up, where each line in this file is a record, and each record represents a single game of baseball in the 2015 season; and inside every single record is a listing of values, separated by a comma. So, the very first game in this dataset is a game between the St. Louis Cardinals—that's SLN—and the Chicago Cubs—that's CHN—and this game took place on March 5th 2015. The final score of this first game was 3-0, and every line in this file is a different game.

Now, CSV isn't a standard, but there are a few properties of a CSV file which I consider to be safe. Consider the following as my suggestions. A CSV file should keep one record per line. The first line should be a description of each column. In a future section, I'm going to tell you that we need to remove the header line; and you'll see that this particular file doesn't have this header line. I still like to see the description line for each column of values. If a field in a record includes a comma, then that field should be surrounded by double quote marks. Now we don't see an example of this—at least, not on this first line—but we do see examples of many values having quote marks surrounding the file, such as the very first value in the file, the date:

In a CSV file, if a field is surrounded by quote marks, then it is optional, unless it has a comma inside that value. While we're here, I would like to make a note of the tenth column in this file, which contains the number 3 on this particular row. This represents the away-team score in every single record of this file. Make a note that our first value on the tenth column is a 3—we're going to come back to that later on.

Our next task is installing the Text.CSV library; we do this using the Cabal tool, which connects with the Hackage repository and downloads the Text.CSV library:

The command that we use to start the install, shown in the first line of the preceding screenshot, is cabal install csv. It takes a moment to download the file, but it should download and install the Text.CSV library in our home folder. Now, let me describe what I currently have in my home folder:

I like to create a directory for my code called Code; and inside here, I have a directory called HaskellDataAnalysis. And inside HaskellDataAnalysis, I have two directories, called analysis and data. In the analysis folder, I would like to store my notebooks. In the data folder, I would like to store my datasets.

That way, I can keep a clear distinction between analysis files and data files. That means I need to move the data file, just downloaded, into my data folder. So, copy GL2015.TXT from our Downloads folder into our data folder. If I do an ls on my data folder, I'll see that I've got my file. Now, I'm going to go into my analysis folder, which currently contains nothing, and I'm going to start the Jupyter Notebook as follows:

Type in jupyter notebook, which will start a web server on your computer. You can use your web browser in order to interact with Haskell:

The address for the Jupyter Notebook is the localhost, on port 8888. Now I'm going to create a new Haskell notebook. To do this, I click on the New drop-down button on the right side of the screen, and I find Haskell:

Let's begin by renaming our notebook Baseball, because we're going to be looking at baseball statistics:

I need to import the Text.CSV file that we just installed. Now, if your cursor is sitting in a text field and you hit Enter, you'll just be making that text field larger, as shown in the following screenshot. Instead, in order to submit expressions to the Jupyter environment, you have to hit hit Shift + Enter on the keyboard:

So, now that we've imported Text.CSV, let's create our Baseball dataset and parse the dataset. The command for this is parseCSVFromFile, after which we pass in the location of our text file:

Great. If you didn't get a File Not Found error at this point, then that means you have successfully parsed the data from the CSV file. Now, let's explore the type of baseball data. To do this, we enter type and baseball, which is what we just created, and we see that we have either a parsing error or a CSV file:

I've already done this, so I know that there aren't any parsing errors in our CSV file, but if there were, they would be represented by ParseError. So I can promise you that if you've gotten this far, you know that we have a working CSV file. Now, I'll be honest: I don't know why the CSV library does this, but the last element in every CSV data is a single empty list, and I call this empty list the "empty row". What I would like to do is to create a quick function, called noEmptyRows, that removes any row of data that doesn't have at least two pieces of information in it:

So, if we have a parsing error, we're just going to return back an empty list, and if we actually have data, we're going to filter out any row that does not have at least two pieces of information in that row. Now, let's apply our noEmptyRows to our Baseball dataset:

I'm going to call this baseballList. Then we can do a quick check to see the length of the baseballList, and we should have 2,429 rows representing 2,429 games played in the 2015 season.

Now let's look at the type of baseballList, and we see that we have a list of fields:

Now, you may be asking yourself: What's a field? We can explore a field using info, and doing so will bring up a window from the bottom of the screen:

It says type Field = String, and it's defined in this Text.CSV library. So, just remember that a field is just a string.

Now, because every value is a field that is also a string, that means that if I do math on strings, it's going to produce an error message, as shown in the following screenshot:

So what I need to do is to parse that information from a string to something else that I can use, such as an int or a double, and I do that with the read command. Let's look at an example:

So if I say read "1", it will be parsed as an Integer, or, if I say read "1.5", then it will be parsed as a Double.

So, armed with this knowledge of parsing data from strings, we can parse a whole column of data. Create a readIndex function, and let's say that, in our case, each value is a cell:

So for each cell in our dataset, we're going to pass in our original Baseball dataset—this is an Either; and we're going to say that we need an Int index position in our list; and we are going to return a list of cells. This requires two arguments: the csv, and the index position that we need. And we are going to map over each record, and we're going to read whatever exists at the specified index position. We also need the noEmptyRows function that we discussed earlier.

Now, if you recall earlier, I said that the away-team scores in our CSV file exist on column 10, and because Haskell is a zero-based index file, that means we need to pass in index 9 to our readIndex function:

Here, we parse this list that's returned as a list of integers, and we are returned a listing of every single away-team score in Major League Baseball. The very first element in our list is a 3, because that is the first record of the file.

In this section, you learned about the structure of a CSV file, how to install the Text.CSV library, and how to pull a little bit of information out of that CSV file using the CSV library. In the next section, we're going to discuss how to create our own module for descriptive statistics, and how to write a function for the range of a dataset.

You have been reading a chapter from

Getting Started with Haskell Data Analysis

Published in: Oct 2018Publisher: PacktISBN-13: 9781789802863

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

James Church

James Church lives in Clarksville, Tennessee, United States, where he enjoys teaching, programming, and playing board games with his wife, Michelle. He is an assistant professor of computer science at Austin Peay State University. He has consulted for various companies and a chemical laboratory for the purpose of performing data analysis work. James is the author of Learning Haskell Data Analysis.
Read more about James Church

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages