Reader small image

You're reading from  Getting Started with Haskell Data Analysis

Product typeBook
Published inOct 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781789802863
Edition1st Edition
Languages
Right arrow
Author (1)
James Church
James Church
author image
James Church

James Church lives in Clarksville, Tennessee, United States, where he enjoys teaching, programming, and playing board games with his wife, Michelle. He is an assistant professor of computer science at Austin Peay State University. He has consulted for various companies and a chemical laboratory for the purpose of performing data analysis work. James is the author of Learning Haskell Data Analysis.
Read more about James Church

Right arrow

Regular Expressions

In this chapter, we are going to learn and understand what regular expressions are. The purpose of regular expressions is to represent a pattern that can be identified within some text data. In the context of data analysis, there are a couple of important uses for regular expressions:

  • To validate fields to make sure that all values within a particular column adhere to a particular format
  • To search fields based on a particular pattern

Word processors and editing applications have a Find and Replace feature. You submit a bit of text to identify within a larger bit of text, and the desired replacement. The application will replace all of the found text with the desired text. Many of these applications now include regular expression support. Rather than submitting an exact sequence of characters that need to be found, we submit a pattern. This pattern defines...

Dots and pipes

In this section, we're going to cover two basic bits of regular expression syntax, and those are dots and pipes. So, to begin, we are going to install the regular expression library in Haskell, and we are going to introduce the dot and the pipe syntax. Let's find the Terminal, and we need to begin by installing the library, which can be done with the following command:

So, cabal install regex-posix will install our regular expression library. Now, once installed, let's go and create a new notebook, and dive in. We are going to name this notebook as RegexLearning. We need to import the Text.Regex.Posix library, so that we can access the =~ operator, which is necessary to look at regular expressions. Let's define a couple of strings in order to get us started:

As you can see, str1 is "one fish two fish red fish blue fish", the title...

Atom and Atom modifiers

In this section, we will be expanding on our knowledge of regular expressions by discussing the atom. We will be covering the concept of an atom. An atom is a single expression such as a character or a dot, or an expression that has been defined using parentheses or - as we will see in a further section- the character class. We will also introduce atom modifiers. The idea is that you can take any atom, and then modify it using a modifier. Now, let's go back to our RegexLearning notebook and continue from where we left off in the last section.

Imagine that you have a string representing a date in the year-month-day separated by a dashes format, and you wish to verify that this date is in the 1900s or the 2000s. So, let's say that we have a date of 1969-07-20, and we wish to verify that this date is in either the 1900s or the 2000s:

Well, we crafted...

Character classes

Character classes are a way of combining characters with common traits into a single classification, such as characters that represent numbers, letters, vowels, or hexadecimal characters. Once we get into the details, we will see how useful character classes are. So, in this section, we're going to take a look at introducing the basics of character classes. We'll expound on that by introducing character class ranges, character class negations, and then we will write a full regular expression to handle matching dates.

So, our first introduction to character classes begins with vowels. Vowels are the letters A, E, I, O, U. Almost every word has a vowel in it. Let's see if we can write a character class that matches a vowel:

So, here we have word "dog" and, to begin a character class, we use square braces. Inside the square braces we have...

Regular expressions in CSV files

We need to know the importance of using regular expressions in various file formats such as CSV and SQLite3. In this section, we will be covering the CSV format. So, let's examine a question using one of our past datasets. Using our Baseball dataset, let's try to find out the average number of runs scored by away teams in the month of March. To do this, we'll need our CSV file of data, which has the dates in the first column, but is not organized by month.

So, in order to solve this, we're going to be crafting a regular expression to match a field in the CSV file. In this case, we will be using the first column of dates. We're going to be pairing that information with another column; and in this case, the other column is going to be the runs scored by away teams. Then, we're going to filter that information to get...

SQLite3 and regular expressions

Working with regular expressions in our SQLite3 database is no different than working with a CSV file. In this section, we will demonstrate how to filter our data using regular expressions, using the timestamp data from an SQLite3 database in a similar manner to our last section. So, we're going to be loading the data from the SQLite3 database, sifting through that data using a regular expression, and analyzing the data gleaned from that regular expression. Now, the problem that we will try to solve in this section is to determine how many earthquakes happen by hour in our 7-day database. Let's go and create a new Haskell notebook; we will name this notebook RegexLearning-SQLite3. Let's first import our libraries:

We won't be using any descriptive statistics in this section, so there's no need to load the descriptive statistics...

Summary

In this chapter, we began by installing the regular expression library, and we talked a little bit about the regular expression syntax, such as how the dot matches any one character and the pipe allows us to match any expression to the left or the right of the pipe. We talked about atoms and atom modifiers. We also talked about character classes at length. We used regular expressions within a CSV file and an SQLite3 database. You should always thoroughly test your regular expressions, as they tend to be difficult to debug. With that, we will be discussing data visualization in the next chapter.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Getting Started with Haskell Data Analysis
Published in: Oct 2018Publisher: PacktISBN-13: 9781789802863
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
James Church

James Church lives in Clarksville, Tennessee, United States, where he enjoys teaching, programming, and playing board games with his wife, Michelle. He is an assistant professor of computer science at Austin Peay State University. He has consulted for various companies and a chemical laboratory for the purpose of performing data analysis work. James is the author of Learning Haskell Data Analysis.
Read more about James Church