Chapter 1. Extracting and Handling Data

In this chapter, we will cover the following recipes:

Why should we use Julia for data science?
Handling data with CSV files
Handling data with TSV files
Working with databases in Julia
Interacting with the Web

Introduction

This chapter deals with the importance of the Julia programming language for data science and its applications. It also serves as a guide to handling data in the most available formats and also shows how to crawl and scrape data from the Internet.

Data Science pipelines that are used for production purposes need to be robust and highly fault-tolerant, without which the teams would be exposed highly error-prone models. So, these pipelines contain a subprocess called Extract-Transform-Load (ETL), in which the Extraction step involves pulling the data from a source, the Transform step involves the transforms performed on the dataset as part of the cleansing process, and the Load step is about loading the now clean data into the local databases for use in production. This will chapter will also teach you how to interact with websites by sending and receiving data through HTTP requests. This would be the first step in any data science and analytics pipeline. So, this chapter will cover some of those methods through which data can be ingested into the pipeline through various data sources.

Why should we use Julia for data science?

Now, you are all set up to learn and experience Julia for data science.

Data Science is simply doing science with data. It applies to a surprisingly wide range of domains, such as engineering, business, marketing, and automotive, owing to the availability of a large amount of data in all these industries from which valuable insights can be extracted and understood.

With the growth of industries, the speed, volume, and variety of the data being produced are drastically increasing. And the tools that have to deal with this data are continuously being adapted, which led to the emergence of more evolved, powerful tools such as Julia.

Julia has been growing steadily as a powerful alternative to the current data science tools. Julia's diverse range of statistical packages along with its powerful compiler features make it a very strong competitor to the current top two programming languages of data science: R and Python. However, advanced users of R and Python can use Julia alongside each of them to reap the maximum benefits from the features of both.

Julia, with its ability to compile code that looks and reads like Python into machine code that performs like C, has showed a lot of promise with its efficiency at generating efficient code using the type inference. It is also interesting to note that even the core mathematical library of Julia is written in Julia itself. As it supports distributed parallel execution, numerical accuracy, and a powerful type inference, such as Python, and diverse range of statistical packages, such as R, Julia is a very powerful programming language for the very rapidly evolving domain of data science.

Installing and spinning up the Julia terminal is very easy, as follows:

Download the Julia package suited to your operating system from http://julialang.org/downloads/ .
Then, fire up Julia's interactive session, which is also called repl (read-eval-print loop). The terminal output would look like this:
Installing and spinning up the Julia terminal is very easy:
Download the Julia package suited to your operating system from http://julialang.org/downloads/ .

Then, fire up Julia's interactive session, which is also called as repl (read-eval-print loop). The terminal output would look something like this:

Now, you are all set up to learn and experience Julia for Data Science.

Handling data with CSV files

In this section, we will explain ways in which you can handle files with the Comma-separated Values (CSV) file format.

Getting ready

Install the DataFrames package, which is the Julia package for working with data arrays and dataframes. The command for adding the DataFrames packages to the catalog is as follows:

Pkg.add("DataFrames")

Make sure that all the installed packages are up-to-date: Pkg.update()

How to do it...

CSV files, as the name suggests, are files whose contents are separated by commas. CSV files can be accessed and read into the REPL process by executing the following steps:

Assign a variable to the local source directory of the file:
```
s = "/Users/username/dir/iris.csv"
```
The readtable() command is used to read the data from the source. The data is read in the form of a Julia DataFrame:
```
iris = readtable(s)
```

Data can be written to CSV files from a Julia DataFrame using the following steps:

Create a data structure with some data inside it. For example, let's create a two-dimensional dataframe to view the the process of writing files of different formats better using DataFrames:
```
df = DataFrame(A = 1:10, B = 11:20)
```
- The preceding command creates a two-dimensional dataframe with columns named A and B.
Now, the dataframe created in Step 1 can be exported to an external CSV file by using the following command:
```
writetable("data.csv", df)
```

Handling data with TSV files

In this section, we will explain how to handle Tab Separated Values (TSV) files.

Getting ready

The DataFrames package is needed to deal with TSV files. So, as it is already installed as instructed in the previous section, we can move ahead and make sure that all the packages are up-to-date with the following command:

Pkg.update()

How to do it...

TSV files, as the name suggests, are files whose contents are separated by commas. TSV files can be accessed and read into the REPL process by the following method:

Assign a variable to the local source directory of the file:
```
s = "/Users/username/dir/data.tsv"
```
The readtable() command is used to read the data from the source. The data is read in the form of a Julia DataFrame:
```
data = readtable(s)
```

Data can be written to TSV files from a Julia DataFrame using the following steps:

Create a data structure with some data inside it. For example, let's create a two-dimensional dataframe like the one we created in the previous example:
```
using DataFrames
df = DataFrame(A = 1:10, B = 11:20)
```
Now, the dataframe, which we created in Step 1, can be exported to an external TSV file using the following command:
```
writetable("data.csv",df)
```

The writetable() command is clever enough to make out the format of the file from the filename extension.

Working with databases in Julia

In this section, we will explain ways to handle data stored in databases: MySQL and PostgreSQL.

Getting ready

MySQL is an open source relational database. To be able to interact with your MySQL databases from Julia, the database server (along with the relevant Julia package) needs to be installed. Assuming that the database is already set up and the MySQL session is already up and running, install the MySQL bindings for Julia by directly cloning the repository:

Pkg.clone("https://github.com/JuliaComputing/MySQL.jl")

PostgreSQL is an open source object relational database. Similar to the MySQL setup, the server of the PostgreSQL database should be up and running with a session. Now, install the PostgreSQL bindings for Julia by following the given instructions:

Install the DBI package. The DBI package is a database-independent API that complies with almost all database drivers.
The DBI package from Julia can be installed by directly cloning it from its repository using the following statement:
```
Pkg.clone("https://github.com/JuliaDB/DBI.jl")
```
Then, install the PostgreSQL library by directly cloning the library's repository using the following statement:
```
Pkg.clone("https://github.com/JuliaDB/PostgreSQL.jl")
```
SQLite is a light, server-less, self-contained, transactional SQL database engine. To interact with data in SQLite databases, one has to first install the SQLite server and make sure that it is up and running and displaying a prompt like this:
Now, the SQLite bindings for Julia can be installed through the following steps:
1. Add the SQLite Julia package by running the following standard package installation command:
```
Pkg.add("SQLite")
```

How to do it...

Here, you will learn about connecting to databases and executing queries to manipulate and analyze data. You will also learn about the various protocols and libraries in Julia that will help you interact with databases.

MySQL

A MySQL database can be connected by a simple command that takes in the host, username, password, and database name as parameters. Let's take a look at the following steps:

First, import the MySQL package:
```
using MySQL
```
Set up the connection to a MySQL database by including all the required parameters to establish a connection:
```
conn = mysql_connect(host, user_name, password, dbname)
```

Now, let's write and run a basic table creation query:

Assign the query statement to a variable.

query = """ CREATE TABLE Student
                 (
                     ID INT NOT NULL AUTO_INCREMENT,
                     Name VARCHAR(255),
                     Attendance FLOAT,
                     JoinDate DATE,
                     Enrolments INT,
                     PRIMARY KEY (ID)
                 );"""

Now to make sure that the query is successfully created, we can get back the response from the connection.
```
response = mysql_query(conn, query)
```

Check for a successful connection through conditional statements:

if (response == 0)
        println("Connection successful. Table created")
else
    println("Connection failed. Table not created.")
end

Queries on the database can be executed by the execute_query() command, which takes the connection variable and the query as parameters. A sample SELECT query can be executed through the following steps:
```
query = """SELECT * FROM Student;"""
data = execute_query(conn, query)
```
To get the query results in the form of a Julia array, an extra parameter called opformat should be specified:
```
data_array = execute_query(conn, query, opformat = MYSQL_ARRAY)
```

Finally, to execute multiple queries at once, use the mysql_execute_multi_query() command:

query = """INSERT INTO Student (Name) VALUES ('');
UPDATE Student SET JoinDate = '08-07-15' WHERE LENGTH(Name) > 5;"""
rows = mysql_execute_multi_query(conn, query)
println("Rows updated by the query: $rows")

PostgreSQL

Data handling within a PostgreSQL database can be done by connecting to the database. Firstly, make sure that the database server is up and running. Now, the data in the database can be handled through the following procedure:

Firstly, import the requisite packages, which are the DBI and the PostgreSQL databases, using the import statements:
```
using DBI
using PostgreSQL
```
In addition, the required packages for the PostgreSQL library are as follows:
- DataFrames.jl: This has already been installed previously.
- DataArrays.jl: This can be installed by running the statement Pkg.add("DataArrays")).
Make a connection to a PostgreSQL database of your choice. It is done through the connect function, which takes in the type of database, the username, the password, the port number, and the database name as input parameters. So, the connection can be established using the following statement:
```
conn = connect(Postgres, "localhost", "password", "testdb", 5432)
```

If the connection is successful, a message similar to this appears on the screen:

PostgreSQL.PostgresDatabaseHandle(Ptr{Void}

        @0x00007  fa8a559f160,0x00000000,false)

Now, prepare the query and tag it to the connection we prepared in the previous step. This can be done using the prepare function, which takes the connection and the query as parameters. So, the execution statement looks something like this:
```
query = prepare(conn,  "SELECT 1::int, 2.0::double precision, 
        'name'::character varying, " *  "'name'::character(20);"))
```
As the query is prepared, let's now execute it, just like we did for MySQL. To do this, we have to enter the query variable, which we created in the previous step, into the execute function. It is done as follows:
```
result = execute(query)
```
Now that the query execution is over, the connection can be disconnected using the finish and disconnect functions, which take the query and the connection variables as the input parameters, respectively. The statements can be executed as follows:
```
finish(query)
disconnect(conn)
```
Now, the results of the query are in the result variable, which can be used for analytics by either moulding it into a dataframe or any other data structure of your choice. The same method can be used for all operations on PostgreSQL databases, which include addition, updating, and deleting.
This resource would help you better understand the Database-Independent API (DBI), which we use to connect local PostgreSQL databases such as SQLite.
Import the SQLite package into the current session and ensure that the SQLite server is up and running. The package can be imported by running the following command:
```
using SQLite
```
Now, a connection to any database can be made through the SQLiteDB() function in Julia Version 3 and the SQLite.DB() function in Julia Version 4.
The connection can be made in Julia version 4 as follows:
```
db = SQLite.DB("dbname.sqlite")
```
The connection can be made in Julia version 3 as follows:
```
db = SQLiteDB("dbname.sqlite")
```
Now, as the connection is made, queries can be executed using the query() function in Version 3 and the SQLite.query() function in Version 4.
- In Version 3:
```
query(db, "A SQL query")
```
- In Version 4:
```
SQLite.query(db, "A SQL query")
```

The SQLite.jl package also allows the user to use macros and registers for manipulating and using data. However, the concepts are beyond the scope of this chapter.

So, these are some of the ways through which data can be handled in Julia. There are a lot of databases whose connectors directly connect to DBI, such as SQlite, MySQL, and so on, and through which queries and their execution can be carried out, as shown in the PostgreSQL section. Similarly, data can be scraped from the Internet and used for analytics, which can be achieved through a combination of Julia libraries, but that is beyond the scope of this book.

There's more...

MySQL

The following resource helps you learn more about its advanced features and provides information about the MySQL.jl library of Julia. This includes performance benchmarks and details, as well as information on CRUD and testing:

https://github.com/JuliaDB/MySQL.jl

PostgreSQL

Visit https://github.com/JuliaDB/DBI.jl to understand better the DBI we use to connect local PostgreSQL databases:

Visit https://github.com/JuliaDB/DBI.jl for extended and in-depth documentation on the PostgreSQL.jl library, which includes dealing with Amazon web services, and so on.

SQLite

Now, as you have learned the ways in which data can be extracted, manipulated, and worked on from various external sources, there are some more interesting things that the database drivers of Julia can do apart from just executing queries. You can find those at https://github.com/JuliaDB/SQLite.jl/blob/master/OLD_README.md#custom-scalar-functions .

Interacting with the Web

In this section, you will learn how to interact with the Web through HTTP requests, both for getting data and posting data to the Web. You will learn about sending and getting requests to and from websites and also analyzing those responses.

Getting ready

Start by downloading and installing the Requests.jl package of Julia, which is available at Pkg.add("Requests").

Make sure that you have an active Internet connection while reading and using the code in the recipe, as it deals with interacting with live websites on the Web. You can experiment with this recipe on the website http://httpbin.org , as it is designed especially for such experiments and tutorials.

This is how you use the Requests.jl package and import the required modules:

Start by importing the package:
```
Pkg.add("Requests")
```
Next, import the necessary modules from the package for quick use. The modules that will be used in this recipe are get, post, put, and delete. So, this is how to import the modules:
```
import Requests: get, post
```

How to do it...

Here, you will learn how to interact with the Web through the HTTP protocol and requests. You will also learn how to send and receive data, and autofill forms on the Internet, through HTTP requests.

GET request

The GET request is used to request data from a specified web resource. So, this is how we send the GET request to a website:
```
get("url of the website")
```
To get requests from a specific web page inside the website, the query parameter of the GET command can be used to specify the web page. This is how you do it:
```
get("url of the website"; query = Dict("title" => 
        "page number/page name"))
```
Timeouts can also be set for the GET requests. This would be useful for identifying unresponsive websites/web pages. The timeout parameter in the GET request takes a particular numeric value to be set as the timeout threshold; above this, if the server does not return any data, a timeout request will be thrown. This is how you set it:
```
get("url of the website"; timeout = 0.5)
```
- Here, 0.5 means 50 ms.
Some websites redirect users to different web pages or sometimes to different websites. So, to avoid getting your request repeatedly redirected, you can set the max_redirects and allow_redirects parameters in the GET request. This is how they can be set:
```
get("url of the website"; max_redirects = 4)
```
Now, to set the allow_redirects parameter preventing the site from redirecting your GET requests:
```
get("url of the website"; allow_redirects = false)
```
- This would not allow the website to redirect your GET request. If a redirect is triggered, it throws an error.
The POST request submits data to a specific web resource. So, this is how to send a post request to a website:
```
post("url of the website")
```
Data can be sent to a web resource through the POST request by adding it into the data parameter in the POST request statement:
```
post("url of the website"; data = "Data to be sent")
```
Data for filling forms on the Web also can be sent through the POST request through the same data parameter, but the data should now be sent in the form of a Julia dictionary data structure:
```
post("url of the website"; data = Dict(First_Name => "abc",
        Last_Name => "xyz" ))
```
Data such as session cookies can also be sent through the POST request by including the session details inside a Julia Dictionary and including it in the POST request as the cookies parameter:
```
post("url of the website"; cookies = Dict("sessionkey" => "key"))
```
Files can also be sent to web resources through the POST requests. This can be done by including the files in the files parameter of the POST request:
```
file = "xyz.jl"
post("url of the website"; files = [FileParam(file), "text/julia", 
        "file_name", "file_name.jl"])
```

There's more...

There are more HTTP requests with which you can interact with web resources such as the PUT and DELETE requests. All of them can be studied in detail from the documentation for the Requests.jl package, which is available at https://github.com/JuliaWeb/Requests.jl .