Machine Learning With Go

5 (5 reviews total)
By Daniel Whitenack
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Gathering and Organizing Data

About this book

The mission of this book is to turn readers into productive, innovative data analysts who leverage Go to build robust and valuable applications. To this end, the book clearly introduces the technical aspects of building predictive models in Go, but it also helps the reader understand how machine learning workflows are being applied in real-world scenarios.

Machine Learning with Go shows readers how to be productive in machine learning while also producing applications that maintain a high level of integrity. It also gives readers patterns to overcome challenges that are often encountered when trying to integrate machine learning in an engineering organization.

The readers will begin by gaining a solid understanding of how to gather, organize, and parse real-work data from a variety of sources. Readers will then develop a solid statistical toolkit that will allow them to quickly understand gain intuition about the content of a dataset. Finally, the readers will gain hands-on experience implementing essential machine learning techniques (regression, classification, clustering, and so on) with the relevant Go packages.

Finally, the reader will have a solid machine learning mindset and a powerful Go toolkit of techniques, packages, and example implementations.

Publication date:
September 2017
Publisher
Packt
Pages
304
ISBN
9781785882104

 

Chapter 1. Gathering and Organizing Data

Polls have shown that 90% or more of a data scientist's time is spent gathering data, organizing it, and cleaning it, not training/tuning their sophisticated machine learning models. Why is this? Isn't the machine learning part the fun part? Why do we need to care so much about the state of our data? Firstly, without data, our machine learning models can't learn. This might seem obvious. However, we need to realize that part of the strength of the models that we build is in the data that we feed them. As the common phrase goes, garbage in, garbage out. We need to make sure that we gather relevant, clean data to power our machine learning models, such that they can operate on the data as expected and produce valuable results.

Not all types of data are appropriate when using certain types of models. For example, certain models do not perform well when we have high-dimensional data (for example, text data), and other models assume that variables are normally distributed, which is definitely not always the case. Thus, we must take care in gathering data that fits our use case and make sure that we understand how our data and models will interact.

Another reason why gathering and organizing data consumes so much of a data scientist's time is that data is often messy and hard to aggregate. In most organizations, data might be housed in various systems and formats, and have various access control policies. We can't assume that supplying a training set to our model will be as easy as specifying a file path; this is often not the case.

To form a training/test set or to supply variables to a model for predictions, we will likely need to deal with various formats of data, such as CSV, JSON, database tables, and so on, and we will likely need to transform individual values. Common transformations include parsing date times, converting categorical data to numerical data, normalizing values, and applying some function across values. However, we can't always assume that all values of a certain variable are present or able to be parsed in a similar manner.

Often data includes missing values, mixed types, or corrupted values. How we handle each of these scenarios will directly influence the quality of the models that we build, and thus, we have to be willing to carefully gather, organize, and understand our data.

Even though much of this book will be focused on various modeling techniques, you should always consider data gathering, parsing, and organization as a (or maybe the) key component of a successful data science project. If this part of your project is not carefully developed with a high level of integrity, you are setting yourself up for trouble in the long run.

 

Handling data - Gopher style


In comparison to many other languages that are used for data science/analysis, Go provides a very strong foundation for data manipulation and parsing. Although other languages (for example, Python or R) may allow users to quickly explore data interactively, they often promote integrity-breaking convenience, that is, dynamic and interactive data exploration often results in code that behaves strangely when applied more generally.

Take, for instance, this simple CSV file:

1,blah1
2,blah2
3,blah3

It is true that, very quickly, we can write some Python code to parse this CSV and output the maximum value from the integer column without even knowing what types are in the data:

import pandas as pd

# Define column names.
cols = [
 'integercolumn',
 'stringcolumn'
 ]

# Read in the CSV with pandas.
data = pd.read_csv('myfile.csv', names=cols)

# Print out the maximum value in the integer column.
print(data['integercolumn'].max())

This simple program will print the correct result:

$ python myprogram.py
3

We now remove one of the integer values to produce a missing value, as shown here:

1,blah1
2,blah2
,blah3

The Python program consequently has a complete breakdown in integrity; specifically, the program still runs, doesn't tell us that anything went differently, still produces a value, and produces a value of a different type:

$ python myprogram.py
2.0

This is unacceptable. All but one of our integer values could disappear, and we wouldn't have any insight into the changes. This could produce profound changes in our modeling, but they would be extremely hard to track down. Generally, when we opt for the conveniences of dynamic types and abstraction, we are accepting this sort of variability in behavior.

Note

The important thing here is not that you cannot handle such behavior in Python, because Python, experts will quickly recognize that you can properly handle such behavior. The point is that such conveniences do not promote integrity by default, and thus, it is very easy to shoot yourself in the foot.

On the other hand, we can leverage Go's static typing and explicit error handling to ensure that our data is parsed as expected. In this small example, we can also write some Go code, without too much trouble, to parse our CSV (don't worry about the details right now):

// Open the CSV.
f, err := os.Open("myfile.csv")
if err != nil {
    log.Fatal(err)
}

// Read in the CSV records.
r := csv.NewReader(f)
records, err := r.ReadAll()
if err != nil {
    log.Fatal(err)
}

// Get the maximum value in the integer column.
var intMax int
for _, record := range records {

    // Parse the integer value.
    intVal, err := strconv.Atoi(record[0])
    if err != nil {
        log.Fatal(err)
    }

    // Replace the maximum value if appropriate.
    if intVal > intMax {
        intMax = intVal
    }
}

// Print the maximum value.
fmt.Println(intMax)

This will produce the same correct result for the CSV file with all the integer values present:

$ go build
$ ./myprogram
3

But in contrast to our previous Python code, our Go code will inform us when we encounter something that we don't expect in the input CSV (for the case when we remove the value 3):

$ go build
$ ./myprogram
2017/04/29 12:29:45 strconv.ParseInt: parsing "": invalid syntax

Here, we have maintained integrity, and we can ensure that we can handle missing values in a manner that is appropriate for our use case.

 

Best practices for gathering and organizing data with Go


As you can see in the preceding section, Go itself provides us with an opportunity to maintain high levels of integrity in our data gathering, parsing, and organization. We want to ensure that we leverage Go's unique properties whenever we are preparing our data for machine learning workflows.

Generally, Go data scientists/analysts should follow the following best practices when gathering and organizing data. These best practices are meant to help you maintain integrity in your applications, and been able you to reproduce any analysis:

  1. Check for and enforce expected types: This might seem obvious, but it is too often overlooked when using dynamically typed languages. Although it is slightly verbose, explicitly parsing data into expected types and handling related errors can save you big headaches down the road.
  2. Standardize and simplify your data ingress/egress: There are many third-party packages for handling certain types of data or interactions with certain sources of data (some of which we will cover in this book). However, if you standardize the ways you are interacting with data sources, particularly centered around the use of stdlib, you can develop predictable patterns and maintain consistency within your team. A good example of this is a choice to utilize database/sql for database interactions rather than using various third-party APIs and DSLs.
  3. Version your data: Machine learning models produce extremely different results depending on the training data you use, your choice of parameters, and input data. Thus, it is impossible to reproduce results without versioning both your code and data. We will discuss the appropriate techniques for data versioning later in this chapter.

Note

If you start to stray from these general principles, you should stop immediately. You are likely to sacrifice integrity for the sake of convenience, which is a dangerous road. We will let these principles guide us through the book and as we consider various data formats/sources in the following section.

 

CSV files


CSV files might not be a go-to format for big data, but as a data scientist or developer working in machine learning, you are sure to encounter this format. You might need a mapping of zip codes to latitude/longitude and find this as a CSV file on the internet, or you may be given sales figures from your sales team in a CSV format. In any event, we need to understand how to parse these files.

The main package that we will utilize in parsing CSV files is encoding/csv from Go's standard library. However, we will also discuss a couple of packages that allow us to quickly manipulate or transform CSV data--github.com/kniren/gota/dataframe and go-hep.org/x/hep/csvutil.

Reading in CSV data from a file

Let's consider a simple CSV file, which we will return to later, named iris.csv (available here: https://archive.ics.uci.edu/ml/datasets/iris). This CSV file includes four float columns of flower measurements and a string column with the corresponding flower species:

$head iris.csv
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

With encoding/csv imported, we first open the CSV file and create a CSV reader value:

// Open the iris dataset file.
f, err := os.Open("../data/iris.csv")
if err != nil {
    log.Fatal(err)
}
defer f.Close()

// Create a new CSV reader reading from the opened file.
reader := csv.NewReader(f)

Then we can read in all of the records (corresponding to rows) of the CSV file. These records are imported as [][]string:

// Assume we don't know the number of fields per line. By setting
// FieldsPerRecord negative, each row may have a variable
// number of fields.
reader.FieldsPerRecord = -1

// Read in all of the CSV records.
rawCSVData, err := reader.ReadAll()
if err != nil {
    log.Fatal(err)
}

We can also read in records one at a time in an infinite loop. Just make sure that you check for the end of the file (io.EOF) so that the loop ends after reading in all of your data:

// Create a new CSV reader reading from the opened file.
reader := csv.NewReader(f)
reader.FieldsPerRecord = -1

// rawCSVData will hold our successfully parsed rows.
var rawCSVData [][]string

// Read in the records one by one.
for {

    // Read in a row. Check if we are at the end of the file.
    record, err := reader.Read()
    if err == io.EOF {
        break
    }

    // Append the record to our dataset.
    rawCSVData = append(rawCSVData, record)
}

Note

If your CSV file is not delimited by commas and/or if your CSV file contains commented rows, you can utilize the csv.Reader.Comma and csv.Reader.Comment fields to properly handle uniquely formatted CSV files. In cases where the fields in your CSV file are single-quoted, you may need to add in a helper function to trim the single quotes and parse the values.

Handling unexpected fields

The preceding methods work fine with clean CSV data, but, in general, we don't encounter clean data. We have to parse messy data. For example, you might find unexpected fields or numbers of fields in your CSV records. This is why reader.FieldsPerRecord exists. This field of the reader value lets us easily handle messy data, as follows:

4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,blah,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa

This version of the iris.csv file has an extra field in one of the rows. We know that each record should have five fields, so let's set our reader.FieldsPerRecord value to 5:

// We should have 5 fields per line. By setting
// FieldsPerRecord to 5, we can validate that each of the
// rows in our CSV has the correct number of fields.
reader.FieldsPerRecord = 5

Then as we are reading in records from the CSV file, we can check for unexpected fields and maintain the integrity of our data:

// rawCSVData will hold our successfully parsed rows.
var rawCSVData [][]string

// Read in the records looking for unexpected numbers of fields.
for {

    // Read in a row. Check if we are at the end of the file.
    record, err := reader.Read()
    if err == io.EOF {
        break
    }

    // If we had a parsing error, log the error and move on.
    if err != nil {
        log.Println(err)
        continue
    }

    // Append the record to our dataset, if it has the expected
    // number of fields.
    rawCSVData = append(rawCSVData, record)
}

Here, we have chosen to handle the error by logging the error, and we only collect successfully parsed records into rawCSVData. The reader will note that this error could be handled in many different ways. The important thing is that we are forcing ourselves to check for an expected property of the data and increasing the integrity of our application.

Handling unexpected types

We just saw that CSV data is read into Go as [][]string. However, Go is statically typed, which allows us to enforce strict checks for each of the CSV fields. We can do this as we parse each field for further processing. Consider some messy data that has random fields that don't match the type of the other values in a column:

4.6,3.1,1.5,0.2,Iris-setosa
5.0,string,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,string,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa

To check the types of the fields in our CSV records, let's create a struct variable to hold successfully parsed values:

// CSVRecord contains a successfully parsed row of the CSV file.
type CSVRecord struct {
    SepalLength  float64
    SepalWidth   float64
    PetalLength  float64
    PetalWidth   float64
    Species      string
    ParseError   error
}

Then, before we loop over the records, let's initialize a slice of these values:

// Create a slice value that will hold all of the successfully parsed
// records from the CSV.
var csvData []CSVRecord

Now as we loop over the records, we can parse into the relevant type for that record, catch any errors, and log as needed:

// Read in the records looking for unexpected types.
for {

    // Read in a row. Check if we are at the end of the file.
    record, err := reader.Read()
    if err == io.EOF {
        break
    }

    // Create a CSVRecord value for the row.
    var csvRecord CSVRecord

    // Parse each of the values in the record based on an expected type.
    for idx, value := range record {

        // Parse the value in the record as a string for the string column.
        if idx == 4 {

            // Validate that the value is not an empty string. If the
            // value is an empty string break the parsing loop.
            if value == "" {
                log.Printf("Unexpected type in column %d\n", idx)
                csvRecord.ParseError = fmt.Errorf("Empty string value")
                break
            }

            // Add the string value to the CSVRecord.
            csvRecord.Species = value
            continue
        }

        // Otherwise, parse the value in the record as a float64.
        var floatValue float64

        // If the value can not be parsed as a float, log and break the
        // parsing loop.
        if floatValue, err = strconv.ParseFloat(value, 64); err != nil {
            log.Printf("Unexpected type in column %d\n", idx)
            csvRecord.ParseError = fmt.Errorf("Could not parse float")
            break
        }

        // Add the float value to the respective field in the CSVRecord.
        switch idx {
        case 0:
            csvRecord.SepalLength = floatValue
        case 1:
            csvRecord.SepalWidth = floatValue
        case 2:
            csvRecord.PetalLength = floatValue
        case 3:
            csvRecord.PetalWidth = floatValue
        }
    }

    // Append successfully parsed records to the slice defined above.
    if csvRecord.ParseError == nil {
        csvData = append(csvData, csvRecord)
    }
}

Manipulating CSV data with data frames

As you can see, manually parsing many different fields and performing row-by-row operations can be rather verbose and tedious. This is definitely not an excuse to increase complexity and import a bunch of non standard functionalities. You should still default to the use of encoding/csv in most cases.

However, manipulation of data frames has proven to be a successful and somewhat standardized way (in the data science community) of dealing with tabular data. Thus, in some cases, it is worth employing some third-party functionality to manipulate tabular data, such as CSV data. For example, data frames and the corresponding functionality can be very useful when you are trying to filter, subset, and select portions of tabular datasets. In this section, we will introduce github.com/kniren/gota/dataframe, a wonderful dataframe package for Go:

import "github.com/kniren/gota/dataframe" 

To create a data frame from a CSV file, we open a file with os.Open() and then supply the returned pointer to the dataframe.ReadCSV() function:

// Open the CSV file.
irisFile, err := os.Open("iris.csv")
if err != nil {
    log.Fatal(err)
}
defer irisFile.Close()

// Create a dataframe from the CSV file.
// The types of the columns will be inferred.
irisDF := dataframe.ReadCSV(irisFile)

// As a sanity check, display the records to stdout.
// Gota will format the dataframe for pretty printing.
fmt.Println(irisDF)

If we compile and run this Go program, we will see a nice, pretty-printed version of our data with the types that were inferred during parsing:

$ go build
$ ./myprogram
[150x5] DataFrame

 sepal_length sepal_width petal_length petal_width species 
 0: 5.100000 3.500000 1.400000 0.200000 Iris-setosa
 1: 4.900000 3.000000 1.400000 0.200000 Iris-setosa
 2: 4.700000 3.200000 1.300000 0.200000 Iris-setosa
 3: 4.600000 3.100000 1.500000 0.200000 Iris-setosa
 4: 5.000000 3.600000 1.400000 0.200000 Iris-setosa
 5: 5.400000 3.900000 1.700000 0.400000 Iris-setosa
 6: 4.600000 3.400000 1.400000 0.300000 Iris-setosa
 7: 5.000000 3.400000 1.500000 0.200000 Iris-setosa
 8: 4.400000 2.900000 1.400000 0.200000 Iris-setosa
 9: 4.900000 3.100000 1.500000 0.100000 Iris-setosa
 ... ... ... ... ... 
 <float> <float> <float> <float> <string>

Once we have the data parsed into a dataframe, we can filter, subset, and select our data easily:

// Create a filter for the dataframe.
filter := dataframe.F{
    Colname: "species",
    Comparator: "==",
    Comparando: "Iris-versicolor",
}

// Filter the dataframe to see only the rows where
// the iris species is "Iris-versicolor".
versicolorDF := irisDF.Filter(filter)
if versicolorDF.Err != nil {
    log.Fatal(versicolorDF.Err)
}

// Filter the dataframe again, but only select out the
// sepal_width and species columns.
versicolorDF = irisDF.Filter(filter).Select([]string{"sepal_width", "species"})

// Filter and select the dataframe again, but only display
// the first three results.
versicolorDF = irisDF.Filter(filter).Select([]string{"sepal_width", "species"}).Subset([]int{0, 1, 2})

This is really only scratching the surface of the github.com/kniren/gota/dataframe package. You can merge datasets, output to other formats, and even process JSON data. For more information about this package, you should visit the auto generated GoDocs at https://godoc.org/github.com/kniren/gota/dataframe, which is good practice, in general, for any packages we discuss in the book.

 

JSON


In a world in which the majority of data is accessed via the web, and most engineering organizations implement some number of microservices, we are going to encounter data in JSON format fairly frequently. We may only need to deal with it when pulling some random data from an API, or it might actually be the primary data format that drives our analytics and machine learning workflows.

Typically, JSON is used when ease of use is the primary goal of data interchange. Since JSON is human readable, it is easy to debug if something breaks. Remember that we want to maintain the integrity of our data handling as we process data with Go, and part of that process is ensuring that, when possible, our data is interpretable and readable. JSON turns out to be very useful in achieving these goals (which is why it is also used for logging, in many cases).

Go offers really great JSON functionality in its standard library with encoding/json. We will utilize this standard library functionality throughout the book.

Parsing JSON

To understand how to parse (that is, unmarshal) JSON data in Go, we will be using some data from theCiti Bike API (https://www.citibikenyc.com/system-data), a bike-sharing service operating in New York City. Citi Bike provides frequently updated operational information about its network of bike sharing stations in JSON format at https://gbfs.citibikenyc.com/gbfs/en/station_status.json:

{
  "last_updated": 1495252868,
  "ttl": 10,
  "data": {
    "stations": [
      {
        "station_id": "72",
        "num_bikes_available": 10,
        "num_bikes_disabled": 3,
        "num_docks_available": 26,
        "num_docks_disabled": 0,
        "is_installed": 1,
        "is_renting": 1,
        "is_returning": 1,
        "last_reported": 1495249679,
        "eightd_has_available_keys": false
      },
      {
        "station_id": "79",
        "num_bikes_available": 0,
        "num_bikes_disabled": 0,
        "num_docks_available": 33,
        "num_docks_disabled": 0,
        "is_installed": 1,
        "is_renting": 1,
        "is_returning": 1,
        "last_reported": 1495248017,
        "eightd_has_available_keys": false
      },

      etc...

      {
        "station_id": "3464",
        "num_bikes_available": 1,
        "num_bikes_disabled": 3,
        "num_docks_available": 53,
        "num_docks_disabled": 0,
        "is_installed": 1,
        "is_renting": 1,
        "is_returning": 1,
        "last_reported": 1495250340,
        "eightd_has_available_keys": false
      }
    ]
  }
}

To parse the import and this type of data in Go, we first need to import encoding/json (along with a couple of other things from a standard library, such as net/http, because we are going to pull this data off of the previously mentioned website). We will also define struct that mimics the structure of the JSON shown in the preceding code:

import (
    "encoding/json"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
)

// citiBikeURL provides the station statuses of CitiBike bike sharing stations.
const citiBikeURL = "https://gbfs.citibikenyc.com/gbfs/en/station_status.json"

// stationData is used to unmarshal the JSON document returned form citiBikeURL.
type stationData struct {
    LastUpdated int `json:"last_updated"`
    TTL int `json:"ttl"`
    Data struct {
        Stations []station `json:"stations"`
    } `json:"data"`
}

// station is used to unmarshal each of the station documents in stationData.
type station struct {
    ID string `json:"station_id"`
    NumBikesAvailable int `json:"num_bikes_available"`
    NumBikesDisabled int `json:"num_bike_disabled"`
    NumDocksAvailable int `json:"num_docks_available"`
    NumDocksDisabled int `json:"num_docks_disabled"`
    IsInstalled int `json:"is_installed"`
    IsRenting int `json:"is_renting"`
    IsReturning int `json:"is_returning"`
    LastReported int `json:"last_reported"`
    HasAvailableKeys bool `json:"eightd_has_available_keys"`
}

Note a couple of things here: (i) we have followed Go idioms by avoiding the struct field name with underscores, but (ii) we have utilized the json struct tags to label the struct fields with the corresponding expected fields in the JSON data.

Note

Note, to properly parse JSON data, the struct fields need to be exported fields. That is, the fields need to begin with a capital letter. encoding/json does cannot view fields using reflect unless they are exported.

Now we can get the JSON data from the URL and unmarshal it into a new stationData value. This will produce a struct variable with the respective fields filled with the data in the tagged JSON data fields. We can check it by printing out some data associated with one of the stations:

// Get the JSON response from the URL.
response, err := http.Get(citiBikeURL)
if err != nil {
    log.Fatal(err)
}
defer response.Body.Close()

// Read the body of the response into []byte.
body, err := ioutil.ReadAll(response.Body)
if err != nil {
    log.Fatal(err)
}

// Declare a variable of type stationData.
var sd stationData

// Unmarshal the JSON data into the variable.
if err := json.Unmarshal(body, &sd); err != nil {
    log.Fatal(err)
}

// Print the first station.
fmt.Printf("%+v\n\n", sd.Data.Stations[0])

When we run this, we can see that our struct contains the parsed data from the URL:

$ go build
$ ./myprogram
{ID:72 NumBikesAvailable:11 NumBikesDisabled:0 NumDocksAvailable:25 NumDocksDisabled:0 IsInstalled:1 IsRenting:1 IsReturning:1 LastReported:1495252934 HasAvailableKeys:false}

JSON output

Now let's say that we have the Citi Bike station data in our stationData struct value and we want to save that data out to a file. We can do this with json.marshal:

// Marshal the data.
outputData, err := json.Marshal(sd)
if err != nil {
    log.Fatal(err)
}

// Save the marshalled data to a file.
if err := ioutil.WriteFile("citibike.json", outputData, 0644); err != nil {
    log.Fatal(err)
}
 

SQL-like databases


Although there is a good bit of hype around interesting NoSQL databases and key-value stores, SQL-like databases are still ubiquitous. Every data scientist will, at some point, be processing data from an SQL-like database, such as Postgres, MySQL, or SQLite.

For example, we may be required to query one or more tables in a Postgres database to generate a set of features for model training. After using that model to make predictions or identify anomalies, we may send results to another database table that drives a dashboard or other reporting tool.

Go, of course, interacts nicely with all the popular data stores, such as SQL, NoSQL, key-value, and so on, but here, we will focus on SQL-like interactions. We will utilize database/sql for these interactions throughout the book.

Connecting to an SQL database

The first thing we need do before connecting to an SQL-like database is identify the particular database that we will be interacting with and import a corresponding driver. In the following examples, we will be connecting to a Postgres database and will utilize the github.com/lib/pq database driver for database/sql. This driver can be loaded via an empty import (with a corresponding comment):

import (
    "database/sql"
    "fmt"
    "log"
    "os"

    // pq is the library that allows us to connect
    // to postgres with databases/sql.
    _ "github.com/lib/pq"
)

Now let's assume that you have exported the Postgres connection string to an environmental variable PGURL. We can easily create an sql.DB value for our connection via the follow code:

// Get the postgres connection URL. I have it stored in
// an environmental variable.
pgURL := os.Getenv("PGURL")
if pgURL == "" {
    log.Fatal("PGURL empty")
}

// Open a database value. Specify the postgres driver
// for databases/sql.
db, err := sql.Open("postgres", pgURL)
if err != nil {
    log.Fatal(err)
}
defer db.Close()

Note that we need to defer the close method on this value. Also, note that creating this value does not mean that you have made a successful connection to the database. This is merely a value used by database/sql to connect to the database when triggered to do so by certain operations (such as a query).

To ensure that we can make a successful connection to the database, we can use the Ping method:

if err := db.Ping(); err != nil {
    log.Fatal(err)
}

Querying the database

Now that we know how to connect to the database, let's see how we can get data out of the database. We won't cover the specifics of SQL queries and statements in this book. If you are not familiar with SQL, I would highly recommend that you learn how to query, insert, and so on, but for our purposes here, you should know that there are basically two types of operations we want to perform as related to SQL databases:

  • A Query operation selects, groups, or aggregates data in the database and returns rows of data to us
  • An Exec operation updates, inserts, or otherwise modifies the state of the database without an expectation that portions of the data stored in the database should be returned

As you might expect, to get data out of our database, we will use a Query operation. To do this, we need to query the database with an SQL statement string. For example, imagine we have a database storing a bunch of iris flower measurements (petal length, petal width, and so on), we could query some of that data related to a particular iris species as follows:

// Query the database.
rows, err := db.Query(`
    SELECT 
        sepal_length as sLength, 
        sepal_width as sWidth, 
        petal_length as pLength, 
        petal_width as pWidth 
    FROM iris
    WHERE species = $1`, "Iris-setosa")
if err != nil {
    log.Fatal(err)
}
defer rows.Close()

Note that this returns a pointer to an sql.Rows value, and we need to defer the closing of this rows value. Then we can loop over our rows and parse the data into values of expected type. We utilize the Scan method on rows to parse out the columns returned by the SQL query and print them to standard out:

// Iterate over the rows, sending the results to
// standard out.
for rows.Next() {

    var (
        sLength float64
        sWidth float64
        pLength float64
        pWidth float64
    )

    if err := rows.Scan(&sLength, &sWidth, &pLength, &pWidth); err != nil {
        log.Fatal(err)
    }

    fmt.Printf("%.2f, %.2f, %.2f, %.2f\n", sLength, sWidth, pLength, pWidth)
}

Finally, we need to check for any errors that might have occurred while processing our rows. We want to maintain the integrity of our data handling, and we cannot assume that we looped over all the rows without encountering an error:

// Check for errors after we are done iterating over rows.
if err := rows.Err(); err != nil {
    log.Fatal(err)
}

Modifying the database

As mentioned earlier, there is another flavor of interaction with the database called Exec. With these types of statements, we are concerned with updating, adding to, or otherwise modifying the state of one or more tables in the database. We use the same type of database connection, but instead of calling db.Query, we will call db.Exec.

For example, let's say we want to update some of the values in our iris database table:

// Update some values.
res, err := db.Exec("UPDATE iris SET species = 'setosa' WHERE species = 'Iris-setosa'")
if err != nil {
    log.Fatal(err)
}

But how do we know whether we were successful and changed something? Well, the res function returned here allows us to see how many rows of our table were affected by our update:

// See how many rows where updated.
rowCount, err := res.RowsAffected()
if err != nil {
    log.Fatal(err)
}

// Output the number of rows to standard out.
log.Printf("affected = %d\n", rowCount)
 

Caching


Sometimes, our machine learning algorithms will be trained by and/or given input for prediction via data from external sources (for example, APIs), that is, data that isn't local to the application running our modeling or analysis. Further, we might have various sets of data that are being accessed frequently, may be accessed again soon, or may need to be made available while the application is running.

In at least some of these cases, it might make sense to cache data in memory or embed the data locally where the application is running. For example, if you are reaching out to a government API (typically having high latency) for census data frequently, you may consider maintaining a local or in-memory cache of the census data being used so that you can avoid constantly reaching out to the API.

Caching data in memory

To cache a series of values in memory, we will use github.com/patrickmn/go-cache. With this package, we can create an in-memory cache of keys and corresponding values. We can even specify things, such as the time to live, in the cache for specific key-value pairs.

To create a new in-memory cache and set a key-value pair in the cache, we do the following:

// Create a cache with a default expiration time of 5 minutes, and which
// purges expired items every 30 seconds
c := cache.New(5*time.Minute, 30*time.Second)

// Put a key and value into the cache.
c.Set("mykey", "myvalue", cache.DefaultExpiration)

To then retrieve the value for mykey out of the cache, we just need to use the Get method:

v, found := c.Get("mykey")
if found {
    fmt.Printf("key: mykey, value: %s\n", v)
}

Caching data locally on disk

The caching we just saw is in memory. That is, the cached data exists and is accessible while your application is running, but as soon as your application exits, your data disappears. In some cases, you may want your cached data to stick around when your application restarts or exits. You may also want to back up your cache such that you don't have to start applications from scratch without a cache of relevant data.

In these scenarios, you may consider using a local, embedded cache, such as github.com/boltdb/bolt. BoltDB, as it is referred to, is a very popular project for these sorts of applications, and basically consists of a local key-value store. To initialize one of these local key-value stores, do the following:

// Open an embedded.db data file in your current directory.
// It will be created if it doesn't exist.
db, err := bolt.Open("embedded.db", 0600, nil)
if err != nil {
    log.Fatal(err)
}
defer db.Close()

// Create a "bucket" in the boltdb file for our data.
if err := db.Update(func(tx *bolt.Tx) error {
    _, err := tx.CreateBucket([]byte("MyBucket"))
    if err != nil {
        return fmt.Errorf("create bucket: %s", err)
    }
    return nil
}); err != nil {
    log.Fatal(err)
}

You can, of course, have multiple different buckets of data in your BoltDB and use a filename other than embedded.db.

Next, let's say you had a map of string values in memory that you need to cache in BoltDB. To do this, you would range over the keys and values in the map, updating your BoltDB:

// Put the map keys and values into the BoltDB file.
if err := db.Update(func(tx *bolt.Tx) error {
    b := tx.Bucket([]byte("MyBucket"))
    err := b.Put([]byte("mykey"), []byte("myvalue"))
    return err
}); err != nil {
    log.Fatal(err)
}

Then, to get values out of BoltDB, you can view your data:

// Output the keys and values in the embedded
// BoltDB file to standard out.
if err := db.View(func(tx *bolt.Tx) error {
    b := tx.Bucket([]byte("MyBucket"))
    c := b.Cursor()
    for k, v := c.First(); k != nil; k, v = c.Next() {
        fmt.Printf("key: %s, value: %s\n", k, v)
    }
    return nil
}); err != nil {
    log.Fatal(err)
}
 

Data versioning


As mentioned, machine learning models produce extremely different results depending on the training data you use, the choices of parameters, and the input data. It is essential to be able to reproduce results for collaborative, creative, and compliance reasons:

  • Collaboration: Despite what you see on social media, there are no data science and machine learning unicorns (that is, people with knowledge and capabilities in every area of data science and machine learning). We need to have our colleagues' reviews and improve on our work, and this is impossible if they aren't able to reproduce our model results and analyses.
  • Creativity: I don't know about you, but I have trouble remembering even what I did yesterday. We can't trust ourselves to always remember our reasoning and logic, especially when we are dealing with machine learning workflows. We need to track exactly what data we are using, what results we created, and how we created them. This is the only way we will be able to continually improve our models and techniques.
  • Compliance: Finally, we may not have a choice regarding data versioning and reproducibility in machine learning very soon. Laws are being passed around the world (for example, the General Data Protection Regulation (GDPR) in the European Union) that give users a right to an explanation for algorithmically made decisions. We simply cannot hope to comply with these rulings if we don't have a robust way of tracking what data we are processing and what results we are producing.

There are multiple open source data versioning projects. Some of these are focused on security and peer-to-peer distributed storage of data. Others are focused on data science workflows. In this book, we will focus on and utilize Pachyderm (http://pachyderm.io/), an open source framework for data versioning and data pipelining. Some of the reasons for this will be clear later in the book when we talk about production deploys and managing ML pipelines. For now, I will just summarize some of the features of Pachyderm that make it an attractive choice for data versioning in Go-based (and other) ML projects:

  • It has an convenient Go client, github.com/pachyderm/pachyderm/src/client
  • The ability to version any type and format of data
  • A flexible object store backing for the versioned data
  • Integration with a data pipelining system for driving versioned ML workflows

Pachyderm jargon

Think about versioning data in Pachyderm kind of like versioning code in Git. The primitives are similar:

  • Repositories: These are versioned collections of data, similar to having versioned collections of code in Git repositories
  • Commits: Data is versioned in Pachyderm by making commits of that data into data repositories
  • Branches: These lightweight points to certain commits or sets of commits (for example, master points to the latest HEAD commit)
  • Files: Data is versioned at the file level in Pachyderm, and Pachyderm automatically employs strategies, such as de-duplication, to keep your versioned data space efficient

Note

Even though versioning data with Pachyderm feels similar to versioning code with Git, there are some major differences. For example, merging data doesn't exactly make sense. If there are merge conflicts on petabytes of data, no human could resolve these. Furthermore, the Git protocol would not be space efficient in general for large sets of data. Pachyderm uses its own internal logic to perform the versioning and work with versioned data, and the logic is both space efficient and processing efficient in terms of caching.

Deploying/installing Pachyderm

We will be using Pachyderm in various other places in the book to both version data and create distributed ML workflows. Pachyderm itself is an app that runs on top of Kubernetes (https://kubernetes.io/), and is backed by an object store of your choice. For the purposes of this book, development, and experimentation, you can easily install and run Pachyderm locally. It should take 5-10 minutes to install and doesn't require much effort. The instructions for the local installation can be found in the Pachyderm documentation at http://docs.pachyderm.io.

When you are ready to run your workflows in production or your deploy model, you can easily deploy a production-ready Pachyderm cluster that will behave the same exact way as your local installation. Pachyderm can be deployed to any cloud, or even on premises.

As mentioned, Pachyderm is an open source project and has an active group of users. If you have questions or need help, you can join the public Pachyderm Slack channel by visiting http://slack.pachyderm.io/. The active Pachyderm users and the Pachyderm team itself will be able to respond very quickly to your questions there.

Creating data repositories for data versioning

If you followed the local installation of Pachyderm specified in the Pachyderm documentation, you should have the following:

  • Kubernetes running in a Minikube VM on your machine
  • The pachctl command line tool installed and connected to your Pachyderm cluster

Of course, if you have a production cluster running in a cloud, the following steps still apply. Your pachctl would just be connected to the remote cluster.

Note

We will be demonstrating data versioning functionality with the pachctlCommand-line Interface (CLI) tool below (which is a Go program). However, as mentioned above, Pachyderm has a full-fledged Go client. You can create repositories, commit data, and much more directly from your Go programs. This functionality will be demonstrated later in Chapter 9, Deploying and distributing Analyses and Models.

To create a repository of data called myrepo, you can run this code:

$ pachctl create-repo myrepo

You can then confirm that the repository exists with list-repo:

$ pachctl list-repo
NAME CREATED SIZE 
myrepo 2 seconds ago 0 B

This myrepo repository is a collection of data that we have defined and is ready for housing-versioned data. Right now, there is no data in the repository, because we haven't put any data there yet.

Putting data into data repositories

Let's say that we have a simple text file:

$ cat blah.txt 
This is an example file.

If this file is part of the data we are utilizing in our ML workflow, we should version it. To version this file in our repository, myrepo, we just need to commit it into that repository:

$ pachctl put-file myrepo master -c -f blah.txt

The -c flag specifies that we want Pachyderm to open a new commit, insert the file we are referencing, and close the commit all in one shot. The -f flag specifies that we are providing a file.

Note that we are committing a single file to the master branch of a single repository here. However, the Pachyderm API is incredibly flexible. We can commit, delete, or otherwise modify many versioned files in a single commit or over multiple commits. Further, these files could be versioned via a URL, object store link, database dump, and so on.

As a sanity check, we can confirm that our file was versioned in the repository:

$ pachctl list-repo
NAME CREATED SIZE 
myrepo 10 minutes ago 25 B 
$ pachctl list-file myrepo master
NAME TYPE SIZE 
blah.txt file 25 B

Getting data out of versioned data repositories

Now that we have versioned data in Pachyderm, we probably want to know how to interact with that data. The primary way is via Pachyderm data pipelines (which will be discussed later in this book). The mechanism for interacting with versioned data when using pipelines is a simple file I/O.

However, if we manually want to pull certain sets of versioned data out of Pachyderm, analyze them interactively, then we can use the pachctl CLI to get data:

$ pachctl get-file myrepo master blah.txt
This is an example file.
 

References


CSV data:

JSON data:

Caching:

Pachyderm:

 

 

Summary


In this chapter, you learned how to gather, organize, and parse data. This is the first step, and one of the most important step, in developing machine learning models, but having data does not get us very far if we do not gain some intuition about our data and put it into a standard form for processing. Next, we will tackle some techniques for further structuring our data (matrices) and for understanding our data (statistics and probability).

About the Author

  • Daniel Whitenack

    Daniel Whitenack is a trained PhD data scientist with over 10 years' experience working on data-intensive applications in industry and academia. Recently, Daniel has focused his development efforts on open source projects related to running machine learning (ML) and artificial intelligence (AI) in cloud-native infrastructure (Kubernetes, for instance), maintaining reproducibility and provenance for complex data pipelines, and implementing ML/AI methods in new languages such as Go. Daniel co-hosts the Practical AI podcast, teaches data science/engineering at Ardan Labs and Purdue University, and has spoken at conferences around the world (including ODSC, PyCon, DataEngConf, QCon, GopherCon, Spark Summit, and Applied ML Days, among others).

    Browse publications by this author

Latest Reviews

(5 reviews total)
Muito boa a compra que fiz.
High quality, knowledgeable author
Great! Helpful book ineed.

Recommended For You

Book Title
Unlock this full book FREE 10 day trial
Start Free Trial