Using the AWS web interface to manage and run your projects is time-consuming. We will, therefore, start running our projects via the command line with the AWS Command Line Interface (AWS CLI). With just one tool to download and configure, multiple AWS services can be controlled from the command line and they can be automated through scripts.
This article is an excerpt from a book written by Alexis Perrier titled Effective Amazon Machine Learning.
Creating a performing predictive model from raw data requires many trials and errors, much back and forth. Creating new features, cleaning up data, and trying out new parameters for the model are needed to ensure the robustness of the model. There is a constant back and forth between the data, the models, and the evaluations. Scripting this workflow either via the AWS CLI will give us the ability to speed up the create, test, select loop.
In order to set up your CLI credentials, you need your access key ID and your secret access key. You can simply create them from the IAM console (https://console.aws.amazon.com/iam).
Navigate to Users, select your IAM user name and click on the Security credentials tab. Choose Create Access Key and download the CSV file. Store the keys in a secure location. We will need the key in a few minutes to set up AWS CLI. But first, we need to install AWS CLI.
There is no need to rewrite the AWS documentation on how to install the AWS CLI. It is complete and up to date, and available at http://docs.aws.amazon.com/cli/latest/userguide/installing.html. In a nutshell, installing the CLI requires you to have Python and pip already installed.
Then, run the following:
$ pip install --upgrade --user awscli
Add AWS to your $PATH:
$ export PATH=~/.local/bin:$PATH
Reload the bash configuration file (this is for OSX):
$ source ~/.bash_profile
Check that everything works with the following command:
$ aws --version
You should see something similar to the following output:
$ aws-cli/1.11.47 Python/3.5.2 Darwin/15.6.0 botocore/1.5.10
Once installed, we need to configure the AWS CLI type:
$ aws configure
Now input the access keys you just created:
$ aws configure AWS Access Key ID [None]: ABCDEF_THISISANEXAMPLE AWS Secret Access Key [None]: abcdefghijk_THISISANEXAMPLE Default region name [None]: us-west-2 Default output format [None]: json
Choose the region that is closest to you and the format you prefer (JSON, text, or table). JSON is the default format.
The AWS configure command creates two files: a config file and a credential file. On OSX, the files are ~/.aws/config and ~/.aws/credentials. You can directly edit these files to change your access or configuration. You will need to create different profiles if you need to access multiple AWS accounts. You can do so via the AWS configure command:
$ aws configure --profile user2
You can also do so directly in the config and credential files:
~/.aws/config
[default]
output = json
region = us-east-1
[profile user2]
output = text
region = us-west-2
You can edit Credential file as follows:
~/.aws/credentials
[default]
aws_secret_access_key = ABCDEF_THISISANEXAMPLE
aws_access_key_id = abcdefghijk_THISISANEXAMPLE
[user2]
aws_access_key_id = ABCDEF_ANOTHERKEY
aws_secret_access_key = abcdefghijk_ANOTHERKEY
Refer to the AWS CLI setup page for more in-depth information:
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
The overall format of any AWS CLI command is as follows:
$ aws <service> [options] <command> <subcommand> [parameters]
Here the terms are stated as:
A simple example will help you understand the syntax better. To list the content of an S3 bucket named aml.packt, the command is as follows:
$ aws s3 ls aml.packt
Here, s3 is the service, ls is the command, and aml.packt is the parameter. The aws help command will output a list of all available services.
There are many more examples and explanations on the AWS documentation available at
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-using.html.
For some services and commands, the list of parameters can become long and difficult to check and maintain.
For instance, in order to create an Amazon ML model via the CLI, you need to specify at least seven different elements: the Model ID, name, type, the model's parameters, the ID of the training data source, and the recipe name and URI (aws machinelearning create-ml-model help ).
When possible, we will use the CLI ability to read parameters from a JSON file instead of specifying them in the command line. AWS CLI also offers a way to generate a JSON template, which you can then use with the right parameters. To generate that JSON parameter file model (the JSON skeleton), simply add --generate-cli-skeleton after the command name. For instance, to generate the JSON skeleton for the create model command of the machine learning service, write the following:
$ aws machinelearning create-ml-model --generate-cli-skeleton
This will give the following output:
{ "MLModelId": "", "MLModelName": "", "MLModelType": "", "Parameters": { "KeyName": "" }, "TrainingDataSourceId": "", "Recipe": "", "RecipeUri": "" }
You can then configure this to your liking.
To have the skeleton command generate a JSON file and not simply output the skeleton in the terminal, add > filename.json:
$ aws machinelearning create-ml-model --generate-cli-skeleton > filename.json
This will create a filename.json file with the JSON template. Once all the required parameters are specified, you create the model with the command (assuming the filename.json is in the current folder):
$ aws machinelearning create-ml-model file://filename.json
Before we dive further into the machine learning workflow via the CLI, we need to introduce the dataset we will be using in this chapter.
We will use the Ames Housing dataset that was compiled by Dean De Cock for use in data science education. It is a great alternative to the popular but older Boston Housing dataset. The Ames Housing dataset is used in the Advanced Regression Techniques challenge on the Kaggle website: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/. The original version of the dataset is available: http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls and in the GitHub repository for this chapter.
For more information on the genesis of this dataset and an in-depth explanation of the different variables, read the paper by Dean De Cock available in PDF at https://ww2.amstat.org/publications/jse/v19n3/decock.pdf.
We will start by splitting the dataset into a train and a validate set and build a model on the train set. Both train and validate sets are available in the GitHub repository as ames_housing_training.csv and ames_housing_validate.csv. The entire dataset is in the ames_housing.csv file.
We will use shell commands to shuffle, split, and create training and validation subsets of the Ames Housing dataset:
$ head -n 1 ames_housing.csv > ames_housing_header.csv
$ tail -n +2 ames_housing.csv > ames_housing_nohead.csv
$ gshuf ames_housing_nohead.csv -o ames_housing_nohead.csv
$ head -n 2050 ames_housing_nohead.csv > ames_housing_training.csv $ tail -n 880 ames_housing_nohead.csv > ames_housing_validate.csv
$ cat ames_housing_header.csv ames_housing_training.csv > tmp.csv $ mv tmp.csv ames_housing_training.csv
$ cat ames_housing_header.csv ames_housing_validate.csv > tmp.csv
$ mv tmp.csv ames_housing_validate.csv
We are now ready to execute a simple Amazon ML workflow using the CLI. This includes the following:
Let's start by uploading the training and validation files to S3. In the following lines, replace the bucket name aml.packt with your own bucket name.
To upload the files to the S3 location s3://aml.packt/data/ch8/, run the following command lines:
$ aws s3 cp ./ames_housing_training.csv s3://aml.packt/data/ch8/ upload: ./ames_housing_training.csv to s3://aml.packt/data/ch8/ames_housing_training.csv
$ aws s3 cp ./ames_housing_validate.csv s3://aml.packt/data/ch8/
upload: ./ames_housing_validate.csv to s3://aml.packt/data/ch8/ames_housing_validate.csv
That's it for the S3 part. Now let's explore the CLI for Amazon's machine learning service.
All Amazon ML CLI commands are available at http://docs.aws.amazon.com/cli/latest/reference/machinelearning/. There are 30 commands, which can be grouped by object and action.
You can perform the following:
These can be performed on the following elements:
You can also handle tags and set waiting times.
Note that the AWS CLI gives you the ability to create datasources from S3, Redshift, and RDS, while the web interface only allowed datasources from S3 and Redshift.
We will start by creating the datasource. Let's first see what parameters are needed by generating the following skeleton:
$ aws machinelearning create-data-source-from-s3 --generate-cli-skeleton
This generates the following JSON object:
{ "DataSourceId": "", "DataSourceName": "", "DataSpec": { "DataLocationS3": "", "DataRearrangement": "", "DataSchema": "", "DataSchemaLocationS3": "" }, "ComputeStatistics": true }
The different parameters are mostly self-explanatory and further information can be found on the AWS documentation at http://docs.aws.amazon.com/cli/latest/reference/machinelearning/create-data-source-from-s3.html.
A word on the schema: when creating a datasource from the web interface, you have the possibility to use a wizard, to be guided through the creation of the schema. The wizard facilitates the process by guessing the type of the variables, thus making available a default schema that you can modify.
There is no default schema available via the AWS CLI. You have to define the entire schema yourself, either in a JSON format in the DataSchema field or by uploading a schema file to S3 and specifying its location, in the DataSchemaLocationS3 field.
Since our dataset has many variables (79), we cheated and used the wizard to create a default schema that we uploaded to S3. Throughout the rest of the chapter, we will specify the schema location not its JSON definition.
In this example, we will create the following datasource parameter file, dsrc_ames_housing_001.json:
{ "DataSourceId": "ch8_ames_housing_001", "DataSourceName": "[DS] Ames Housing 001", "DataSpec": { "DataLocationS3": "s3://aml.packt/data/ch8/ames_housing_training.csv", "DataSchemaLocationS3": "s3://aml.packt/data/ch8/ames_housing.csv.schema" }, "ComputeStatistics": true }
For the validation subset (save to dsrc_ames_housing_002.json):
{ "DataSourceId": "ch8_ames_housing_002", "DataSourceName": "[DS] Ames Housing 002", "DataSpec": { "DataLocationS3": "s3://aml.packt/data/ch8/ames_housing_validate.csv", "DataSchemaLocationS3": "s3://aml.packt/data/ch8/ames_housing.csv.schema" }, "ComputeStatistics": true }
Since we have already split our data into a training and a validation set, there's no need to specify the data DataRearrangement field.
Alternatively, we could also have avoided splitting our dataset and specified the following DataRearrangement on the original dataset, assuming it had been already shuffled: (save to dsrc_ames_housing_003.json):
{ "DataSourceId": "ch8_ames_housing_003", "DataSourceName": "[DS] Ames Housing training 003", "DataSpec": { "DataLocationS3": "s3://aml.packt/data/ch8/ames_housing_shuffled.csv", "DataRearrangement": "{"splitting":{"percentBegin":0,"percentEnd":70}}", "DataSchemaLocationS3": "s3://aml.packt/data/ch8/ames_housing.csv.schema" }, "ComputeStatistics": true }
For the validation set (save to dsrc_ames_housing_004.json):
{ "DataSourceId": "ch8_ames_housing_004", "DataSourceName": "[DS] Ames Housing validation 004", "DataSpec": { "DataLocationS3": "s3://aml.packt/data/ch8/ames_housing_shuffled.csv", "DataRearrangement": "{"splitting":{"percentBegin":70,"percentEnd":100}}", }, "ComputeStatistics": true }
Here, the ames_housing.csv file has previously been shuffled using the gshuf command line and uploaded to S3:
$ gshuf ames_housing_nohead.csv -o ames_housing_nohead.csv $ cat ames_housing_header.csv ames_housing_nohead.csv > tmp.csv $ mv tmp.csv ames_housing_shuffled.csv $ aws s3 cp ./ames_housing_shuffled.csv s3://aml.packt/data/ch8/
Note that we don't need to create these four datasources; these are just examples of alternative ways to create datasources.
We then create these datasources by running the following:
$ aws machinelearning create-data-source-from-s3 --cli-input-json file://dsrc_ames_housing_001.json
We can check whether the datasource creation is pending:
In return, we get the datasoure ID we had specified:
{ "DataSourceId": "ch8_ames_housing_001" }
We can then obtain information on that datasource with the following:
$ aws machinelearning get-data-source --data-source-id ch8_ames_housing_001
This returns the following:
{ "Status": "COMPLETED", "NumberOfFiles": 1, "CreatedByIamUser": "arn:aws:iam::178277xxxxxxx:user/alexperrier", "LastUpdatedAt": 1486834110.483, "DataLocationS3": "s3://aml.packt/data/ch8/ames_housing_training.csv", "ComputeStatistics": true, "StartedAt": 1486833867.707, "LogUri": "https://eml-prod-emr.s3.amazonaws.com/178277513911-ds-ch8_ames_housing_001/.....", "DataSourceId": "ch8_ames_housing_001", "CreatedAt": 1486030865.965, "ComputeTime": 880000, "DataSizeInBytes": 648150, "FinishedAt": 1486834110.483, "Name": "[DS] Ames Housing 001" }
Note that we have access to the operation log URI, which could be useful to analyze the model training later on.
Creating the model with the create-ml-model command follows the same steps:
$ aws machinelearning create-ml-model --generate-cli-skeleton > mdl_ames_housing_001.json
{ "MLModelId": "ch8_ames_housing_001", "MLModelName": "[MDL] Ames Housing 001", "MLModelType": "REGRESSION", "Parameters": { "sgd.shuffleType": "auto", "sgd.l2RegularizationAmount": "1.0E-06", "sgd.maxPasses": "100" }, "TrainingDataSourceId": "ch8_ames_housing_001", "RecipeUri": "s3://aml.packt/data/ch8 /recipe_ames_housing_001.json" }
Note the parameters of the algorithm. Here, we used mild L2 regularization and 100 passes.
$ aws machinelearning create-ml-model --cli-input-json file://mdl_ames_housing_001.json
{ "MLModelId": "ch8_ames_housing_001" }
$ aws machinelearning get-ml-model --ml-model-id ch8_ames_housing_001
$ watch -n 10 aws machinelearning get-ml-model --ml-model-id ch8_ames_housing_001
The output of the get-ml-model will be refreshed every 10s until you kill it.
Our model is now trained and we would like to evaluate it on the evaluation subset. For that, we will use the create-evaluation CLI command:
$ aws machinelearning create-evaluation --generate-cli-skeleton > eval_ames_housing_001.json
{ "EvaluationId": "ch8_ames_housing_001", "EvaluationName": "[EVL] Ames Housing 001", "MLModelId": "ch8_ames_housing_001", "EvaluationDataSourceId": "ch8_ames_housing_002" }
$ aws machinelearning create-evaluation --cli-input-json file://eval_ames_housing_001.json
$ aws machinelearning get-evaluation --evaluation-id ch8_ames_housing_001
"PerformanceMetrics": { "Properties": { "RegressionRMSE": "29853.250469108018" } }
The value may seem big, but it is relative to the range of the salePrice variable for the houses, which has a mean of 181300.0 and std of 79886.7. So an RMSE of 29853.2 is a decent score.
At this point, we have a trained and evaluated model.
In this tutorial, we have successfully seen the detailed steps on how to get started with CLI and we have also implemented a simple project to get comfortable with the same.
To understand how to leverage Amazon's powerful platform for your predictive analytics needs, check out this book Effective Amazon Machine Learning
Part1. Learning AWS CLI
Part2. ChatOps with Slack and AWS CLI
Automate tasks using Azure PowerShell and Azure CLI [Tutorial]