You're reading from Fast Data Processing with Spark 2 - Third Edition
The Spark shell is an excellent tool for rapid prototyping with Spark. It works with Scala and Python. It allows you to interact with the Spark cluster and as a result of which, the full API is under your command. It can be great for debugging, just trying things out, or interactively exploring new Datasets or approaches.
The previous chapter should have gotten you to the point of having a Spark instance running; now all you need to do is start your Spark shell and point it at your running instance with the command given in the table we're soon going to check out.
For local mode, Spark will start an instance when you invoke the Spark shell or start a Spark program from an IDE. So, a local installation on a Mac or Linux PC/laptop is sufficient to start exploring the Spark shell. Not having to spin up a real cluster to do the prototyping is an important and useful feature of Spark. The Quick Start guide at http://spark.apache.org/docs/latest/quick-start.html is a good reference...
Let's download a Dataset and do some experimentation. One of the (if not the best) books for machine learning is The Elements of Statistical Learning, Trevor Hastie, Jerome H. Friedman, Robert Tibshirani, Springer. The book site has an interesting set of Datasets. Let's grab the spam Dataset using the following command:
wget http://www-stat.stanford.edu/~tibs/ElemStatLearn/ datasets/spam.data
Alternatively, you can find the spam Dataset from the GitHub link at https://github.com/xsankar/fdps-v3.
Note
All the examples assume that you have downloaded the repository in the fdps-v3
directory in your home folder, that is, ~/fdps-v3/
. Please adjust the directory name if you have downloaded them somewhere else.
Now, load it as a text file into Spark with the following commands inside your Spark shell:
scala> val inFile = sc.textFile("data/spam.data") scala> inFile.count()
This loads the spam.data
file into Spark with each line being a separate entry in the Resilient...
Now let's try another exercise with the Spark shell. As part of Amazon's EMR Spark support, they have handily provided some sample data of Wikipedia traffic statistics in S3, in the format that Spark can use. To access the data, you first need to set your AWS access credentials as shell params. For instructions on signing up for EC2 and setting up the shell parameters, see the Running Spark on EC2 with the scripts section in Chapter 1, Installing Spark and Setting Up Your Cluster (S3 access requires additional keys such as fs.s3n.awsAccessKeyId/awsSecretAccessKey
or the use of the s3n://user:pw@
syntax). You can also set the shell parameters as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. We will leave the AWS configuration out of this discussion, but it needs to be completed.
Tip
This is a slightly advanced topic and needs a few S3 configurations (which we won't cover here). The Stack Overflow has two good links on this, namely http://stackoverflow.com/questions...