Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Scala Data Analysis Cookbook

You're reading from  Scala Data Analysis Cookbook

Product type Book
Published in Oct 2015
Publisher
ISBN-13 9781784396749
Pages 254 pages
Edition 1st Edition
Languages
Author (1):
Arun Manivannan Arun Manivannan
Profile icon Arun Manivannan

Table of Contents (14) Chapters

Scala Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
1. Getting Started with Breeze 2. Getting Started with Apache Spark DataFrames 3. Loading and Preparing Data – DataFrame 4. Data Visualization 5. Learning from Data 6. Scaling Up 7. Going Further Index

Index

A

  • Apache Spark
    • about / Introduction
    • tools / Introduction
    • URL / Getting Apache Spark
    • obtaining / How to do it...
  • apply method / Constructing a vector from values
  • arbitrary transformations
    • URL / How to do it...
  • ATLAS
    • URL / The org.scalanlp.breeze-natives package
  • Avro data model
    • using, in Parquet / Using the Avro data model in Parquet, How to do it…
    • URL / Using the Avro data model in Parquet
    • creating / Creation of the Avro model
    • schema_complex, URL / Creation of the Avro model
    • schema_primitive, URL / Creation of the Avro model
    • Avro objects, generating with sbt-avro plugin / Generation of Avro objects using the sbt-avro plugin
    • RDD of generated object, constructing from Students.csv / Constructing an RDD of our generated object from Students.csv

B

  • binary classification
    • LogisticRegression, using with Pipeline API / Binary classification using LogisticRegression with Pipeline API
  • binary classification, with LogisticRegression and SVM
    • about / Binary classification using LogisticRegression and SVM
  • Bokeh-Scala
    • URL / Introduction
    • used, for creating scatter plots / Creating scatter plots with Bokeh-Scala, How to do it...
    • glyph / How to do it...
    • plot / How to do it...
    • document / How to do it...
    • used, for creating time series MultiPlot / Creating a time series MultiPlot with Bokeh-Scala, How to do it...
  • Breeze
    • URL / Getting Breeze – the linear algebra library
    • about / Getting Breeze – the linear algebra library
    • breeze dependencies / How to do it...
    • breeze-native dependencies / How to do it...
    • obtaining / How to do it...
    • org.scalanlp.breeze dependency / The org.scalanlp.breeze dependency
    • org.scalanlp.breeze-natives package / The org.scalanlp.breeze-natives package
    • org.scalanlp.breeze-natives package, URL / The org.scalanlp.breeze-natives package
  • breeze-viz / Creating scatter plots with Bokeh-Scala

C

  • chill library
    • reference link / How to do it...
  • classes
    • more than 22 features, loading / Loading more than 22 features into classes, How to do it..., How it works...
  • clustering
    • about / Clustering using K-means
    • K-means, using / Clustering using K-means, How to do it...
  • continuous values
    • predicting, with linear regression / Predicting continuous values using linear regression, How to do it...
  • CSV
    • DataFrame, creating from / Creating a DataFrame from CSV, How to do it..., There's more…
  • CSV files
    • reading / Reading and writing CSV files, How it works...
    • writing / Reading and writing CSV files, How it works...
  • csvread function / How it works...

D

  • data
    • preparing, in Dataframes / Preparing data in Dataframes, How to do it...
    • pulling, from ElasticSearch / How to do it...
  • DataFrame
    • creating, from CSV / Creating a DataFrame from CSV, How to do it..., There's more…
    • URL / Creating a DataFrame from CSV, Manipulating DataFrames, Creating a DataFrame from Scala case classes
    • manipulating / Manipulating DataFrames, How to do it...
    • schema, printing / Printing the schema of the DataFrame
    • data, sampling / Sampling the data in the DataFrame
    • columns, selecting / Selecting DataFrame columns
    • data by condition, filtering / Filtering data by condition
    • data, sorting in frame / Sorting data in the frame
    • columns, renaming / Renaming columns
    • treating, as relational table / Treating the DataFrame as a relational table
    • two DataFrame, joining / Joining two DataFrames
    • inner join / Inner join
    • right outer join / Right outer join
    • left outer join / Left outer join
    • saving, as file / Saving the DataFrame as a file
    • creating, from Scala case classes / Creating a DataFrame from Scala case classes, How to do it..., How it works...
    • JSON, loading / Loading JSON into DataFrames, How to do it…
    • JSON file, reading with SQLContext.jsonFile / Reading a JSON file using SQLContext.jsonFile
    • text file, converting to JSON RDD / Reading a text file and converting it to JSON RDD
    • text file, reading / Reading a text file and converting it to JSON RDD
    • schema, explicitly specifying / Explicitly specifying your schema, There's more…
    • data, preparing / Preparing data in Dataframes, How to do it...
  • Directed Acyclic Graph (DAG) / Submitting jobs to the Spark cluster (local)
  • Dow Jones Index Data Set
    • URL / Creating a time series MultiPlot with Bokeh-Scala
  • Driver program / There's more…
  • DStreams
    • about / Using Spark Streaming to subscribe to a Twitter stream

E

  • EC2
    • Spark Standalone cluster, running / Running the Spark Standalone cluster on EC2
  • Elasticsearch
    • URL, for downloading installable / How to do it...
  • ElasticSearch
    • URL / Using Spark Streaming to subscribe to a Twitter stream
    • data, pulling from / How to do it...
  • ETL tool
    • Spark, using as / Using Spark as an ETL tool, How to do it...

F

  • feature reduction
    • PCA, using / Feature reduction using principal component analysis

G

  • gradient descent
    • about / Gradient descent
  • Graphviz
    • URL / Transitive dependency stated explicitly in the SBT dependency
  • GraphX
    • about / Using GraphX to analyze Twitter data
    • used, for analyzing Twitter data / How to do it...

H

  • Hadoop cluster
    • URL / Installing the Hadoop cluster
  • Hadoop Distributed File System (HDFS)
    • about / Loading JSON into DataFrames
    • URL / Loading JSON into DataFrames
  • HDFS
    • data, pushing / Pushing data into HDFS
  • head function / Using the tools to inspect the Parquet file
  • Hive table
    • URL / Save it as a Parquet file

I

  • instance-type
    • URL / Running the launch script
  • iris data
    • URL / How to do it...

J

  • Joda-Time API / Preparing our data

K

  • K-means
    • used, for clustering / Clustering using K-means
    • about / Clustering using K-means, How to do it...
    • KMeans.RANDOM / KMeans.RANDOM
    • KMeans.PARALLEL / KMeans.PARALLEL
    • max iterations / Max iterations
    • epsilon / Epsilon
    • data, importing / Importing the data and converting it into a vector
    • data, converting into vector / Importing the data and converting it into a vector
    • data, feature scaling / Feature scaling the data
    • number of clusters, deriving / Deriving the number of clusters
    • model, constructing / Constructing the model
    • model, evaluating / Evaluating the model
  • Kafka
    • setting up / How to do it...
  • Kafka server
    • starting / How to do it...
  • Kafka topic
    • creating / How to do it...
  • Kafka version 0.8.2.1, for Spark 2.10
    • URL / How to do it...
  • KMeans.PARALLEL
    • about / KMeans.PARALLEL
    • K-means++ / K-means++
    • K-means|| / K-means||
  • Kryo / Saving RDD[StudentAvro] in a Parquet file
  • KryoSerializer
    • about / Using Spark as an ETL tool
    • used, for publishing data to Kafka / How to do it...

L

  • legends property / Adding a legend to the plot
  • Lempel-Ziv-Oberhumer (LZO) / Enable compression for the Parquet file
  • linear regression
    • used, for predicting continuous values / Predicting continuous values using linear regression, How to do it...
    • data. importing / Importing the data
    • each instance, converting into LabeledPoint / Converting each instance into a LabeledPoint
    • training, preparing / Preparing the training and test data
    • test data, preparing / Preparing the training and test data
    • features, scaling / Scaling the features
    • model, training / Training the model
    • test data, predicting against / Predicting against test data
    • model, evaluating / Evaluating the model
    • parameters, regularizing / Regularizing the parameters
    • mini batching / Mini batching
  • LogisticRegression
    • used, for binary classification with Pipeline API / Binary classification using LogisticRegression with Pipeline API

M

  • matrices
    • working with / Working with matrices, How to do it...
    • creating / Creating matrices
    • creating, from values / Creating a matrix from values
    • zero matrix, creating / Creating a zero matrix
    • creating, out of function / Creating a matrix out of a function
    • identity matrix, creating / Creating an identity matrix
    • creating, from random numbers / Creating a matrix from random numbers
    • Scala collection, creating / Creating from a Scala collection
    • appending / Appending and conversion
    • concatenating / Concatenating matrices – vertically
    • concatenating, hvertcat function / Concatenating matrices – vertically
    • concatenating, horzcat function / Concatenating matrices – horizontally
    • data manipulation operations / Data manipulation operations
    • basic statistics, computing / Computing basic statistics
    • mean and variance / Mean and variance
    • standard deviation / Standard deviation
    • working / How it works...
    • with randomly distributed values / Vectors and matrices with randomly distributed values, How it works...
  • matrix
    • column vectors, obtaining / Getting column vectors out of the matrix
    • row vectors, obtaining / Getting row vectors out of the matrix
    • inside values, obtaining / Getting values inside the matrix
    • inverse, obtaining / Getting the inverse and transpose of a matrix
    • transpose, obtaining / Getting the inverse and transpose of a matrix
    • largest value, finding / Finding the largest value in a matrix
    • sum, finding / Finding the sum, square root and log of all the values in the matrix
    • square root, finding / Finding the sum, square root and log of all the values in the matrix
    • log of all values, finding / Finding the sum, square root and log of all the values in the matrix
    • sqrt function / Finding the sum, square root and log of all the values in the matrix
    • log function / Calculating the eigenvectors and eigenvalues of a matrix
    • eigenvectors, calculating / Calculating the eigenvectors and eigenvalues of a matrix
    • eigenvalues, calculating / Calculating the eigenvectors and eigenvalues of a matrix
    • with uniformly random values, creating / Creating a matrix with uniformly random values
    • with normally distributed random values, creating / Creating a matrix with normally distributed random values
    • with random values with Poisson distribution, creating / Creating a matrix with random values that has a Poisson distribution
  • matrix arithmetic
    • about / Matrix arithmetic
    • addition / Addition
    • multiplication / Multiplication
  • matrix of Int
    • converting, into matrix of Double / Converting a matrix of Int to a matrix of Double
  • Mesos
    • Spark job, running / How to do it...
    • installing / Installing Mesos
    • URL / Installing Mesos
    • master and slave, starting / Starting the Mesos master and slave
    • Spark binary package, uploading to HDFS / Uploading the Spark binary package and the dataset to HDFS
    • dataset, uploading to HDFS / Uploading the Spark binary package and the dataset to HDFS
  • micro-batching
    • about / Using Spark Streaming to subscribe to a Twitter stream

N

  • NumPy
    • URL / Getting Breeze – the linear algebra library

O

  • OpenBLAS
    • URL / The org.scalanlp.breeze-natives package

P

  • PairRDD
    • URL / Saving RDD[StudentAvro] in a Parquet file
  • Parquet
    • URL / Storing data as Parquet files
    • parquet-tools, URL / Install Parquet tools
    • Avro data model, using / Using the Avro data model in Parquet, How to do it…
  • Parquet-MR project
    • URL / Storing data as Parquet files
  • Parquet files
    • data, storing as / Storing data as Parquet files, How to do it…, Load a simple CSV file, convert it to case classes, and create a DataFrame from it, Save it as a Parquet file
    • inspecting. with tools / Using the tools to inspect the Parquet file
    • Snappy compression of data, enabling / Enable compression for the Parquet file
    • RDD[StudentAvro], saving / Saving RDD[StudentAvro] in a Parquet file
    • file back, reading for verification / Reading the file back for verification
  • Parquet tools
    • installing / Install Parquet tools
    • using, for verification / Using Parquet tools for verification
  • PCA
    • used, for feature reduction / Feature reduction using principal component analysis, How to do it...
    • about / Feature reduction using principal component analysis
    • dimensionality reduction, of data for supervised learning / Dimensionality reduction of data for supervised learning
    • training data, mean-normalizing / Mean-normalizing the training data, Mean-normalizing the training data
    • principal components, extracting / Extracting the principal components, Extracting the principal components
    • labeled data, preparing / Preparing the labeled data
    • test data, preparing / Preparing the test data
    • metrics, classifying / Classify and evaluate the metrics
    • metrics, evaluating / Classify and evaluate the metrics, Evaluating the metrics
    • data, dimensionality reduction / Dimensionality reduction of data for unsupervised learning
    • number of components / Arriving at the number of components
  • pem key
    • URL / Creating the AccessKey and pem file
  • Pipeline API, used for solving binary classification
    • data, importing as test / Importing and splitting data as test and training sets
    • data, importing as training sets / Importing and splitting data as test and training sets
    • data, splitting as training sets / Importing and splitting data as test and training sets
    • data, splitting as test / Importing and splitting data as test and training sets
    • participants, constructing / Construct the participants of the Pipeline
    • pipeline, preparing / Preparing a pipeline and training a model
    • model, training / Preparing a pipeline and training a model
    • test data, predicting against / Predicting against test data
    • mode, evaluating without cross-validation / Evaluating a model without cross-validation
    • parameters for cross-validation, constructing / Constructing parameters for cross-validation
    • cross-validator, constructing / Constructing cross-validator and fit the best model
    • model, evaluating with cross-validation / Evaluating the model with cross-validation
  • Pipeline API, used for solving binary classification problem
    • about / Binary classification using LogisticRegression with Pipeline API
  • prerequisite, for running ElasticSearch instance on machine
    • Elasticsearch, running / How to do it...
    • Twitter app, creating / How to do it...
    • Spark Streaming, adding / How to do it...
    • Twitter dependency, adding / How to do it...
    • Twitter stream, creating / How to do it...
    • stream, saving to ElasticSearch / How to do it...
  • Principal Component Analysis (PCA) / Gradient descent
  • Privacy Enhanced Mail (PEM) / How to do it...
  • Product
    • API docs, URL / How to do it...
  • pseudo-clustered mode
    • HDFS, running / Running HDFS on Pseudo-clustered mode
    • URL / Running HDFS on Pseudo-clustered mode

R

  • RDBMS
    • loading / Loading from RDBMS, How to do it…
  • reduceByKey function / How to do it...
  • Resilient Distributed Dataset (RDD) / How it works...
  • RowGroups / Storing data as Parquet files

S

  • save method / Save it as a Parquet file
  • sbt-avro plugin
    • URL / Generation of Avro objects using the sbt-avro plugin
  • sbt-dependency-graph plugin
    • URL / How to do it...
  • SBT assembly plugin
    • URL / How to do it...
  • sbteclipse plugin
    • URL / How to do it...
  • Scala bindings
    • URL / Introduction
  • Scala Build Tool (SBT) / Getting Breeze – the linear algebra library
  • Scala case classes
    • DataFrame, creating from / Creating a DataFrame from Scala case classes, How to do it..., How it works...
  • scatter plots, creating with Bokeh-Scala
    • about / Creating scatter plots with Bokeh-Scala, How to do it...
    • data, preparing / Preparing our data
    • Plot and Document objects, creating / Creating Plot and Document objects
    • marker object, creating / Creating a marker object
    • x and y axes’ data ranges, setting for plot / Setting the X and Y axes' data range for the plot
    • x and y axes, drawing / Drawing the x and the y axes
    • flower species with varying colors, viewing / Viewing flower species with varying colors
    • grid lines, adding / Adding grid lines
    • legend, adding to plot / Adding a legend to the plot
    • URL / Adding a legend to the plot
  • Sense plugin
    • URL / How to do it...
  • Snappy
    • URL / Enable compression for the Parquet file
  • Snappy compression / Enable compression for the Parquet file
  • source build tool (SBT) / Getting Apache Spark
  • Spark
    • downloading / Downloading Spark
    • URL, for download / Downloading Spark
    • using, as ETL tool / Using Spark as an ETL tool, How to do it...
  • spark.driver.extraClassPath property
    • URL / Building the Uber JAR
  • Spark 14
    • URL / Submitting jobs to the Spark cluster (local)
  • Spark application
    • building / Introduction
    • submitting, on cluster / Submitting the Spark application on the cluster
  • Spark cluster
    • jobs, submitting to / Submitting jobs to the Spark cluster (local)
  • Spark job
    • submitting, to Spark cluster / Submitting jobs to the Spark cluster (local)
    • running, on Mesos / Running the Spark Job on Mesos (local), Running the job
    • running, on YARN / Running the Spark Job on YARN (local), How to do it...
    • running, in yarn-client mode / Running a Spark job in yarn-client mode
    • running, in yarn-cluster mode / Running Spark job in yarn-cluster mode
  • Spark job, installing on YARN
    • about / Running the Spark Job on YARN (local), How to do it...
    • Hadoop cluster, installing / Installing the Hadoop cluster
    • HDFS, starting / Starting HDFS and YARN
    • Spark assembly, pushing to HDFS / Pushing Spark assembly and dataset to HDFS
    • dataset, pushing to HDFS / Pushing Spark assembly and dataset to HDFS
  • Spark master and slave
    • running / Running the Spark master and slave locally
  • Spark Standalone cluster
    • running, on EC2 / Running the Spark Standalone cluster on EC2
    • AccessKey, creating / Creating the AccessKey and pem file
    • pem file, creating / Creating the AccessKey and pem file
    • environment variables, setting / Setting the environment variables
    • launch script, running / Running the launch script
    • installation, verifying / Verifying installation
    • changes, making to code / Making changes to the code
    • data, transferring / Transferring the data and job files
    • job files, transferring / Transferring the data and job files
    • dataset, loading into HDFS / Loading the dataset into HDFS
    • job, running / Running the job
    • destroying / Destroying the cluster
  • Spark Streaming
    • used, for subscribing to Twitter stream / Using Spark Streaming to subscribe to a Twitter stream
  • Stochastic Gradient Descent (SGD) / Gradient descent
  • StreamingLogisticRegression, used for classifying Twitter stream
    • about / Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
    • subscription, to Kafka stream / How to do it...
    • classification model, training / How to do it...
    • live Twitter stream, classifying / How to do it...
  • Student dataset
    • URL / Loading more than 22 features into classes
  • supervised learning / Supervised and unsupervised learning
  • Support Vector Machine (SVM)
    • about / Binary classification using LogisticRegression and SVM

T

  • time series MultiPlot, creating with Bokeh-Scala
    • about / Creating a time series MultiPlot with Bokeh-Scala, How to do it...
    • data, preparing / Preparing our data
    • Plot, creating / Creating a plot
    • line joining to all data points, creating / Creating a line that joins all the data points
    • and y axes’ data ranges for plot, setting / Setting the x and y axes' data range for the plot
    • axes, drawing / Drawing the axes and the grids
    • grids, drawing / Drawing the axes and the grids
    • tools, adding / Adding tools
    • legend, adding to plot / Adding a legend to the plot
    • multiple plots, creating in document / Multiple plots in the document
    • URL / Multiple plots in the document
  • toDF() function / How to do it...
  • twitter-chill project
    • URL / Saving RDD[StudentAvro] in a Parquet file
  • twitter4j library
    • URL / How to do it...
  • Twitter app
    • URL / How to do it...
  • Twitter data
    • analyzing, with GraphX / How to do it...
  • Twitter stream
    • subscribing to / Using Spark Streaming to subscribe to a Twitter stream

U

  • Uber JAR
    • building / Building the Uber JAR, How to do it...
    • transitive dependency stated explicitly, in SBT dependency / Transitive dependency stated explicitly in the SBT dependency
    • different libraries dependency, on same library / Two different libraries depend on the same external library
  • unsupervised learning / Supervised and unsupervised learning

V

  • vector concatenation
    • about / Concatenating two vectors
    • vector of Int, converting to vector of Double / Converting a vector of Int to a vector of Double
    • basic statistics, computing / Computing basic statistics
    • mean, calculating / Mean and variance
    • variance, calculating / Mean and variance
  • vectors
    • working with / Working with vectors, Getting ready
    • creating / Creating vectors
    • constructing, from values / Constructing a vector from values
    • zero vector, creating / Creating a zero vector
    • creating, out of function / Creating a vector out of a function
    • vector of linearly spaced values, creating / Creating a vector of linearly spaced values
    • vector with values, creating in specific range / Creating a vector with values in a specific range
    • entire vector with single value, creating / Creating an entire vector with a single value
    • sub-vector, slicing from bigger vector / Slicing a sub-vector from a bigger vector
    • Breeze vector, creating from Scala vector / Creating a Breeze Vector from a Scala Vector
    • arithmetic / Vector arithmetic
    • scalar operations / Scalar operations
    • dot product of two vectors, creating / Calculating the dot product of two vectors
    • creating, by adding two vectors / Creating a new vector by adding two vectors together
    • appending / Appending vectors and converting a vector of one type to another
    • converting from one type to another / Appending vectors and converting a vector of one type to another
    • concatenating / Concatenating two vectors
    • standard deviation / Standard deviation
    • largest value, finding / Find the largest value in a vector
    • sum, finding / Finding the sum, square root and log of all the values in the vector
    • log, finding / Finding the sum, square root and log of all the values in the vector
    • square root, finding / Finding the sum, square root and log of all the values in the vector
    • Sqrt function / Finding the sum, square root and log of all the values in the vector
    • Log function / Finding the sum, square root and log of all the values in the vector
    • with randomly distributed values / Vectors and matrices with randomly distributed values, How it works...
    • with uniformly distributed random values, creating / Creating vectors with uniformly distributed random values
    • with normally distributed random values, creating / Creating vectors with normally distributed random values
    • with random values with Poisson distribution, creating / Creating vectors with random values that have a Poisson distribution

W

  • Worker nodes / There's more…

Y

  • YARN
    • Spark job, running / Running the Spark Job on YARN (local), How to do it...

Z

  • Zeppelin
    • used, for visualizing / Visualizing using Zeppelin
    • URL / Installing Zeppelin
    • installing / Installing Zeppelin
    • server, customizing / Customizing Zeppelin's server and websocket port
    • websocket port, customizing / Customizing Zeppelin's server and websocket port
    • data, visualizing on HDFS / Visualizing data on HDFS – parameterizing inputs
    • inputs, parameterizing / Visualizing data on HDFS – parameterizing inputs
    • custom functions, running / Running custom functions
    • external dependencies, adding / Adding external dependencies to Zeppelin
    • external Spark cluster, pointing to / Pointing to an external Spark cluster
  • Zookeeper
    • starting / How to do it...
    • URL / How to do it...
lock icon The rest of the chapter is locked
arrow left Previous Chapter
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}