Index
A
- A/B testing
- models, iterating / Iterating on models through A/B testing
- experimental allocation / Experimental allocation – assigning customers to experiments
- sample size, deciding / Deciding a sample size
- multiple hypothesis testing / Multiple hypothesis testing
- adjacency matrix / Where agglomerative clustering fails
- affinity propagation
- cluster numbers, selecting automatically / Affinity propagation – automatically choosing cluster numbers
- agglomerative clustering
- about / Agglomerative clustering
- failures / Where agglomerative clustering fails
- Alternating Least Squares (ALS) / Case Study: Training a Recommender System in PySpark
- Amazon Web Services (AWS) / Working in the cloud
- analytic pipeline
- data splitting / Modeling layer
- parameter tuning / Modeling layer
- model performance / Modeling layer
- model persistence / Modeling layer
- analytic solution, advanced
- designing / Designing an advanced analytic solution
- data layer / Data layer: warehouses, lakes, and streams
- modeling layer / Modeling layer
- deployment layer / Deployment layer
- reporting layer / Reporting layer
- application layer / Deployment layer
- Area Under Curve (AUC) / Evaluating changes in model performance
- area under curve (AUC)
- about / Evaluating classification models
- auto-regressive moving average (ARMA) / Time series data
B
- back-propagation
- boosting
- broker / Persisting information with database systems
C
- categorical data
- similarity metrics / Similarity metrics for categorical data
- normalizing / Similarity metrics for categorical data
- Celery library
- URL / The web application
- Classification and Regression Trees (CART) algorithm / Decision trees
- classification models
- evaluating / Evaluating classification models
- improving / Strategies for improving classification models
- client layer / Deployment layer
- client requests
- handling / Clients and making requests
- GET requests, implementing / The GET requests
- POST request, implementing / The POST request
- HEAD request, implementing / The HEAD request
- PUT request, implementing / The PUT request
- DELETE request, implementing / The DELETE request
- communication
- guidelines / Guidelines for communication
- terms, translating to business values / Translate terms to business values
- results, visualizing / Visualizing results
- convexity
- convolutional network
- about / Convolutional networks and rectified units
- input layer / Convolutional networks and rectified units
- convolutional layer / Convolutional networks and rectified units
- rectifying layer / Convolutional networks and rectified units
- downsampling layer / Convolutional networks and rectified units
- fully connected layer / Convolutional networks and rectified units
- correlation similarity metrics
- covariance / Correlation similarity metrics and time series
- curl command
D
- database systems
- data layer / Designing an advanced analytic solution
- decision trees
- about / Decision trees
- dendrograms / Agglomerative clustering
- deployment layer / Deployment layer
- digit recognition / The TensorFlow library and digit recognition
- distance metrics
- about / Similarity and distance metrics
- numerical distance metrics / Numerical distance metrics
- time series / Correlation similarity metrics and time series
- blending / Similarity metrics for categorical data
- Dow Jones Industrial Average (DJIA) / Correlation similarity metrics and time series
- Driver / Creating the SparkContext
- Dynamic Time Warping (DTW) / Correlation similarity metrics and time series
E
- e-mail campaigns, case study
- about / Case study: targeted e-mail campaigns
- data input and transformation / Data input and transformation
- sanity checking / Sanity checking
- model development / Model development
- scoring / Scoring
- visualization and reporting / Visualization and reporting
- Executors / Creating the SparkContext
F
- false positive rate (FPR)
- about / Evaluating classification models
- familywise error rate (FWER) / Multiple hypothesis testing
- Flask
G
- Gaussian kernel
- Gauss Markov Theorem / Linear regression
- generalized linear models
- about / Generalized linear models
- Generalized Linear Models (GLMs) / Logistic regression
- Generalize Estimating Equations (GEE)
- about / Generalize estimating equations
- geospatial data
- about / Working with geospatial data
- loading / Loading geospatial data
- cloud, working in / Working in the cloud
- gradient boosted decision trees
- about / Gradient boosted decision trees
- versus, support vector machines and logistic regression / Comparing classification methods
- gradient boosted machine (GBM) / Evaluating changes in model performance
- graphical user interface (GUI) / Cleaning textual data
- graphics processing unit (GPU) / The TensorFlow library and digit recognition
H
- H20
- Hadoop distributed file system (HDFS) / Creating an RDD
- hierarchical clustering / Agglomerative clustering
- hinge loss
- horizontal scaling / Server – the web traffic controller
- HTTP Status Codes / The GET requests
- hypertext transfer protocol (HTTP)
I
- images
- about / Images
- image data, cleaning / Cleaning image data
- thresholding, for highlighting objects / Thresholding images to highlight objects
- dimensionality reduction, for image analysis / Dimensionality reduction for image analysis
- Indicator Function / Extracting features from textual data
- Internet Movie Database
- IPython notebook
- about / Exploring categorical and numerical data in IPython
- installing / Installing IPython notebook
- interface / The notebook interface
- data, loading / Loading and inspecting data
- data, inspecting / Loading and inspecting data
- basic manipulations / Basic manipulations – grouping, filtering, mapping, and pivoting
- Matplotlib, charting with / Charting with Matplotlib
- iteratively reweighted least squares (IRLS)
K
- K-means ++ / K-means clustering
- K-means clustering
- about / K-means clustering
- k-medoids
- about / k-medoids
- kernel function
L
- Labeled RDD / Streaming clustering in Spark
- Latent Dirichlet Allocation (LDA)
- about / Latent Dirichlet Allocation
- Latent Semantic Indexing (LSI) / Principal component analysis
- linear regression
- about / Linear regression
- data, preparing / Data preparation
- evaluation / Model fitting and evaluation
- model, fitting / Model fitting and evaluation
- statistical significance / Statistical significance of regression outputs
- Generalize Estimating Equations (GEE) / Generalize estimating equations
- mixed effects models / Mixed effects models
- time series data / Time series data
- generalized linear models / Generalized linear models
- regularization, applying to linear models / Applying regularization to linear models
- linkage metric / Where agglomerative clustering fails
- link functions
- Logit / Generalized linear models
- Poisson / Generalized linear models
- Exponential / Generalized linear models
- logistic regression
- about / Logistic regression
- multiclass logistic classifiers / Multiclass logistic classifiers: multinomial regression
- dataset, formatting for classification problems / Formatting a dataset for classification problems
- stochastic gradient descent (SGD) / Learning pointwise updates with stochastic gradient descent
- parameters, optimizing with second-order methods / Jointly optimizing all parameters with second-order methods
- model, fitting / Fitting the model
- versus, support vector machines and gradient boosted decision trees / Comparing classification methods
- logistic regression service
- as case study / Case study – logistic regression service
- database, setting up / Setting up the database
- web server, setting up / The web server
- web application, setting up / The web application
- model, training / The flow of a prediction service – training a model
- on-demand and bulk prediction, obtaining / On-demand and bulk prediction
- Long Short Term Memory Networks (LSTM) / Optimizing the learning rate
M
- Matplotlib
- charting with / Charting with Matplotlib
- message passing / Affinity propagation – automatically choosing cluster numbers
- Mixed National Institute of Standards and Technology (MNIST) database / The MNIST data
- modeling layer / Modeling layer
- model performance
- checking, with diagnostic / Checking the health of models with diagnostics
- changes, evaluating / Evaluating changes in model performance
- changes in feature importance, evaluating / Changes in feature importance
- unsupervised model performance, changes / Changes in unsupervised model performance
- models
- iterating, through A/B testing / Iterating on models through A/B testing
- multiclass logistic classifiers
- multidimensional scaling (MDS) / Numerical distance metrics
- multinomial regression / Multiclass logistic classifiers: multinomial regression
N
- natural language toolkit (NLTK) library / Cleaning textual data
- neural networks
- patterns, learning with / Learning patterns with neural networks
- perceptron / A network of one – the perceptron
- perceptrons, combining / Combining perceptrons – a single-layer neural network
- single-layer neural network / Combining perceptrons – a single-layer neural network
- parameter fitting, with back-propagation / Parameter fitting with back-propagation
- discriminative, versus generative models / Discriminative versus generative models
- gradients, vanishing / Vanishing gradients and explaining away
- belief networks, pretraining / Pretraining belief networks
- regularizing, dropout used / Using dropout to regularize networks
- convolutional networks / Convolutional networks and rectified units
- rectified units / Convolutional networks and rectified units
- data compressing, with autoencoder networks / Compressing Data with autoencoder networks
- learning rate, optimizing / Optimizing the learning rate
- neurons / Combining perceptrons – a single-layer neural network
- Newton methods
- non-relational database / Persisting information with database systems
- numerical distance metrics
- about / Numerical distance metrics
O
- Ordinary Least Squares (OLS) / Linear regression
P
- prediction service
- architecture / The architecture of a prediction service
- sever, using / Server – the web traffic controller
- application, setting up / Application – the engine of the predictive services
- information, persisting with database systems / Persisting information with database systems
- Principal Component Analysis (PCA)
- about / Principal component analysis
- Latent Dirichlet Allocation (LDA) / Latent Dirichlet Allocation
- dimensionality reduction, using in predective modeling / Using dimensionality reduction in predictive modeling
- pseudo-residuals / Gradient boosted decision trees
- pyspark
- classifier models, implementing / Case study: fitting classifier models in pyspark
- PySpark
- URL / Joining signals and correlation, Introduction to PySpark
- about / Introduction to PySpark, Scaling out with PySpark – predicting year of song release
- SparkContext, creating / Creating the SparkContext
- RDD, creating / Creating an RDD
- Spark DataFrame, creating / Creating a Spark DataFrame
- example / Scaling out with PySpark – predicting year of song release
- Python requests library
- URL / The GET requests
R
- RabbitMQ
- URL / The web application
- random forest
- about / Random forest
- RDD
- creating / Creating an RDD
- Receiver-Operator-Characteristic (ROC) / Evaluating changes in model performance
- receiver operator characteristic (ROC) / Logistic regression
- Receiver Operator Characteristic (ROC) curve
- about / Evaluating classification models
- recommender system training, in PySpark
- case study / Case Study: Training a Recommender System in PySpark
- Rectified Linear Unit (ReLU) / Convolutional networks and rectified units
- Recurrent Neural Networks (RNNs) / Optimizing the learning rate
- Redis
- URL / Setting up the database
- relational database / Persisting information with database systems
- reporting layer / Reporting layer
- reporting service
- about / Case Study: building a reporting service
- report server, setting up / The report server
- report application, setting up / The report application
- visualization layer, using / The visualization layer
- Resilient Distributed Dataset (RDD) / Streaming clustering in Spark
- Resilient Distributed Datasets (RDDs) / Introduction to PySpark
S
- second-order methods
- about / Formatting a dataset for classification problems
- parameters, optimizing / Jointly optimizing all parameters with second-order methods
- server
- used, for communicating with external systems / Server – the web traffic controller
- similarity metrics
- about / Similarity and distance metrics
- correlation similarity metrics / Correlation similarity metrics and time series
- for categorical data / Similarity metrics for categorical data
- Singular Value Decomposition (SVD) / Numerical distance metrics, Principal component analysis
- social media feeds, case study
- about / Case study: sentiment analysis of social media feeds
- data input and transformation / Data input and transformation
- sanity checking / Sanity checking
- model development / Model development
- scoring / Scoring
- visualization and reporting / Visualization and reporting
- soft-margin formulation / Separating Nonlinear boundaries with Support vector machines
- Spark
- streaming clustering / Streaming clustering in Spark
- SparkContext
- creating / Creating the SparkContext
- Spark DataFrame
- creating / Creating a Spark DataFrame
- spectral clustering / Where agglomerative clustering fails
- statsmodels
- stochastic gradient descent
- stochastic gradient descent (SGD)
- streaming clustering
- about / Streaming clustering in Spark
- support-vector networks
- support vector machine (SVM)
- nonlinear boundaries, separating / Separating Nonlinear boundaries with Support vector machines
- implementing, to census data / Fitting and SVM to the census data
- boosting / Boosting – combining small models to improve accuracy
- versus, logistic regression and gradient boosted decision trees / Comparing classification methods
T
- TensorFlow library
- about / The TensorFlow library and digit recognition
- MNIST data / The MNIST data
- network, constructing / Constructing the network
- term-frequency-inverse document frequency (tf-idf) / Extracting features from textual data
- textual data
- working with / Working with textual data
- cleaning / Cleaning textual data
- features, extracting from / Extracting features from textual data
- dimensionality reduction, used for simplyfying datasets / Using dimensionality reduction to simplify datasets
- time series
- time series analysis
- about / Time series analysis
- cleaning and converting / Cleaning and converting
- time series diagnostics / Time series diagnostics
- signals and correlation, joining / Joining signals and correlation
- transformations and operations
- URL / Creating an RDD
- tree methods
- about / Tree methods
- decision trees / Decision trees
- random forest / Random forest
- true positive rate (TPR)
- about / Evaluating classification models
U
- units / Combining perceptrons – a single-layer neural network
- Unweighted Pair Group Method with Arithmetic Mean (UPGMA) / Agglomerative clustering
V
- vertical scaling / Server – the web traffic controller
W
- Web Server Gateway Interface (WSGI)
X
- XGBoost