Choose your language:

France
Germany
Hong Kong
India
Ireland
Japan
Malaysia
Netherlands
New Zealand
Singapore
Sweden
United Kingdom
United States

Spark Machine Learning

Course Code

BD74

Duration

3 Days

General knowledge of data stores, advanced math, analytics and a working knowledge of the Python language.
This Spark Machine Learning class will help you master Machine Learning using Scala, Python and R.
The students learn basic theoretical principles, algorithms, and applications of Machine Learning using the Spark ML Libraries. We cover all learning techniques including: supervised learning Classification and Regression, unsupervised learning Clustering and Dimensionality Reduction, and Recommendation algorithms using Collaborative Filtering.

Students will gain hands-on experience applying these principles using Spark ML pipelines and model evaluation techniques.
This course is designed for data engineers, analysts, architects, software engineers, IT operations and technical managers interested in a thorough, hands-on course covering Spark Machine Learning.

In this course, participants will:

  • Understand machine learning concepts
  • Install Spark and Anaconda
  • Prepare data for analysis
  • Learn the difference is supervised and unsupervised learning
  • Apply ML algorithms to LIBSVM data
  • Articulate and implement typical use cases for machine learning
  • Understand and improve model Performance
  • Build data pipelines with SparkSQL and DataFrames
  • Analyze Spark jobs using the Spark UI and logs
  • Understand and use Classification
  • Prepare and visualize ML data
  • Understand and use Regression
  • Use Dimensionality Reduction and Principal Component Analysis
1. Overview
Spark
Big Data
History
Spark Ecosystem
Clusters
Machine Learning
Spark ML Libraries
Types of Learning
Data Preparation
Algorithms
Iterative Processes
Scalability
Ensemble Modeling
Catalyst Optimizer
Tungsten
WholeStageCodegen
Lab

2. Installation
Platforms
Prerequisites
Download
Windows
Mac OS
Homebrew
Linux
spark-shell
Hello Spark
Anaconda
Anaconda Installation
Anaconda Spark
Lab

3. Spark Basics
Spark Shell
Scala
PySpark
Python
SparkR
R
RDDs
Datasets
DataFrames
Spark UI
Spark SQL
SQL Temp Views
Lab

4. Machine Learning
Machine Learning
Spark ML Libraries
Types of Learning
Supervised Learning
Unsupervised Learning
Recommendation Learning
Algorithms
Data Types
Sparse Vectors
LIBSVM
LibSVM Training Set
Machine Learning Flow
Naive Bayes Classifier
Anaconda
Lab

5. Predicting Titanic Survival
The Titanic Data Set
Building Good Training Sets
Spark SQL
Feature Engineering
Age and Gender
Family Size
Class and Fare
Category Table
Survival Analysis
Naive Bayes Prediction
Naive Baynes in R
Classification
Lab

6. Classification Models
Data Sets
Classification
Naive Bayes
Logistic Regression
Y Intercept
Decision Tree Classifier
Over Fitting
Gini Impurity
Ensemble
Random Forest Classifier
Gradient-Boosted Tree
Additional Classifiers
Lab

7. ML Pipelines
DataFrame API
ML Pipelines
DataFrames
Data Types
Vectors
Transformers
Estimators
Parameters
Pipelines
Pipeline.fit()
Model.transform()
Pipeline Examples
Saving and Loading
Lab

8. Extracting Features
Nomenclature
Feature Selection
Dimensionality
Feature Reduction
Tokenizer
StopWordsRemover
Term Frequency (TF)
Inverse Document Frequency (IDF)
HashingTF and IDF
StringIndexer
VectorIndexer
Word2Vec
Lab

9. Regression Models
Regression
Linear Regression
Regularization
Lasso Model
Ridge Regression
Generalized Linear Regression
Decision Tree Regression
Random Forest Regression
Gradient-Boosted Tree Regression
Survival Regression
Lab

10. Evaluating Predictions
Model Evaluation
Classification Evaluation
Confusion Matrix
Binary Classification
Receiver Operating Characteristic (ROC)
Area Under ROC
Threshold Tuning
Multiclass Classification
Label Based Metrics
Multilabel Classification
Regression Evaluation
Lab

11. Clustering Models
Unsupervised Learning
Clustering
K-Means
K-means Clustering
K-means Cluster Processing
Within Set Sum of Squared Error
Latent Dirichlet Allocation
LDA Pipeline
LDA Example
LDA LibSVM
Bisecting K-means
Gaussian Mixture Model
K-Means in SparkR
Lab

12. Recommendation Models
Collaborative Filtering
Alternating Least Squares
Feedback
MovieLens Dataset
MovieLens Dataset Analysis
Alternating Least Squares ML
Frequent Pattern Mining
FP-Growth Algorithm
Association Rules
Sequential Pattern Mining
PrefixSpan
Lab

13. Model Selection & Tuning
Hyperparameter Tuning
Model Selection
Cross-Validation
Cross-Validation Example
Train-Validation Split
Train-Validation Split Example
Stochastic Gradient Descent
LinearRegressionWithSGD
Broyden–Fletcher–Goldfarb–Shanno
Limited-memory BFGS
LogisticRegressionWithLBFGS
Lab

14. Deep Learning
Machine versus Deep Learning
Multilayer Perceptron Classifier
Feedforward Artificial Neural Network
Hidden Layers
Scala Example
Python Example
R Example
Iris Dataset
Popular DNN Frameworks
TensorFlow
Spark with TensorFlow
Lab

15. Business Applications of ML
Checklist
Marketing Use Cases
Healthcare Use Cases
Expedia
Expedia Scratchpad
Datacenter Network Traffic
Cisco ML Applications
Tetration Analytics
Stealthwatch Learning Network
Machine Learning Model Factory
Propensity to Buy
Companies Using ML Spark
Lab
Send Us a Message
Choose one