Choose your language:

Hong Kong
New Zealand
United Kingdom
United States

Machine Learning with Apache Spark

Course Code



3 Days

No machine learning knowledge is assumed.

If students are new to Apache Spark, we can offer one day of ‘Introduction to Spark’ training.

We recommend participants have:

  • programming background
  • familiarity with Python would be a plus, but not required
This course teaches doing Machine Learning at Scale with the popular Apache Spark framework.

This course is intended for data scientists and software engineers. We assume no previous knowledge of Machine Learning – We teach popular Machine Learning algorithms from scratch.

For each machine learning concept, we first discuss the foundations, its applicability, and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

This course is taught using Spark & Python.
This course is designed for Data Scientists and Software Engineers

In this course, participants will:

  • Learn popular machine learning algorithms, their applicability, and limitations
  • Practice the application of these methods in the Spark machine learning environment
  • Learn practical use cases and limitations of algorithms
Section 1: Machine Learning (ML) Overview
Machine Learning landscape
Machine Learning applications
Understanding ML algorithms & models

Section 2: ML in Python and Spark
Spark ML Overview
Introduction to Jupyter notebooks
Lab: Working with Jupyter + Python + Spark
Lab: Spark ML utilities

Section 3: Machine Learning Concepts
Statistics Primer
Covariance, Correlation, Covariance Matrix
Errors, Residuals
Overfitting / Underfitting
Cross-validation, bootstrapping
Confusion Matrix
ROC curve, Area Under Curve (AUC)
Lab: Basic stats

Section 4: Feature Engineering (FE)
Preparing data for ML
Extracting features, enhancing data
Data cleanup
Visualizing Data
Lab: data cleanup
Lab: visualizing data

Section 5: Linear regression
Simple Linear Regression
Multiple Linear Regression
Running LR
Evaluating LR model performance
Use case: House price estimates

Section 6: Logistic Regression
Understanding Logistic Regression
Calculating Logistic Regression
Evaluating model performance
Use case: credit card application, college admissions

Section 7: Classification: SVM (Supervised Vector Machines)
SVM concepts and theory
SVM with kernel
Use case: Customer churn data

Section 8: Classification: Decision Trees & Random Forests
Theory behind trees
Classification and Regression Trees (CART)
Random Forest concepts
Use case: predicting loan defaults, estimating election contributions

Section 9: Classification: Naive Bayes
Use case: spam filtering

Section 10: Clustering (K-Means)
Theory behind K-Means
Running K-Means algorithm
Estimating the performance
Use case: grouping cars data, grouping shopping data

Section 11: Principal Component Analysis (PCA)
Understanding PCA concepts
PCA applications
Running a PCA algorithm
Evaluating results
Use case: analyzing retail shopping data

Section 12: Recommendations (Collaborative filtering)
Recommender systems overview
Collaborative Filtering concepts
Use case: movie recommendations, music recommendations

Section 13: Performance
Best practices for scaling and optimizing Apache Spark
Memory caching
Testing and validation

Section 14: Final workshop (time permitting)
Students will analyze a couple of datasets and run ML algorithms.
This is done as a group exercise. Each group will present their findings to the class.
Send Us a Message
Choose one