Choose your language:

France
Germany
Hong Kong
India
Ireland
Japan
Malaysia
Netherlands
New Zealand
Singapore
Sweden
United Kingdom
United States

Introduction to Spark

Course Code

BD72

Duration

2 Days

General knowledge of data stores, advanced math, analytics and a working knowledge of the Python language.
This course will help you to master the Apache Spark platform. Spark enables participants to build complete, unified big data applications combining batch, streaming, and interactive analytics on all their data. With Spark, developers can write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries. The course covers the core APIs for using Spark, fundamental mechanisms and basic internals of the platform. The students will use the Spark Shell, Python, Scala, R and SQL to access and transform data. We also cover Jupyter Notebooks, Anaconda and Apache Parquet.
This course is designed for data engineers, analysts, architects, software engineers, IT operations and technical managers interested in a thorough, hands-on course covering the Introduction to Apache Spark.

In this course, participants will:

  • Understand the motivation for non-relational data stores
  • Install Spark
  • Describe Spark’s fundamental architecture
  • Use RDDs, Datasets & DataFrames
  • Use Jupyter notebook
  • Use the core Spark APIs to operate on data
  • Articulate and implement typical use cases for Spark
  • Understand and improve Spark Performance
  • Build data pipelines with SparkSQL and DataFrames
  • Analyze Spark jobs using the Spark UI and logs
  • Prepare and visualize data
  • Explore Spark Streaming
  • Use SparkX and GraphFrames
Day 1 Foundations
1. Overview

Spark
Big Data
History
Hadoop
MapReduce
Word Count
Spark Ecosystem
RDDs
Datasets
DataFrames
Clusters
Use Cases
DataBricks
Resources

2. Installation
Platforms
Prerequisites
Download
Windows
Mac OS
Homebrew
Linux
Building Spark
Maven
build/mvn
spark-shell
Hello Spark

3. Spark Shells
Spark Shell
Spark Context
Options
Scala
PySpark
Options
Python
SparkR
R
Spark UI
Spark Jobs

4. RDDs, Datasets & DataFrames
RDD Overview
Creating
Parallelized Collections
External Datasets
RDD Operations
Transformations
Actions
Persistence
Datasets
Create Dataset
DataFrames
Create DataFrame

5. Cluster Architecture
Ecosystem
Cluster
Cluster Manager Types
Standalone
Apache Mesos
Hadoop YARN
Spark Application
Spark Context
DAG
DAG Scheduler
Task Scheduler
Backend Scheduler
Running Spark on EC2

6. Spark Streaming
Why Streaming
Basics
Unified Spark
Processing
DStreams
StreamingContext
DStreams Processing
Structured Streaming
DataFrame Operations
SQL Operations

Day 2 Programming Techniques
7. RDDs

MapReduce
Mining Console Logs
Closure
Accumulators
Key-value Pairs
Word Count
Transformations
Actions
GroupByKey
ReduceByKey
Joins
Data Partitioning
Broadcast Variables
Map-Side Join

8. RDDs & Datasets
RDD Attributes
Low Level RDD Operations
Timing RDDs
RDDs in Spark UI
RDD.toDebugString
RDD Storage & Memory
Datasets Attributes
Timing Datasets
Datasets in Spark UI
Dataset Storage
Dataset Memory
Catalyst Optimizer
Tungsten

9. DataFrames
Data Frames
SparkSession
Creating DataFrames
DataFrame Operations
SQL Temp Views
Data Sources
Read DataFrame
CSV Files
Parquet Files
JSON Files
JDBC Access

10. Spark SQL
Spark SQL
SQL Temp Views
SQL Commands
Spark SQL-CLI
spark-sql
Persistent Tables
Create Table
Spark Warehouse
SQL Catalog
Data Types
Select
Joins
Case Expressions
Subqueries

11. Spark Performance
Query Processing
Catalyst Optimizer
Analysis
Logical Plan
Explain
Explain EXTENDED
Joins
Broadcast Hash-Join
SortMergeJoin
Cost-Based Optimizer
Spark 2.2 Statistics
ANALYZE TABLE
Explain COST
WholeStageCodegen

12. Notebooks
Notebooks
Interactive, Sharable Notebook
Apache Zeppelin
Anaconda
IPython
Jupyter Notebook
Spark and Jupyter
PySpark Notebook
Create Notebook
Notebook Code
RStudio
Shiny
Send Us a Message
Choose one