Choose your language:

Hong Kong
New Zealand
United Kingdom
United States

Data Science for Solution Architects

Course Code



4 Days

Participants should have the general knowledge of statistics and programming.
This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics. The course covers the fundamental and advanced concepts and methods of deriving business insights from raw data using cost-effective data processing solutions. The course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.
This course is designed for Enterprise Architects, Solution Architects, Information Technology Architects, Business Analysts, Senior Developers, and Team Leads.
Upon completion of this course, participants will learn:
  • Applied Data Science and Business Analytics
  • Algorithms, Techniques and Common Analytical Methods
  • NoSQL and Big Data Systems Overview
  • MapReduce
  • Big Data Business Intelligence and Analytics
  • Visualizing and Reporting Processed Results
  • Data Analysis with R
  • Hadoop Programming Ecosystem
Chapter 1. Applied Data Science
What is Data Science?
Data Science Ecosystem
Data Mining vs. Data Science
Business Analytics vs. Data Science
Who is a Data Scientist?
Data Science Skill Sets Venn Diagram
Data Scientists at Work
Examples of Data Science Projects
An Example of a Data Product
Applied Data Science at Google
Data Science Gotchas

Chapter 2. Data Science Algorithms and Analytical Methods
Supervised vs Unsupervised Machine Learning
Supervised Machine Learning Algorithms
Unsupervised Machine Learning Algorithms
Choose the Right Algorithm
Life-cycles of Machine Learning Development
Classifying with k-Nearest Neighbors (SL)
k-Nearest Neighbors Algorithm
k-Nearest Neighbors Algorithm
Decision Trees (SL)
Naive Bayes Classifier (SL)
Naive Bayesian Probabilistic Model in a Nutshell
Unsupervised Learning Type: Clustering
K-Means Clustering (UL)
K-Means Clustering in a Nutshell
Time-Series Analysis
Decomposing Time-Series
Monte-Carlo Simulation (Method)
Who Uses Monte-Carlo Simulation?
Monte-Carlo Simulation in a Nutshell

Chapter 3. Introduction to R
Positioning of R in the Data Science Arena
R Integrated Development Environments
General Notes on R Commands and Statements
R Data Structures
R Objects and Workspace
Assignment Operators
Assignment Example
Arithmetic Operators
Logical Operators
System Date and Time
User-defined Functions
User-defined Function Example
R Code Example
Control Statements
Conditional Execution
Repetitive Execution
Built-in Functions
Reading Data from Files into Vectors
Example of Reading Data from a File
Writing Data to a File
Example of Writing Data to a File
Matrix Data Structure
Creating Matrices
Working with Data Frames
Matrices vs Data Frames
A Data Frame Sample
Accessing Data Cells
Getting Info About a Data Frame
Selecting Columns in Data Frames
Selecting Rows in Data Frames
Getting a Subset of a Data Frame
Sorting (ordering) Data in Data Frames by Attribute(s)
Applying Functions to Matrices and Data Frames
Using the apply() Function
Example of Using apply()
Listing Objects in Workspace
Saving Your Workspace
Loading Your Workspace
Batch (Unattended) Processing
Importing Data into R
Exporting Data from R
Standard R Packages
Extending R

Chapter 4. R Statistical Computing Features
Statistical Computing Features
Descriptive Statistics
Basic Statistical Functions
Examples of Using Basic Statistical Functions
Non-uniformity of a Probability Distribution
Writing Your Own skew and kurtosis Functions
Generating Normally Distributed Random Numbers
Generating Uniformly Distributed Random Numbers
Using the summary() Function
Math Functions Used in Data Analysis
Examples of Using Math Functions
Correlation Example
Testing Correlation Coefficient for Significance
The cor.test() Function
The cor.test() Example
Regression Analysis
Types of Regression
Simple Linear Regression Model
Least-Squares Method (LSM)
LSM Assumptions
Fitting Linear Regression Models in R
Example of Using lm()
Confidence Intervals for Model Parameters
Example of Using lm() with a Data Frame
Regression Models in Excel
Multiple Regression Analysis
Finding the Best-Fitting Regression Model
Comparing Regression Models

Chapter 5. Defining Big Data
Transforming Data into Business Information
Quality of Data
Gartner's Definition of Big Data
More Definitions of Big Data
Processing Big Data
Challenges Posed by Big Data
The Cloud and Big Data
The Business Value of Big Data
Big Data: Hype or Reality?
Big Data Quiz
Big Data Quiz Answers

Chapter 6. What is NoSQL?
Limitations of Relational Databases
Limitations of Relational Databases (Cont'd)
Defining NoSQL
What are NoSQL (Not Only SQL) Databases?
The Past and Present of the NoSQL World
NoSQL Database Properties
NoSQL Benefits
NoSQL Database Storage Types
The CAP Theorem
Mechanisms to Guarantee a Single CAP Property
Limitations of NoSQL Databases
Big Data Sharding
Sharding Example
Quiz Answers

Chapter 7. MapReduce Overview
MapReduce Defined
Google's MapReduce
The Map Phase of MapReduce
The Reduce Phase of MapReduce
MapReduce Explained
MapReduce Word Count Job
MapReduce Shared-Nothing Architecture
Similarity with SQL Aggregation Operations
Example of Map & Reduce Operations using JavaScript
Problems Suitable for Solving with MapReduce
Typical MapReduce Jobs
Fault-tolerance of MapReduce
Distributed Computing Economics
MapReduce Systems

Chapter 8. Introduction to MongoDB
MongoDB Features (Cont'd)
MongoDB's Logo
Positioning of MongoDB
MongoDB Limitations
MongoDB Operational Intelligence
MongoDB Use Cases
MongoDB Data Model
The _id Primary Key Filed Considerations
MongoDB Data Model
Data Modeling in RDBMS
Data Modeling in MongoDB
MongoDB Data Modeling
A Sample JSON Document Matching the Schema
Data Lifecycle Management
Data Lifecycle Management: TTL
Data Lifecycle Management: Capped Collections
MongoDB Query Language (QL)
The find and findOne Methods
A MongoDB QL Example
Data Inserts
Creating an Index
MongoDB vs Apache CouchDB

Chapter 9. Hadoop Overview
Apache Hadoop
Apache Hadoop Logo
Typical Hadoop Applications
Hadoop Clusters
Hadoop Design Principles
Hadoop's Core Components
Hadoop Simple Definition
High-Level Hadoop Architecture
Hadoop-based Systems for Data Analysis
Hadoop Caveats

Chapter 10. Hadoop Distributed File System Overview
Hadoop Distributed File System
Data Blocks
Data Block Replication Example
HDFS NameNode Directory Diagram
Accessing HDFS
Examples of HDFS Commands
Client Interactions with HDFS for the Read Operation
Read Operation Sequence Diagram
Client Interactions with HDFS for the Write Operation
Communication inside HDFS

Chapter 11. MapReduce with Hadoop
Hadoop's MapReduce
MapReduce v1 ("Classic MapReduce")
JobTracker and TaskTracker
YARN (MapReduce v2)
MapReduce Programming Options
Java MapReduce API
The Structure of a Java MapReduce Program
The Mapper Class
The Reducer Class
The Driver Class
Compiling Classes
Running the MapReduce Job
The Structure of a Single MapReduce Program
Combiner Pass (Optional)
Hadoop's Streaming MapReduce
Python Word Count Mapper Program Example
Python Word Count Reducer Program Example
Setting up Java Classpath for Streaming Support
Streaming Use Cases
The Streaming API vs Java MapReduce API
Amazon Elastic MapReduce

Chapter 12. Apache Pig Scripting Platform
What is Pig?
Pig Latin
Apache Pig Logo
Pig Execution Modes
Local Execution Mode
MapReduce Execution Mode
Running Pig
Running Pig in Batch Mode
What is Grunt?
Pig Latin Statements
Pig Programs
Pig Latin Script Example
SQL Equivalent
Differences between Pig and SQL
Statement Processing in Pig
Comments in Pig
Supported Simple Data Types
Supported Complex Data Types
Defining Relation's Schema
The bytearray Generic Type
Using Field Delimiters
Referencing Fields in Relations

Chapter 13. Apache Pig Relational and Eval Operators
Pig Relational Operators
Example of Using the JOIN Operator
Example of Using the Order By Operator
Caveats of Using Relational Operators
Pig Eval Functions
Caveats of Using Eval Functions (Operators)
Example of Using Single-column Eval Operations
Example of Using Eval Operators For Global Operations

Chapter 14. Apache Pig Performance
Apache Pig Performance
Performance Enhancer - Use the Right Schema Type
Performance Enhancer - Apply Data Filters
Use the PARALLEL Clause
Examples of the PARALLEL Clause
Performance Enhancer - Limiting the Data Sets
Displaying Execution Plan

Chapter 15. Hive
What is Hive?
Apache Hive Logo
Hive's Value Proposition
Who uses Hive?
Hive's Main Sub-Systems
Hive Features
Hive Architecture
Where are the Hive Tables Located?
Hive Command-line Interface (CLI)

Chapter 16. Hive Command-line Interface
Hive Command-line Interface (CLI)
The Hive Interactive Shell
Running Host OS Commands from the Hive Shell
Interfacing with HDFS from the Hive Shell
The Hive in Unattended Mode
The Hive CLI Integration with the OS Shell
Executing HiveQL Scripts
Comments in Hive Scripts
Variables and Properties in Hive CLI
Setting Properties in CLI
Example of Setting Properties in CLI
Hive Namespaces
Using the SET Command
Setting Properties in the Shell
Setting Properties for the New Shell Session

Chapter 17. Hive Data Definition Language
Hive Data Definition Language
Creating Databases in Hive
Using Databases
Creating Tables in Hive
Supported Data Type Categories
Common Primitive Types
Example of the CREATE TABLE Statement
Table Partitioning
Table Partitioning
Table Partitioning on Multiple Columns
Viewing Table Partitions
Row Format
Data Serializers / Deserializers
File Format Storage
More on File Formats
The EXTERNAL DDL Parameter
Example of Using EXTERNAL
Creating an Empty Table
Dropping a Table
Table / Partition(s) Truncation
Alter Table/Partition/Column
Create View Statement
Why Use Views?
Restricting Amount of Viewable Data
Examples of Restricting Amount of Viewable Data
Creating and Dropping Indexes
Describing Data

Chapter 18. Hive Select Statement
The SELECT Statement Syntax
The WHERE Clause
Examples of the WHERE Statement
Partition-based Queries
Example of an Efficient SELECT Statement
Supported Numeric Operators
Built-in Mathematical Functions
Built-in Aggregate Functions
Built-in Statistical Functions
Other Useful Built-in Functions
The GROUP BY Clause
The HAVING Clause
The LIMIT Clause
The ORDER BY Clause
The JOIN Clause
The CASE … Clause
Example of CASE … Clause

Chapter 19. Apache Sqoop
What is Sqoop?
Apache Sqoop Logo
Sqoop Import / Export
Sqoop Help
Examples of Using Sqoop Commands
Data Import Example
Fine-tuning Data Import
Controlling the Number of Import Processes
Data Splitting
Helping Sqoop Out
Example of Executing Sqoop Load in Parallel
A Word of Caution: Avoid Complex Free-Form Queries
Using Direct Export from Databases
Example of Using Direct Export from MySQL
More on Direct Mode Import
Changing Data Types
Example of Default Types Overriding
File Formats
The Apache Avro Serialization System
Binary vs Text
More on the SequenceFile Binary Format
Generating the Java Table Record Source Code
Data Export from HDFS
Export Tool Common Arguments
Data Export Control Arguments
Data Export Example
Using a Staging Table
INSERT and UPDATE Statements
INSERT Operations
UPDATE Operations
Example of the Update Operation
Failed Exports

Chapter 20. Apache HBase
What is HBase?
HBase Design
HBase Features
The Write-Ahead Log (WAL) and MemStore
HBase vs RDBS
HBase vs Apache Cassandra
Interfacing with HBase
HBase Thrift And REST Gateway
HBase Table Design
Column Families
A Cell's Value Versioning
Accessing Cells
HBase Table Design Digest
Table Horizontal Partitioning with Regions
HBase Compaction
Loading Data in HBase
HBase Shell
HBase Shell Command Groups
Creating and Populating a Table in HBase Shell
Getting a Cell's Value
Counting Rows in an HBase Table
Send Us a Message
Choose one