Basic Linux command line skills are valuable but not required. Each participant will require the ability to run a 64 bit virtual machine (provided with the course).
This course will teach participants how to use Apache Hadoop and Apache Spark to solve sophisticated data science problems, producing valuable insights in a wide range of scenarios.
Day one focuses on data science basics, including data acquisition, scrubbing and manipulation, as well as a general overview of data science applications as well as the analytics and machine learning processes typically employed. A number of practical use cases are examined during class and lab sessions.
Day two focuses on Apache Hadoop and its ecosystem along with the types of data science applications typically handled by the Hadoop platform. The course outlines the statistical methods used to produce actionable business insights with Mapreduce, Python, Pig, Mahout and other tools.
Day three begins with an overview of the Apache Spark platform and its machine learning library, MLlib.
Participants will learn how to perform entity ranking, implement recommendation engines and perform other common data science tasks using Spark batch, streaming, graph and machine learning capabilities.
This course is designed for Application developers, analysts and data scientists.
Upon completion of this course, participants will be able to:
- Have a clear understanding of data science, its typical use cases and how data science is performed using a range of tools in the Apache open source ecosystem
Data Science Overview
Structured and Unstructured Data
Data Acquisition and Transformation
Data Analysis and Machine Learning
Common Hadoop use cases
Machine Learning with Mahout
NLTK and Natural Language Processing
Apache Spark Overview
Working with MLlib
Moving applications to production