It is beneficial to have familiarity with Java, Sequel, and general operating system knowledge.
This course provides Java programmers a deep-dive into Hadoop application development. Participants will learn how to design and develop MapReduce applications for Hadoop and manipulate, analyze and perform computations on their Big Data.
This course is designed for individuals who are developers and engineers who have programming experience.
Upon completion of this course, participants will be able to:
- Describe the features of Hadoop-based software, HDFS, and MapReduce programs.
- Use Cloudera to access Hadoop-based software, support, and services.
- Build a MapReduce program.
- Manage data using Hive.
- Perform queries using Impala.
- Manipulate data using Pig.
- Extract data into Hadoop using Sqoop.
- Create a workflow using Oozie.
- Work with data using FLUME.
Module 1: Getting Started with Hadoop-based Software
Motivation for Hadoop
History of Hadoop
Big Data Use Cases
What is Hadoop?
Components of Hadoop
The Hadoop Distributed File System (HDFS)
Hadoop Daemons and Ports
The HDFS Model
Class Discussion: Nodes and Associated Config Files in HDFS
HDFS Java API
Exercise: Developing Java Code and Using the HDFS Java API
Module 2: Using Distributors to Access Hadoop-based Software, Support, and Services
Open Source Hadoop
Benefits of Using a Distributor
Distributors of Hadoop
The Apache Ambari System
Components of Cloudera Hadoop
Architecture of Cloudera Manager
Installing Cloudera CDH4 and Cloudera Manager
Cloudera Manager: Stopping and Restarting
Module 3: Building a MapReduce Program
MapReduce and YARN
MapReduce Programming in Java
Compile and Run the MapReduce Program
Job Submission in MRv1
NextGen MapReduce Using YARN
Job Submission in YARN
Exercise: Implement MapReduce Jobs
Module 4: Managing Data Using Hive
The Components of Hive
Interfacing with Hive
Starting Beeline CLI and Connecting
Operators and Functions
Creating and Dropping Databases
Browsing, Altering, and Dropping Tables and Partitions
Exercise: Basic Commands for Working with Hive Tables
Exercise: Partition a Table
Module 5: Performing Queries Using Impala
The Impala Daemon
The Impala Statestore
The Impala Catalog Service
How Impala Works with Hive
Impala SQL Dialect
Class Discussion: Impala vs. Hive
Module 6: Manipulating Data Using Pig
Simple and Complex Data Types
Nulls, Operators, and Functions
How to Start Pig Shell (grunt)
Exercise: Executing Simple Pig Statements and Viewing the Output
Exercise: Group by Display from Table Records
Exercise: Tokenize Text of a File Using Pig and Grunt
JOIN (Inner) and (Outer)
Exercise: Performing an Inner Join on Two Data Sets Using Pig
Exercise: Join Two Sets of Data
Exercise: Finding First 5 Max Occurring Tokens Using Pig
Module 7: Extracting Data into Hadoop Using Sqoop
Syntax to Use Sqoop Commands
Controlling the Import
Exercise: Importing to HDFS
Exercise: Importing to HDFS Directory
Exercise: Importing a Subset of Rows
Exercise: Encoding Database NULL Values While Importing
Exercise: Importing Tables Using One Command
Exercise: Using Sqoop’s Incremental Import Feature
Sqoop’s Export Methodology
Export Control Arguments
Exercise: Export Table Back to MySQL
Exercise: Modifying and Exporting Rows
Exercise: Adding Rows and Making Changes before Exporting
Exercise: Overriding the NULL Substitution Characters
Module 8: Create a Workflow Using Oozie
Introduction to Oozie
Features of Oozie
Creating a MapReduce Workflow
Start, End, and Error Nodes
Parallel Fork and Join Nodes
Workflow Jobs Lifecycle
Creating and Running a Workflow
Exercise: Create an Oozie Workflow from Terminal
Exercise: Create an Oozie Workflow Using Java API
Oozie Coordinator Components, Variables, and Parameters
Exercise: Create an Oozie Workflow from HUE
Module 9: Working with Data Using Flume
Anatomy of a Flume Agent
Exercise: Flume with Avro as Source and Terminal as Sink
Exercise: Flume with Avro as Source and HDFS as Sink
Exercise: Flume with Net-Cat as Source and HDFS as Sink