Choose your language:

France
Germany
Hong Kong
India
Ireland
Japan
Malaysia
Netherlands
New Zealand
Singapore
Sweden
United Kingdom
United States

Hadoop Developer Essentials – Hortonworks

Course Code

BD68

Duration

4 Days

It is beneficial to have familiarity with Java, SQL, and general operating system knowledge.
This course provides Java programmers a deep-dive into Hadoop application development. Participants will learn how to design and develop MapReduce applications for Hadoop and manipulate, analyze and perform computations on their Big Data.
This course is designed for individuals who are developers and engineers who have programming experience.

In this course, participants will:

  • Describe the features of Hadoop-based software, HDFS, and MapReduce programs
  • Use Hortonworks to access Hadoop-based software, support, and services
  • Build a MapReduce program
  • Manage data using Hive
  • Manipulate data using Pig
  • Extract data into Hadoop using Sqoop
  • Create a workflow using Oozie
  • Work with data using FLUME
  • Understand Spark, Kafka components
Module 1: Getting Started with Hadoop-based Software
Motivation for Hadoop
History of Hadoop
Big Data Use Cases
What is Hadoop?
Components of Hadoop
The Hadoop Distributed File System (HDFS)
Knowledge Check
MapReduce
Hadoop Architecture
Hadoop Daemons and Ports
The HDFS Model
Class Discussion: Nodes and Associated Config Files in HDFS
HDFS Java API
HDFS Commands lab walk through using console.
Exercise: Developing Java Code and Using the HDFS Java API

Module 2: Using Distributors to Access Hadoop-based Software, Support, and Services
Open Source Hadoop
Benefits of Using a Distributor
Distributors of Hadoop
Hortonworks Hadoop
The Apache Ambari System
Hortonworks Hadoop
Components of Hortonworks Hadoop
Architecture of Hortonworks
Installing Hortonworks
Stopping and Restarting Services

Module 3: Distributed Data Processing Engines
MapReduce and YARN
MapReduce Phases
MapReduce Programming in Java
Mapper Function
Reducer Function
Driver Function
Compile and Run the MapReduce Program
Job Submission in MRv1
NextGen MapReduce Using YARN
ResourceManager
ApplicationMaster
NodeManager
Job Submission in YARN
Exercise: Implement MapReduce Jobs (2 working lab sessions with examples)
Overview of Spark
Spark Architecture
Talk about Scala

Module 4: Managing Data Using Hive
Hive
HiveQL (HQL)
The Components of Hive
Query Flows
Interfacing with Hive
Hive Commands
Starting Beeline CLI and Connecting
Hive Data
Data Types
Operators and Functions
Overview on HUE UI
Walk through HUE using an example
Creating and Dropping Databases
Hive Tables
Hive Views
ORDER BY vs SORT BY vs DISTRIBUTE BY vs CLUSTER BY
Hive Partitions
Browsing, Altering, and Dropping Tables and Partitions
Bucketing in Hive
Bucketing Advantages
Sampling in Hive
Loading Data
Exercise: Basic Commands for Working with Hive Tables
Exercise: Partition a Table
Exercise: Bucketing in Hive
Exercise: View in Hive
Hive Indexes
Hive UDF
Exercise: Custom Hive UDF
Running Hive in Script Mode
Best Practices in Hive
Optimization Techniques
Demonstrate Performance Improvements using a sample.

Module 5: Manipulating Data Using Pig
Pig Relations
Simple and Complex Data Types
Nulls, Operators, and Functions
Expressions
Schemas
How to Start Pig Shell (grunt)
Pig Functions
Exercise: Executing Simple Pig Statements and Viewing the Output
FOREACH Function
GROUP
Exercise: Group by Display from Table Records
Exercise: Tokenize Text of a File Using Pig and Grunt
JOIN (Inner) and (Outer)
COGROUP
Exercise: Performing an Inner Join on Two Data Sets Using Pig
Exercise: Join Two Sets of Data
FILTER
Exercise: Finding First 5 Max Occurring Tokens Using Pig
Sample Operator
User Defined Function (UDF) in Pig
Exercise: UDF Example
Built-In UDF Examples
Best Practices in Pig
Optimization Techniques
Demonstrate Performance Improvements using a sample

Module 6: Extracting Data into Hadoop Using Sqoop
Sqoop
Syntax to Use Sqoop Commands
Sqoop Import
Controlling the Import
Exercise: Importing to HDFS
Exercise: Importing to HDFS Directory
Exercise: Importing a Subset of Rows
Exercise: Encoding Database NULL Values While Importing
Exercise: Importing Tables Using One Command
Exercise: Using Sqoop’s Incremental Import Feature
Sqoop Export
Sqoop’s Export Methodology
Export Control Arguments
Exercise: Export Table Back to MySQL
Exercise: Modifying and Exporting Rows
Exercise: Adding Rows and Making Changes before Exporting
Exercise: Overriding the NULL Substitution Characters

Module 7: Create a Workflow Using Oozie
Introduction to Oozie
Features of Oozie
Oozie Workflow
Creating a MapReduce Workflow
Start, End, and Error Nodes
Parallel Fork and Join Nodes
Workflow Jobs Lifecycle
Workflow Notifications
Workflow Manager
Creating and Running a Workflow
Exercise: Create an Oozie Workflow from Terminal
Exercise: Create an Oozie Workflow Using Java API
Oozie Coordinator Sub-groups
Oozie Coordinator Components, Variables, and Parameters
Exercise: Create an Oozie Workflow from HUE

Module 8: Working with Data Using Flume
Overview on Real Time Ingestion into Hadoop (Flume, Kafka and Spark Streaming).
Flume
Anatomy of a Flume Agent
Flume Sources
Flume Channels
Flume Sinks
Running Flume
Exercise: Flume with Avro as Source and Terminal as Sink
Exercise: Flume with Avro as Source and HDFS as Sink
Exercise: Flume with Net-Cat as Source and HDFS as Sink

Module 9: Mini Project:
Discuss on the problem statement
Demonstrate end-to-end Implementation Methodology
Send Us a Message
Choose one