Introduction to Hadoop Development

Course:  HDPDEV
Duration:  5 Days
Level:  II
Course Summary

You will learn how to use Apache Hadoop and write MapReduce programs. You will begin with a quick overview of installing Hadoop, setting it up in a cluster, and then proceed to writing data analytic programs. The course will present the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. The course will further examine related technologies such as Hive, Pig, and Apache Accumulo. Apache Accumulo is a highly scalable structured store based on Google's BigTable, written in Java and operates over the Hadoop Distributed File System (HDFS). Hive is data warehouse software for querying and managing large datasets. Pig is a platform to take advantage of parallelization when running data analysis. Finally, you will observe how Hadoop works in and supports cloud computing and explore examples with Amazon Web Services and case studies.

This class is focused on the Hadoop 2.0 (pre-)release.

« Hide The Details
Topics Covered In This Course

What is Hadoop?

  • Understanding distributed systems and Hadoop
  • Comparing SQL databases and Hadoop
  • Understanding MapReduce
  • Counting words with Hadoop?running your first program
  • History of Hadoop

Starting Hadoop

  • The building blocks of Hadoop
  • Setting up SSH for a Hadoop cluster
  • Running Hadoop
  • Web-based cluster UI

Components of Hadoop

  • Working with files in HDFS
  • Anatomy of a MapReduce program
  • Reading and writing

Writing basic MapReduce programs

  • Constructing the basic template of a MapReduce program
  • Counting things
  • Adapting for Hadoop?s API changes
  • Streaming in Hadoop
  • Improving performance with combiners

Advanced MapReduce

  • Chaining MapReduce jobs
  • Joining data from different sources
  • Creating a Bloom filter

Programming Practices

  • Developing MapReduce programs
  • Monitoring and debugging on a production cluster
  • Tuning for performance


  • Passing job-specific parameters to your tasks
  • Probing for task-specific information
  • Partitioning into multiple output files
  • Inputting from and outputting to a database
  • Keeping all output in sorted order

Managing Hadoop

  • Setting up parameter values for practical use
  • Checking system?s health
  • Setting permissions
  • Managing quotas
  • Enabling trash
  • Removing DataNodes
  • Adding DataNodes
  • Managing NameNode and Secondary NameNode
  • Recovering from a failed NameNode
  • Designing network layout and rack awareness
  • Scheduling jobs from multiple users

Running Hadoop in the cloud

  • Introducing Amazon Web Services
  • Setting up AWS
  • Setting up Hadoop on EC2
  • Running MapReduce programs on EC2
  • Cleaning up and shutting down your EC2 instances
  • Amazon Elastic MapReduce and other AWS services

Programming with Pig

  • Installing Pig
  • Running Pig
  • Learning Pig Latin through Grunt
  • Speaking Pig Latin
  • Working with user-defined functions
  • Working with scripts
  • Seeing Pig in action?example of computing similar patents

Overview Hadoop Related Technologies

  • Hive
  • Apache Accumulo
  • NoSQL
  • Mahout
Recommended Prerequisites

Attendees should have good Java development experience, including Eclipse or similar IDE, as well as experience using JPA and data access. Exposure to UNIX/Linux bash or tcsh is also helpful.

Training Style

This course is approximately 40% lecture and 60% hands-on labs.

« Hide The Details
Related Courses
Code Course Title Duration Level
Introduction to NoSQL
3 Days
Apache Hadoop
5 Days
Hadoop Administration
3 Days
Real World Hadoop in the Enterprise
5 Days
Developing Data-driven Applications with Apache Accumulo
3 Days

Every student attending a Verhoef Training class will receive a certificate good for $100 toward their next public class taken within a year.

You can also buy "Verhoef Vouchers" to get a discounted rate for a single student in any of our public or web-based classes. Contact your account manager or our sales office for details.

Schedule For This Course
There are currently no public sessions scheduled for this course. We can schedule a private class for your organization just a couple of weeks from now. Or we can let you know the next time we do schedule a public session.
Notify me the next time this course is confirmed!
Can't find the course you want?
Call us at 800.533.3893, or
email us at [email protected]