Apache Hadoop

Course:  HDOOP
Duration:  5 Days
Level:  II
Course Summary

Apache Hadoop is an OpenSource framework for creating reliable and distributable compute clusters. Credited with the IBM Watson Jeopardy win in 2011, Hadoop can be used (with other related frameworks) to process large unstructured or semi-structured data sets from multiple sources to dissect, classify, learn from and make suggestions for business analytics, decision support, and other advanced forms of machine intelligence.

This course will go well beyond the "Hello World" word-count example into practical, applied uses of Hadoop in large-scale real-world scenarios, including fraud detection, algorithmic trading, and data mining. Students will develop in an environment architected for a dynamically changing business-rule driven infrastructure with multiple disparate data sources and large-scale datasets on a real Hadoop/Drools cluster.

« Hide The Details
Topics Covered In This Course


  • Map/Reduce
  • Hadoop
  • NoSQL
  • Mahout
  • Alternate Frameworks

Hadoop Architecture

  • Hadoop Map/Reduce
  • HDFS
  • Cassandra
  • HBase
  • Hive
  • Pig

Retrieving and Localizing Data

  • Using JPA in Map/Reduce: Pros and Cons
  • HDFS
  • NoSQL
  • HBase
  • Cassandra
  • Neo4J
  • Sqoop
  • Flume
  • Caching with JBoss Infinispan
  • Caching with OpenTerracotta
  • Using Spring Data

Feeding Hadoop in the Enterprise

  • Apache UIMA
  • Spring Integration
  • Apache Camel
  • Spring Batch

Machine Learning with Mahout

  • Artificial Intelligence Overview
  • Fuzzy Logic
  • K-Means
  • Pattern Mining
  • Bayesian Classifiers
  • Analytics
  • Random Forests
  • Decision Support with Mahout and Hadoop

Applying Business Rules with Drools

  • Drools Overview
  • Integrating Rules-based approach with Hadoop
  • Decision Making with Drools and Hadoop
  • Integrating Drools, Mahout, and Hadoop

Pig and Pig Pipelines

  • Pig Latin
  • Pig Pipelines
  • Pig UDFs (User Defined Functions)

Working with the Hive

  • Hive and HDFS
  • Meta-data and indexing
  • Hive UDFs (User Defined Functions)
  • Hive and Apache S3
  • HQL

Testing, Performance and Troubleshooting

  • TDD with MRUnit
  • TDD with other Unit Testing Frameworks
  • Bottleneck discovery
  • Monitoring
  • Join Framework Optimization
  • Troubleshooting
  • Hadoop and Virtualization
  • Hadoop in the Cloud
  • Hadoop and Amazon EC2
Recommended Prerequisites

Experience using Java with Eclipse, with the JPA API for data persistence and access, and experience using UNIX shell is expected.

Training Style

40% lecture and 60% hands-on labs.

« Hide The Details
Related Courses
Code Course Title Duration Level
Introduction to Hadoop Development
5 Days
Hadoop Administration
3 Days
Real World Hadoop in the Enterprise
5 Days
Developing Data-driven Applications with Apache Accumulo
3 Days

Every student attending a Verhoef Training class will receive a certificate good for $100 toward their next public class taken within a year.

You can also buy "Verhoef Vouchers" to get a discounted rate for a single student in any of our public or web-based classes. Contact your account manager or our sales office for details.

Schedule For This Course
There are currently no public sessions scheduled for this course. We can schedule a private class for your organization just a couple of weeks from now. Or we can let you know the next time we do schedule a public session.
Notify me the next time this course is confirmed!
Can't find the course you want?
Call us at 800.533.3893, or
email us at [email protected]