Big Data Training & Consulting

Get training & advice from experts

Contact Us

Hadoop Training

Our Hadoop training equips individuals with the knowledge and skills required to effectively work with Hadoop, a powerful framework used for processing and analyzing big data. Hadoop has gained immense popularity due to its ability to handle large volumes of data, distributed processing capabilities, fault tolerance, and scalability. Our comprehensive Hadoop training curriculum covers various aspects, including understanding the Hadoop ecosystem, working with Hadoop Distributed File System (HDFS), MapReduce programming, data ingestion techniques, data processing with Apache Hive and Apache Pig, and cluster administration and monitoring. Our training involves hands-on exercises, real-world use cases, and practical examples to provide learners with a solid understanding of Hadoop's core concepts and its applications in the big data landscape. By completing Hadoop training you will acquire the expertise needed to effectively utilize Hadoop's capabilities and contribute to the efficient management and analysis of large-scale data in diverse industries.

Hadoop & Big Data Training Course Outline

Big Data Overview

  • Necessity of Big Data in the industry
  • Paradigm shift - why the industry is shifting to Big Data tools
  • Different dimensions of Big Data
  • Data explosion in the industry
  • Various implementations of Big Data
  • Different technologies to handle Big Data
  • Traditional systems and associated problems
  • Future of Big Data in the IT industry

Hadoop Introduction

  • Why Hadoop is at the heart of every Big Data solution
  • Introduction to the Hadoop framework
  • Hadoop architecture and design principles
  • Ingredients of Hadoop
  • Hadoop characteristics and data-flow
  • Components of the Hadoop ecosystem

Hadoop Installation & Configuration

  • Hadoop environment setup and pre-requisites
  • Installation and configuration of Hadoop
  • Working with Hadoop in pseudo-distributed mode
  • Troubleshooting encountered problems
  • Setup and Installation of Hadoop multi-node cluster
  • Configuration of masters and slaves on the cluster

Hadoop Storage - HDFS

  • What is HDFS (Hadoop Distributed File System)
  • HDFS daemons and architecture
  • HDFS data flow and storage mechanism
  • Hadoop HDFS characteristics and design principles
  • Responsibility of HDFS Master – NameNode
  • Storage mechanism of Hadoop meta-data
  • Work of HDFS Slaves – DataNodes
  • Data Blocks and distributed storage
  • Replication of blocks, reliability, and high availability
  • Rack-awareness, scalability, and other features
  • Different HDFS APIs and terminologies
  • Commissioning of nodes and addition of more nodes
  • Expanding clusters in real-time
  • Hadoop HDFS Web UI and HDFS explorer

MapReduce Introduction

  • What is MapReduce, the processing layer of Hadoop
  • The need for a distributed processing framework
  • Issues before MapReduce and its evolution
  • List processing concepts
  • Components of MapReduce – Mapper and Reducer
  • MapReduce terminologies- keys, values, lists, and more
  • Hadoop MapReduce execution flow
  • Mapping and reducing data based on keys
  • MapReduce word-count example to understand the flow
  • Execution of Map and Reduce together
  • Controlling the flow of mappers and reducers
  • Optimization of MapReduce Jobs
  • Fault-tolerance and data locality
  • Working with map-only jobs
  • Introduction to Combiners in MapReduce
  • How MR jobs can be optimized using combiners

MapReduce Advanced Concepts

  • Anatomy of MapReduce
  • Hadoop MapReduce data types
  • Developing custom data types using Writable & WritableComparable
  • InputFormat in MapReduce
  • InputSplit as a unit of work
  • How Partitioners partition data
  • Customization of RecordReader
  • Moving data from mapper to reducer – shuffling & sorting
  • Distributed cache and job chaining
  • Different Hadoop case-studies to customize each component
  • Job scheduling in MapReduce

Big Data Tools - Hive

  • The need for an ad-hoc SQL based solution – Apache Hive
  • Introduction to and architecture of Hadoop Hive
  • Playing with the Hive shell and running HQL queries
  • Introduction to and architecture of Hadoop Hive
  • Playing with the Hive shell and running HQL queries
  • Hive DDL and DML operations
  • Hive execution flow Schema design and other Hive operations
  • Schema-on-Read vs Schema-on-Write in Hive
  • Meta-store management and the need for RDBMS
  • Limitations of the default meta-store
  • Using SerDe to handle different types of data
  • Optimization of performance using partitioning
  • Different Hive applications and use cases

Big Data Tools - Pig

  • The need for a high-level query language - Apache Pig
  • How Pig complements Hadoop with a scripting language
  • What is Pig
  • Pig execution flow
  • Different Pig operations like filter and join
  • Compilation of Pig code into MapReduce
  • Comparison - Pig vs MapReduce

HBase - NoSQL Columnar Data Store

  • NoSQL databases and their need in the industry
  • Introduction to Apache HBase
  • Internals of the HBase architecture
  • The HBase Master and Slave Model
  • Column-oriented, 3-dimensional, schema-less datastores
  • Data modeling in Hadoop HBase
  • Storing multiple versions of data
  • Data high-availability and reliability
  • Comparison - HBase vs HDFS
  • Comparison - HBase vs RDBMS
  • Data access mechanisms
  • Working with HBase using the shell

Data Ingestion  - Flume & Scoop

  • Introduction and working of Sqoop
  • Importing data from RDBMS to HDFS
  • Exporting data to RDBMS from HDFS
  • Conversion of data import/export queries into MapReduce jobs
  • What is Apache Flume
  • Flume architecture and aggregation flow
  • Understanding Flume components like data Sources and Sinks
  • Flume channels to buffer events
  • Reliable & scalable data collection tools
  • Aggregating streams using Fan-in
  • Separating streams using Fan-out
  • Internals of the agent architecture
  • Production architecture of Flume
  • Collecting data from different sources to Hadoop HDFS
  • Multi-tier Flume flow for collection of volumes of data using AVROW

Apache YARN

  • The need for and the evolution of YARN
  • YARN and its eco-system
  • YARN daemon architecture
  • Master of YARN – Resource Manager
  • Slave of YARN – Node Manager
  • Requesting resources from the application master
  • Dynamic slots (containers)
  • Application execution flow
  • MapReduce version 2 application over Yarn
    Hadoop Federation and Namenode HA

Contact Us

Please contact us for any queries via phone or our contact form. We will be happy to answer your questions.

3 Appian Place,373 Kent Ave
2194 South Africa
Tel: +2711-781 8014 (Johannesburg)
  +2721-020-0111 (Cape Town)

Contact Form


Contact Form