Big Data consulting and training - Big Data
Big Data Consulting & Training Services
Big Data Training - Hadoop & Spark
Hadoop and Spark are by far the most widely deployed Big Data components in corporate Big Data architectures. Hadoop and Spark are the bedrock of the Big Data ecosystem providing scalability and ease-of-use, enabling companies to mine the value out of their data hills.
Hadoop & Big Data Training Course Outline
Big Data Overview
- Necessity of Big Data in the industry
- Paradigm shift - why the industry is shifting to Big Data tools
- Different dimensions of Big Data
- Data explosion in the industry
- Various implementations of Big Data
- Different technologies to handle Big Data
- Traditional systems and associated problems
- Future of Big Data in the IT industry
Hadoop Introduction
- Why Hadoop is at the heart of every Big Data solution
- Introduction to the Hadoop framework
- Hadoop architecture and design principles
- Ingredients of Hadoop
- Hadoop characteristics and data-flow
- Components of the Hadoop ecosystem
Hadoop Installation & Configuration
- Hadoop environment setup and pre-requisites
- Installation and configuration of Hadoop
- Working with Hadoop in pseudo-distributed mode
- Troubleshooting encountered problems
- Setup and Installation of Hadoop multi-node cluster
- Configuration of masters and slaves on the cluster
Hadoop Storage - HDFS
- What is HDFS (Hadoop Distributed File System)
- HDFS daemons and architecture
- HDFS data flow and storage mechanism
- Hadoop HDFS characteristics and design principles
- Responsibility of HDFS Master – NameNode
- Storage mechanism of Hadoop meta-data
- Work of HDFS Slaves – DataNodes
- Data Blocks and distributed storage
- Replication of blocks, reliability, and high availability
- Rack-awareness, scalability, and other features
- Different HDFS APIs and terminologies
- Commissioning of nodes and addition of more nodes
- Expanding clusters in real-time
- Hadoop HDFS Web UI and HDFS explorer
MapReduce Introduction
- What is MapReduce, the processing layer of Hadoop
- The need for a distributed processing framework
- Issues before MapReduce and its evolution
- List processing concepts
- Components of MapReduce – Mapper and Reducer
- MapReduce terminologies- keys, values, lists, and more
- Hadoop MapReduce execution flow
- Mapping and reducing data based on keys
- MapReduce word-count example to understand the flow
- Execution of Map and Reduce together
- Controlling the flow of mappers and reducers
- Optimization of MapReduce Jobs
- Fault-tolerance and data locality
- Working with map-only jobs
- Introduction to Combiners in MapReduce
- How MR jobs can be optimized using combiners
MapReduce Advanced Concepts
- Anatomy of MapReduce
- Hadoop MapReduce data types
- Developing custom data types using Writable & WritableComparable
- InputFormat in MapReduce
- InputSplit as a unit of work
- How Partitioners partition data
- Customization of RecordReader
- Moving data from mapper to reducer – shuffling & sorting
- Distributed cache and job chaining
- Different Hadoop case-studies to customize each component
- Job scheduling in MapReduce
Big Data Tools - Hive
- The need for an ad-hoc SQL based solution – Apache Hive
- Introduction to and architecture of Hadoop Hive
- Playing with the Hive shell and running HQL queries
- Introduction to and architecture of Hadoop Hive
- Playing with the Hive shell and running HQL queries
- Hive DDL and DML operations
- Hive execution flow Schema design and other Hive operations
- Schema-on-Read vs Schema-on-Write in Hive
- Meta-store management and the need for RDBMS
- Limitations of the default meta-store
- Using SerDe to handle different types of data
- Optimization of performance using partitioning
- Different Hive applications and use cases
Big Data Tools - Pig
- The need for a high-level query language - Apache Pig
- How Pig complements Hadoop with a scripting language
- What is Pig
- Pig execution flow
- Different Pig operations like filter and join
- Compilation of Pig code into MapReduce
- Comparison - Pig vs MapReduce
HBase - NoSQL Columnar Data Store
- NoSQL databases and their need in the industry
- Introduction to Apache HBase
- Internals of the HBase architecture
- The HBase Master and Slave Model
- Column-oriented, 3-dimensional, schema-less datastores
- Data modeling in Hadoop HBase
- Storing multiple versions of data
- Data high-availability and reliability
- Comparison - HBase vs HDFS
- Comparison - HBase vs RDBMS
- Data access mechanisms
- Working with HBase using the shell
Data Ingestion - Flume & Scoop
- Introduction and working of Sqoop
- Importing data from RDBMS to HDFS
- Exporting data to RDBMS from HDFS
- Conversion of data import/export queries into MapReduce jobs
- What is Apache Flume
- Flume architecture and aggregation flow
- Understanding Flume components like data Sources and Sinks
- Flume channels to buffer events
- Reliable & scalable data collection tools
- Aggregating streams using Fan-in
- Separating streams using Fan-out
- Internals of the agent architecture
- Production architecture of Flume
- Collecting data from different sources to Hadoop HDFS
- Multi-tier Flume flow for collection of volumes of data using AVROW
Apache YARN
- The need for and the evolution of YARN
- YARN and its eco-system
- YARN daemon architecture
- Master of YARN – Resource Manager
- Slave of YARN – Node Manager
- Requesting resources from the application master
- Dynamic slots (containers)
- Application execution flow
- MapReduce version 2 application over Yarn
Hadoop Federation and Namenode HA
Spark Training Course Outline
Spark Introduction
- What is Apache Spark?
- Components of Spark architecture
- Apache Spark design principles
- Spark features and characteristics
- Apache Spark ecosystem components and their insights
Scala Introduction
- What is Scala
- Setup and configuration of Scala
- Developing and running basic Scala Programs
- Scala operations
- Functions and procedures in Scala
- Different Scala APIs for common operations
- Loops and collections- Array, Map, Lists, Tuples
- Pattern matching for advanced operations
Deploying Spark
- Setting up the Spark Environment
- Installing and configuring prerequisites
- Installing Apache Spark in local mode
- Working with Spark in local mode
- Troubleshooting encountered problems in Spark
- Setup and configuration of Scala
- Installing Spark in standalone mode
- Installing Spark in YARN mode
- Installing & configuring Spark on a real multi-node cluster
- Playing with Spark in cluster mode
Spark Shell
- Playing with the Spark shell
- Executing Scala and Java statements in the shell
- Understanding the Spark context and driver
- Reading data from the local filesystem
- Integrating Spark with HDFS
- Caching the data in memory for further use
- Distributed persistence
- Testing and troubleshooting
RDD & Spark
- What is an RDD in Spark
- How do RDDs make Spark a feature-rich framework
- Transformations in Apache Spark RDDs
- Spark RDD action and persistence
- Spark Lazy Operations - Transformation and Caching
- Fault tolerance in Spark
- Loading data and creating RDD in Spark
- Persist RDD in memory or disk
- Pair operations and key-value in Spark
- Spark integration with Hadoop
Spark Streaming
- The need for stream analytics
- Real-time data processing using Spark streaming
- Fault tolerance and check-pointing
- Stateful stream processing
- DStream and window operations
- Spark Stream execution flow
- Connection to various source systems
- Performance optimizations in Spark
Spark MLib and GraphX
- Why Machine Learning is needed
- What is Spark Machine Learning
- Various Spark ML libraries
- Algorithms for clustering, statistical analytics, classification etc.
- What is GraphX
- The need for different graph processing engines
- Graph handling using Apache Spark
Spark SQL
- What is Spark SQL
- Apache Spark SQL features and data flow
- Spark SQL architecture and components
- Hive and Spark SQL together
- Play with Data-frames and data states
- Data loading techniques in Spark
- Hive queries through Spark
- Performance tuning in Spark