Apache Spark Training

Apache Spark is a key component in today's organisation's data pipeline and understanding how it works and how to leverahe it's APIs to process data is mandatory. In this course you will learn how to install and maintain an Apache Spark cluster, how to perform streaming analytics and leverage Sparks MLib and GraphX components. Once you have completed this course you will know how to leverage Apache Spark for Big Data processing.

Apache Spark Course Outline

Spark Introduction

What is Apache Spark?
Components of Spark architecture
Apache Spark design principles
Spark features and characteristics
Apache Spark ecosystem components and their insights

Scala Introduction

What is Scala
Setup and configuration of Scala
Developing and running basic Scala Programs
Scala operations
Functions and procedures in Scala
Different Scala APIs for common operations
Loops and collections- Array, Map, Lists, Tuples
Pattern matching for advanced operations

Deploying Spark

Setting up the Spark Environment
Installing and configuring prerequisites
Installing Apache Spark in local mode
Working with Spark in local mode
Troubleshooting encountered problems in Spark
Setup and configuration of Scala
Installing Spark in standalone mode
Installing Spark in YARN mode
Installing & configuring Spark on a real multi-node cluster
Playing with Spark in cluster mode

Spark Shell

Playing with the Spark shell
Executing Scala and Java statements in the shell
Understanding the Spark context and driver
Reading data from the local filesystem
Integrating Spark with HDFS
Caching the data in memory for further use
Distributed persistence
Testing and troubleshooting

RDD & Spark

What is an RDD in Spark
How do RDDs make Spark a feature-rich framework
Transformations in Apache Spark RDDs
Spark RDD action and persistence
Spark Lazy Operations - Transformation and Caching
Fault tolerance in Spark
Loading data and creating RDD in Spark
Persist RDD in memory or disk
Pair operations and key-value in Spark
Spark integration with Hadoop

Spark Streaming

The need for stream analytics
Real-time data processing using Spark streaming
Fault tolerance and check-pointing
Stateful stream processing
DStream and window operations
Spark Stream execution flow
Connection to various source systems
Performance optimizations in Spark

Spark MLib and GraphX

Why Machine Learning is needed
What is Spark Machine Learning
Various Spark ML libraries
Algorithms for clustering, statistical analytics, classification etc.
What is GraphX
The need for different graph processing engines
Graph handling using Apache Spark

Spark SQL

What is Spark SQL
Apache Spark SQL features and data flow
Spark SQL architecture and components
Hive and Spark SQL together
Play with Data-frames and data states
Data loading techniques in Spark
Hive queries through Spark
Performance tuning in Spark

Menu Display

Apache Spark Training

Apache Spark Course Outline

Contact Form