Big Data Training & Consulting

Get training & advice from experts

Contact Us

Apache Spark Training

Apache Spark is a key component in today's organisation's data pipeline and understanding how it works and how to leverahe it's APIs to process data is mandatory. In this course you will learn how to install and maintain an  Apache Spark cluster, how to perform streaming analytics and leverage Sparks MLib and GraphX components. Once you have completed this course you will know how to leverage Apache Spark for Big Data processing.

Apache Spark Course Outline

Spark Introduction

  • What is Apache Spark?
  • Components of Spark architecture
  • Apache Spark design principles
  • Spark features and characteristics
  • Apache Spark ecosystem components and their insights

Scala Introduction

  • What is Scala
  • Setup and configuration of Scala
  • Developing and running basic Scala Programs
  • Scala operations
  • Functions and procedures in Scala
  • Different Scala APIs for common operations
  • Loops and collections- Array, Map, Lists, Tuples
  • Pattern matching for advanced operations

Deploying Spark

  • Setting up the Spark Environment
  • Installing and configuring prerequisites
  • Installing Apache Spark in local mode
  • Working with Spark in local mode
  • Troubleshooting encountered problems in Spark
  • Setup and configuration of Scala
  • Installing Spark in standalone mode
  • Installing Spark in YARN mode
  • Installing & configuring Spark on a real multi-node cluster
  • Playing with Spark in cluster mode

Spark Shell

  • Playing with the Spark shell
  • Executing Scala and Java statements in the shell
  • Understanding the Spark context and driver
  • Reading data from the local filesystem
  • Integrating Spark with HDFS
  • Caching the data in memory for further use
  • Distributed persistence
  • Testing and troubleshooting

RDD & Spark

  • What is an RDD in Spark
  • How do RDDs make Spark a feature-rich framework
  • Transformations in Apache Spark RDDs
  • Spark RDD action and persistence
  • Spark Lazy Operations - Transformation and Caching
  • Fault tolerance in Spark
  • Loading data and creating RDD in Spark
  • Persist RDD in memory or disk
  • Pair operations and key-value in Spark
  • Spark integration with Hadoop

Spark Streaming

  • The need for stream analytics
  • Real-time data processing using Spark streaming
  • Fault tolerance and check-pointing
  • Stateful stream processing
  • DStream and window operations
  • Spark Stream execution flow
  • Connection to various source systems
  • Performance optimizations in Spark

Spark MLib and GraphX

  • Why Machine Learning is needed
  • What is Spark Machine Learning
  • Various Spark ML libraries
  • Algorithms for clustering, statistical analytics, classification etc.
  • What is GraphX
  • The need for different graph processing engines
  • Graph handling using Apache Spark

Spark SQL

  • What is Spark SQL
  • Apache Spark SQL features and data flow
  • Spark SQL architecture and components
  • Hive and Spark SQL together
  • Play with Data-frames and data states
  • Data loading techniques in Spark
  • Hive queries through Spark
  • Performance tuning in Spark

Contact Us

Please contact us for any queries via phone or our contact form. We will be happy to answer your questions.

3 Appian Place,373 Kent Ave
2194 South Africa
Tel: +2711-781 8014 (Johannesburg)
  +2721-020-0111 (Cape Town)

Contact Form


Contact Form