Big Data Consulting & Training Services

Jumping Bean can enable your company with Big Data infrastructure and processes. We will engineer the Big Data solution that's right for you. We can assist with the development of data extraction, loading and transformation processes, performing data analysis, reports writing and AI/ML.

Our team of experts can assist with developing the big data architecture, data pipeline that is efficient, effective and affordable, to meet your business requirements. Whether it's AWS, GCP,Azure or on-premises; we can assist.

Our training division can also assist in the upskilling of your knowledge workers and technical staff via our Big Data training programmes.

Big Data Engineering

Whether it's on-premises,  multi-cloud and/or hybrid cloud we can assist in the set up and configuration of your big data infrastructure that fits with your strategic objectives and risk profile. From the Hadoop ecosystem with Hive, Spark and Pig and standard services such as Kafka clusters and HDF or ceph storage clusters, to cloud hosted solutions with Google Big Table, Big Query and Dataflow or AWS's Redshift and Elastic reduce Map; we can build an infrastructure solution that will scale with your data and analytical requirement growth.

Big Data Processing

Need assistance with the creation of your Big Data pipeline? Our engineers can assist in the creation of efficient processes that can scale with your data growth. Need to integrate and consolidate data sources into your data lake or specialist data warehouse like stores?  Whether its on-premises or in the cloud we will help you select the best components to meet your budget and policy requirements.  Our developers can assist with the writing of map/reduce jobs, the development of Hive, Spark or Pig scripts to transform and enrich your data.

Data Scientists and Data Analysis

Looking for data scientists to assist with the interrogation of your data? Our pool of expert resources can be made available to you.  Part mathematician, part computer scientist and part trend-spotter our data scientists can sift through your data, panning for the gold nuggets that it contains.

Big Data Training

Our training division will sit with your management team, after performing a skills-gap analysis and develop a unique training plan for your staff. Whether it's data analysts or infrastructure and cloud engineers or developers requiring retraining or refresher courses our training team is up to the task.

Continual training and skill development is the new norm and any corporate looking to keep up with the latest technological developments and their competitors will need to integrate a long-term training and skill acquisition programme into their yearly planning cycles. Jumping Bean is the team to speak to get the best results.

Hadoop Training

Our Hadoop training equips individuals with the knowledge and skills required to effectively work with Hadoop, a powerful framework used for processing and analyzing big data. Hadoop has gained immense popularity due to its ability to handle large volumes of data, distributed processing capabilities, fault tolerance, and scalability. Our comprehensive Hadoop training curriculum covers various aspects, including understanding the Hadoop ecosystem, working with Hadoop Distributed File System (HDFS), MapReduce programming, data ingestion techniques, data processing with Apache Hive and Apache Pig, and cluster administration and monitoring. Our training involves hands-on exercises, real-world use cases, and practical examples to provide learners with a solid understanding of Hadoop's core concepts and its applications in the big data landscape. By completing Hadoop training you will acquire the expertise needed to effectively utilize Hadoop's capabilities and contribute to the efficient management and analysis of large-scale data in diverse industries.

Hadoop & Big Data Training Course Outline

Big Data Overview

  • Necessity of Big Data in the industry
  • Paradigm shift - why the industry is shifting to Big Data tools
  • Different dimensions of Big Data
  • Data explosion in the industry
  • Various implementations of Big Data
  • Different technologies to handle Big Data
  • Traditional systems and associated problems
  • Future of Big Data in the IT industry

Hadoop Introduction

  • Why Hadoop is at the heart of every Big Data solution
  • Introduction to the Hadoop framework
  • Hadoop architecture and design principles
  • Ingredients of Hadoop
  • Hadoop characteristics and data-flow
  • Components of the Hadoop ecosystem

Hadoop Installation & Configuration

  • Hadoop environment setup and pre-requisites
  • Installation and configuration of Hadoop
  • Working with Hadoop in pseudo-distributed mode
  • Troubleshooting encountered problems
  • Setup and Installation of Hadoop multi-node cluster
  • Configuration of masters and slaves on the cluster

Hadoop Storage - HDFS

  • What is HDFS (Hadoop Distributed File System)
  • HDFS daemons and architecture
  • HDFS data flow and storage mechanism
  • Hadoop HDFS characteristics and design principles
  • Responsibility of HDFS Master – NameNode
  • Storage mechanism of Hadoop meta-data
  • Work of HDFS Slaves – DataNodes
  • Data Blocks and distributed storage
  • Replication of blocks, reliability, and high availability
  • Rack-awareness, scalability, and other features
  • Different HDFS APIs and terminologies
  • Commissioning of nodes and addition of more nodes
  • Expanding clusters in real-time
  • Hadoop HDFS Web UI and HDFS explorer

MapReduce Introduction

  • What is MapReduce, the processing layer of Hadoop
  • The need for a distributed processing framework
  • Issues before MapReduce and its evolution
  • List processing concepts
  • Components of MapReduce – Mapper and Reducer
  • MapReduce terminologies- keys, values, lists, and more
  • Hadoop MapReduce execution flow
  • Mapping and reducing data based on keys
  • MapReduce word-count example to understand the flow
  • Execution of Map and Reduce together
  • Controlling the flow of mappers and reducers
  • Optimization of MapReduce Jobs
  • Fault-tolerance and data locality
  • Working with map-only jobs
  • Introduction to Combiners in MapReduce
  • How MR jobs can be optimized using combiners

MapReduce Advanced Concepts

  • Anatomy of MapReduce
  • Hadoop MapReduce data types
  • Developing custom data types using Writable & WritableComparable
  • InputFormat in MapReduce
  • InputSplit as a unit of work
  • How Partitioners partition data
  • Customization of RecordReader
  • Moving data from mapper to reducer – shuffling & sorting
  • Distributed cache and job chaining
  • Different Hadoop case-studies to customize each component
  • Job scheduling in MapReduce

Big Data Tools - Hive

  • The need for an ad-hoc SQL based solution – Apache Hive
  • Introduction to and architecture of Hadoop Hive
  • Playing with the Hive shell and running HQL queries
  • Introduction to and architecture of Hadoop Hive
  • Playing with the Hive shell and running HQL queries
  • Hive DDL and DML operations
  • Hive execution flow Schema design and other Hive operations
  • Schema-on-Read vs Schema-on-Write in Hive
  • Meta-store management and the need for RDBMS
  • Limitations of the default meta-store
  • Using SerDe to handle different types of data
  • Optimization of performance using partitioning
  • Different Hive applications and use cases

Big Data Tools - Pig

  • The need for a high-level query language - Apache Pig
  • How Pig complements Hadoop with a scripting language
  • What is Pig
  • Pig execution flow
  • Different Pig operations like filter and join
  • Compilation of Pig code into MapReduce
  • Comparison - Pig vs MapReduce

HBase - NoSQL Columnar Data Store

  • NoSQL databases and their need in the industry
  • Introduction to Apache HBase
  • Internals of the HBase architecture
  • The HBase Master and Slave Model
  • Column-oriented, 3-dimensional, schema-less datastores
  • Data modeling in Hadoop HBase
  • Storing multiple versions of data
  • Data high-availability and reliability
  • Comparison - HBase vs HDFS
  • Comparison - HBase vs RDBMS
  • Data access mechanisms
  • Working with HBase using the shell

Data Ingestion  - Flume & Scoop

  • Introduction and working of Sqoop
  • Importing data from RDBMS to HDFS
  • Exporting data to RDBMS from HDFS
  • Conversion of data import/export queries into MapReduce jobs
  • What is Apache Flume
  • Flume architecture and aggregation flow
  • Understanding Flume components like data Sources and Sinks
  • Flume channels to buffer events
  • Reliable & scalable data collection tools
  • Aggregating streams using Fan-in
  • Separating streams using Fan-out
  • Internals of the agent architecture
  • Production architecture of Flume
  • Collecting data from different sources to Hadoop HDFS
  • Multi-tier Flume flow for collection of volumes of data using AVROW

Apache YARN

  • The need for and the evolution of YARN
  • YARN and its eco-system
  • YARN daemon architecture
  • Master of YARN – Resource Manager
  • Slave of YARN – Node Manager
  • Requesting resources from the application master
  • Dynamic slots (containers)
  • Application execution flow
  • MapReduce version 2 application over Yarn
    Hadoop Federation and Namenode HA

Apache Spark Training

Apache Spark is a key component in today's organisation's data pipeline and understanding how it works and how to leverahe it's APIs to process data is mandatory. In this course you will learn how to install and maintain an  Apache Spark cluster, how to perform streaming analytics and leverage Sparks MLib and GraphX components. Once you have completed this course you will know how to leverage Apache Spark for Big Data processing.

Apache Spark Course Outline

Spark Introduction

  • What is Apache Spark?
  • Components of Spark architecture
  • Apache Spark design principles
  • Spark features and characteristics
  • Apache Spark ecosystem components and their insights

Scala Introduction

  • What is Scala
  • Setup and configuration of Scala
  • Developing and running basic Scala Programs
  • Scala operations
  • Functions and procedures in Scala
  • Different Scala APIs for common operations
  • Loops and collections- Array, Map, Lists, Tuples
  • Pattern matching for advanced operations

Deploying Spark

  • Setting up the Spark Environment
  • Installing and configuring prerequisites
  • Installing Apache Spark in local mode
  • Working with Spark in local mode
  • Troubleshooting encountered problems in Spark
  • Setup and configuration of Scala
  • Installing Spark in standalone mode
  • Installing Spark in YARN mode
  • Installing & configuring Spark on a real multi-node cluster
  • Playing with Spark in cluster mode

Spark Shell

  • Playing with the Spark shell
  • Executing Scala and Java statements in the shell
  • Understanding the Spark context and driver
  • Reading data from the local filesystem
  • Integrating Spark with HDFS
  • Caching the data in memory for further use
  • Distributed persistence
  • Testing and troubleshooting

RDD & Spark

  • What is an RDD in Spark
  • How do RDDs make Spark a feature-rich framework
  • Transformations in Apache Spark RDDs
  • Spark RDD action and persistence
  • Spark Lazy Operations - Transformation and Caching
  • Fault tolerance in Spark
  • Loading data and creating RDD in Spark
  • Persist RDD in memory or disk
  • Pair operations and key-value in Spark
  • Spark integration with Hadoop

Spark Streaming

  • The need for stream analytics
  • Real-time data processing using Spark streaming
  • Fault tolerance and check-pointing
  • Stateful stream processing
  • DStream and window operations
  • Spark Stream execution flow
  • Connection to various source systems
  • Performance optimizations in Spark

Spark MLib and GraphX

  • Why Machine Learning is needed
  • What is Spark Machine Learning
  • Various Spark ML libraries
  • Algorithms for clustering, statistical analytics, classification etc.
  • What is GraphX
  • The need for different graph processing engines
  • Graph handling using Apache Spark

Spark SQL

  • What is Spark SQL
  • Apache Spark SQL features and data flow
  • Spark SQL architecture and components
  • Hive and Spark SQL together
  • Play with Data-frames and data states
  • Data loading techniques in Spark
  • Hive queries through Spark
  • Performance tuning in Spark

Gold In Them Big Data Hills - What and Why of Big Data

Big Data is an all-embracing label for the evolution of business intelligence practices, methodologies, and technical infrastructure that has qualitatively transformed business reporting, data analysis. Now powerful tools and algorithms are within the reach of even small and medium enterprises.

Big Data Democratization

The amount of technical and analytical skill required to configure complex infrastructure and perform advanced data manipulation and analysis has been reduced to such an extent that users of reports can now generate them themselves and gain valuable insights.

The cloud revolution means that niche experts can spin up the required infrastructure in minutes and apply machine learning logarithms at the click of a button. All you need to bring is the data.  The data that is busy accumulat5ing at a rapid rate on your storage systems.

Big Data  = Big Value

Data is the new gold rush. "There is gold in them mountains of data".  This is what drives the value of Google, Facebook and Amazon.   We are here to assist you in your big data strategy, implementation, and analysis.

Our Business Partners

Google Cloud Partners
Pearson Vue Authorised Testing Centre
LInux Professional Instititue Partners
Linux Foundation Partners
EC Council Partners
PSI Authorised Testing Center
OpenEDG Partners
Kryterion Authorised Testing Centre
Python Institute Partners
CompTIA Partners

Kafka Training

Our Kafka course will teach admnistrators how to  secure and administer  a Kafka installation and how to engineer and maintain solutions for your streaming data needs. Application developers will understand how to  devleop robust applications that leverage the Kafka's API and implement  best practice for application development.  Learn how to build flexible data pipelines that scale to handle your Big Data processing requirements.

Kafka Course Outline

Kafka Introduction

  • Architecture
  • Overview of key concepts
  • Overview of ZooKeeper
  • Cluster, Nodes, Kafka Brokers
  • Consumers, Producers, Logs, Partitions, Records, Keys
  • Partitions for write throughput
  • Partitions for Consumer parallelism (multi-threaded consumers)
  • Replicas, Followers, Leaders
  • How to scale writes
  • Disaster recovery
  • Performance profile of Kafka
  • Consumer Groups, “High Water Mark”, what do consumers see
  • Consumer load balancing and fail-over
  • Working with Partitions for parallel processing and resiliency
  • Brief Overview of Kafka Streams, Kafka Connectors, Kafka REST
  • Create a topic
  • Produce and consume messages from the command line
  • Configure and set up three servers
  • Create a topic with replication and partitions
  • Produce and consume messages from the command line


Kafka Producer & Consumer Basics

  • Introduction to Producer Java API and basic configuration
  • Create topic from command line
  • View topic layout of partitions topology from command line
  • View log details
  • Use ./ to verify replication is correct
  • Introduction to Consumer Java API and basic configuration

  • View how far behind the consumer is from the command line
  • Force failover and verify new leaders are chosen
Kafka Architecture
  • Motivation Focus on high-throughput
  • Embrace file system / OS caches and how this impacts OS setup and usage
  • File structure on disk and how data is written
  • Kafka Producer load balancing details
  • Producer Record batching by size and time
  • Producer async commit and commit (flush, close)
  • Pull vs poll and backpressure
  • Compressions via message batches (unified compression to server, disk and consumer)
  • Consumer poll batching, long poll
  • Consumer Trade-offs of requesting larger batches
  • Consumer Liveness and fail over redux
  • Managing consumer position (auto-commit, async commit and sync commit)
  • Messaging At most once, At least once, Exactly once
  • Performance trade-offs message delivery semantics
  • Performance trade-offs of poll size
  • Replication, Quorums, ISRs, committed records
  • Failover and leadership election
  • Log compaction by key
  • Failure scenarios

Advanced Kafka Producers

  • Using batching (time/size)
  • Using compression
  • Async producers and sync producers
  • Commit and async commit
  • Default partitioning (round-robin no key, partition on key if key)
  • Controlling which partition records are written to (custom partitioning)
  • Message routing to a particular partition (use cases for this)
  • Advanced Producer configuration
  • Use message batching and compression
  • Use round-robin partition
  • Use a custom message routing scheme
  • Embrace file system / OS caches and how this impacts OS setup and usage

Advanced Kafka Consumers

  • Adjusting poll read size
  • Implementing at most once message semantics using Java API
  • Implementing at least once message semantics using Java API
  • Implementing as close as we can get to exactly once Java API
  • Re-consume messages that are already consumed
  • Using ConsumerRebalanceListener to start consuming from a certain offset (*)
  • Assigning a consumer a specific partition (use cases for this)
  • Write Java Advanced Consumer
  • Adjusting poll read size
  • Implementing at most once message semantics using Java API
  • Implementing at least once message semantics using Java API
  • Implementing as close as we can get to exactly once Java API
Schema Management in Kafka
  • Avro overview
  • Avro Schemas
  • Flexible Schemas with JSON and defensive programming
  • Using Kafka’s Schema Registry
  • Topic Schema management
  • Validation of schema
  • Prevent producers that don’t align with topic schema
  • Validation of schema
  • Prevent Consumer from accepting unexpected schema / defensive programming
  • Prevent producers from sending messages that don’t align with schema registry
Kafka Security
  • SSL for Encrypting transport and Authentication
  • Setting up keys
  • Using SSL for authentication instead of username/password
  • Setup keystore for transport encryption
  • Setup truststore for authentication
  • Producer to server encryption
  • Consumer to server encryption
  • Kafka broker to Kafka broker encryption
  • SASL for Authentication
  • Overview of SASL
  • Integrating SASL with Active DirectoryAvro overview

Kafka Admin

  • OS config and hardware selection
  • Monitoring Kafka KPIs
  • Monitoring Consumer Lag (consumer group inspection)
  • Log Retention and Compaction
  • Fault tolerant Cluster
  • Growing your cluster
  • Reassign partitions
  • Broker configuration details
  • Topic configuration details
  • Producer configuration details
  • Consumer configuration details
  • ZooKeeper configuration details
  • Tools to managing ZooKeeper
  • Accessing JMX from command line
  • Using dump-log segment from command line
  • Replaying a log (replay log producer)
  • Re-consume messages that are already consumed
  • Setting Consumer Group Offset from command line
  • Kafka Migration Tool − migrate a broker from one version to another
  • Mirror Maker − Mirroring one Kafka cluster to another (one DC/region to another DC/region)

Kafka Streams

  • Architecture Overview
  • Concepts
  • Design Patterns
  • Examples

Hive Training

Learn how to query your data and get the answers you need with Apache Hive. Use your existing SQL knowledge to query your data warehouse in a familiar manner.  This course is a great start for decision makers and data scientist looking to query data quickly and efficeintly.

Hive Training

Hive Overview

  • Architecture and design
  • Data types
  • SQL support in Hive
  • Creating Hive tables and querying
  • Partitions
  • Joins
  • Text processing
  • labs : various labs on processing data with Hive

DQL (Data Query Language) in Detail

  • SELECT clause
  • Column aliases
  • Table aliases
  • Date types and Date functions
  • Group function
  • Table joins
  • JOIN clause
  • UNION operator
  • Nested queries
  • Correlated sub-queries

Contact Us

Please contact us for any queries via phone or our contact form. We will be happy to answer your questions.

3 Appian Place,373 Kent Ave
2194 South Africa
Tel: +2711-781 8014 (Johannesburg)
  +2721-020-0111 (Cape Town)

Contact Form


Contact Form