We use cookies to maintain login sessions, analytics and to improve your experience on our website. By continuing to use our site, you accept our use of cookies, Terms of Use

Hadoop & Apache Spark



Please login first

Hadoop and Apache Spark

Hadoop and Apache Spark


As data has grown, so has the rate at which it is processed, along with the complex demands made of it. Traditional tools are no longer able to handle this magnitude of storing and processing data - a single computer does not suffice due to IO, CPU & RAM limitations. This is when the new generation tools that run on multiple computers are required.

This is a very hands on course and will take you from the very basics to an advanced level in Big Data Analysis and Streaming processing using Apache Spark. Apache spark is probably the fastest and most efficient amongst all distributed computing tools. We will start with the basics of Big Data, understand the architecture of Apache Spark and solve problems.

What you'll learn

  • How to build and maintain reliable, scalable, distributed systems with Apache Hadoop.
  • How to write Map-Reduce based Applications
  • How to design and build MongoDB based Big data Applications and learn MongoDB query language
  • How to apply tips and tricks for Big Data use cases and solutions.

Course Content

1. Introduction to BIG DATA and Its characteristics

  • 4 V's of BIG DATA(IBM Definition of BIG DATA)
  • What is Hadoop?
  • Why Hadoop?
  • Core Components of Hadoop
  • Intro to HDFS and its Architecture
  • Difference b/w Code Locality and Data Locality
  • HDFS commands
  • Name Node’s Safe Mode
  • Different Modes of Hadoop
  • Intro to MAPREDUCE
  • Versions of HADOOP
  • What is Daemon?
  • Hadoop Daemons?
  • What is Name Node?
  • What is Data Node?
  • What is Secondary name Node?
  • What is Job Tracker?
  • What is Task Tracker?
  • What is Edge computer in Hadoop Cluster and Its role
  • Read/Write operations in HDFS
  • Complete Overview of Hadoop1.x and Its architecture
  • Rack awareness
  • Introduction to Block size
  • Introduction to Replication Factor(R.F)
  • Introduction to HeartBeat Signal/Pulse
  • Introduction to Block report
  • MAPREDUCE Architecture
  • What is Mapper phase?
  • What is shuffle and sort phase?
  • What is Reducer phase?
  • What is split?
  • Difference between Block and split
  • Intro to first Word Count program using MAPREDUCE
  • Different classes for running MAPREDUCE program using Java
  • Mapper class
  • Reducer Class and Its role
  • Driver class
  • Submitting the Word Count MAPREDUCE program
  • Going through the Jobs system output
  • Intro to Partitioner with example
  • Intro to Combiner with example
  • Intro to Counters and its types
  • Different types of counters
  • Different types of input/output formats in HADOOP
  • Use cases for HDFS & MapReduce programs using Java
  • Single Node cluster Installation
  • Multi Node cluster Installation
  • Introduction to Configuration files in Hadoop and Its Imp.
  • Complete Overview of Hadoop2.x and Its architecture
  • Introduction to YARN
  • Resource Manager
  • Node Manager
  • Application Master(AM)
  • Applications Manager(AsM)
  • Journal Nodes
  • Difference Between Hadoop1.x and Hadoop2.x
  • High Availability(HA)
  • Hadoop Federation

2. PIG

  • The difference between MAPREDUCE and PIG
  • When to go with MAPREDUCE?
  • When to go with PIG?
  • PIG data types
  • What is field in PIG?
  • What is tuple in PIG?
  • What is Bag in PIG?
  • Intro to Grunt shell?
  • Different modes in PIG
  • Local Mode
  • MAPREDUCE mode
  • Running PIG programs
  • PIG Script
  • Intro to PIG UDFs
  • Writing PIG UDF using Java
  • Registering PIG UDF
  • Running PIG UDF
  • Different types of UDFs in PIG
  • Word Count program using PIG script
  • Use cases for PIG scripts


  • Intro to HIVE
  • Why HIVE?
  • History of HIVE
  • Difference between PIG and HIVE
  • HIVE data types
  • Complex data types
  • What is Metastore and its importance?
  • Different types of tables in HIVE
  • Managed tables
  • External tables
  • Running HIVE queries
  • Intro to HIVE partitions
  • Intro to HIVE Buckets
  • How to perform the JOINS using HIVE queries
  • Intro to HIVE UDFs
  • Different types of UDFs in HIVE
  • Running HIVE queries for Word Count example
  • Use cases for HIVE


  • Intro to HBASE
  • Intro to NoSQL database
  • Sparse and dense Concept in RDBMS
  • Intro to columnar/column oriented database
  • Core architecture of HBase
  • Why Hbase?
  • HDFS vs HBase
  • Intro to Regions, Region server and Hmaster
  • Limitations of Hbase
  • Integration with Hive and Hbase
  • Hbase commands
  • Use cases for HBASE


  • Intro to Flume
  • Intro to Sink, Source, Flume Master and Flume agents
  • Importance of Flume agents
  • Live Demo on copying LOG DATA into HDFS


  • Intro to Sqoop
  • Importing and exporting the RDBMS into HDFS
  • Intro to incremental imports and its types
  • Use cases to import the Mysql data into HDFS


  • Intro to Zookeeper
  • Zookeeper operations


  • Intro to Oozie
  • What is Job.properties
  • What is workflow.xml
  • Scheduling the jobs in Oozie
  • Scheduling MapReduce, HIVE,PIG jobs/Programs using Oozie.
  • Setting up the VMware for Hadoop
  • Installing all Hadoop Components
  • Intro to Hadoop Distributions
  • Intro to Cloudera and its major components

9. Spark

  • What is Spark Ecosystem
  • What is Scala and its utility in Spark
  • What is SparkContext
  • How to work on RDD in Spark
  • How to run a Spark Cluster
  • Comparison of MapReduce vs Spark

10. Tableau

  • Tableau Fundamentals
  • Tableau Analytics
  • Visual Analytics
  • Connecting
  • Hadoop Integration with Tableau

About us

Bluebell Research Australia Pty. Ltd. is an Australian Startup engaged in training and placing candidates across Australia.

Contact Us

Email: training@bluebellresearch.com.au

Ph: +61 457298829/416769207

Address: 79 George Street, Parramatta, NSW, Australia

Policy and Terms