Data is everywhere. People upload videos, take pictures, use several apps on their phones, search the web and more. Machines too, are generating and keeping more and more data. Existing tools are incapable of processing such large data sets. Hadoop and large-scale distributed data processing, in general, is rapidly becoming an important skill set for many programmers. Hadoop is an open-source framework for writing and running distributed applications that process large amounts of data. This course introduces Hadoop in terms of distributed systems as well as data processing systems. With this course, get an overview of the MapReduce programming model using a simple word counting mechanism along with existing tools that highlight the challenges around processing data at a large scale. Dig deeper and implement this example using Hadoop to gain a deeper appreciation of its simplicity.
Skills covered
Different techniques of big data analytics using Hadoop
Understand the importance of distributed data storage system
Course Syllabus
Introduction to Hadoop
Introduction to Big Data / Hadoop
Hadoop distributed file system (HDFS)
Intro to ETL
Distributed computing
Map-Reduce abstraction
Programming MapReduce jobs
Introduction to Oozie and HDFS processing
Hadoop cluster and eco system
Input/Output formats and conversion between different formats
MapReduce features
Troubleshooting MapReduce jobs
YARN (Hadoop2.0)