Data is everywhere. People upload videos, take pictures, use several apps on their phones, search the web and more. Machines too, are generating and keeping more and more data. Existing tools are incapable of processing such large data sets. Hadoop and large-scale distributed data processing, in general, is rapidly becoming an important skill set for many programmers. Hadoop is an open-source framework for writing and running distributed applications that process large amounts of data. This course introduces Hadoop in terms of distributed systems as well as data processing systems. With this course, get an overview of the MapReduce programming model using a simple word counting mechanism along with existing tools that highlight the challenges around processing data at a large scale. Dig deeper and implement this example using Hadoop to gain a deeper appreciation of its simplicity.

Skills covered

  • checkDifferent techniques of big data analytics using Hadoop
  • checkUnderstand the importance of distributed data storage system

Course Syllabus

Introduction to Hadoop

  • playIntroduction to Big Data / Hadoop
  • playHadoop distributed file system (HDFS)
  • playIntro to ETL
  • playDistributed computing
  • playMap-Reduce abstraction
  • playProgramming MapReduce jobs
  • playIntroduction to Oozie and HDFS processing
  • playHadoop cluster and eco system
  • playInput/Output formats and conversion between different formats
  • playMapReduce features
  • playTroubleshooting MapReduce jobs
  • playYARN (Hadoop2.0)

Introduction to Hadoop

Leave a Reply

Your email address will not be published. Required fields are marked *