You, This Course and Us
Pig and the Hadoop ecosystem Install and set up How does Pig compare with Hive? Pig Latin as a data flow language Pig with HBase Downloads
Operating modes, running a Pig script, the Grunt shell Loading data and creating our first relation Scalar data types Complex data types - The Tuple, Bag and Map Partial schema specification for relations Displaying and storing relations - The dump and store commands Downloads
Selecting fields from a relation Built-in functions Evaluation functions Using the distinct, limit and order by keywords Filtering records based on a predicate Downloads
Group by and aggregate transformations Combining datasets using Join Concatenating datasets using Union Generating multiple records by flattening complex fields Using Co-Group, Semi-Join and Sampling records The nested Foreach command Debug Pig scripts using Explain and Illustrate Downloads
Parallelize operations using the Parallel keyword Join Optimizations: Multiple relations join, large and small relation join Join Optimizations: Skew join and sort-merge join Common sense optimizations Downloads
Parsing server logs Summarizing error logs Downloads
Hadoop Install Modes Hadoop Standalone mode Install Hadoop Pseudo-Distributed mode Install Downloads
[For Linux/Mac OS Shell Newbies] Path and other Environment Variables Setup a Virtual Linux Instance (For Windows users) Downloads
What will I learn?
- Work with unstructured data to extract information, transform it and store it in a usable form
- Write intermediate level Pig scripts to munge data
- Optimize Pig operations which work on large data sets
About the course
This is taught by a team which includes 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with large-scale data processing jobs.
Pig is aptly named, it is omnivorous, will consume any data that you throw at it and bring home the bacon!
Let's parse that
omnivorous: Pig works with unstructured data. It has many operations which are very SQL-like but Pig can perform these operations on data sets which have no fixed schema. Pig is great at wrestling data into a form which is clean and can be stored in a data warehouse for reporting and analysis.
bring home the bacon: Pig allows you to transform data in a way that makes is structured, predictable and useful, ready for consumption.
Pig Basics: Scalar and Complex data types (Bags, Maps, Tuples), basic transformations such as Filter, Foreach, Load, Dump, Store, Distinct, Limit, Order by and other built-in functions.
Advanced Data Transformations and Optimizations: The mind-bending Nested Foreach, Joins and their optimizations using "parallel", "merge", "replicated" and other keywords, Co-groups and Semi-joins, debugging using Explain and Illustrate commands
Real-world example: Clean up server logs using Pig
Who should take the course?
- Yep! Analysts who want to wrangle large, unstructured data into shape
- Yep! Engineers who want to parse and extract useful information from large datasets
Pre-requisites & Requirements
- Working with Pig requires some basic knowledge of the SQL query language, a brief understanding of the Hadoop eco-system and MapReduce
- A basic understanding of SQL and working with data
- A basic understanding of the Hadoop eco-system and MapReduce tasks