A deep dive to learn PIG.

One of the biggest reasons why the popularity of Hadoop skyrocketed in recent times is the fact that features like Pig and Hive run on top of it allowing non-programmers with functionality that was previously exclusive to Java programmers. These features were a consequence of the growing demand for Hadoop professionals. Other features that are used by Hadoop professionals from non-Java backgrounds are Flume, Sqoop, HBase and Oozie.

To understand why you do not need Java to learn Hadoop, do check out this blog.

1Pig History

Let’s understand how these features work.

2 features

We all know that programming knowledge is a necessity for writing MapReduce codes. But what if I have a tool that can do the coding if I would just provide the details? That is where Pig exhibits its muscle power. Pig uses a platform called Pig Latin that abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. The codes written in Pig Latin MapReduce automatically get converted to equivalent MapReduce functions. Isn’t that awesome? Another Mind-Blowing fact is that only 10 Lines of Pig is needed to replace 200 Lines of Java.

10 lines of Pig = 200 lines of Java

This not only means that non-Java professionals use Hadoop but also testifies the underlining fact that Pig is used by an equal number of technical developers.

3Pig

Additionally, if you want to write your own MapReduce code, you can do that in any of the languages like Perl, Python, Ruby or C. Some basic operations that we can perform on any Dataset using Pig are Group, Join, Filter and Sort. These operations can be performed on structured, un-structured and also semi-structured data. They provide an ad-hoc way for creating and executing MapReduce jobs on very large data sets.

Next up, let’s understand Hive. It is an open source, peta-byte scale data warehousing framework based on Hadoop for data summarization, query and analysis. Hive provides an SQL-like interface for Hadoop. You can use Hive to read and write files on Hadoop and run your reports from a BI tool. Some typical functionality of Hadoop are:

4HIVE-applications

Let me show you a demo using Pig on Clickstream data set
We will use this Clickstream data and perform Transformations, Joins and Groupings.

5Clickstream dataset

ClickStream is a series of mouse clicks made by a user while accessing the Internet especially as monitored to assess a person’s interests for marketing purposes. It is mainly used by online retail websites like Flipkart and Amazon who track your activities to generate recommendations. The Clickstream data set that we have made use of has the following fields:

1.  Type of language supported by the web application

2. Browser type

3. Connection type

4. Country ID

5. Time Stamp

6. URL

7. User status

8. Type of User

6dataset fields

It will look like this with the appropriate fields.

7dataset fields

Below is the list of browser types that have been used by various people when surfing on a particular website. Among the list are browsers like Internet Explorer, Google Chrome, Lynx and so on.

8browser types

Internet connection type can be Lan/ Modem/Wifi. See the image below for the complete list:

9internet connection type

In the next image, you will find the list of countries from where the website has attracted audience along with their IDs.

10website countries

Once we have gathered all the data sets, we have to launch Pig’s Grunt shell, which is launched in order to run the Pig commands.

The first thing we have to do on launching Grunt shell Is to load the Clickstream data into Pig’s relation. A relation is nothing but a table. Below is the command that we use to load a file residing in HDFS onto Pig’s relation.

11command

We can verify the schema of the relation by the command describe click_stream.

12click_stream

We now need to add the reference files which will contain details about the list of countries with their IDs and the different browser types along with their IDs.

13IDs

We now have two reference files, but they need to be connected to form a relation.
We run a connection_ref  command to indicate the type of connection.

Now that we have a working connection and an established relation, we will show you how we can Transform that data.
For each record in Clickstream, we will generate a new record in a different format, i.e the transformed data. The new format will include fields like TimeStamp, Browser type, Country IDs and a few more.

14fields

We can perform a Filter operation to trim down the Big Data. The different types of users are Administrators, Guests or Bots. In our demo, I have filtered the list for the Guests.

15guests If you remember, the Country ID is present in the Clickstream and we loaded a country_ref file containing the names of the countries along with its IDs. We can thus perform a Join operation between the two files and merge the data to derive insights.

16join operation

If we have joined the data, then we can find out the different countries from where the users are by Grouping. Once we have this data, we can perform a Count operation to identify the number of users from a particular country.

17count operation

It is no rocket science to derive insights from Big Data. These are just some of the many features that I have implemented and with tools like Hive, Hbase, Oozie, Sqoop and Flume there is a treasure of data yet to be explored.  So those of you who are holding yourselves back from learning Hadoop, it’s time to change.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s