Pig Use Case – The Daily Show Data Analysis Part – I

In this post, we will be looking at the use case, The daily show. Here, we will work on some problem statements and come up with solutions using Pig scripts.

We have a historical data of The Daily Show guests from 1999 to 2004. You can download this dataset fromhere.

Please find the dataset description below.

Dataset Description:

YEAR – The year the episode aired.

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph. On the other hand, if they are not in there, how Stewart introduced them on the program.

Show – Air date of the episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, U.S senators, U.S presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

Problem Statement 1:

Find the top five kinds of GoogleKnowlege_Occupation people who were guests in the show, in a particular time period.

Source Code:

In statement A, we are loading the dataset using PigStorage along with the schema of the file.

In statement B, we are extracting the required columns i.e., occupation and date.

In statement C, we are converting the date in string format to date using ToDate function in Pig.

In statement D, we are filtering the dates in a specific range. Here, we have given the date range from 1/11/99 to 6/11/99 i.e., we are taking out the data for 6 months.

In statement E, we are grouping relation D by occupation.

If you describe relation E then you can see the schema of the relation as shown below:

In statement F, we are generating the group and the Count of values. Here, we will get the occupation of the guest and the number of times that occupation guest came to the show within this span of 6 months.

In statement G, we are ordering the relation F by descending order.

In statement H, we are limiting the records of relation G to 5.

With this, we will get the top five GoogleKnowlege_Occupation guests in the show in a particular period.

When we dump the relation, we will get the below result.

 

Problem Statement 2:

Find out the number of politicians who came each year.

Source Code:

In statement A, we are loading the dataset using PigStorage along with the schema of the file.

In statement B, we are extracting the required columns i.e., year and the group.

In statement C, we are filtering the group by Politician.

In statement D, we are grouping the relation C by year.

If you describe relation D then you can see the schema of the relation as shown below:

In statement E, we are generating the group and the Count of values in the relation C.

In statement F, we are ordering the values in the relation F by descending order.

When we dump, the relation F we will get the number of politicians who were guests on the show each year and the result is as displayed below.

We hope this post has been helpful in understanding how to perform analysis using Apache Pig. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

Advertisements
Categories:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s