Pig Use Case – The Daily Show Data Analysis Part – II

In this post, we will be looking at the use case, The daily show. Here, we will work on some problem statements and come up with solutions using Pig.

In our previous blog Pig Use Case – The Daily Show Data Analysis Part – I we have worked on the problem statements the top five kinds of GoogleKnowlege_Occupation people who were guests in the show, in a particular time period and the number of politicians who came every year in the show as guests.

We have a historical data of The Daily Show guests from 1999 to 2004. You can download this dataset fromhere.

Please find the dataset description below.

Dataset Description:

YEAR – The year the episode aired.

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph. On the other hand, if they are not in there, how Stewart introduced them on the program.

Show – Air date of the episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, U.S senators, U.S presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

Problem Statement 1:

Find the number of GoogleKnowledge occupation types in each group, who have been guests on the show

Source Code:

A = load ‘/home/kiran/dialy_shows’ using PigStorage(‘,’) AS (year:chararray,occupation:chararray,date:chararray,grp:chararray,gusetlist:chararray);
B = foreach A generate occupation,grp;
C = group B by grp;
D = foreach C generate group, COUNT(B) as cnt;
E = order D by cnt desc;

In statement A, we are loading the dataset using PigStorage along with the schema of the file.

In statement B, we are extracting the required columns i.e., occupation and the grp.

In statement C, we are grouping the relation B by the grp.

If you describe the relation C then you can see the schema of the relation as shown below:

In statement D, we are generating the group and the Count of values in relation B.

In statement E, we are displaying the count of the number of Google_knowledge_occupation types in each group, who have been guests on the show and the result is displayed below.

 

Problem Statement 2:

To verify problem statement 1, we will find out what are the combinations of group and the Google_knowledge_occupation types who have been guests in the show.

Source Code:

In statement A, we are loading the dataset using PigStorage along with the schema of the file.

In statement B, we are extracting the required columns i.e., occupation and the group.

In statement C, we are grouping the relation B by the group and the occupation.

If you describe relation C then you can see the schema of the relation as shown below:

In statement D, we are generating the group and the Count of values in relation B.

In statement E, we are displaying the count of the number of combinations of Google_knowledge_occupation types each group, who have been guests on the show and the sample result is displayed below.

If you count all the combinations, you will get a total of 930 which has been displayed for Acting in the above problem statement.

We hope this post has been helpful in understanding how to perform analysis using Apache Pig. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

Pig Use Case – The Daily Show Data Analysis Part – I

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s