Apache Pig Use Case – Electrical Bulb Testing

In this blog, we will work on a use case involving electric bulbs and work with the date and time concepts in Pig.

In this instance, Pig is used in the local mode to load the local data. We can use Pig in HDFS mode as per our convenience.

In the research center of bulb manufacturing companies, the longevity of bulbs is tested by subjecting them to adverse conditions.

The dataset used in this case is a sample from the light bulb production house where bulbs are tested at random intervals of time. The first column is StartDate which is the date and time when the testing of the bulb started and the second column is EndDate which is the date when the testing ended.

StartDate           EndDate

30-Jun-2018 23:42    04-Jul-2018 15:10

30-Jun-2018 23:37

30-Jun-2018 23:13    30-Jun-2019 23:34m

A few rows may be empty which indicates that data is not available, maybe because of various reasons. But as a developer we need not worry about missing data. With the help of Data Filtering, we can remove the unnecessary data.

Loading Data into the Pig environment

Since Pig uses default as tab(\t) delimited data, it’s not mandatory to state USING PigStorage(‘\t’)  in the code while loading, nevertheless it is good to write it. You have to use this parameter depending on the dataset.

Since we have data inside Pig, the first step is to filter data in the column we are working on.

Here we remove all the rows with null data.

In this step, it is mandatory to filter all the data in EndTime containing symbol.

We have to convert the data loaded in Pig into datetime format in order to work with it.

Here, we use two predefined functions:

ToDate()

MinutesBetween()

The first one converts the character array to datetime readable structure which can be interpreted by Pig and the second one takes the difference between two DateTime parameters provided.

The ToDate function can be used in different formats of year, month and date. Some examples are as follows:

YYYY-MM-DD

DD/MM/YYYY

DD-YY-MM

Depending on the appropriate structure in the dataset provided, we can choose the format.

After simple filtering and conversion of character array data to datetime format, we have now determined the difference in terms of minutes for every bulb which was in ON state during testing.

 

We can see the results with dump command.

Result in minutes is displayed:

Once we achieve this, we can perform analysis on the result, for example, to find the maximum time a bulb can stay ON or minimum time and so on..

Shown below is the result for the average time the bulbs were ON during the testing phase.

Dump Avg_ALL;

This way we can perform analysis on the filtered result and get the results with help of Pig in a matter of minutes from a large set of data.

For dataset and code for practice, click  —  https://drive.google.com/open?id=0B2nmxAJLHEE8Tnp3SmEyLUhnMkE

HAPPY LEARNING !

ANAND PANDEY

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s