PIG – Relational Operators

In this instructional post, we will explore and understand few important relational operators in Pig which is widely used in big data industry. Before we understand relational operators, let us see what Pig is.

Apache Pig, developed by Yahoo! helps in analyzing large datasets and spend less time in writing mapper and reducer programs. Pig enables users to write complex data analysis code without prior knowledge of Java. Pig’s simple SQL-like scripting language is called Pig Latin and has its own Pig runtime environment where PigLatin programs are executed. F

Below are two datasets that will be used in this post.

Employee_details.txt

This data set have 4 columns i.e.

Emp_id: unique id for each employee

Name: name of the employee

Salary: salary of an employee

Ratings: Rating of an employee.

Employee_expenses.txt

This data set have 2 columns i.e.

Emp_id: id of an employee

Expense: expenses made by an employee

As you have got the idea of datasets, let us proceed and perform some relational operations using this data.

NOTE: All the analysis is performed in local mode. To work in local mode, you need to start pig grunt shell using “pig -x local” command. 

Relational Operators:

Load

To load the data either from local filesystem or Hadoop filesystem.

Syntax:

LOAD ‘path_of_data’ [USING function] [AS schema];

Where;

path_of_data : file/directory name in single quotes.

USING : is the keyword.

function : If you choose to omit this, default load function PigStorage() is used.

AS : is the keyword

schema : schema of your data along with data type.

Eg:

The file named employee_details.txt is comma separated file and we are going to load it from local file system.

A = LOAD ‘/home/anand/pig/employee_details.txt’ USING PigStorage(‘,’) AS (id:int, name:chararray, salary:int, ratings:int);

To check the result, you can use DUMP command.

Similarly, you can load another data into another relation, say ‘B’

B = LOAD ‘/home/anand/pig/employee_expenses.txt’ USING PigStorage(‘\t’) AS (id:int, expenses:int);

As the fields in this file are tab separated, you need to use ‘\t’

NOTE: If you load this dataset in relation A, the earlier dataset will not be accessible.

Limit

Used to limit the number of outputs to the desired number.

Syntax:

Alias = LIMIT alias n;

Where;

alias : name of the relation.

n : number of tuples to be displayed.

Ex:

We will be limiting the result of relation A (described above) to 5.

limited_val = LIMIT A 5;

NOTE: there is no guarantee which 5 tuples will be the output.

Order

Sorts a relation based on single or multiple fields.

Syntax:

alias = ORDER alias BY {field_name [ASC | DESC]

Where;

alias : is the relation

ORDER : is the keyword.

BY : is the keyword.

field_name : column on which you want to sort the relation.

ASC : sort in ascending order

DESC : sort in descending order.

Eg:

We will sort the relation A based on the ratings field and get top 3 employee details with highest ratings.

Sorted = ORDER A by ratings DESC;

Result = LIMIT Sorted 3;

You can also Order the relation based on multiple fields. Let’s order the relation A based on Descending ‘ratings’ and Ascending ‘names’ and generate top 3 result.

Ex:

Double_sorted = ORDER A by ratings DESC, name ASC;

Final_result = LIMIT Double_sorted 3;

Now, compare the ‘Result’ and ‘Final_result’ relation.

Group

Groups the data based on one or multiple fields. It groups together tuples that have the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key.

Syntax:

alias = GROUP alias {ALL | BY field};

Where;

alias : is the relation

GROUP : is the keyword

ALL : keyword. Use ALL if you want all tuples to go to a single group

BY : keyword

Field : field name on which you want to group your data.

Ex:

We will group our relation A based on ratings.

Grouped = GROUP A BY ratings;

You can see that the output is a tuple based on ‘ratings’ with multiple bags. You can also group the data based on more than one fields. For example,

Multi_group = GROUP A BY (ratings, salary);

Foreach

It generates data transformations based on desired columns of data.

Syntax:

alias = FOREACH alias GENERATE {expression | field};

Where;

alias : is the relation

FOREACH : is the keyword

GENERATE : is the keyword

Ex:

In our previous example we saw how to group a relation. Now, using FOREACH, we will generate the count of employees belonging to a particular group.

Result = FOREACH Grouped GENERATE group,COUNT(A.ratings);

From the result, we can conclude that there are 4 employees who got 1 as their rating. It is indicated by the first row. Basically, if want to operate at column level, you can use Foreach.

Filter

Filters a relation based on certain condition.

Syntax:

alias = FILTER alias BY expression;

Where;

alias : is the relation

FILTER : is the keyword

BY : is the keyword

expression : condition on which filter will be performed.

Ex:

We will filter our data (relation A) based on ratings greater than or equal to 4.

Filtered = FILTER A BY ratings >= 4;

Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH …GENERATE operation).

FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.

We can also use multiple conditions to filter data at one go.

Eg:

Multi_condition = FILTER A BY (ratings >= 4) AND (salary >1000)

This will produce results that follows in the category of ratings greater than equals 4 and salary greater than 1000.

Store

Stores and saves the data into a filesystem.

Syntax:

STORE alias INTO ‘directory’ [USING function]

Where;

STORE : is a keyword

alias : is the relation which you want to store.

INTO : is the keyword

directory : name of directory where you want to store your result.

NOTE: If directory already exists, you will receive an error and STORE operation will fail.

function : the store function. By default, PigStorage is the default storage function and hence it is not mandatory to mention this explicitly.

Ex:

We will store our result named “Multi_condition” (achieved in previous operation) into local file system.

Only catch here is, I have specified PigStorage(‘|’) , means, I will loading result into local fs and the delimiter will be pipeline i.e. ‘|’

Let’s check the result.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s