PIG LATIN – OPERATORS

This post is about the operators in Apache Pig. Let’s take a quick look at what Pig and Pig Latin is and the different modes in which they can be operated, before heading on to Operators.

What is Apache Pig?

Apache Pig is a high-level procedural language for querying large data sets using Hadoop and the Map Reduce Platform. It is a Java package, where the scripts can be executed from any language implementation running on the JVM. This is greatly used in iterative processes.

Apache Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset and makes it possible to create complex tasks to process large volumes of data quickly and effectively. The best feature of Pig is that, it backs many relational features like Join, Group and Aggregate.

I know Pig sounds a lot more like an ETL tool and it does have many features common with ETL tools. But the advantage of Pig over ETL tools is that it can run on many servers simultaneously.

What is Apache Pig Latin?

Apache Pig create a simpler procedural language abstraction over Map Reduce to expose a more Structured Query Language (SQL)-like interface for Hadoop applications called Apache Pig Latin, So instead of writing a separate Map Reduce application, you can write a single script in Apache Pig Latin that is automatically parallelized and distributed across a cluster. In simple words, Pig Latin, is a sequence of simple statements taking an input and producing an output. The input and output data are composed of bags, maps, tuples and scalar.

Apache Pig Execution Modes:

Apache Pig has two execution modes:

  • Local Mode

In ‘Local Mode’, the source data would be picked from the local directory in your computer system. The MapReduce mode can be specified using ‘pig –x local’ command.

Operators in Apache Pig - 1

  • MapReduce Mode: 

To run Pig in MapReduce mode, you need access to Hadoop cluster and HDFS installation. The MapReduce mode can be specified using the ‘pig’ command.

Operators in Apache Pig - 2

Apache Pig Operators:

The Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop and the Map Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. These operators are the main tools for Pig Latin provides to operate on the data. They allow you to transform it by sorting, grouping, joining, projecting, and filtering.

Let’s create two files to run the commands:

We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.

Operators in Apache Pig - 3

The second file contain two fields: url & rating. These two files are CSV files.

Operators in Apache Pig - 4

The Apache Pig operators can be classified as: Relational and Diagnostic.

Relational Operators:

Relational operators are the main tools Pig Latin provides to operate on the data. It allows you to transform the data by sorting, grouping, joining, projecting and filtering. This section covers the basic relational operators.

LOAD:

LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.

In this example, the Load operator loads data from file ‘first’ to form relation ‘loading1’. The field names are user, url, id.

Operators in Apache Pig - 5

Operators in Apache Pig - 6

FOREACH:

This operator generates data transformations based on columns of data. It is used to add or remove fields from a relation. Use FOREACH-GENERATE operation to work with columns of data.

Operators in Apache Pig - 7

FOREACH Result:

Operators in Apache Pig - 8

FILTER:

This operator selects tuples from a relation based on a condition.

In this example, we are filtering the record from ‘loading1’ when the condition ‘id’ is greater than 8.

Operators in Apache Pig - 9

FILTER Result:

Operators in Apache Pig - 10

JOIN:

JOIN operator is used to perform an inner, equijoin join of two or more relations based on common field values. The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes sense to filter them out before the join.

In this example, join the two relations based on the column ‘url’ from ‘loading1’ and ‘loading2’.

Operators in Apache Pig - 11

JOIN Result:

Operators in Apache Pig - 12

ORDER BY:

Order By is used to sort a relation based on one or more fields. You can do sorting in ascending or descending order using ASC and DESC keywords.

In below example, we are sorting data in loading2 in ascending order on ratings field.

orderby1

ORDER BY Result:

orderby2

DISTINCT:

Distinct removes duplicate tuples in a relation.Lets take an input file as below, which has amr,crap,8 and amr,myblog,10twice in the file. When we apply distinct on the data in this file, duplicate entries are removed.

distinct1

distinct2

DISTINCT Result:

distinct3

STORE:

Store is used to save results to the file system.

Here we are saving loading3 data into a file named storing on HDFS.

store1

STORE Result:

store2

store3

GROUP:

The GROUP operator groups together the tuples with the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP operation is a relation that includes one tuple per group.

In this example, group th

Operators in Apache Pig - 13

e relation ‘loading1’ by column url.

GROUP Result:

Operators in Apache Pig - 14

COGROUP:

COGROUP is same as GROUP operator. For readability, programmers usually use GROUP when only one relation is involved and COGROUP when multiple relations re involved.

In this example group the ‘loading1’ and ‘loading2’ by url field in both relations.

Operators in Apache Pig - 15

COGROUP Result:

Operators in Apache Pig - 16

CROSS:

The CROSS operator is used to compute the cross product (Cartesian product) of two or more relations.

Applying cross product on loading1 and loading2.

Operators in Apache Pig - 17

CROSS Result:

Operators in Apache Pig - 18

LIMIT:

LIMIT operator is used to limit the number of output tuples. If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, the output will include all tuples in the relation.

Operators in Apache Pig - 19

LIMIT Result:

Operators in Apache Pig - 20

SPLIT:

SPLIT operator is used to partition the contents of a relation into two or more relations based on some expression. Depending on the conditions stated in the expression.

Split the loading2 into two relations x and y. x relation created by loading2 contain the fields that the rating is greater than 8 and y relation contain fields that rating is less than or equal to 8.

Operators in Apache Pig - 21

Operators in Apache Pig - 22

Operators in Apache Pig - 23

Got a question for us? Please mention them in the comments section and we will get back to you.

Advertisements
Categories:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s