Hadoop Tutorial : Joins in Hive

In this Hadoop tutorial post, we will understand various types of JOIN operation that Hive supports. Hive, like alternative SQL database, permits users to join various tables. However, hive doesn’t support JOIN clauses that don’t seem to have the Equality Conditions because it is extremely troublesome to express such conditions as a MapReduce job.

Syntax for Disallowed Joins (Not allowed in hive)

SELECT a.* FROM a JOIN b ON (a.id <> b.id)

Now, let us see different types of JOIN operations supported by Hive.

  • JOIN
  • OUTER JOIN
    • LEFT OUTER JOIN
    • RIGHT OUTER JOIN
    • FULL OUTER JOIN
  • LEFT SEMI JOIN
  • CROSS JOIN

Download sample dataset to practice: Employee.txt  Expense.txt

JOIN

Similar to OUTER JOIN in SQL, join in hive is used to combine records based on common fields from multiple tables. Let us understand the concept by creating two tables and Joining them. Creating a table ’emp’ and loading the data into it as shown below:

 

 

Creating a table named ‘expense’ and loading the data into it is shown below:

 

Performing the JOIN operation.

Command:

SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM emp c JOIN expense o ON (c.ID = o.emp_ID);

The above JOIN is also known as INNER JOIN as the value which is available is both the tables gets joined. Missing values in either table gets discarded. From the figure above we can understand that the values which are common in both the tables get joined and hence the name Inner Join. Check below output screenshot and you will find that few id’s from tables did not appear after performing join operation.

OUTER JOINS

Outer Joins can be further sub-categorized into:

  • Left Outer Join
  • Right Outer Join
  • Full Outer Join

Let us see each of them closely.

LEFT OUTER JOIN

If you have a requirement where in you want to keep all the rows of the table a (present in left side) + the common rows of the table a & b, you can use left outer join.

The syntax “FROM a LEFT OUTER join b” should be written in one line so as to grasp how it works: a is to the LEFT of b in this query, then all rows from a are kept.

The below image in this hadoop tutorial depicts the same.

In the above picture, if you consider ‘O’ as a left-hand side table and ‘E’ as a right-hand side table, you will get an idea that all the rows in the left table are present and the common rows of O+E are present (Shown by the yellow portion in this hadoop tutorial).

Syntax:

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key);

RIGHT OUTER JOIN: 

This join will return all the rows from right hand side table along with the common rows present in both left and right table. In case there is no match, join operation will still return the row but with NULL values.

Below image in this hadoop tutorial shows the right outer join.

Image missing.

 

Command:

FULL OUTER JOIN:

Full outer join combines all the records present in both left as well as right table based on the condition.In brief, you will not lose the data after performing the join operation.

Command

Below is the picture that depicts the same.

Consider ‘O’ as a left side table and ‘E’ as right side table. You can see that all the data is available (shown by yellow areas) after join operation.

 

 

LEFT SEMI JOIN

LEFT SEMI JOIN implements the related IN/EXISTS subquery semantics in an efficient manner. Since Hive presently doesn’t support IN/EXISTS subqueries, you’ll be able to rewrite your queries using the LEFT SEMI join. The restrictions of using the LEFT SEMI join is that the right-hand-side table ought to solely be referenced within the JOIN condition (the ON-clause), however not in the WHERE- or SELECT-clauses, etc.

This type of syntax query:
SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
Can be written as Syntax:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
Example:

 

CROSS JOIN

The SQL CROSS JOIN produces a result set that is the number of rows within the 1st table multiplied by the number of rows within the second table if no WHEREVER clause is used alongside the CROSS join. This kind of a result’s called as a Cartesian product.

If where clause is used with CROSS join, it functions like an INNER join.

An alternative manner for achieving the same results to use column names separated by commas after selecting and mentioning the table names involved, this comes after a FROM clause.

Example:

 

 

Hope this In this hadoop tutorial helped you in understanding the different types of JOINS available in Hive using Map-Reduce. Feel free to comment below for your queries

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s