Working With Hive Complex Data Types

In this blog, we will discuss the working of complex data types in Hive. Before we move ahead you can go through the below link blogs to gain more knowledge on Hive and its working.

Beginners Guide For Hive

Perform Word Count Job Using Hive

Pokemon Data Analysis Using Hive

Bucketing in Hive – Let’s Start.

What are Complex Data Types?

Complex Data types are also known as nested types which allow storing multiple values within a single row/column position in a Hive table. Generally, in Hive and other databases, we have more experience on working with primitive data types like:

Numeric Types

  • TINYINT (1-byte signed integer, from -128 to 127)
  • SMALLINT (2-byte signed integer, from -32,768 to 32,767)
  • INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
  • BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
  • FLOAT (4-byte single precision floating point number)
  • DOUBLE (8-byte double precision floating point number)
  • DECIMAL (Hive 0.13.0 introduced user definable precision and scale)

String Types

  • STRING
  • VARCHAR
  • CHAR

Date/Time Types

  • TIMESTAMP
  • DATE

Misc Types

  • BOOLEAN
  • BINARY

Apart from these primitive data types, Hive offers some complex data types which are as listed below:
Complex Data Types

  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], …>

Thus, let us know what are these complex data types and their working in Hive.

So, in our example, we will be using our Hive default database to store the complex data type tables.

Array:

The first complex type is an array. It is nothing but a collection of items of similar data type. i.e, an array can contain one or more values of the same data type.

In our Array example, we will be using the dataset Temperature.txt where the fields are delimited by tab and the complex data type Array values are delimited by the comma.

Dataset_Temperature

Dataset Description

Date

State

Temperature [Depending on their district level wise]

So, let us create a table to store the above values of dataset Temperature using below code.

create table Temperature(date string,city string,MyTemp array<double>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating an array named Myarray which will hold only the double type values.

To check the schema of the table Temperature we can use the command describe Temperature;

Now, let us load our input dataset Temperature.txt using the below command.

load data local inpath ‘/home/anand/Desktop/Temperature.txt’ into table Temperature;
We can view the contents of the table using the command select * from Temperature;

We can observe, from the above image all the array values are collected and stored in a single column.

To select a column and a value from the table we can use the below command.

select city,MyTemp[0] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp Zeroth position values.

select date,city,MyTemp[3] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp 4th position values.

So, we can follow the above steps to work with complex data type array values in Hive.

Now, let us know what is Map and its working.

Map:

Map is a collection of key-value pairs where fields are accessed using array notation of keys

Eg: [‘Key’]

In our Map example we will be using the dataset Schools.txt where the fields are delimited by tab, the complex type Map values are delimited by the comma.

Dataset_School_Data

Dataset Description

School Type

State

Gender

Total

So, let us create a table to store the above values of dataset Schools using the below code.

create table MySchools(schooltype string,state string,gender string, total map<int,int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’;

We can observe, in the above code, we are creating a collection named total which will hold the Values of type int and int.

To check the schema of the table MySchools we can use the command describe MySchools;

Now, let us load our input dataset School_Data.txt using the below command.

load data local inpath ‘/home/anandDesktop/School_Data.txt’ into table MySchools;

We can view the contents of the table using the command select * from MySchools;

We can observe, from the above image each Map total column contains to data i.e, key and value.

Here year represents the key and the subsequent data represents the value.

To select a column and its key value from the table we can use the below command.

select total[2016] from MySchools where state=’Assam’;

We can observe from the above image, we are selecting the column state = ‘Assam’ to find the male and female strength in the year 2016.

select total[2017] from MySchools where state=’Chhattisgarh’ and gender=’Female’;

We can observe from the above image; we are selecting the column state = ‘chhattisgarh’ to find the female strength in the year 2017.

So, we can follow the above steps to work on a collection Map Key value pairs in Hive.

Now, let us know what is Struct and its working.

Struct:

Struct is a record type which encapsulates a set of named fields that can be any primitive data type. An element in STRUCT type can be accessed using the DOT (.) notation.

In our Struct example, we will be using the dataset Bikes.txt where the fields are delimited by tab and the complex type Array values are delimited by the comma.

Dataset_Bikes

Dataset Description

Name

BikeFeatures struct<EngineType,cc,power,gears>

So, let us create a table to store the above values of dataset Bikes using below code.

create table MyBikes(name string, BikeFeatures struct<EngineType:string,cc:float,power:float,gears:int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating a collection Struct named BikeFeatures which will hold string, float, float, and int value types.

To check the schema of the table MySchools we can use the command describe MyBikes;

Now, let us load our input dataset Bikes.txt using the below command.

load data local inpath ‘/home/anand/Desktop/Bikes.txt’ into table MyBikes;

We can view the contents of the table using the command select * from MyBikes;

We can observe, from the above image the Struct BikeFeatures column contains multiple values of different types.

To select a Struct column values from the table we can use the below command.

select BikeFeatures.EngineType from MyBikes;

To select a column and its Struct column values from the table we can use the below command.

We can observe from the above image we are using Dot (.) operator to access Struct Bike Features Engine Types column values from the table.

select BikeFeatures.EngineType from MyBikes where name=’Suzuki Swish’;

We can observe from the above image; we are selecting the column name where Bike name = ‘Suzuki Swish’ to find its EngineType.

So, we can follow the above steps to work on collection Struct values in Hive.

We hope this post has clarified the concept of Bucketing in Hive.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s