Introduction to HBase Filters

In this post, we will be discussing the need for Filters in HBase architecture and its working.

Before moving further, to know more about HBase Filters, we should first know why HBase came into the picture and how it overpowered the use of RDBMS architecture to handle Big data.

Limitations of RDBMS Architecture

Firstly, the size of data has increased tremendously, well into the range of petabytes. RDBMS finds it challenging to handle such huge data volumes. To address this, RDBMS added several central processing units and more memory to the database management system to scale up vertically.

Second, the majority of the data comes in a semi-structured or unstructured format from social media, audio, video, texts, and emails. However, this problem related to unstructured data is outside the scope of RDBMS, because relational databases just can’t categorize unstructured data. They are designed and structured to accommodate structured data, such as weblog sensor and financial data.

“Big data” is being generated at a very high velocity but RDBMS lags when it comes to high velocity because it’s designed for steady data retention rather than rapid growth. Even if RDBMS is capable of handling and storing “Big Data,” it will turn out to be very expensive.

As a result, the inability of relational databases to handle “Big Data” led to the emergence of new technologies, thus, Google came up with a solution in 2004 – 2005, where they developed a NoSQL distributed column-oriented database known as BigTable, which allows user to perform random, real-time read/write access to the data stored in the distributed cluster. This is how Apache HBase was introduced. HBase modeled after Google’s BigTable and provides BigTable-like capabilities on top of Hadoop and HDFS.

Why do we need Filters?

HBase can query data very quickly on demand but specific use cases may require to only return a subset of the scan results. Instead of scanning the entire dataset only to return a subset to the client, we can use Filters to get the data closer to what we need in less amount of time.

Thus, HBase has a set of predefined Filters as well as custom filters that we can use to scan and get filtered results from the HBase database.

How HBase Filters Work

There are two prominent ways to read data from HBase.

  • Get is simply a Scan limited by the API to one row.

  • Scan fetches zero or more rows of a table. By default, a Scan reads the entire table from start to end. We can limit our Scan results in several different ways, which affect the Scan’s load in terms of IO, network, or both, as well as processing load on the client side.

When reading data from HBase using Get or Scan operations, we can use custom filters to return a subset of results to the client. It does reduce network bandwidth and reduces the amount of data the client needs to process.

Filters are generally used when using the Java API and take zero or more arguments, in parentheses. Where the argument is a string, it is surrounded by single quotes (‘string’).

Now let’s look at how HBase Filters work with the help of the below example.

 

Problem Statement:

To find a column family value from an existing table in HBase, using Filters.

Table:

For the below example, we will be using an existing table named “customer” from HBase default database. We can observe in the below image that by using HBase “list” command, we are listing the tables present in the HBase default database.

Table “customer” contents :

As shown in the below image, the table “customer” consists of three rows, namely Kiran, Manjunath andPrateek with a single column family named “order” and its column qualifier name as the number.

Expected Output:

We can refer to the below screenshot to see the what the expected output will be.

Source Code :

package com.acadgild.hbase;

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;

import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;

import org.apache.hadoop.hbase.util.Bytes;

Here’s the explanation of each line of code:

  • In line 1, we are declaring a class name Filter_ColumnValue.

  • In line 3, the Configuration class adds HBase configuration resources to its object conf with the help ofcreate() method of the HBaseConfiguration class.

  • In line 4, the class HTable instance “table” will allow to communicate with a single HBase table, it accepts configuration object and the table name as the parameters.

  • In line 5, we are using the class SingleColumnValueFilter to filter the cells based on the value. It takes a CompareFilter.CompareOp operator (equal, greater, not equal, etc), and either a byte[] value or a ByteArrayComparable.

Here, “order” is the column family name, “number” is its column qualifier name, and “ACD-15” is the value in the table “customer”. We are using CompareOp.EQUAL operator to check whether the value “ACD-15” is present in the column family qualifier name “number”.

  • In line 6, to prevent the entire row from being omitted if the column is not found on a row, we are using setFilterMissing(boolean) method. If the column is found, the entire row will be omitted only if the value passes. If the value fails, the row will be filtered out.

  • In line 7, we are creating class Scan “scan” instance to perform Scan operations.

  • In line 8, we are using setFilter method to perform Filter operation on by column.

  • In line 9, we are declaring ResultScanner instance “result” which returns a scanner on the current table“customer” as specified by the Scan object.

  • In line 10, a foreach loop is taken, which will run each time for the rows inside the “customer” table until the result scanner value is found.

  • In line 11, we are storing the value “ACD-15”, if it is found in the table “customer” in the variable val.

  • In line 12, we are printing the variable value with the String “Row-value Found”.

  • In line 14, we are closing the table operation..

Output:

From the below screenshot, we can see that a successful message is thrown stating ROW – value Found: ACD -15.

Thus, from the above steps we can observe that how HBase custom Filter helped us to retrieve a column family value by scanning in particular column family which was passed as an argument in the program instead of scanning the whole table.

We hope this post has been helpful in understanding the concept of Filters in HBase, for retrieving results from a HBase database. In case of any queries, feel free to comment below and we will get back to you at the earliest.

Keep visiting our website for more post on Big Data and other technologies.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s