Apache Sqoop: Beginner`s Guide

Introduction

Traditional databases are the top choice of businesses and enterprises for data storage. But with the rapidly increasing volume and the variety of data, Hadoop and NoSQL databases are gaining momentum. Many enterprises have started switching to the Hadoop for storing their huge datasets. While Pig and Hive are used to query the data in Hadoop, Apache Sqoop is used to transfer the data from traditional databases to Hadoop.

What is Apache Sqoop?

Apache Sqoop is a general purpose tool to do a bulk transfer of data from databases like Oracle, MySQL to Hadoop. Sqoop supports a no. of traditional databases like Oracle, MySQL, Postgres, and Teradata. It can also transfer data from Hadoop to traditional data stores. Transferring data from enterprise data store to through a MapReduce application can be challenging if executed without Sqoop.

Working

Sqoop can import and export data from the Hadoop. It transfers data very efficiently from RDBMS to HDFS and vice-versa. It provides a lot of Sqoop APIs for building connectors which can be added to Sqoop to connect to different databases. It comes with built-in connectors for a lot of popular databases. Sqoop is implemented in MapReduce, which provides it with the ability to handle bulk data transfers between different data stores.

Sqoop Import

Sqoop imports individual tables from the traditional database, RDMS to Hadoop storage, HDFS. Sqoop divides the tables into equal ranges using the primary key of the table. It creates mappers for each range and these mappers write to the HDFS in parallel. Each mapper does the data transfer through transactions to ensure no data is lost and minimum resource usage. It also allows incremental data imports which continuously imports data from traditional data stores.

Sqoop Import

Syntax

Sqoop import (generic-args) (import-args)

Importing a table

Sqoop import –connect host_name –username user_name –table table_name –m 1

Importing into directory

Sqoop import –connect host_name –username user_name –table table_name –m 1

–target-dir directory_name

Importing all tables

Sqoop import-all-tables –connect host_name –username user_name  

Sqoop Export

While exporting data from HDFS, Sqoop uses MapReduce and do bulk insert into the traditional database table. The target tables should exist in a database. It generates INSERT statements that stores data into the database. It also has a UPDATE mode, where it generates UPDATE statement to update the existing records in a table.

Sqoop Export

Syntax

Sqoop export (generic-args) (import-args)

Sqoop export –connect host-name –username user_name –table table_name –export-dir directory_name

 

Sqoop Jobs

These individual commands can be saved into sqoop jobs and reused.

Sqoop job <option> [jobId] — <sqoop command>

Create Sqoop Import Job

Sqoop job –create job_name –import –connect host_name –username user_name –table table_name –m 1

Verify Sqoop Job

To verify the already saved sqoop jobs, use –list

Sqoop job –list         

Inspect Sqoop Job

To inspect or verify a specific job and its details, use –show

Sqoop job –show job_name

Execute Sqoop Job

To execute a sqoop job, use –exec

Sqoop job –exec job_name

Features of Apache Sqoop

Apache Sqoop is easy to use, support multiple data platforms, provides strong security and integrates well with other Hadoop ecosystems components.

  • Integrate with various Hadoop component like Hive Query view, Oozie to schedule the import-export job
  • Automatically generates client side code for various jobs
  • Parallelizes data transfer for fast performance
  • JDBC based implementation for connecting to databases
  • Connectors allow the data to be transferred efficiently ensuring data integrity and consistency. New connectors can be easily built to support more databases
  • Sqoop is enterprise ready. It offers improved error handling and handling of temporary tables.

End Notes

Apache Sqoop takes the pain away when transferring huge databases from traditional databases to HDFS.  Enterprises like Apollo group use Sqoop to bring data in and out of Hadoop, online marketing company Coupons.com uses Sqoop to transfer data between Hadoop and the IBM Netezza data warehouse.

 

For any query :-

anandpandey04@live.com

1 Comment

Leave a reply to June Pinto Cancel reply