Introduction
Traditional databases are the top choice of businesses and enterprises for data storage. But with the rapidly increasing volume and the variety of data, Hadoop and NoSQL databases are gaining momentum. Many enterprises have started switching to the Hadoop for storing their huge datasets. While Pig and Hive are used to query the data in Hadoop, Apache Sqoop is used to transfer the data from traditional databases to Hadoop.
What is Apache Sqoop?
Apache Sqoop is a general purpose tool to do a bulk transfer of data from databases like Oracle, MySQL to Hadoop. Sqoop supports a no. of traditional databases like Oracle, MySQL, Postgres, and Teradata. It can also transfer data from Hadoop to traditional data stores. Transferring data from enterprise data store to through a MapReduce application can be challenging if executed without Sqoop.
Working
Sqoop can import and export data from the Hadoop. It transfers data very efficiently from RDBMS to HDFS and vice-versa. It provides a lot of Sqoop APIs for building connectors which can be added to Sqoop to connect to different databases. It comes with built-in connectors for a lot of popular databases. Sqoop is implemented in MapReduce, which provides it with the ability to handle bulk data transfers between different data stores.
Sqoop Import
Sqoop imports individual tables from the traditional database, RDMS to Hadoop storage, HDFS. Sqoop divides the tables into equal ranges using the primary key of the table. It creates mappers for each range and these mappers write to the HDFS in parallel. Each mapper does the data transfer through transactions to ensure no data is lost and minimum resource usage. It also allows incremental data imports which continuously imports data from traditional data stores.
Syntax
Sqoop import (generic-args) (import-args)
Importing a table
Sqoop import –connect host_name –username user_name –table table_name –m 1
Importing into directory
Sqoop import –connect host_name –username user_name –table table_name –m 1
–target-dir directory_name
Importing all tables
Sqoop import-all-tables –connect host_name –username user_name
Sqoop Export
While exporting data from HDFS, Sqoop uses MapReduce and do bulk insert into the traditional database table. The target tables should exist in a database. It generates INSERT statements that stores data into the database. It also has a UPDATE mode, where it generates UPDATE statement to update the existing records in a table.
Syntax
Sqoop export (generic-args) (import-args)
Sqoop export –connect host-name –username user_name –table table_name –export-dir directory_name
Sqoop Jobs
These individual commands can be saved into sqoop jobs and reused.
Sqoop job <option> [jobId] — <sqoop command>
Create Sqoop Import Job
Sqoop job –create job_name –import –connect host_name –username user_name –table table_name –m 1
Verify Sqoop Job
To verify the already saved sqoop jobs, use –list
Sqoop job –list
Inspect Sqoop Job
To inspect or verify a specific job and its details, use –show
Sqoop job –show job_name
Execute Sqoop Job
To execute a sqoop job, use –exec
Sqoop job –exec job_name
Features of Apache Sqoop
Apache Sqoop is easy to use, support multiple data platforms, provides strong security and integrates well with other Hadoop ecosystems components.
- Integrate with various Hadoop component like Hive Query view, Oozie to schedule the import-export job
- Automatically generates client side code for various jobs
- Parallelizes data transfer for fast performance
- JDBC based implementation for connecting to databases
- Connectors allow the data to be transferred efficiently ensuring data integrity and consistency. New connectors can be easily built to support more databases
- Sqoop is enterprise ready. It offers improved error handling and handling of temporary tables.
End Notes
Apache Sqoop takes the pain away when transferring huge databases from traditional databases to HDFS. Enterprises like Apollo group use Sqoop to bring data in and out of Hadoop, online marketing company Coupons.com uses Sqoop to transfer data between Hadoop and the IBM Netezza data warehouse.
For any query :-
anandpandey04@live.com
Thanks for sharing. Btw, Do you know about Binfer? It is an easy way to transfer big data.
LikeLike