Sqoop - Data splitting

Solution 1:

There is some gap in your understanding.

First of all, the degree of parallelism is controlled by -m <n> or --num-mappers <n>. By default value of --num-mappers is 4.

Second, --split-by <column-name>, will split your task on the basis of column-name.

Third, $CONDITIONS, it is used internally by sqoop to achieve this splitting task.

Example, You fired a query:

sqoop import --connect jdbc:mysql://myserver:1202/ --username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir /temp/emp -m 4

Say, my empId is uniformly distributed from 1- 100.

Now, sqoop will take --split-by column and find its max and min value using query:

SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1

See it replaced $CONDITIONS with (1 = 1).

In our case, min, max values are 1 and 100.

As number of mappers are 4, sqoop will divide my query in 4 parts.

Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'

Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'

Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'

Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'

Now $CONDITIONS will again come into the picture. It is replaced by above range queries.

First mapper will fire query like this:

Select * From emp WHERE empId >= 25' AND 'empId < 50

and so on for other 3 mappers.

Results from all the mappers is aggregated and written to a final HDFS directory.

Regarding your query :

select deptid, avg(salary) from emp group by deptid

you will specify

--query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'

It will be first converted to

select deptid, avg(salary) from emp group by deptid where (1 = 0)

to fetch column metadata.

I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)) directly in Mysql.

So you will not be able to use this query to fetch data using Sqoop.

Sqoop is used for simpler SQL queries.

Solution 2:

Your query, 'Select * from emp where empid< 1123 and $CONDITIONS', works perfectly well to import data from RDBMS to HDFS.

Even though $CONDITIONS first evaluates to 1=0 in where clause, if you see the Sqoop import log on console, you will actually see another SQL Val boundary query that is replacing $CONDITIONS with 1=1 which will pass the query and hence the data can be imported.

Note that Sqoop CAN import even medium complex SQL queries like joins. I am not sure if it can support highly complex SQL queries though as I have not tested it myself.