Sqoop - Data splitting

Solution 1:

There is some gap in your understanding.

First of all, the degree of parallelism is controlled by -m <n> or --num-mappers <n>. By default value of --num-mappers is 4.

Second, --split-by <column-name>, will split your task on the basis of column-name.

Third, $CONDITIONS, it is used internally by sqoop to achieve this splitting task.

Example, You fired a query:

sqoop import --connect jdbc:mysql://myserver:1202/ --username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir /temp/emp -m 4

Say, my empId is uniformly distributed from 1- 100.

Now, sqoop will take --split-by column and find its max and min value using query:

SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1

See it replaced $CONDITIONS with (1 = 1).

In our case, min, max values are 1 and 100.

As number of mappers are 4, sqoop will divide my query in 4 parts.

Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'

Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'

Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'

Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'

Now $CONDITIONS will again come into the picture. It is replaced by above range queries.

First mapper will fire query like this:

Select * From emp WHERE empId >= 25' AND 'empId < 50

and so on for other 3 mappers.

Results from all the mappers is aggregated and written to a final HDFS directory.

Regarding your query :

select deptid, avg(salary) from emp group by deptid

you will specify

--query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'

It will be first converted to

select deptid, avg(salary) from emp group by deptid where (1 = 0)

to fetch column metadata.

I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)) directly in Mysql.

So you will not be able to use this query to fetch data using Sqoop.

Sqoop is used for simpler SQL queries.

Solution 2:

Your query, 'Select * from emp where empid< 1123 and $CONDITIONS', works perfectly well to import data from RDBMS to HDFS.

Even though $CONDITIONS first evaluates to 1=0 in where clause, if you see the Sqoop import log on console, you will actually see another SQL Val boundary query that is replacing $CONDITIONS with 1=1 which will pass the query and hence the data can be imported.

Note that Sqoop CAN import even medium complex SQL queries like joins. I am not sure if it can support highly complex SQL queries though as I have not tested it myself.

Error while import tensorflow module

Calculate percentage of each category in each group in R [duplicate]

What input voltages does the 85W Magsafe 2 power adapter accept?

Can Mail fetch email signature set using Google Mail?

How can I run Mojave on an iMac that won’t run Mojave?

How can I extend the iPhone call log to have more than 100 entries? [duplicate]

How to remove an iTunes Wi-Fi Sync host

Macbook Pro 2010 crashes after waking up and usually when launching a Google Search in Chrome

Is it possible to execute the package (located in FTP) from terminal of Mac

Setting arrangement of new Finder Window via AppleScript

iOS mount point in macOS - Is there any way to access the iOS files in macOS directly?

how can I diagnose the root of the problem when my mac starts freezing for 10 sec each time I switch to an app?