Hive unable to manually set number of reducers
I have the following hive query:
select count(distinct id) as total from mytable;
which automatically spawns:
1408 Mappers
1 Reducer
I need to manually set the number of reducers and I have tried the following:
set mapred.reduce.tasks=50
set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is there a way to manually set the reducers or maybe rewrite the query so it can result in more reducers? Thanks!
writing query in hive like this:
SELECT COUNT(DISTINCT id) ....
will always result in using only one reducer. You should:
-
use this command to set desired number of reducers:
set mapred.reduce.tasks=50
rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
-
either by changing hive-site.xml
<property> <name>hive.exec.reducers.bytes.per.reducer</name> <value>1000000</value> </property>
-
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"
You could set the number of reducers spawned per node in the conf/mapred-site.xml
config file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.
In particular, you need to set this property:
mapred.tasktracker.reduce.tasks.maximum
Mapper is totaly depend on number of file i.e size of file we can call it as input splits. Split is noting but the logical split of data. Ex: my file size is 150MB and my HDFS default block is 128MB. It will create two split means two blocks. Two Mapper will get assigned for this job.
Imp Note: Suppose I have specified the split size is 50MB then It will start 3 Mapper because of it totally depend on number of split.
Imp Note: if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
Note: If we haven't specifyed the split size it will take default hdfs block size as split size.
Reducer has 3 primary phases: shuffle, sort and reduce.
Command :
1] Set Map Task : -D mapred.map.tasks=4
2] Set Reduce Task : -D mapred.reduce.tasks=2