Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)

Solution 1:

If final reducer is a join then it looks like skew in join key. First of all check two things:

check that b.f1 join key has no duplicates:

select b.f1, count(*) cnt from B b 
 group by b.f1 
having count(*)>1 order by cnt desc;

check the distribution of a.f1:

select a.f1, count(*) cnt from A a
 group by a.f1  
order by cnt desc
limit 10;

This query will show skewed keys.

If there is a skew (too many rows with the same value) then join skewed keys separately, use union all:

SELECT a.f1, f2, ..., fn
  FROM ( select * from A where f1 = skewed_value) as a --skewed
  LEFT JOIN B as b
  ON a.f1 = b.f1
WHERE {PARTITION_FILTER}
UNION ALL
SELECT a.f1, f2, ..., fn
  FROM ( select * from A where f1 != skewed_value) as a --all other
  LEFT JOIN B as b
  ON a.f1 = b.f1
WHERE {PARTITION_FILTER}

And finally if there is no issues with skew and duplication, then try to increase reducers parallelism: Get current bytes per reducer configuration

set hive.exec.reducers.bytes.per.reducer;

typically this will return some value about 1G. Try to divide by two, set new value before your query and check how many reducers will it start and performance. Success criteria is more reducers has started and performance improved.

set hive.exec.reducers.bytes.per.reducer=67108864;

The less the bytes per reducer the more reducers will be started, increasing parallelism;

UPDATE: Try to enable map-join, your second table is small enough to fit in memory, mapjoin will work without reducers at all and it will be no problem with skew on reducers.

How to enable mapjoin: https://stackoverflow.com/a/49154414/2700344