Single map task taking long time and failing in hive map reduce

I am running a simple query like the one shown below(similar form)

INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1

There is nothing wrong with query syntax wise.

TABLE2 IS EMPTY and the total size of TABLE1 is 2gb in HDFS(stored as parquet with snappy compression)

When I run the query in hive, I see that 17 map tasks and 0 reducer tasks are launched.

What I notice is that most of the map task complete in a minute. But one of the map task takes long time. It's like all the data in the table is going to that map task.

The whole query fails eventually with container physical memory limit error.

Any reasons for why this is happening or might happen?


Solution 1:

It may happen because some partition is bigger than others.

Try to trigger reducer task by adding distribute by

INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
DISTRIBUTE BY COLUMN

Additionally you can add some other evenly distributed column with low cardinality to the DISTRIBUTE BY to increase parallelism:

DISTRIBUTE BY COLUMN, COLUMN2

If COLUMN2 has high cardinality, it will produce too many files in each partition, if column values are distributed not evenly (skewed) then it will result in skew in reducer, so it is important to use low-cardinality, evenly distributed column or deterministic function with the same properties like substr(), etc.

Alternatively also try to increase mapper parallelism and check if it helps: https://stackoverflow.com/a/48487306/2700344