How to skip CSV header in Hive External Table?
As of Hive v0.13.0, you can use skip.header.line.count
table property:
create external table testtable (name string, message string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
location '/testtable'
TBLPROPERTIES ("skip.header.line.count"="1");
Use ALTER TABLE
for an existing table:
ALTER TABLE tablename
SET TBLPROPERTIES ("skip.header.line.count"="1");
Please note that while it works it comes with its own issues. When there is more than one output file generated i.e. reducers are greater than 1, it skips the first record for each and every file which might not necessarily be the desired behaviour.
While you have your answer from Daniel, here are some customizations possible using OpenCSVSerde
:
CREATE EXTERNAL TABLE `mydb`.`mytable`(
`product_name` string,
`brand_id` string,
`brand` string,
`color` string,
`description` string,
`sale_price` string)
PARTITIONED BY (
`seller_id` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '\t',
'quoteChar' = '"',
'escapeChar' = '\\')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://namenode.com:port/data/mydb/mytable'
TBLPROPERTIES (
'serialization.null.format' = '',
'skip.header.line.count' = '1')
With this, you have total control over the separator, quote character, escape character, null handling and header handling.
Look here and here.
Just append below property in your query and the first header or line int the record will not load or it will be skipped.
Try this
tblproperties ("skip.header.line.count"="1");
skip.header.line.count
will skip the header line.
However, if you have some external tool accessing accessing the table, it will still see that actual data without skipping those lines
create external table table_name(
Year int,
Month int,
column_name data_type )
row format delimited fields terminated by ','
location '/user/user_name/example_data' TBLPROPERTIES('serialization.null.format'='', 'skip.header.line.count'='1');