Hadoop/Hive : Loading data from .csv on a local machine
As this is coming from a newbie...
I had Hadoop and Hive set up for me, so I can run Hive queries on my computer accessing data on AWS cluster. Can I run Hive queries with .csv data stored on my computer, like I did with MS SQL Server?
How do I load .csv data into Hive then? What does it have to do with Hadoop and which mode I should run that one?
What settings I should care about so that if I did something wrong I can always go back and run queries on Amazon without compromising what was set up for me earlier?
Let me work you through the following simple steps:
Steps:
First, create a table on hive using the field names in your csv file. Lets say for example, your csv file contains three fields (id, name, salary) and you want to create a table in hive called "staff". Use the below code to create the table in hive.
hive> CREATE TABLE Staff (id int, name string, salary double) row format delimited fields terminated by ',';
Second, now that your table is created in hive, let us load the data in your csv file to the "staff" table on hive.
hive> LOAD DATA LOCAL INPATH '/home/yourcsvfile.csv' OVERWRITE INTO TABLE Staff;
Lastly, display the contents of your "Staff" table on hive to check if the data were successfully loaded
hive> SELECT * FROM Staff;
Thanks.
if you have a hive setup you can put the local dataset directly using Hive load command in hdfs/s3.
You will need to use "Local" keyword when writing your load command.
Syntax for hiveload command
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Refer below link for more detailed information. https://cwiki.apache.org/confluence/display/Hive/LanguageManual%20DML#LanguageManualDML-Loadingfilesintotables