How to load local file in sc.textFile, instead of HDFS
I'm following the great spark tutorial
so i'm trying at 46m:00s to load the README.md
but fail to what i'm doing is this:
$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
how can I load that README.md
?
Solution 1:
Try explicitly specify sc.textFile("file:///path to the file/")
. The error occurs when Hadoop environment is set.
SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits
, which in turn uses org.apache.hadoop.fs.getDefaultUri
if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".
Solution 2:
gonbe's answer is excellent. But still I want to mention that file:///
= ~/../../
, not $SPARK_HOME
. Hope this could save some time for newbs like me.
Solution 3:
While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.
Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.
If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path
rdd = sc.textFile("file:///path/to/file")
If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers
Take care to put file:// in front and the use of "/" or "\" according to OS.
Solution 4:
Attention:
Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")
) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist
.
Becasuse executors which run on different workers will not find this file in it's local path.
Solution 5:
If the file is located in your Spark master node (e.g., in case of using AWS EMR), then launch the spark-shell in local mode first.
$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (e.g., YARN in case of using AWS EMR) to read the file directly.
$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r-- 1 hadoop hadoop 73 2017-05-01 00:49 /hdfs/spark/examples/people.json
$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+