How to read input from S3 in a Spark Streaming EC2 cluster application

Odd. Try also doing a .set on the sparkContext. Try also exporting env variables before you start the application:

export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>

^^this is how we do it.

UPDATE: According to @tribbloid the above broke in 1.3.0, now you have to faff around for ages and ages with hdfs-site.xml, or your can do (and this works in a spark-shell):

val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

The following configuration works for me, make sure you also set "fs.s3.impl":

val conf = new SparkConf().setAppName("Simple Application").setMaster("local")      
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)

On AWS EMR the above suggestions did not work. Instead I updated the following properties in the conf/core-site.xml:

fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey with your S3 credentials.

For those using EMR, use the Spark build as described at https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and just reference S3 with the s3:// URI. No need to set S3 implementation or additional configuration as credentials are set by IAM or role.

How to read input from S3 in a Spark Streaming EC2 cluster application

Related

Recent Posts