How to import a text file on AWS S3 into pandas without writing to disk
I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
import io
import boto3
import os
import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"
s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]
pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
the error is
OSError: Expected file path name or file-like object, got <class 'bytes'> type
How do I convert the response body into a format pandas will accept?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: initial_value must be str or None, not StreamingBody
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: 'StreamingBody' does not support the buffer interface
UPDATE - Using the following worked
file = response["Body"].read()
and
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
Solution 1:
pandas
uses boto
for read_csv
, so you should be able to:
import boto
data = pd.read_csv('s3://bucket....csv')
If you need boto3
because you are on python3.4+
, you can
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
Since version 0.20.1 pandas
uses s3fs
, see answer below.
Solution 2:
Now pandas can handle S3 URLs. You could simply do:
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/file.csv')
You need to install s3fs
if you don't have it. pip install s3fs
Authentication
If your S3 bucket is private and requires authentication, you have two options:
1- Add access credentials to your ~/.aws/credentials
config file
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Or
2- Set the following environment variables with their proper values:
aws_access_key_id
aws_secret_access_key
aws_session_token
Solution 3:
This is now supported in latest pandas. See
http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files
eg.,
df = pd.read_csv('s3://pandas-test/tips.csv')