Read a zipped file as a pandas DataFrame
Solution 1:
If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csv
methods includes this particular implementation.
df = pd.read_csv('filename.zip')
Or the long form:
df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')
Description of the compression argument from the docs:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’ For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
Solution 2:
I think you want to open
the ZipFile, which returns a file-like object, rather than read
:
In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))
In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN 24567 non-null values
REPORTDATETIME 24567 non-null values
SHIFT 24567 non-null values
OFFENSE 24567 non-null values
METHOD 24567 non-null values
LASTMODIFIEDDATE 24567 non-null values
BLOCKSITEADDRESS 24567 non-null values
BLOCKXCOORD 24567 non-null values
BLOCKYCOORD 24567 non-null values
WARD 24563 non-null values
ANC 24567 non-null values
DISTRICT 24567 non-null values
PSA 24567 non-null values
NEIGHBORHOODCLUSTER 24263 non-null values
BUSINESSIMPROVEMENTDISTRICT 3613 non-null values
dtypes: float64(4), int64(1), object(10)
Solution 3:
It seems you don't even have to specify the compression any more. The following snippet loads the data from filename.zip into df.
import pandas as pd
df = pd.read_csv('filename.zip')
(Of course you will need to specify separator, header, etc. if they are different from the defaults.)
Solution 4:
For "zip" files, you can use import zipfile
and your code will be working simply with these lines:
import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
with z.open("Crime_Incidents_in_2013.csv") as f:
train = pd.read_csv(f, header=0, delimiter="\t")
print(train.head()) # print the first 5 rows
And the result will be:
X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0 -77.054968548763071,38.899775938598317,0925135...
1 -76.967309569035052,38.872119553647011,1003352...
2 -76.996184958456539,38.927921847721443,1101010...
3 -76.943077541353617,38.883686046653935,1104551...
4 -76.939209158039446,38.892278093281632,1125028...