Pandas read_csv dtype leading zeros

So I'm reading in a station codes csv file from NOAA which looks like this:

"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"

The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % i since they are always six digits, but you know... that's the lazy mans way.

The csv is obtained using this code:

file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

which is all well and good but when I go and try and read it using this:

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})

or

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})

I get a nasty error message:

File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
    return parser.read()
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood

It's a pretty big csv (31k rows) so maybe that has something to do with it?


Solution 1:

This is an issue of pandas dtype guessing.

Pandas sees numbers and guesses you want it to be numbers.

To make pandas not doubt your intentions, you should set the dtype you want: object

pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})

Will do the trick

Update as it helps others:

To have all columns as str, one can do this (from the comment):

pd.read_csv('sample.csv', dtype = str)

To have most or selective columns as str, one can do this:

# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)

Solution 2:

This problem caused me all sorts of headaches when parsing a file with serial numbers. For unknown reasons 00794 and 000794 are two distinct serial numbers. I eventually came up with

converters = {'serial_number': str}

Solution 3:

It looks like you have to specify the length of the string if you don't want it to be an object.
For example:

dtype={'USAF': '|S6'}

I can't find the reference for this, but I seem to recall Wes discussing this very issue (perhaps in a talk). He suggested that numpy doesn't allow "proper" variable length strings (see this question/answer), and using the maximum length to populate the array will more often than not be incredibly space inefficient (even if a string is short it'll use as much space as the longest string).

As @Wes points out, this is also a case where:

dtype={'USAF': object}

works just as well.