Ignore the first space in CSV
I have a CSV file like this:
Time Latitude Longitude
2021-09-12 23:13 44.63 -63.56
2021-09-14 23:13 43.78 -62
2021-09-16 23:14 44.83 -54.6
2021-09-12 23:13
is under Time
column.
I would like to open it using pandas. But there is a problem with the first column. It contains a space. If I open it using:
import pandas as pd
points = pd.read_csv("test.csv", delim_whitespace=True)
I get
Time | Latitude | Longitude | |
---|---|---|---|
2021-09-12 | 23:13 | 44.630 | -63.560 |
2021-09-14 | 23:13 | 43.780 | -62.000 |
2021-09-16 | 23:14 | 44.830 | -54.600 |
But I would like to skip the space in the first column in CSV (2021-09-12 23:13
should be under Time
column) like:
Time | Latitude | Longitude | |
---|---|---|---|
0 | 2021-09-12 23:13 | 44.630 | -63.560 |
1 | 2021-09-14 23:13 | 43.780 | -62.000 |
2 | 2021-09-16 23:14 | 44.830 | -54.600 |
How can I ignore the first space when using pd.read_csv
?
Please do not stick to this csv file. This is a general question to skip (not to consider as a delimiter) the first space(s) in the first column. Because everyone knows that the first space is part of the time value, not a delimiter.
Solution 1:
Try to fix the column and the index after load the file:
import pandas as pd
points = pd.read_csv('test.csv', delim_whitespace=True)
points = points.assign(Time=pd.to_datetime(df.index + ' ' + df['Time'])) \
.reset_index(drop=True)
Output:
>>> points
Time Latitude Longitude
0 2021-09-12 23:13:00 44.63 -63.56
1 2021-09-14 23:13:00 43.78 -62.00
2 2021-09-16 23:14:00 44.83 -54.60
Solution 2:
Your data is in 2 different kind of formats:
- your headerrow has a single space between
'Latitude'
and'Longitude'
. - the "data" rows look to be separated by multiple spaces.
You can either edit your data and add a second space between lat & long or trick it by supplying the column headers separately:
Create file:
with open("test.csv","w") as f:
f.write("""Time Latitude Longitude
2021-09-12 23:13 44.63 -63.56
2021-09-14 23:13 43.78 -62
2021-09-16 23:14 44.83 -54.6""")
Parse file:
import pandas as pd
# ignore files headers, supply own, use multiple spaces as seperator
df = pd.read_csv("test.csv", delimiter = " ",
header=0, names = ["Time","Latitude","Longitude"])
print (df)
Output:
Time Latitude Longitude
0 2021-09-12 23:13 44.63 -63.56
1 2021-09-14 23:13 43.78 -62.00
2 2021-09-16 23:14 44.83 -54.60
Solution 3:
What you have shown is not a csv file. Full stop. Pandas read_csv
is indeed versatile enough to possibly find workaround allowing to process it. But it is actually a fixed width fields file and should be read with pd.read_fwf
:
pd.read_fwf(file_name, [(0,16), (16,26), (26, 40)])
directly gives:
Time Latitude Longitude
0 2021-09-12 23:13 44.63 -63.56
1 2021-09-14 23:13 43.78 -62.00
2 2021-09-16 23:14 44.83 -54.60
From your edit, you only want to tell read_csv
to consider the first white space as a non delimiter character. I know no simple way to do that. The hard way is to read the file, replace the first space in each and every line with a different character. Then you submit that changed file to read_csv
with a custom converter for the first column to change the special character back to a space:
with open('test.csv') as fdin, open('test2.csv', 'w') as fdout):
fdout.write(next(fdin) # do not process the header line
for line in fdin:
fdout.write(line.replace(' ', '_', 1)
df = pd.read_csv('test2.csv', delim_whitespace=True,
converters = {'Time': lambda x: x.replace('_', ' ')})
It gives too:
Time Latitude Longitude
0 2021-09-12 23:13 44.63 -63.56
1 2021-09-14 23:13 43.78 -62.00
2 2021-09-16 23:14 44.83 -54.60
Solution 4:
Ideally you should be parsing the first two parts as a datetime. By using a space as a delimiter, it would imply the header has three columns. The space after the date though is being seen as an extra column.
A workaround is to skip the header entirely and supply your own column names. The parse_dates
parameter can be used to tell Pandas to parse the first two columns as a single combined datetime object.
For example:
import pandas as pd
points = pd.read_csv("test.csv", delimiter=" ",
skipinitialspace=True, skiprows=1, index_col=None,
parse_dates=[[0, 1]], names=["Date", "Time", "Latitude", "Longitude"])
print(points)
Should give you the following dataframe:
Date_Time Latitude Longitude
0 2021-09-12 23:13:00 44.63 -63.56
1 2021-09-14 23:13:00 43.78 -62.00
2 2021-09-16 23:14:00 44.83 -54.60