Ignore the first space in CSV

I have a CSV file like this:

Time              Latitude Longitude
2021-09-12 23:13    44.63     -63.56
2021-09-14 23:13    43.78     -62
2021-09-16 23:14    44.83     -54.6

2021-09-12 23:13 is under Time column.

I would like to open it using pandas. But there is a problem with the first column. It contains a space. If I open it using:

import pandas as pd
points = pd.read_csv("test.csv", delim_whitespace=True) 

I get

Time Latitude Longitude
2021-09-12 23:13 44.630 -63.560
2021-09-14 23:13 43.780 -62.000
2021-09-16 23:14 44.830 -54.600

But I would like to skip the space in the first column in CSV (2021-09-12 23:13 should be under Time column) like:

Time Latitude Longitude
0 2021-09-12 23:13 44.630 -63.560
1 2021-09-14 23:13 43.780 -62.000
2 2021-09-16 23:14 44.830 -54.600

How can I ignore the first space when using pd.read_csv?

Please do not stick to this csv file. This is a general question to skip (not to consider as a delimiter) the first space(s) in the first column. Because everyone knows that the first space is part of the time value, not a delimiter.


Solution 1:

Try to fix the column and the index after load the file:

import pandas as pd

points = pd.read_csv('test.csv', delim_whitespace=True)

points = points.assign(Time=pd.to_datetime(df.index + ' ' + df['Time'])) \
               .reset_index(drop=True)

Output:

>>> points
                 Time  Latitude  Longitude
0 2021-09-12 23:13:00     44.63     -63.56
1 2021-09-14 23:13:00     43.78     -62.00
2 2021-09-16 23:14:00     44.83     -54.60

Solution 2:

Your data is in 2 different kind of formats:

  • your headerrow has a single space between 'Latitude' and 'Longitude'.
  • the "data" rows look to be separated by multiple spaces.

You can either edit your data and add a second space between lat & long or trick it by supplying the column headers separately:

Create file:

with open("test.csv","w") as f:
    f.write("""Time              Latitude Longitude
2021-09-12 23:13    44.63     -63.56
2021-09-14 23:13    43.78     -62
2021-09-16 23:14    44.83     -54.6""")

Parse file:

import pandas as pd

# ignore files headers, supply own, use multiple spaces as seperator
df = pd.read_csv("test.csv", delimiter = "   ", 
                 header=0, names = ["Time","Latitude","Longitude"])

print (df)

Output:

               Time  Latitude  Longitude
0  2021-09-12 23:13     44.63     -63.56
1  2021-09-14 23:13     43.78     -62.00
2  2021-09-16 23:14     44.83     -54.60

Solution 3:

What you have shown is not a csv file. Full stop. Pandas read_csv is indeed versatile enough to possibly find workaround allowing to process it. But it is actually a fixed width fields file and should be read with pd.read_fwf:

pd.read_fwf(file_name, [(0,16), (16,26), (26, 40)])

directly gives:

               Time  Latitude  Longitude
0  2021-09-12 23:13     44.63     -63.56
1  2021-09-14 23:13     43.78     -62.00
2  2021-09-16 23:14     44.83     -54.60

From your edit, you only want to tell read_csv to consider the first white space as a non delimiter character. I know no simple way to do that. The hard way is to read the file, replace the first space in each and every line with a different character. Then you submit that changed file to read_csv with a custom converter for the first column to change the special character back to a space:

with open('test.csv') as fdin, open('test2.csv', 'w') as fdout):
    fdout.write(next(fdin)    # do not process the header line
    for line in fdin:
        fdout.write(line.replace(' ', '_', 1)

df = pd.read_csv('test2.csv', delim_whitespace=True,
                 converters = {'Time': lambda x: x.replace('_', ' ')})

It gives too:

               Time  Latitude  Longitude
0  2021-09-12 23:13     44.63     -63.56
1  2021-09-14 23:13     43.78     -62.00
2  2021-09-16 23:14     44.83     -54.60

Solution 4:

Ideally you should be parsing the first two parts as a datetime. By using a space as a delimiter, it would imply the header has three columns. The space after the date though is being seen as an extra column.

A workaround is to skip the header entirely and supply your own column names. The parse_dates parameter can be used to tell Pandas to parse the first two columns as a single combined datetime object.

For example:

import pandas as pd

points = pd.read_csv("test.csv", delimiter=" ", 
    skipinitialspace=True, skiprows=1, index_col=None, 
    parse_dates=[[0, 1]], names=["Date", "Time", "Latitude", "Longitude"])

print(points)

Should give you the following dataframe:

            Date_Time  Latitude  Longitude
0 2021-09-12 23:13:00     44.63     -63.56
1 2021-09-14 23:13:00     43.78     -62.00
2 2021-09-16 23:14:00     44.83     -54.60