How to detect comma vs semicolon separation in a file before reading it? [duplicate]
I have to read in a lot of CSV files automatically. Some have a comma as a delimiter, then I use the command read.csv()
.
Some have a semicolon as a delimiter, then I use read.csv2()
.
I want to write a piece of code that recognizes if the CSV file has a comma or a semicolon as a a delimiter (before I read it) so that I don´t have to change the code every time.
My approach would be something like this:
try to read.csv("xyz")
if error
read.csv2("xyz")
Is something like that possible? Has somebody done this before? How can I check if there was an error without actually seeing it?
Solution 1:
Here are a few approaches assuming that the only difference among the format of the files is whether the separator is semicolon and the decimal is a comma or the separator is a comma and the decimal is a point.
1) fread As mentioned in the comments fread
in data.table package will automatically detect the separator for common separators and then read the file in using the separator it detected. This can also handle certain other changes in format such as automatically detecting whether the file has a header.
2) grepl Look at the first line and see if it has a comma or semicolon and then re-read the file:
L <- readLines("myfile", n = 1)
if (grepl(";", L)) read.csv2("myfile") else read.csv("myfile")
3) count.fields We can assume semicolon and then count the fields in the first line. If there is one field then it is comma separated and if not then it is semicolon separated.
L <- readLines("myfile", n = 1)
numfields <- count.fields(textConnection(L), sep = ";")
if (numfields == 1) read.csv("myfile") else read.csv2("myfile")
Update Added (3) and made improvements to all three.
Solution 2:
A word of caution. read.csv2() is designed to handle commas as decimal point and semicolons as separators (default values). If by any chance, your csv files have semicolons as separators AND points as decimal point, you may get problems because of dec = "," setting. If this is the case and you indeed have separator as the ONLY difference between the files, it is better to change the "sep" option directly using read.table()