Remove every line breaks following carriage return (^M) and join the lines
edit: Now that I got answers, I marked one by @KamilMaciorowski that fits better to the title as an answer, but this answer by @oliv actually were better suited to my actual need to my primary purpose. (To process csv file with breaks consistently on awk.)
So in the case if you were looking for awking in the similar circumstance, I recommend checking that first!
Please help me preparing a few thousands of csv file ready for awk
to process! Some of the field has line breaks inside the field and that's causing awk
to process them as a multiple records.
However those problematic line breaks only happens where ^M is inserted, so I just need to remove ^M and line-break altogether from all of them.
*These ^M
s are indeed line break character, not literal caret & letter M string. This file is generated for .net to parse and process, but I haven't worked on developing apps on neither file producing/reading sides, so I don't really know how it's successfully parsed. It's exclusively used for fields in certain columns with multiple-lined strings (comments).
So how do you make this (csv with 1 header and 2 records. Some field has line breaks in it preceeded by ^M):
"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a^M
2-2_b^M
2-2_c", "2-3"
like this? (csv with 1 header and 2 records without line breaks within each of them.):
"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a2-2_b2-2_c", "2-3"
I tried removing them with sed
but I heard there's no way to process, and I didn't quite get the reason why.
for file in *.csv; do
sed -e "s/^M//" $file > sedded/$file;
done
Anyhow, I get this:
"header_1", "header_2", "header_3"
"1-1", "1-2", "1-3"
"2-1", "2-2_a
2-2_b
2-2_c", "2-3"
I tried to go for something like "s/^M\n/"
, and it doesn't work as I suspected. Should I use completely different tool like vim
? As long as it works for thousands of files at once (each containing ~500 lines, and I don't really care the time it takes to process) I'm fine with any sort of resolution. Just thought sed
was the way. (I'm ok to use DOS command/powershell if it's easier or more straight forward!)
If these ^M
-s are indeed line break characters, not literal caret & letter M strings, then they are what we denote \r
, CR
or 0x0d
(compare this answer of mine, the beginning of it).
Your command
sed -e "s/^M//"
doesn't remove \r
; it doesn't even remove literal ^M
. The command means "take a line, search for a letter M
that is at the very beginning of the line (^
, see this), replace it with nothing.
Note sed
understands \r
. Still sed -e 's/\r//'
is not exactly what you need. It removes \r
but you need to remove the following \n
as well. You may want to try sed -e 's/\r\n//'
, this will also fail. The problem is sed
is a text tool and it treats \n
as a separator. Excerpt from info sed
(emphasis mine):
sed
operates by performing the following cycle on each lines of input: first,sed
reads one line from the input stream, removes any trailing newline, and places it in the pattern space. Then commands are executed; […].
This means normally \n
doesn't belong to any string processed with s/…
(or another sed
command). For this reason concatenating few lines is not easy. Still it can be done. This is the command you need:
sed -e ': start; /\r$/{ s/\r$//; N; s/\n// }; /\r$/b start'
Explanation:
-
: start
is a label. - If the line contains
\r
(i.e.^M
,0x0d
character) at the very end ($
), execute the{}
block which is:- replace
\r
at the very end with nothing, - append an additional line from the input (
N
), - replace
\n
that separates the additional line from the previous data.
- replace
- If the outcome contains
\r
at the very end (meaning the additional line brought it, so we need to add yet another line), jump tostart
.
Assuming there are 3 fields in each row, and there isn't any double quote inside any value, your could use this GNU awk script:
awk -v FPAT='"[^"]*"' '{while(NF!=3){p=$0;getline;gsub("^",p)}; p=""}1' file
FPAT
defines how a field should look like, i.e anything surrounding with double quotes.
The awk statement builds a record by getting lines from the file until there are 3 fields.