How to make the 'cut' command treat same sequental delimiters as one?
I'm trying to extract a certain (the fourth) field from the column-based, 'space'-adjusted text stream. I'm trying to use the cut
command in the following manner:
cat text.txt | cut -d " " -f 4
Unfortunately, cut
doesn't treat several spaces as one delimiter. I could have piped through awk
awk '{ printf $4; }'
or sed
sed -E "s/[[:space:]]+/ /g"
to collapse the spaces, but I'd like to know if there any way to deal with cut
and several delimiters natively?
Solution 1:
Try:
tr -s ' ' <text.txt | cut -d ' ' -f4
From the tr
man page:
-s, --squeeze-repeats replace each input sequence of a repeated character that is listed in SET1 with a single occurrence of that character
Solution 2:
As you comment in your question, awk
is really the way to go. To use cut
is possible together with tr -s
to squeeze spaces, as kev's answer shows.
Let me however go through all the possible combinations for future readers. Explanations are at the Test section.
tr | cut
tr -s ' ' < file | cut -d' ' -f4
awk
awk '{print $4}' file
bash
while read -r _ _ _ myfield _
do
echo "forth field: $myfield"
done < file
sed
sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' file
Tests
Given this file, let's test the commands:
$ cat a
this is line 1 more text
this is line 2 more text
this is line 3 more text
this is line 4 more text
tr | cut
$ cut -d' ' -f4 a
is
# it does not show what we want!
$ tr -s ' ' < a | cut -d' ' -f4
1
2 # this makes it!
3
4
$
awk
$ awk '{print $4}' a
1
2
3
4
bash
This reads the fields sequentially. By using _
we indicate that this is a throwaway variable as a "junk variable" to ignore these fields. This way, we store $myfield
as the 4th field in the file, no matter the spaces in between them.
$ while read -r _ _ _ a _; do echo "4th field: $a"; done < a
4th field: 1
4th field: 2
4th field: 3
4th field: 4
sed
This catches three groups of spaces and no spaces with ([^ ]*[ ]*){3}
. Then, it catches whatever coming until a space as the 4th field, that it is finally printed with \1
.
$ sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' a
1
2
3
4
Solution 3:
shortest/friendliest solution
After becoming frustrated with the too many limitations of cut
, I wrote my own replacement, which I called cuts
for "cut on steroids".
cuts provides what is likely the most minimalist solution to this and many other related cut/paste problems.
One example, out of many, addressing this particular question:
$ cat text.txt
0 1 2 3
0 1 2 3 4
$ cuts 2 text.txt
2
2
cuts
supports:
- auto-detection of most common field-delimiters in files (+ ability to override defaults)
- multi-char, mixed-char, and regex matched delimiters
- extracting columns from multiple files with mixed delimiters
- offsets from end of line (using negative numbers) in addition to start of line
- automatic side-by-side pasting of columns (no need to invoke
paste
separately) - support for field reordering
- a config file where users can change their personal preferences
- great emphasis on user friendliness & minimalist required typing
and much more. None of which is provided by standard cut
.
See also: https://stackoverflow.com/a/24543231/1296044
Source and documentation (free software): http://arielf.github.io/cuts/
Solution 4:
This Perl one-liner shows how closely Perl is related to awk:
perl -lane 'print $F[3]' text.txt
However, the @F
autosplit array starts at index $F[0]
while awk fields start with $1