Using awk to remove the Byte-order mark

Solution 1:

Using GNU sed (on Linux or Cygwin):

# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt

On FreeBSD:

sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt

Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.

On Mac:

This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.

A similar trick can be achieved with any program by piping to the sponge tool from moreutils:

awk '…' INFILE | sponge INFILE

Solution 2:

Try this:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE

On the first record (line), remove the BOM characters. Print every record.

Or slightly shorter, using the knowledge that the default action in awk is to print the record:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE

1 is the shortest condition that always evaluates to true, so each record is printed.

Enjoy!

-- ADDENDUM --

Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF   |  UTF-32, big-endian
FF FE 00 00   |  UTF-32, little-endian
FE FF         |  UTF-16, big-endian
FF FE         |  UTF-16, little-endian
EF BB BF      |  UTF-8

Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.

Solution 3:

Not awk, but simpler:

tail -c +4 UTF8 > UTF8.nobom

To check for BOM:

hd -n 3 UTF8

If BOM is present you'll see: 00000000 ef bb bf ...

Solution 4:

In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:

dos2unix *.txt

dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:

$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
   bom-utf8 efbbbfc3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
   bom-utf8 c3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a

Using awk to remove the Byte-order mark

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts