How do I convert UTF-8 special characters in Bash?
I am writing on a script that extracts and saves JPEG-attachements from emails and passes them to imagemagick. However, I am living in Germany and special characters in email text/subject as "ö", "ä", "ü" and "ß" are pretty common.
I am extracting the subject with formail:
SUBJECT=$(formail -zxSubject: <"$file")
and that results in:
- =?UTF-8?Q?Meine_G=c3=bcte?=
("Meine Güte") or even worse
- =?UTF-8?B?U2Now7ZuZSBHcsO8w59lIQ==?=
("Schöne Grüße!").
I try to use part of the subject as a filename and as imagemagick text annotation, which obviously doesn't work.
How do I convert this UTF-8 text to text with special characters in bash?
Thanks in advance! Markus
Solution 1:
How do I convert this UTF-8 text to text with special characters in bash?
What you have isn't quite "UTF-8 text". You actually want plain UTF-8 text as output, as it's what Linux uses for "special characters" everywhere.
Your input, instead, is MIME (RFC 2047) encoded UTF-8. The "Q" marks Quoted-Printable mode, and "B" marks Base64 mode. Among others, Perl's Encode::MIME::Header can be used to decode both:
#!/usr/bin/env perl
use open qw(:std :utf8);
use Encode qw(decode);
while (my $line = <STDIN>) {
print decode("MIME-Header", $line);
}
Oneliner (see perldoc perlrun
for explanation):
perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'
This can take any format as input:
$ echo "Subject: =?UTF-8?Q?Meine_G=c3=bcte?=, \
=?UTF-8?B?U2Now7ZuZSBHcsO8w59lIQ==?=" | perl ./decode.pl
Subject: Meine Güte, Schöne Grüße!
A version in Python 3:
#!/usr/bin/env python3
import email.header, sys
words = email.header.decode_header(sys.stdin.read())
words = [s.decode(c or "utf-8") for (s, c) in words]
print("".join(words))
Solution 2:
E-mail subject itself is header and headers must contain only ASCII characters. This is why UTF-8 (or any other non-ASCII charset) subject must be encoded.
This way of encoding non-ASCII characters in to ASCII is described in RFC 1342.
Basically, encoded subject has (as you've already listed in your examples) following format:
=?charset?encoding?encoded-text?=
Based on encoding value is encoded-text decoded either as quoted-printable (Q) or as base64 (B).
To get human readable form you need to pass encoded-text portion of subject header value to program that decode it. I believe there are some standalone commands to do that (uudecode), but I prefer to use Perl one-liners:
For quoted-printable:
perl -pe 'use MIME::QuotedPrint; $_=MIME::QuotedPrint::decode($_);'
and for base64:
perl -pe 'use MIME::Base64; $_=MIME::Base64::decode($_);'
Be sure you pass only encoded-text portion and not whole subject header value.