How do I convert UTF-8 special characters in Bash?

I am writing on a script that extracts and saves JPEG-attachements from emails and passes them to imagemagick. However, I am living in Germany and special characters in email text/subject as "ö", "ä", "ü" and "ß" are pretty common.

I am extracting the subject with formail:

    SUBJECT=$(formail -zxSubject: <"$file")

and that results in:

  • =?UTF-8?Q?Meine_G=c3=bcte?=

("Meine Güte") or even worse

  • =?UTF-8?B?U2Now7ZuZSBHcsO8w59lIQ==?=

("Schöne Grüße!").

I try to use part of the subject as a filename and as imagemagick text annotation, which obviously doesn't work.

How do I convert this UTF-8 text to text with special characters in bash?

Thanks in advance! Markus


Solution 1:

How do I convert this UTF-8 text to text with special characters in bash?

What you have isn't quite "UTF-8 text". You actually want plain UTF-8 text as output, as it's what Linux uses for "special characters" everywhere.

Your input, instead, is MIME (RFC 2047) encoded UTF-8. The "Q" marks Quoted-Printable mode, and "B" marks Base64 mode. Among others, Perl's Encode::MIME::Header can be used to decode both:

#!/usr/bin/env perl
use open qw(:std :utf8);
use Encode qw(decode);

while (my $line = <STDIN>) {
        print decode("MIME-Header", $line);
}

Oneliner (see perldoc perlrun for explanation):

perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'

This can take any format as input:

$ echo "Subject: =?UTF-8?Q?Meine_G=c3=bcte?=, \
                 =?UTF-8?B?U2Now7ZuZSBHcsO8w59lIQ==?=" | perl ./decode.pl
Subject: Meine Güte, Schöne Grüße!

A version in Python 3:

#!/usr/bin/env python3
import email.header, sys

words = email.header.decode_header(sys.stdin.read())
words = [s.decode(c or "utf-8") for (s, c) in words]
print("".join(words))

Solution 2:

E-mail subject itself is header and headers must contain only ASCII characters. This is why UTF-8 (or any other non-ASCII charset) subject must be encoded.

This way of encoding non-ASCII characters in to ASCII is described in RFC 1342.

Basically, encoded subject has (as you've already listed in your examples) following format:

=?charset?encoding?encoded-text?=

Based on encoding value is encoded-text decoded either as quoted-printable (Q) or as base64 (B).

To get human readable form you need to pass encoded-text portion of subject header value to program that decode it. I believe there are some standalone commands to do that (uudecode), but I prefer to use Perl one-liners:

For quoted-printable:

perl -pe 'use MIME::QuotedPrint; $_=MIME::QuotedPrint::decode($_);'

and for base64:

perl -pe 'use MIME::Base64; $_=MIME::Base64::decode($_);'

Be sure you pass only encoded-text portion and not whole subject header value.