Why is base64 needed (aka why can't I just email a binary file)?

There is a good Wikipedia article on this.


The earliest iterations of NCP as used by ARPAnet were more like bit streams than byte streams, or attempts to negotiate a convenient byte size; the 8-bit byte was only standardized on much later. There were also several attempts at creating file transfer protocols that would work in different machines (mail was initially a function of the FTP protocol, primarily as the MAIL and MLFL commands, then split into MTP, later SMTP.). Those machines often had differing character encodings – ASCII vs EBCDIC – or even different byte sizes, 8-bit bytes vs 6-bit vs ...

Therefore, mail transfer functions were initially defined for transferring relatively short messages in plain text; specifically, "NVT-ASCII". For example, RFC 772 says:

MAIL REPRESENTATION AND STORAGE

Mail is transferred from a storage device in the sending host to a storage device in the receiving host. It may be necessary to perform certain transformations on the mail because data storage representations in the two systems are different. For example, NVT-ASCII has different data storage representations in different systems. PDP-10's generally store NVT-ASCII as five 7-bit ASCII characters, left-justified in a 36-bit word. 360's store NVT-ASCII as four 8-bit EBCDIC codes in a 32-bit word. Multics stores NVT-ASCII as four 9-bit characters in a 36-bit word.

For the sake of simplicity, all data must be represented in MTP as NVT-ASCII. This means that characters must be converted into the standard NVT-ASCII representation when transmitting text, regardless of whether the sending and receiving hosts are dissimilar. The sender converts the data from its internal character representation to the standard 8-bit NVT-ASCII representation (see the TELNET specification). The receiver converts the data from the standard form to its own internal form. In accordance with this standard, the sequence should be used to denote the end of a line of text.

Even though eight bits was being transmitted over the wire, the 8th bit would often get discarded or mangled, since there was no requirement to preserve it; in fact, some protocols required the 8th bit to be set to zero, such as the initial SMTP RFC as quoted below. In other words, the software was not 8-bit clean.

Data Transfer

The TCP connection supports the transmission of 8-bit bytes. The SMTP data is 7-bit ASCII characters. Each character is transmitted as an 8-bit byte with the high-order bit cleared to zero.

This persisted for a long time even after 8-bit ISO-8859-# character encodings became widespread. Even though some servers were already 8-bit clean, many others weren't, and blindly sending 8 bit data would have resulted in mangled messages.

Later, "Extended SMTP" was published, allowing mail servers to declare SMTP extensions they supported; one of them was 8BITMIME, indicating that the receiving server could safely accept 8-bit data. MIME message parts can have "Content-Transfer-Encoding: 8bit", indicating that they are not encoded in any way.

However, the SMTP protocol remained line-based and has the 998 octet line limit, as well as using a . line (0D 0A 2E 0D 0A) as the "end of message" indicator. This means that even though most binary files could be sent unaltered, it is still possible that files containing this octet sequence would be interpreted as the end of transferred message, and the rest of the file as a SMTP command, possibly causing damage. Similarly, a "line" longer than 998 octets might be cut off by the receiving server.

In 2000, the "BINARYMIME" ESMTP extension was published as RFC 3030, allowing transfers of raw binary data over SMTP. The message is now transferred in chunks of pre-indicated length, with a zero-length chunk used as the terminator, and Base64 & similar encodings are not needed anymore. Unfortunately, few SMTP servers support this extension; for example, neither Postfix nor Exim4 advertise CHUNKING in reply to EHLO. To take advantage of BINARYMIME, it would have to be supported by all servers in the message path, which can be more than just one or two.

See also:

  • RFC 765: File Transfer Protocol
  • RFC 772: Mail Transfer Protocol
  • RFC 821: Simple Mail Transfer Protocol
  • Wikipedia: 8-bit clean
  • Wikipedia: 8BITMIME

Some older e-mail systems and software were not 8-bit clean, the 8th bit was used as a control character. This was enough to muck up binary files, thus Base64 (or other encoding schemes) were needed.