Why isn't there a delimiter character in ASCII?
Solution 1:
Delimiters already exist in ASCII. Decimal 28-31 (hex 1C-1F) are delimiters. This includes the file, record, group and unit separators.
I would assume we do not use them, as it is easier to type keyboard characters that do not require multiple keys to type a single character. This also allows for easier interchange between different formats, as well. Comma separated values will work on virtually any system, ASCII compliant, or not.
Solution 2:
As already noted, ASCII includes delimiters. The problem is not that an extra key is needed during data entry to include the delimiters - Control is no harder to use than Shift for an UPPER-case or other special printable character (e.g., !@#$). The problem is that traditionally those control characters are not directly visible. Even tab, carriage return and line feed - which produce immediate actions, do not produce visible output.
You can't tell the difference on a teletype between tabs and spaces. You can't tell the difference between line feeds and spaces to end-of-line + wrap to next line. Similarly, the delimiters do not have a defined printable image. They may show in some (modern) text editors, and they may produce immediate actions in various devices, but they don't leave a mark.
All of this doesn't matter if data is only designed to be machine-readable - i.e., what we commonly refer to as binary files. But text for data entry and transfer between systems is often, deliberately, human-readable. If it is going to be human-readable, the delimiters need to be printable.
Solution 3:
As was mentioned in another answer, ASCII does have delimiters. Looking here [1], these are mentioned:
code point | name |
---|---|
U+001C | File Separator |
U+001D | Group Separator |
U+001E | Record Separator |
U+001F | Unit Separator |
and these are used. For example, U+001C (octal 34), is the default SUBSEP
[2]
string for GNU AWK.
- https://wikipedia.org/wiki/ASCII#Control_code_chart
- https://gnu.org/software/gawk/manual/html_node/Multidimensional
Solution 4:
This is mainly historical.
In the old days of informatics, data files were mostly Fixed Width Fields files because it was the natural IO for languages like Fortran IV and COBOL: n characters for first field, m for second, etc.
Then C language provided a scanf
function that splitted input on (sets of) white spaces, and people started to use free format for data files containing numbers. But that lead to messy results when some fields could contain spaces (scanf
is known as a poor man's parser). So as the other standard function for splitting was strtok
which used one single delimiter, most (English speaking) people started to use the comma (,
) as separator, because it is easy to manually write a Comma Separated Value file in a text editor.
Then National Language Support came into the game... In some European languages (French), the decimal point is the comma. IT guys were used to the decimal point but less techies were not, so French versions of Windows started to define the semicolon (;
) as the separator to allow the comma in decimal numbers.
In the meanwhile, some realized that when fields always had close length, a tab character (which existed on all keyboards) allowed to provide a nice vertical alignment and that was the reason for a third standard.
Finaly, standardization began to be a fact, and RFC 4180 emerged in 2005. It did define the comma to be the official separator, but as Windows had decided to play the NLS game, tools and libraries wanting to process real files had to adapt to various possible delimiters.
And that is the reason why in 2021, we have many possible delimiters in CSV files...
Solution 5:
It has come to pass that there is a de facto universal delimiter in ASCII: the null character. Unix and the language C showed that you can build an entire platform in which the null character is banished from character strings, serving as a terminator in their representation. Other platforms have followed suit like Microsoft Windows.
Today, it's a virtually iron-clad guarantee that no textual datum contains a null byte. If a datum contains a null byte, it's binary and not text.
If you want to store a sequence of textual records or fields in a byte stream, if you separate them with nulls, you will have next to no issues. Nulls don't require any nonsense like escaping. If someone comes along and says they want to include a null byte in a text field, you can laugh them off as a comedian.
Examples of null separation in the wild:
-
Microsoft allows items in the registry to be multi-strings: single items containing multiple strings. This is stored as a sequence of null-terminated strings catenated together, with an extra null byte to terminate the whole sequence. As in
"the\0quick\0brown\0fox\0\0"
to represent the list of strings"the"
,"quick"
,"brown"
,"fox"
. -
On the Linux kernel, the environment variables of each process are available via the
/proc
filesystem as/proc/<pid>/environ
. This virtual file uses null separation, likePATH=/bin:/usr/bin\0TERM=xterm\0...
. -
Some GNU utilities have the option to produce null separated output, and that is precisely what allows them to be used to write much more robust scripts. GNU
find
has a-print0
predicate for printing paths with null termination instead of newline separation. These paths can be fed toxargs -0
which reads null-separated strings from its standard input and turns them into command line arguments for a specified command. This combo will cleanly pass absolutely all file names/paths regardless of what they contain: because paths cannot contain a null byte.
Why do we play games with other separation? Tabs, commas, semicolons and whatnot, rather than just using null? The problem is that we need multiple levels of separation. Okay, so nulls chop the byte stream into texts, reliably. But within those texts, there may be another level of delimitation needed. It sometimes happens that a single string has more structure inside it. A path contains slashes to separate components. A MAC address uses colons to separate bytes. That sort of thing. An e-mail address has multiple levels of nested delimitation like local@domain
around the @
symbol, and then the domain part separated with dots. Parentheses are allowed in there and things like %
and !
. People write string-handling code to deal with these formats, and that string-handling code will not like null bytes in a lot of languages, due to the influence of C and Unix.
Demo of GNU Awk using the null byte as the field separator, processing /proc/self/environ
.
$ awk -F'\0' \
'{ for (i = 1; i <= NF; i++)
printf("field[%d] = %s\n", i, $i) }' \
/proc/self/environ
field[1] = CLUTTER_IM_MODULE=xim
field[2] = XDG_MENU_PREFIX=gnome-
field[3] = LANG=en_CA.UTF-8
field[4] = DISPLAY=:0
field[5] = OLDPWD=/home/kaz/tftproot
field[6] = GNOME_SHELL_SESSION_MODE=ubuntu
field[7] = EDITOR=vim
[ snip ... ]
field[54] = PATH=/home/kaz/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/kaz/bin:/home/kaz/bin
field[55] = GJS_DEBUG_TOPICS=JS ERROR;JS LOG
field[56] = SESSION_MANAGER=local/sun-go:@/tmp/.ICE-unix/1986,unix/sun-go:/tmp/.ICE-unix/1986
field[57] = GTK_IM_MODULE=ibus
field[58] = _=/usr/bin/awk
field[59] =
We get an extra blank field due to the null byte at the end, because Awk is treating it as a field separator, rather than terminator. However, this is possible precisely because GNU Awk allows for the null byte to be a constituent of character strings. The argument -F '\0'
is not required to work, according to the POSIX specification. POSIX says, in a table entitled "Escape Sequences in awk" that
\ddd
: A character followed by the longest sequence of one, two, or three octal-digit characters (01234567). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined.
Thus it is entirely nonportable to rely on Awk to separate fields or records on the null byte. This kind of language problem is probably one reason we don't make more use of null characters.