How to transfer a file over pen and paper, with error correction
I doubt if otherwise transcribing it will be too difficult
is going to be a problem.
Let's say you have Red, Green, Blue and Black. You can write a script that turns your data into a collection of letters from RGBY
, eg: RGBYGBRYBGBYRYYBYBRYYG
(or even Red Green Blue Black Green Blue Red Black...
in an Excel sheet) and back again. It's just a matter of base converting your binary data from base 2 (or hexadecimal data from base 16) to the base in the amount of colors you take (4 in this example).
Now, the most logical approach would be to get yourself 16 colors. This way, you have to use 4 times less dots which makes switching between the pens worth it. This allows you to write 4 times as much data on the paper if you need to, or perhaps have can be 4 times less accurate when putting your dots, the scaling is up to you. I would really advise against drawing every single bit.
For instance, 5565 bytes
would have to be multiplied by two to get the amount of hexadecimals which is 11130 hexadecimals
(as opposed to 44520 bits
) which can be put into a 106 x 106
grid.
Depending on the type of data you probably can come with some optimizations...
Hint: Attempt to pick the most distinct (most contrasting) colors...
Alternatives that can use a single pen:
Represent the different hexadecimals by different symbols
-
,/
,|
,\
,+
, ...-
Represent the different hexadecimals by a small pixel font, see my avatar.
This makes it even useful to use something like Base 32 (or Base 36). Note that the
Q
and9
are the same, so you will want the top right pixel of theQ
to be White for a clear distinction. Base 32 only requires a53 x 53
grid for your example, plus a small bit of spacing to distinct between letters.
If you want people to be able to read and write the data, the problem with Base64 and many text encodings is that they use characters like I, l, 1, |, /, 0, O, o, and so on that people confuse with each other.
Investigate Douglas Crockford's Base32 encoding. Its alphabet was specifically chosen to avoid similar characters, and it includes error detection.
After reading your comments, that sounds more reasonable. I just wasn't sure if you were intending on encoding megabytes of data like this.
I'd recommend, along the lines of Oliver's suggestion, that you increase your data density by borrowing a page from Bacon's cipher, which prison gangs often use to encode hidden messages in missives written in 2 different script styles--usually either upper vs. lowercase characters or print vs. cursive characters, e.g.
Hey mOM, WHAT's FOR diNNeR TODAY? = ABBBA AAAAA BAAAB BAABA AAAAA
= P A S T A
However, since your goal isn't stegnography, you would simply use this to expand your glyph set. Doing this, you could have up to 114 glyphs just using print & cursive alphanumeric characters, or 12996 code points using dual-character encoding.
However, since all glyph counts greater than 15 and less than 256 are essentially the same for a straight cipher of binary data (meaning, you'll still need 2 characters to represent each byte, giving you a data density of 4 bits per character in all cases), you can use the extra 98 glyphs / 12740 code points for error detection/correction.
Ways to do this include:
- Choose a set of the 256 most easy to read/write character combos. If any other character combo occurs, you know it's a copying error.
- Use two versions of the end character as a parity bit.
-
Create 50 different 16-character glyph sets. You can then use them to cipher encode error correction data.
E.g.
{set 1}{set 1}
means the next 3 nibbles equal0x000
,{set 1}{set 2}
equals0x001
, etc.You can use this to represent 2500+ of the 4096 possible 1.5 byte values. Similarly, you could use just 16 sets to represent all values of the following byte, giving you 100% redundancy without increasing your encoded data length.
Alternatively, you could use the extra glyphs for additional compression:
- Implement variable-width encoding by choosing 98 single-character code points. This would reduce the average encoded content size by about 20%.
- Implement something similar to run-length encoding by using different glyph sets or glyph set combinations to represent repeating nibbles/bytes. E.g.
Ab
=aba
;aB
=abab
;AB
=ababab
... - Use the extra glyphs or code points to represent "words" and "phrases" that are repeated in your data. Though pre-compressed data will likely have a high level of entropy, so I don't know how effective this would be.
To further reduce copying errors, I would display the encoded content in gridlines and copy onto graphing paper. If you can use custom stationary that has alternating column/row colors or a chessboard-style checkered grid with lettered columns & numbered rows for quick look-ups, that would further increase copying accuracy.
You can also combine an alternating grid layout with alternating character styles as an easy form of error detection. I.e. if odd columns are always capitalized, if the transcriber finds themselves writing lowercase letters in odd columns, then they know they've made an error and can start tracking back to see where it happened.
Though if your main priority is accuracy, I would use a binary encoding + Hamming code. Using a (12, 8) shortened Hamming code on standard graphing paper, you might only fit 187 bytes, encoding only 124 bytes of data. But it could be transcribed very quickly (a slash for 1, nothing for 0) and provide single error correction. Tacking on an extra parity bit (13, 8) would provide SECDED (single error correction, double error detection). Using a standard hamming code like (15, 11) or (31, 26), you get even better efficiency with 137 and 156 bytes of data per sheet, respectively. Even higher code rates can be achieved, depending on how accurate you think your transcriber can be.
A binary encoding would also be easier to read (aloud) and OCR/OMR.
We used to use S-Records for this purpose. There was a simple checksum, per line, for error detection. Normally all but the last line was fixed length, so the end-of-line marker served as a check for insertions and deletions. There was no check for missing lines though. For this we simply counted the number of lines. Mostly files were short, less than 100 lines, but I do remember at least one which had 300 lines or more. It was very tedious typing files into the system. Of course, among the first programs transferred this way was a downloader ;)