Text inside files has squares with numbers in it

Solution 1:

The boxes mean "glyph not found"; the characters in the box are hexidecimal representations of the codepoint, in unicode.

There are two possibilities: the character encoding is garbled, or the font you are using doesn't have a glyph for that character. This is a great overview character encoding if you really want to understand it: http://trochee.net/2011/05/character-encoding-tutorial/

Curiously, U+001F and U+001D are really just glorified line breaks. It seems odd that OCR would return those.

Solution 2:

The squares (as far as I can tell) always occur in places where special typesetting characters have been used. For example, typesetting ty as the letter t followed by the letter y in some fonts leaves extra, unwanted space between the two letters. For that reason, many fonts used for more advanced typesetting have extra characters for this, like the ty character that should read "...ancient beauty a temperate...". Since you don't have these extra characters (it's possible you can't even decode them, since they might not have an ascii/utf-8 code) you get squares.

I have no real idea on how to copy the actual text (and in this case get a t and a y as separate characters), but the people at TeX, LaTeX and friends might be able to help - they're not necessarily font experts, but they're all into typesetting...