How do I read a text file's hidden characters?
I've created a text file from an application that I developed.
When I send the text file to a SYSTEM validation, they (third-party system) say that the file is invalid and that the file contains three characters in the beginning of the file that are not allowed as well special characters are not correct.
They also say I need to use either ISO 8859-1 or PC850.
Well, I'm using Notepad++, and I can't see that at all! What is the best text file reader for these kind of problems?
I also have a Mac and just thought I remembered opening in TextMate ... WOW! Now I know what they are talking about!
How can I have the same in Windows?
Well, I'm using NotePad++ and I can't see that at all! What is the best text file reader for this kind of problems?
The problem is, a ‘good’ text editor should be able to load all text encodings transparently — even stupid broken ones like UTF-8-plus-BOM — which would prevent you from seeing the problem. Sure, a good text editor should save UTF-8 without the bogus-BOM, or at least give you the option to do so, but you won't know to re-save it if you don't see the faux-BOM there.
The reason you see the three high-bytes at the start of the file in TextMate is actually because TextMate has got it wrong and guessed the encoding as Latin-1 instead of UTF-8. This presumably reproduces the behaviour of the service you're sending to which don't know about Unicode, but it's not really a desirable feature in itself. It's also why the æ
s and ø
s haven't come out.
If you want to see every byte in the file explicitly, what you want isn't really a text editor, but a hex editor. There are lots to choose from, eg. xvi32 on Windows.
And then fix your application to not produce bogus BOMs; they have no place in a UTF-8 file anyway, never mind the problems it causes to non-Unicode applications. [I don't know what the application is written in, but a common cause of unwanted BOMs is using .NET's Encoding.UTF8
encoding. A new UTF8Encoding(false)
would be preferable.]
Whether the service you're sending to wants UTF-8 or some other encoding is in any case something you'll have to ask the operators of that service. If they're already describing the high-bytes for æ
et al in your file as inherently ‘invalid’, you may be facing a situation where they don't support any non-ASCII characters at all, in which case you'll have to consider transliterating characters appropriately for the target language, eg. æ
->ae
.
An easy way to view this kind of stuff in Windows is to use the "type" command.
I would do something like this:
type filename.txt | more
Frhed jumps to my mind...it is a very nice tool. And as Arjan pointed out, you're saving the file as UTF-8 encoded document.