How to find out the Encoding of a File? C#

There is no reliable way to do it (since the file might be just random binary), however the process done by Windows Notepad software is detailed in Micheal S Kaplan's blog:

http://www.siao2.com/2007/04/22/2239345.aspx

Check the first two bytes; 1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file; 2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file; 3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;

Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;

Check to see if it UTF-8 using the original RFC 2279 definition from 1998 and if it then treat it (and load it) as a "UTF-8" file;

Assume an ANSI file using the default system code page of the machine.

Now note that there are some holes here, like the fact that step 2 does not do quite as good with BOM-less UTF-16 BE (there may even be a bug here, I'm not sure -- if so it's a bug in Notepad beyond any bug in IsTextUnicode).

http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2

There is no great way to detect an arbitrary ANSI code page, though there have been some attempts to do this based on the probability of certain byte sequences in the middle of text. We don't try that in StreamReader. A few file formats like XML or HTML have a way of specifying the character set on the first line in the file, so Web browsers, databases, and classes like XmlTextReader can read these files correctly. But many text files don't have this type of information built in.

How to find out the Encoding of a File? C#

Related

Recent Posts