How can I determine if a file is binary or text in c#? [duplicate]

I need to determine in 80% if a file is binary or text, is there any way to do it even quick and dirty/ugly in c#?

Solution 1:

There's a method called Markov Chains. Scan a few model files of both kinds and for each byte value from 0 to 255 gather stats (basically probability) of a subsequent value. This will give you a 64Kb (256x256) profile you can compare your runtime files against (within a % threshold).

Supposedly, this is how browsers' Auto-Detect Encoding feature works.

Solution 2:

I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.

As stated though, you can always be unlucky and get a binary file that looks like text or vice versa.

Solution 3:

Sharing my solution in the hope it helps others as it helps me from these posts and forums.

Background

I have been researching and exploring a solution for the same. However, I expected it to be simple or slightly twisted.

However, most of the attempts provide convoluted solutions here as well as other sources and dives into Unicode, UTF-series, BOM, Encodings, Byte orders. In the process, I also went off-road and into Ascii Tables and Code pages too.

Anyways, I have come up with a solution based on the idea of stream reader and custom control characters check.

It is built taking into considerations various hints and tips provided on the forum and elsewhere such as:

Check for lot of control characters for example looking for multiple consecutive null characters.
Check for UTF, Unicode, Encodings, BOM, Byte Orders and similar aspects.

My goal is:

It should not rely on byte orders, encodings and other more involved esoteric work.
It should be relatively easy to implement and easy to understand.
It should work on all types of files.

The solution presented works for me on test data that includes mp3, eml, txt, info, flv, mp4, pdf, gif, png, jpg. It gives results as expected so far.

How the solution works

I am relying on the StreamReader default constructor to do what it can do best with respect to determining file encoding related characteristics which uses UTF8Encoding by default.

I created my own version of check for custom control char condition because Char.IsControl does not seem useful. It says:

Control characters are formatting and other non-printing characters, such as ACK, BEL, CR, FF, LF, and VT. Unicode standard assigns code points from \U0000 to \U001F, \U007F, and from \U0080 to \U009F to control characters. These values are to be interpreted as control characters unless their use is otherwise defined by an application. It considers LF and CR as control characters among other things

That makes it not useful since text files include CR and LF at least.

Solution

static void testBinaryFile(string folderPath)
{
    List<string> output = new List<string>();
    foreach (string filePath in getFiles(folderPath, true))
    {
        output.Add(isBinary(filePath).ToString() + "  ----  " + filePath);
    }
    Clipboard.SetText(string.Join("\n", output), TextDataFormat.Text);
}

public static List<string> getFiles(string path, bool recursive = false)
{
    return Directory.Exists(path) ?
        Directory.GetFiles(path, "*.*",
        recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly).ToList() :
        new List<string>();
}    

public static bool isBinary(string path)
{
    long length = getSize(path);
    if (length == 0) return false;

    using (StreamReader stream = new StreamReader(path))
    {
        int ch;
        while ((ch = stream.Read()) != -1)
        {
            if (isControlChar(ch))
            {
                return true;
            }
        }
    }
    return false;
}

public static bool isControlChar(int ch)
{
    return (ch > Chars.NUL && ch < Chars.BS)
        || (ch > Chars.CR && ch < Chars.SUB);
}

public static class Chars
{
    public static char NUL = (char)0; // Null char
    public static char BS = (char)8; // Back Space
    public static char CR = (char)13; // Carriage Return
    public static char SUB = (char)26; // Substitute
}

If you try above solution, let me know it works for you or not.

Solution 4:

While this isn't foolproof, this should check to see if it has any binary content.

public bool HasBinaryContent(string content)
{
    return content.Any(ch => char.IsControl(ch) && ch != '\r' && ch != '\n');
}

Because if any control character exist (aside from the standard \r\n), then it is probably not a text file.