Regex to strip line comments from C#

Solution 1:

Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.

The thing is, every time you have /* and // and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.

So let’s define a regular expression that matches each of those four tokens:

var blockComments = @"/\*(.*?)\*/";
var lineComments = @"//(.*?)\r?\n";
var strings = @"""((\\[^\n]|[^""\n])*)""";
var verbatimStrings = @"@(""[^""]*"")+";

To answer the question in the title (strip comments), we need to:

  • Replace the block comments with nothing
  • Replace the line comments with a newline (because the regex eats the newline)
  • Keep the literal strings where they are.

Regex.Replace can do this easily using a MatchEvaluator function:

string noComments = Regex.Replace(input,
    blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,
    me => {
        if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))
            return me.Value.StartsWith("//") ? Environment.NewLine : "";
        // Keep the literal strings
        return me.Value;
    },
    RegexOptions.Singleline);

I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.

Solution 2:

You could tokenize the code with an expression like:

@(?:"[^"]*")+|"(?:[^"\n\\]+|\\.)*"|'(?:[^'\n\\]+|\\.)*'|//.*|/\*(?s:.*?)\*/

It would also match some invalid escapes/structures (eg. 'foo'), but will probably match all valid tokens of interest (unless I forgot something), thus working well for valid code.

Using it in a replace and capturing the parts you want to keep will give you the desired result. I.e:

static string StripComments(string code)
{
    var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
    return Regex.Replace(code, re, "$1");
}

Example app:

using System;
using System.Text.RegularExpressions;

namespace Regex01
{
    class Program
    {
        static string StripComments(string code)
        {
            var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
            return Regex.Replace(code, re, "$1");
        }

        static void Main(string[] args)
        {
            var input = "hello /* world */ oh \" '\\\" // ha/*i*/\" and // bai";
            Console.WriteLine(input);

            var noComments = StripComments(input);
            Console.WriteLine(noComments);
        }
    }
}

Output:

hello /* world */ oh " '\" // ha/*i*/" and // bai
hello  oh " '\" // ha/*i*/" and