Are Java and C# regular expressions compatible?

Both languages claim to use Perl style regular expressions. If I have one language test a regular expression for validity, will it work in the other? Where do the regular expression syntaxes differ?

The use case here is a C# (.NET) UI talking to an eventual Java back end implementation that will use the regex to match data.

Note that I only need to worry about matching, not about extracting portions of the matched data.


There are quite (a lot of) differences.

Character Class

  1. Character classes subtraction [abc-[cde]]
    • .NET YES (2.0)
    • Java: Emulated via character class intersection and negation: [abc&&[^cde]])
  2. Character classes intersection [abc&&[cde]]
    • .NET: Emulated via character class subtraction and negation: [abc-[^cde]])
    • Java YES
  3. \p{Alpha} POSIX character class
    • .NET NO
    • Java YES (US-ASCII)
  4. Under (?x) mode COMMENTS/IgnorePatternWhitespace, space (U+0020) in character class is significant.
    • .NET YES
    • Java NO
  5. Unicode Category (L, M, N, P, S, Z, C)
    • .NET YES: \p{L} form only
    • Java YES:
      • From Java 5: \pL, \p{L}, \p{IsL}
      • From Java 7: \p{general_category=L}, \p{gc=L}
  6. Unicode Category (Lu, Ll, Lt, ...)
    • .NET YES: \p{Lu} form only
    • Java YES:
      • From Java 5: \p{Lu}, \p{IsLu}
      • From Java 7: \p{general_category=Lu}, \p{gc=Lu}
  7. Unicode Block
    • .NET YES: \p{IsBasicLatin} only. (Supported Named Blocks)
    • Java YES: (name of the block is free-casing)
      • From Java 5: \p{InBasicLatin}
      • From Java 7: \p{block=BasicLatin}, \p{blk=BasicLatin}
  8. Spaces, and underscores allowed in all long block names (e.g. BasicLatin can be written as Basic_Latin or Basic Latin)
    • .NET NO
    • Java YES (Java 5)

Quantifier

  1. ?+, *+, ++ and {m,n}+ (possessive quantifiers)
    • .NET NO
    • Java YES

Quotation

  1. \Q...\E escapes a string of metacharacters
    • .NET NO
    • Java YES
  2. \Q...\E escapes a string of character class metacharacters (in character sets)
    • .NET NO
    • Java YES

Matching construct

  1. Conditional matching (?(?=regex)then|else), (?(regex)then|else), (?(1)then|else) or (?(group)then|else)
    • .NET YES
    • Java NO
  2. Named capturing group and named backreference
    • .NET YES:
      • Capturing group: (?<name>regex) or (?'name'regex)
      • Backreference: \k<name> or \k'name'
    • Java YES (Java 7):
      • Capturing group: (?<name>regex)
      • Backreference: \k<name>
  3. Multiple capturing groups can have the same name
    • .NET YES
    • Java NO (Java 7)
  4. Balancing group definition (?<name1-name2>regex) or (?'name1-name2'subexpression)
    • .NET YES
    • Java NO

Assertions

  1. (?<=text) (positive lookbehind)
    • .NET Variable-width
    • Java Obvious width
  2. (?<!text) (negative lookbehind)
    • .NET Variable-width
    • Java Obvious width

Mode Options/Flags

  1. ExplicitCapture option (?n)
    • .NET YES
    • Java NO

Miscellaneous

  1. (?#comment) inline comments
    • .NET YES
    • Java NO

References

  • regular-expressions.info - Comparison of Different Regex Flavors
  • MSDN Library Reference - .NET Framework 4.5 - Regular Expression Language
  • Pattern (Java Platform SE 7)

Check out: http://www.regular-expressions.info/refflavors.html Plenty of regex info on that site, and there's a nice chart that details the differences between java & .net.


c# regex has its own convention for named groups (?<name>). I don't know of any other differences.