How to split long regular expression rules to multiple lines in Python

Is this actually doable? I have some very long regex pattern rules that are hard to understand because they don't fit into the screen at once. Example:

test = re.compile('(?P<full_path>.+):\d+:\s+warning:\s+Member\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) of (class|group|namespace)\s+(?P<class_name>.+)\s+is not documented' % (self.__MEMBER_TYPES), re.IGNORECASE)

Backslash or triple quotes won't work.

EDIT. I ended using the VERBOSE mode. Here's how the regexp pattern looks now:

test = re.compile('''
  (?P<full_path>                                  # Capture a group called full_path
    .+                                            #   It consists of one more characters of any type
  )                                               # Group ends                      
  :                                               # A literal colon
  \d+                                             # One or more numbers (line number)
  :                                               # A literal colon
  \s+warning:\s+parameters\sof\smember\s+         # An almost static string
  (?P<member_name>                                # Capture a group called member_name
    [                                             #   
      ^:                                          #   Match anything but a colon (so finding a colon ends group)
    ]+                                            #   Match one or more characters
   )                                              # Group ends
   (                                              # Start an unnamed group 
     ::                                           #   Two literal colons
     (?P<function_name>                           #   Start another group called function_name
       \w+                                        #     It consists on one or more alphanumeric characters
     )                                            #   End group
   )*                                             # This group is entirely optional and does not apply to C
   \s+are\snot\s\(all\)\sdocumented''',           # And line ends with an almost static string
   re.IGNORECASE|re.VERBOSE)                      # Let's not worry about case, because it seems to differ between Doxygen versions

Solution 1:

You can split your regex pattern by quoting each segment. No backslashes needed.

test = re.compile(('(?P<full_path>.+):\d+:\s+warning:\s+Member'
                   '\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '
                   'of (class|group|namespace)\s+(?P<class_name>.+)'
                   '\s+is not documented') % (self.__MEMBER_TYPES), re.IGNORECASE)

You can also use the raw string flag 'r' and you'll have to put it before each segment.

See the docs.

Solution 2:

From http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation:

Multiple adjacent string literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings, for example:

re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )

Note that this feature is defined at the syntactical level, but implemented at compile time. The ‘+’ operator must be used to concatenate string expressions at run time. Also note that literal concatenation can use different quoting styles for each component (even mixing raw strings and triple quoted strings).

Solution 3:

Just for completeness, the missing answer here is using the re.X or re.VERBOSE flag, which the OP eventually pointed out. Besides saving quotes, this method is also portable on other regex implementations such as Perl.

From https://docs.python.org/2/library/re.html#re.X:

re.X
re.VERBOSE

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)

 

b = re.compile(r"\d+\.\d*")

Solution 4:

Personally, I don't use re.VERBOSE because I don't like to escape the blank spaces and I don't want to put '\s' instead of blank spaces when '\s' isn't required.
The more the symbols in a regex pattern are precise relatively to the characters sequences that must be catched, the faster the regex object acts. I nearly never use '\s'

.

To avoid re.VERBOSE, you can do as it has been already said:

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' # comment
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented'\
% (self.__MEMBER_TYPES),

re.IGNORECASE)

Pushing the strings to the left gives a lot of space to write comments.

.

But this manner isn't so good when the pattern is very long because it isn't possible to write

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' % (self.__MEMBER_TYPES)  # !!!!!! INCORRECT SYNTAX !!!!!!!
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented',

re.IGNORECASE)

then in case the pattern is very long, the number of lines between
the part % (self.__MEMBER_TYPES) at the end
and the string '(?P<member_type>%s)' to which it is applied
can be big and we loose the easiness in reading the pattern.

.

That's why I like to use a tuple to write a very long pattern:

pat = ''.join((
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % (self.__MEMBER_TYPES), # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

.

This manner allows to define the pattern as a function:

def pat(x):

    return ''.join((\
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % x , # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

test = re.compile(pat(self.__MEMBER_TYPES), re.IGNORECASE)