What is the difference between re.search and re.match?
re.match
is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using ^
in the pattern.
As the re.match documentation says:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding
MatchObject
instance. ReturnNone
if the string does not match the pattern; note that this is different from a zero-length match.Note: If you want to locate a match anywhere in string, use
search()
instead.
re.search
searches the entire string, as the documentation says:
Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding
MatchObject
instance. ReturnNone
if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
So if you need to match at the beginning of the string, or to match the entire string use match
. It is faster. Otherwise use search
.
The documentation has a specific section for match
vs. search
that also covers multiline strings:
Python offers two different primitive operations based on regular expressions:
match
checks for a match only at the beginning of the string, whilesearch
checks for a match anywhere in the string (this is what Perl does by default).Note that
match
may differ fromsearch
even when using a regular expression beginning with'^'
:'^'
matches only at the start of the string, or inMULTILINE
mode also immediately following a newline. The “match
” operation succeeds only if the pattern matches at the start of the string regardless of mode, or at the starting position given by the optionalpos
argument regardless of whether a newline precedes it.
Now, enough talk. Time to see some example code:
# example code:
string_with_newlines = """something
someotherthing"""
import re
print re.match('some', string_with_newlines) # matches
print re.match('someother',
string_with_newlines) # won't match
print re.match('^someother', string_with_newlines,
re.MULTILINE) # also won't match
print re.search('someother',
string_with_newlines) # finds something
print re.search('^someother', string_with_newlines,
re.MULTILINE) # also finds something
m = re.compile('thing$', re.MULTILINE)
print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines,
re.MULTILINE) # also matches
search
⇒ find something anywhere in the string and return a match object.
match
⇒ find something at the beginning of the string and return a match object.
match is much faster than search, so instead of doing regex.search("word") you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
This comment from @ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let's find out how many tons of performance you will really gain.
I prepared the following test suite:
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
I made 10 measurements (1M, 2M, ..., 10M words) which gave me the following plot:
The resulting lines are surprisingly (actually not that surprisingly) straight. And the search
function is (slightly) faster given this specific pattern combination. The moral of this test: Avoid overoptimizing your code.
re.search
searches for the pattern throughout the string, whereas re.match
does not search the pattern; if it does not, it has no other choice than to match it at start of the string.