Difference between '.' , '?' and '*' in regular expressions?
Could I get an example as to how these three elements (are these called metacharacters?) differ?
I know that *
means all or nothing, but I am not sure if it's the right way to think about it. On the other hand .
and ?
seem the same. They match one character, right?
You may be confusing regular expressions with shell globs
In regular expression syntax .
represents any single character (usually excluding the newline character), while *
is a quantifier meaning zero or more of the preceding regex atom (character or group). ?
is a quantifier meaning zero or one instances of the preceding atom, or (in regex variants that support it) a modifier that sets the quantifier behavior to non-greedy.
In shell globs, ?
represents a single character (like the regex's .
) while *
represents a sequence of zero or more characters (equivalent to regex .*
).
A couple of references you may find helpful are http://www.regular-expressions.info/quickstart.html and http://mywiki.wooledge.org/glob
Taken straight from Wikipedia:
? The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
* The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
The big difference is that asterisk matches zero or more occurrences, while question mark matches zero or one occurrence. Compare these two examples:
$ printf "colour\ncolor\ncolouur\n" | egrep 'colou?r'
colour
color
$ printf "colour\ncolor\ncolouur\n" | egrep 'colou*r'
colour
color
colouur
Because in colouur
the letter u (the previous element before qualifier ?
) occurred more than once, it's not matched with ?
, but it is matched with *
Similar example:
$ printf "error\neror\ner\n" | egrep 'er?or'
eror
$ printf "error\neror\ner\n" | egrep 'er*or'
error
eror
From the same wikipedia page:
Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".
In our example,
$ printf "colour\ncolor\ncolouur\n" | egrep 'colo.r'
colour
$ printf "colour\ncolor\ncolouur\n" | egrep 'colou.r'
colouur
Appropriately enough, the last one reads as match any line that has "colou", plus any character, plus letter "r"
Conclusion
You've asked: "I know that '*' means all or nothing, but I am not sure if it's the right way to think about it. On the other '.' & '?' seem same." As you can see, the dot and asterisk are not exactly the same. The dot operates on any character that may be occupying that specific position, while question mark operates on the preceding element.
Note: Examples provided are in Python.
Though concept remains the same.
'.'
is a matching symbol which matches any character except for newline character (this too can be overridden with re.DOTALL
argument in Python). Hence it is also called as a Wildcard.
'*'
is a quantifier(defines how often an element can occur). Is short for {0,}.
It means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again.
'?'
is also a quantifier. Is short for {0,1}.
It means "Match zero or one of the group preceding this question mark." It can also be interpreted as the part preceding the question mark is optional.
e.g.:
pattern = re.compile(r'(\d{2}-)?\d{10}')
mobile1 = pattern.search('My number is 91-9999988888')
mobile1.group()
Output: '91-9999988888'
mobile2 = pattern.search('My number is 9999988888')
mobile2.group()
Output: '9999988888'
In above example '?' indicates that the two digits preceding it are optional.They may not occur or occur at the most once.
Difference between '.' and '?':
'.'
matches/accepts/verifies any single character for the place it is holding in the regular expression.
e.g.:
pattern = re.compile(r'.ot')
pattern.findall('dot will identify both hot and got.')
Output: ['dot', 'hot', 'got']
'?'
matches/verifies the zero or single occurrence of the group preceding it.
Check Mobile number example.
Same goes with '*'
. It will check zero or more occurrences of group preceding it.
Combination:
'.*'
: Accepts as many sequence as available. Greedy approach.
'.*?
' Accepts the first matched sequence and stops. Non-Greedy approach
For more info, consider reading following two questions...
- How can I write a regex which matches non greedy?
- regex - What do lazy and greedy mean in the context of regular expressions?