Grep: The asterisk (*) doesn't always work
An asterisk in regular expressions means "match the preceding element 0 or more times".
In your particular case with grep 'This*String' file.txt
, you are trying to say, "hey, grep, match me the word Thi
, followed by lowercase s
zero or more times, followed by the word String
". The lowercase s
is nowhere to be found in Example
, hence grep ignores ThisExampleString
.
In the case of grep '*String' file.txt
, you are saying "grep, match me the empty string--literally nothing--preceding the word String
". Of course, that's not how ThisExampleString
is supposed to be read. (There are other possible meanings--you can try this with and without the -E
flag--but none of the meanings are anything like what you really want here.)
Knowing that .
means "any single character", we could do this: grep 'This.*String' file.txt
. Now the grep command will read it correctly: This
followed by any character (think of it as selection of ASCII characters) repeated any number of times, followed by String
.
The *
metacharacter in BRE1s, ERE1s and PCRE1s matches 0 or more occurences of the previously grouped pattern (if a grouped pattern is preceding the *
metacharacter), 0 or more occurences of the previous character class (if a character class is preceding the *
metacharacter) or 0 or more occurences of the previous character (if neither a grouped pattern nor a character class is preceding the *
metacharacter);
This means that in the This*String
pattern, being the *
metacharacter not preceded either by a grouped pattern or a character class, the *
metacharacter matches 0 or more occurences of the previous character (in this case the s
character):
% cat infile
ThisExampleString
ThisString
ThissString
% grep 'This*String' infile
ThisString
ThissString
To match 0 or more occurences of any character, you want to match 0 or more occurences of the .
metacharacter, which matches any character:
% cat infile
ThisExampleString
% grep 'This.*String' infile
ThisExampleString
The *
metacharacter in BREs and EREs is always "greedy", i.e. it will match the longest match:
% cat infile
ThisExampleStringIsAString
% grep -o 'This.*String' infile
ThisExampleStringIsAString
This may not be the desired behavior; in case it's not, you can turn on grep
's PCRE engine (using the -P
option) and append the ?
metacharacter, which when put after the *
and +
metacharacters has the effect of changing their greediness:
% cat infile
ThisExampleStringIsAString
% grep -Po 'This.*?String' infile
ThisExampleString
1: Basic Regular Expressions, Extended Regular Expressions and Perl Compatible Regular Expressions
One of explanation found here link:
Asterisk "
*
" does not mean the same thing in regular expressions as in wildcarding; it is a modifier that applies to the preceding single character, or expression such as [0-9]. An asterisk matches zero or more of what precedes it. Thus[A-Z]*
matches any number of upper-case letters, including none, while[A-Z][A-Z]*
matches one or more upper-case letters.
*
has a special meaning both as a shell globbing character ("wildcard") and as a regular expression metacharacter. You must take both into account, though if you quote your regular expression then you can prevent the shell from treating it specially and ensure that it passes it unchanged to grep
. Although sort of similar conceptually, what *
means to the shell is quite different from what it means to grep
.
First the shell treats *
as a wildcard.
You said:
Whether the expression is enclosed in quotes makes no difference.
That depends on what files exist in whatever directory you happen to be in when you run the command. For patterns that contain the directory separator /
, it may depend on what files exist across your whole system. You should always quote regular expressions for grep
--and single quotes are usually best--unless you are sure you are okay with the nine types of potentially surprising transformations the shell otherwise performs before executing the grep
command.
When the shell encounters a *
character that is not quoted, it takes it to mean "zero or more of any character" and replaces the word that contains it with a list of filenames that match the pattern. (Filenames that start with .
are excluded--unless your pattern itself starts with .
or you've configured your shell to include them anyway.) This is known as globbing--and also by the names filename expansion and pathname expansion.
The effect with grep
will usually be that the first matching filename is taken as the regular expression--even if it would be quite obvious to a human reader that it is not meant as a regular expression--while all the other filenames listed automatically from your glob are taken as the files inside which to search for matches. (You do not see the list--it is passed opaquely to grep
.) You virtually never want this to happen.
The reason this is sometimes not a problem--and in your particular case, at least so far, it wasn't--is that *
will be left alone if all of the following are true:
-
There were no files whose names matched. ...Or you have disabled globbing in your shell, typically with
set -f
or the equivalentset -o noglob
. But this is uncommon and you would probably know you did it. -
You are using a shell whose default behavior is to leave
*
alone when there are no matching filenames. This is the case in Bash, which you are probably using, but not in all Bourne-style shells. (The default behavior in the popular shell Zsh, for instance, is for globs to either (a) expand or (b) produce an error.) ...Or you have changed this behavior of your shell--how that is done varies across shells. -
You have not otherwise told your shell to allow globs to be replaced with nothing when there are no matching files, nor to fail with an error message in this situation. In Bash that would have been done by enabling the
nullglob
orfailglob
shell option, respectively.
You can sometimes rely on #2 and #3 but you can rarely rely on #1. A grep
command with an unquoted pattern that works now may stop working when you have different files or when you run it from a different place. Quote your regular expression and the problem goes away.
Then the grep
command treats *
as a quantifier.
The other answers--such as those by Sergiy Kolodyazhnyy and by kos--also address this aspect of this question, in somewhat different ways. So I encourage those who haven't read them yet to do so, either before or after reading the rest of this answer.
Assuming the *
does make it to grep--which quoting should ensure--grep
then takes it to mean that the item that precedes it may occur any number of times, rather than having to occur exactly once. It could still occur once. Or it might not be present at all. Or it could be repeated. Text that fits with any of those possibilities will be matched.
What do I mean by "item"?
-
A single character. Since
b
matches a literalb
,b*
matches zero or moreb
s, thusab*c
matchesac
,abc
,abbc
,abbbc
, etc.Similarly, since
.
matches any character,.*
matches zero or more characters1, thusa.*c
matchesac
,akc
,ahjglhdfjkdlgjdfkshlgc
, evenacccccchjckhcc
, etc. Or -
A character class. Since
[xy]
matchesx
ory
,[xy]*
matches zero or more characters where each one is eitherx
ory
, thusp[xy]*q
matchespq
,pxq
,pyq
,pxxq
,pxyq
,pyxq
,pyyq
,pxxxq
,pxxyq
, etc.This also applies to shorthand forms of character classes like
\w
,\W
,\s
, and\S
. Since\w
matches any word character,\w*
matches zero or more word characters. Or -
A group. Since
\(bar\)
matchesbar
,\(bar\)*
matches zero or morebar
s, thusfoo\(bar\)*baz
matchesfoobaz
,foobarbaz
,foobarbarbaz
,foobarbarbarbaz
, etc.With the
-E
or-P
options,grep
treats your regular expression as an ERE or PCRE respectively, rather than as a BRE, and then groups are surrounded by(
)
instead of\(
\)
, so then you'd use(bar)
instead of\(bar\)
andfoo(bar)baz
instead offoo\(bar\)baz
.
man grep
gives a reasonably accessible explanation of BRE and ERE syntax at the end, as well as listing all the command-line options grep
accepts at the beginning. I recommend that manual page as a resource, and also the GNU Grep documentation and this tutorial/reference site (which I've linked to a number of pages on, above).
For testing and learning grep
, I recommend calling it with a pattern but no filename. Then it takes input from your terminal. Enter lines; the lines that are echoed back to you are the ones that contained text your pattern matched. To quit, press Ctrl+D at the beginning of a line, which signals end of input. (Or you can press Ctrl+C as with most command-line programs.) For example:
grep 'This.*String'
If you use the --color
flag, grep
will highlight the specific parts of your lines that matched your regular expression, which is very useful both for figuring out what a regular expression does and for finding what you are looking for once you do. By default, Ubuntu users have a Bash alias that causes grep --color=auto
to run--which is sufficient for this purpose--when you run grep
from the command line, so you likely don't even need to pass --color
manually.
1Therefore .*
in a regular expression means what *
means in a shell glob. However, the difference is that grep
automatically prints lines that contain your match anywhere in them, so it's typically unnecessary to have .*
at the beginning or end of a regular expression.