Problems with "+" in grep

If you want + to mean "one or more of the preceding atom", then you have to do one of:

  1. Use -E (Extended Regular Expressions) (or -P, PCRE):

    grep -E 'data=[a-z,0-9,\"]+' file
    
  2. Escape + so that is treated specially in the Basic Regular Expressions used by default in grep:

    grep 'data=[a-z,0-9,"]\+' file
    

Points:

  • + is an ERE (Extended Regular Expression) token, which indicates one or more of the preceding token, can be used if -E option of grep is used or with escaped (\+) in case of BRE (Basic Regex) i.e. only regular grep

  • The character class [a-z,0-9,\"] would match any of the characters between [a-z], [0-9], , or ". This may not be what you want

  • Normally grep outputs whole line, if you want to output only the matched portion, use -o option of grep


Based on your example, you can do:

grep -E '\bdata=[a-z0-9"]+\b' file
  • -E enables ERE
  • \b matches string edges, zero width
  • data= matches data= literally
  • [a-z0-9"] matches any character of [a-z], [0-9], and ". + matches the previous token one or more times

Your current pattern even you make it correct, without \b this would match false positives like foo fdata=2322ab, data=12AB and so on.

Example:

% grep -oE '\bdata=[a-z0-9"]+\b' <<<'<div class="node_thumbnail" data-type="file" name="GOPR0036.MP4_frame000001.jpg" data="813334c25191468c9f1c57afc99fde60" aid="133948" rel="/Files/ToolTipView?fileId=813334c25191468c9f1c57afc99fde60&pageNo=1&NoCache=101016083044" rev="topMiddle"'
data="813334c25191468c9f1c57afc99fde60