How to extract string following a pattern with grep, regex or perl [duplicate]
I have a file that looks something like this:
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
I need to extract anything within the quotes that follow name=
, i.e., content_analyzer
, content_analyzer2
and content_analyzer_items
.
I am doing this on a Linux box, so a solution using sed, perl, grep or bash is fine.
Solution 1:
Since you need to match content without including it in the result (must
match name="
but it's not part of the desired result) some form of
zero-width matching or group capturing is required. This can be done
easily with the following tools:
Perl
With Perl you could use the n
option to loop line by line and print
the content of a capturing group if it matches:
perl -ne 'print "$1\n" if /name="(.*?)"/' filename
GNU grep
If you have an improved version of grep, such as GNU grep, you may have
the -P
option available. This option will enable Perl-like regex,
allowing you to use \K
which is a shorthand lookbehind. It will reset
the match position, so anything before it is zero-width.
grep -Po 'name="\K.*?(?=")' filename
The o
option makes grep print only the matched text, instead of the
whole line.
Vim - Text Editor
Another way is to use a text editor directly. With Vim, one of the
various ways of accomplishing this would be to delete lines without
name=
and then extract the content from the resulting lines:
:v/.*name="\v([^"]+).*/d|%s//\1
Standard grep
If you don't have access to these tools, for some reason, something similar could be achieved with standard grep. However, without the look around it will require some cleanup later:
grep -o 'name="[^"]*"' filename
A note about saving results
In all of the commands above the results will be sent to stdout
. It's
important to remember that you can always save them by piping it to a
file by appending:
> result
to the end of the command.
Solution 2:
The regular expression would be:
.+name="([^"]+)"
Then the grouping would be in the \1
Solution 3:
If you're using Perl, download a module to parse the XML: XML::Simple, XML::Twig, or XML::LibXML. Don't re-invent the wheel.
Solution 4:
An HTML parser should be used for this purpose rather than regular expressions. A Perl program that makes use of HTML::TreeBuilder
:
Program
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file( \*DATA );
my @elements = $tree->look_down(
sub { defined $_[0]->attr('name') }
);
for (@elements) {
print $_->attr('name'), "\n";
}
__DATA__
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
Output
content_analyzer
content_analyzer2
content_analyzer_items
Solution 5:
this could do it:
perl -ne 'if(m/name="(.*?)"/){ print $1 . "\n"; }'