Extract the content from a file between two match patterns (Extract only HTML from a file)
We can achieve this goal by the tool sed
- stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.
0. First let's create a simple file to test our commands:
$ printf '\nTop text\nSender <[email protected]>\n\n<html>\n\tThe inner text 1\n</html>\n\nMiddle text\n\n<HTML>\n\tThe inner text 2\n</HTML>\n\nBottom text\n' | tee example.file
Top text
Sender <[email protected]>
<html>
The inner text 1
</html>
Middle text
<HTML>
The inner text 2
</HTML>
Bottom text
1. We can crop everything between the tags <html>
and </html>
, including them, in this way:
$ sed -n -e '/<html>/,/<\/html>/p' example.file
<html>
The inner text 1
</html>
-
The option
-e script
(--expression=script
) adds a script to the commands to be executed. In this case the script that is added is'/<html>/,/<\/html>/p'
. While we have only one script we can omit this option. -
The option
-n
(--quiet
,--silent
) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tellsed
what to print. -
This additional command is the print command
p
, added to the end of the script. Ifsed
wasn't started with an-n
option, thep
command will duplicate the input. -
Finally by the two comma separated patterns -
/<html>/,/<\/html>/
- we can specify a range. Please note we using\
to escape the special character/
that plays role of delimiter here.
2. If we want to crop everything between the tags <html>
and </html>
, without printing them, we should add some additional commands:
$ sed -n '/<html>/,/<\/html>/{ /html>/d; p }' example.file
The inner text 1
-
The curly braces,
{
and}
, are used to group the commands. -
The command
d
will delete each line that maces to the expressionhtml>
.
3. But, our example.file
has also upper case <HTML>
tags. So we should make the pattern match case insensitive. We could do that by adding the flag /I
to the regular expressions:
$ sed -n '/<html>/I,/<\/html>/I{ /html>/Id; p }' example.file
The inner text 1
The inner text 2
- The
I
modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.
4. If we want to remove all HTML tags between the <html>
tags we could add an additional command, that will parse and 'delete' the strings, which begin with <
and end with >
:
sed -n '/<html>/I,/<\/html>/I{ /html>/Id; s/<[^>]*>//g; p }' example.file
-
The command
s
will substitute the strings that mach to the expression/<[^>]*>/
with an empty string//
-s/<old>/<new>/
. -
The pattern flag
g
will apply the replacement to all matches to the regexp, not just the first.
Probably we would want to omit the delete command in this case:
sed -n '/<html>/I,/<\/html>/I{ s/<[^>]*>//g; p }' example.file
5. To make the changes in place of the file and create a backup copy we can use the option -i
, or we can to create a new file based on the sed
's output by redirecting >
the output to the new file:
sed -n '/<html>/I,/<\/html>/I p' example.file -i.bak
sed -n '/<html>/I,/<\/html>/I p' example.file > new.file
References:
- Sed - An Introduction and Tutorial by Bruce Barnett
- How to select lines between two marker patterns which may occur multiple times with awk/sed
- GNU: sed, a stream editor | Ubuntu: sed manual page
- Sed remove tags from html file