Extract the content from a file between two match patterns (Extract only HTML from a file)

We can achieve this goal by the tool sed - stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.

0. First let's create a simple file to test our commands:

$ printf '\nTop text\nSender <[email protected]>\n\n<html>\n\tThe inner text 1\n</html>\n\nMiddle text\n\n<HTML>\n\tThe inner text 2\n</HTML>\n\nBottom text\n' | tee example.file

Top text
Sender <[email protected]>

<html>
        The inner text 1
</html>

Middle text

<HTML>
        The inner text 2
</HTML>

Bottom text

1. We can crop everything between the tags <html> and </html>, including them, in this way:

$ sed -n -e '/<html>/,/<\/html>/p' example.file

<html>
        The inner text 1
</html>
  • The option -e script (--expression=script) adds a script to the commands to be executed. In this case the script that is added is '/<html>/,/<\/html>/p'. While we have only one script we can omit this option.

  • The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.

  • This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.

  • Finally by the two comma separated patterns - /<html>/,/<\/html>/ - we can specify a range. Please note we using \ to escape the special character / that plays role of delimiter here.

2. If we want to crop everything between the tags <html> and </html>, without printing them, we should add some additional commands:

$ sed -n '/<html>/,/<\/html>/{ /html>/d; p }' example.file

        The inner text 1
  • The curly braces, { and }, are used to group the commands.

  • The command d will delete each line that maces to the expression html>.

3. But, our example.file has also upper case <HTML> tags. So we should make the pattern match case insensitive. We could do that by adding the flag /I to the regular expressions:

$ sed -n '/<html>/I,/<\/html>/I{ /html>/Id; p }' example.file

        The inner text 1
        The inner text 2
  • The I modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.

4. If we want to remove all HTML tags between the <html> tags we could add an additional command, that will parse and 'delete' the strings, which begin with < and end with >:

sed -n '/<html>/I,/<\/html>/I{ /html>/Id; s/<[^>]*>//g; p }' example.file
  • The command s will substitute the strings that mach to the expression /<[^>]*>/ with an empty string // - s/<old>/<new>/.

  • The pattern flag g will apply the replacement to all matches to the regexp, not just the first.

Probably we would want to omit the delete command in this case:

sed -n '/<html>/I,/<\/html>/I{ s/<[^>]*>//g; p }' example.file

5. To make the changes in place of the file and create a backup copy we can use the option -i, or we can to create a new file based on the sed's output by redirecting > the output to the new file:

sed -n '/<html>/I,/<\/html>/I p' example.file -i.bak
sed -n '/<html>/I,/<\/html>/I p' example.file > new.file

References:

  • Sed - An Introduction and Tutorial by Bruce Barnett
  • How to select lines between two marker patterns which may occur multiple times with awk/sed
  • GNU: sed, a stream editor | Ubuntu: sed manual page
  • Sed remove tags from html file