Extract the content from a file between two match patterns (Extract only HTML from a file)

We can achieve this goal by the tool sed - stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.

0. First let's create a simple file to test our commands:

$ printf '\nTop text\nSender <[email protected]>\n\n<html>\n\tThe inner text 1\n</html>\n\nMiddle text\n\n<HTML>\n\tThe inner text 2\n</HTML>\n\nBottom text\n' | tee example.file

Top text
Sender <[email protected]>

<html>
        The inner text 1
</html>

Middle text

<HTML>
        The inner text 2
</HTML>

Bottom text

1. We can crop everything between the tags <html> and </html>, including them, in this way:

$ sed -n -e '/<html>/,/<\/html>/p' example.file

<html>
        The inner text 1
</html>

The option -e script (--expression=script) adds a script to the commands to be executed. In this case the script that is added is '/<html>/,/<\/html>/p'. While we have only one script we can omit this option.
The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.
This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.
Finally by the two comma separated patterns - /<html>/,/<\/html>/ - we can specify a range. Please note we using \ to escape the special character / that plays role of delimiter here.

2. If we want to crop everything between the tags <html> and </html>, without printing them, we should add some additional commands:

$ sed -n '/<html>/,/<\/html>/{ /html>/d; p }' example.file

        The inner text 1

The curly braces, { and }, are used to group the commands.
The command d will delete each line that maces to the expression html>.

3. But, our example.file has also upper case <HTML> tags. So we should make the pattern match case insensitive. We could do that by adding the flag /I to the regular expressions:

$ sed -n '/<html>/I,/<\/html>/I{ /html>/Id; p }' example.file

        The inner text 1
        The inner text 2

The I modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.

4. If we want to remove all HTML tags between the <html> tags we could add an additional command, that will parse and 'delete' the strings, which begin with < and end with >:

sed -n '/<html>/I,/<\/html>/I{ /html>/Id; s/<[^>]*>//g; p }' example.file

The command s will substitute the strings that mach to the expression /<[^>]*>/ with an empty string // - s/<old>/<new>/.
The pattern flag g will apply the replacement to all matches to the regexp, not just the first.

Probably we would want to omit the delete command in this case:

sed -n '/<html>/I,/<\/html>/I{ s/<[^>]*>//g; p }' example.file

5. To make the changes in place of the file and create a backup copy we can use the option -i, or we can to create a new file based on the sed's output by redirecting > the output to the new file:

sed -n '/<html>/I,/<\/html>/I p' example.file -i.bak

sed -n '/<html>/I,/<\/html>/I p' example.file > new.file

References:

Sed - An Introduction and Tutorial by Bruce Barnett
How to select lines between two marker patterns which may occur multiple times with awk/sed
GNU: sed, a stream editor | Ubuntu: sed manual page
Sed remove tags from html file

Extract the content from a file between two match patterns (Extract only HTML from a file)

Related

Recent Posts