Get content between a pair of HTML tags using Bash

Solution 1:

Using sed in shell/bash, so you needn't install something else.

tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file

Solution 2:

plain text processing is not good for html/xml parsing. I hope this could give you some idea:

kent$  xmllint --xpath "//body" f.html 
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>

Solution 3:

Personally I find it very useful to use hxselect command (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:

$ hxselect -c body <<HTML
  <html>
  <head>
  </head>
  <body>
    text
    <div>
      text2
      <div>
        text3
      </div>
    </div>
  </body>
  </html>
  HTML

to get what you need. Plain and simple.

Get content between a pair of HTML tags using Bash

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts