Get content between a pair of HTML tags using Bash
Solution 1:
Using sed in shell/bash, so you needn't install something else.
tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file
Solution 2:
plain text processing is not good for html/xml parsing. I hope this could give you some idea:
kent$ xmllint --xpath "//body" f.html
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
Solution 3:
Personally I find it very useful to use hxselect
command (often with help of hxclean
) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c
option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:
$ hxselect -c body <<HTML
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
HTML
to get what you need. Plain and simple.