Sed remove tags from html file
Solution 1:
You can either use one of the many HTML to text converters, use Perl regex if possible <.+?>
or if it must be sed
use <[^>]*>
sed -e 's/<[^>]*>//g' file.html
If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines
<div
>Lorem ipsum</div>
this regular expression will not work.
This regular expression consists of three parts <
, [^>]*
, >
- search for opening
<
- followed by zero or more characters
*
, which are not the closing>
[...]
is a character class, when it starts with^
look for characters not in the class - and finally look for closing
>
The simpler regular expression <.*>
will not work, because it searches for the longest possible match, i.e. the last closing >
in an input line. E.g., when you have more than one tag in an input line
<name>Olaf</name> answers questions.
will result in
answers questions.
instead of
Olaf answers questions.
See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.