Emptying HTML tags with sed
I want to empty the content of every HTML tag but "keeping the structure".
From:
<h5>Holdrs <div class="tooltip" data-tooltip="Accounts with ..."></div></h5>
<div class="value">
<span class="amount">25,241</span><a class="smallnav" href="/c/token/0xB31f66AA3C1e785363F0875A1B7"><svg class="icon-s icon">
I want to get:
<>Holdrs <><><>
<>
<>25,241<><><>
From my understanding of sed this should be:
sed 's/<.*>/<>/'
but it only returns:
<>
<>
<>
(Tested here: https://sed.js.org/?gist=7af9c1c1762a6a93d582502b3d4fe22f).
What I'm doing wrong? What's the correct pattern?
*
is greedy, so <.*>
matches everything from the first <
to the last >
in the line. Some tools understand *?
as non-greedy analogue of *
, but not sed
.
In your case one can still go with sed
. Replace .
(any character) with [^>]
(any character but >
). You should also add g
flag because you want to replace all matches in the line, not just the first.
This should work:
sed 's/<[^>]*>/<>/g'
Just rename all nodes to empty strings and delete all attributes using xmlstarlet
:
xml ed -r '//*' '' -d '//@*'
This will add an XML header (<?xml version="1.0"?>
) and leave a slash in the closing tags (</>
) which may be acceptable, or which you can remove with an additional tail
/sed
pass.
Like others have already said, sed
alone will never be able to handle all cases correctly.