Emptying HTML tags with sed

I want to empty the content of every HTML tag but "keeping the structure".

From:

<h5>Holdrs <div class="tooltip" data-tooltip="Accounts with ..."></div></h5>
<div class="value">
  <span class="amount">25,241</span><a class="smallnav" href="/c/token/0xB31f66AA3C1e785363F0875A1B7"><svg class="icon-s icon">

I want to get:

<>Holdrs <><><>
<>
  <>25,241<><><>

From my understanding of sed this should be:

sed 's/<.*>/<>/'

but it only returns:

<>
<>
  <>

(Tested here: https://sed.js.org/?gist=7af9c1c1762a6a93d582502b3d4fe22f).

What I'm doing wrong? What's the correct pattern?

* is greedy, so <.*> matches everything from the first < to the last > in the line. Some tools understand *? as non-greedy analogue of *, but not sed.

In your case one can still go with sed. Replace . (any character) with [^>] (any character but >). You should also add g flag because you want to replace all matches in the line, not just the first.

This should work:

sed 's/<[^>]*>/<>/g'

Just rename all nodes to empty strings and delete all attributes using xmlstarlet:

xml ed -r '//*' '' -d '//@*'

This will add an XML header (<?xml version="1.0"?>) and leave a slash in the closing tags (</>) which may be acceptable, or which you can remove with an additional tail/sed pass.

Like others have already said, sed alone will never be able to handle all cases correctly.

Emptying HTML tags with sed

Related

Recent Posts