How to remove HTML tag in Java [duplicate]
Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.
Solution 1:
There is JSoup which is a java library made for HTML manipulation. Look at the clean()
method and the WhiteList
object. Easy to use solution!
Solution 2:
You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.
With htmlCleaner you can do:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}
Solution 3:
If you just need to remove tags then you can use this regular expression:
content = content.replaceAll("<[^>]+>", "");
It will remove only tags, but not other HTML stuff. For more complex things you should use parser.
EDIT: To avoid problems with HTML comments you can do the following:
content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");
Solution 4:
No. Regular expressions can not by definition parse HTML.
You could use a regex to s/<[^>]*\>//
or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.
As another poster said, use an actual HTML parser.
Solution 5:
You don't need any HTML parser. The below code removes all HTML comments:
htmlString = htmlString.replaceAll("(?s)<!--.*?-->", "");