How do I preserve line breaks when using jsoup to convert html to plain text?
The real solution that preserves linebreaks should be like this:
public static String br2nl(String html) {
if(html==null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
String s = document.html().replaceAll("\\\\n", "\n");
return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}
It satisfies the following requirements:
- if the original html contains newline(\n), it gets preserved
- if the original html contains br or p tags, they gets translated to newline(\n).
With
Jsoup.parse("A\nB").text();
you have output
"A B"
and not
A
B
For this I'm using:
descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");