HTML to UNFORMATTED plain text?

I'm looking for a way to convert a folder full of HTML files to plain text. What I want is for the text files to be as much as possible like what I'd get if I selected all the text in a web browser, copied it, and pasted the text into a plain text file.

NO, REALLY, I WANT UNFORMATTED PLAIN TEXT. All of the solutions that I'm finding produce Markdown or something that looks like it, or tries to preserve layout, or uses asterisks and underscores to indicate text formatting, or preserves the content of scripts in the output file, or some clever goddam thing.

All I want is the words written by the author in the order that the author wrote them. I don't even care if the processing converts all of the list items in a list into a single paragraph, or even collapses the entire document into a single paragraph. Any of this is much better than giving me anything at all other than the actual language contained in the document.

I'd love a terminal application or Python script, but I'll take anything I can get.


Solution 1:

html2text is a Python script that converts a page of HTML into equivalent Markdown-structured text. html2text can be downloaded and run in any operating system that has Python installed. The html2text program is in the repositories of many Linux distributions and it can be run from the command-line like this:

html2text -style pretty input.html  

This command not only converts the original html file to text, but it also does a pretty good job of making the plain text output easy to read. The headings look like headings, the lists look like lists, etc.

If you're having trouble with automatically converting tables from webpages to unformatted text this can be easily done with a modern markdown editor like Typora or Mark Text GUI applications for Windows/Mac/Linux. Comparing these two applications Mark Text is better than Typora at accurately capturing everything on a webpage and Typora has a more user-friendly editor, so I use both applications. I use Mark Text as a webpage grabber, and then I copy/paste the markdown text I captured into Typora and use Typora to edit it.

Solution 2:

Use w3m -dump <page.html>.

It will give you the text representation of the html file.

From the man page:

-dump  dump formatted page into stdout

Although is says formatted, the output is just plain text.

Solution 3:

As mentioned by Gombai Sándor, in a comment to NZD's answer:

lynx -dump -nolist -nomargins

When run from the command-line with a URL, it writes the output to stdout. This seems to work very well. -nomargins may not be supported if one only has access to an older version of lynx (i.e. Lynx Version 2.8.5rel.5 (29 Oct 2005) on an old UNIX).

The output appears quite free of markup and links, with some potential exceptions (the following list may not be typical or exhaustive):

  • Extra white space does seem to occur in tabular data, and, at least in some cases, it appears while the white space is usually helpful for extracting the tabular data, it is occasionally inconsistent in ways that complicate parsing.
  • While links are not dumped, visible text may output. For example, footnote references may render as asterisks, or, on a wiki, clickables may render as the equivalent plain-text (without underlying URL).
  • Some references may expand and output the alternate text.
  • Unordered lists dump with asterisks and indentation.
  • Order lists dump with numbers and indentation.
  • Input fields may appear as underscores

Solution 4:

Unix.com: How to remove only HTML tags in a file provides:
sed -n '/^$/!{s/<[^>]*>//g;p;}' filename
or html2text

CommandLineFu: Remove all HTML tags shows another sed line, or awk.

I believe this is a somewhat common operation provided by multiple programs, and that the most common name for this task is to "strip" the HTML. A quick Google Search for: Linux strip html tags shows multiple solutions, including PHP: strip tags.