Command-line CSS selector tool

Question

What tool (preferably for Linux) can select the content of an HTML element based on its CSS path?

Example

For example, consider the following HTML document:

<html>
<body>
  <div class="header">
  <h1>Header</h1>
  </div>
  <div class="content">
    <table>
      <tbody>
      <tr><td class="data">Tabular Content 1</td></tr>
      <tr><td class="data">Tabular Content 2</td></tr>
      </tbody>
    </table>
  </div>
  <div class="footer">
  <p>Footer</p>
  </div>
</body>
</html>

What command-line program (e.g., a kind of "cssgrep") can extract values using a CSS selector? That is:

cssgrep page.html "body > div.content > table > tbody > tr > td.data"

The program would write the following to standard output:

Tabular Content 1
Tabular Content 2

Related Links

https://getfirebug.com/wiki/index.php/Command_Line_API#.24.24.28selector.29
https://stackoverflow.com/questions/7334942/is-there-something-like-a-css-selector-or-xpath-grep
https://github.com/keeganstreet/element-finder
http://www.w3.org/Tools/HTML-XML-utils/

Thank you!

Solution 1:

Use the W3C tools for HTML/XML parsing and extraction of content using CSS selectors. For example:

hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "td.data"

Will produce the desired output:

Tabular Content 1
Tabular Content 2

Using a line length of 240 characters ensures that elements with long content will not be split across multiple lines. The hxnormalize -x command creates a well-formed XML document, which can be used by hxselect.

Solution 2:

CSS Solution

The Element Finder command will partially accomplish this task:

https://github.com/keeganstreet/element-finder
http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

For example:

elfinder -j -s td.data -x "html"

This renders the result in JSON format, which can be extracted.

XML Solution

The XML::Twig module ("sudo apt-get install xml-twig-tools") comes with a tool named xml_grep that is able to do just that, provided that your HTML is well-formed, of course.

I'm sorry I'm not able to test this at the moment, but something like this should work:

xml_grep -t '*/div[@class="content"]/table/tbody/tr/td[@class="data"]' file.html

Solution 3:

https://github.com/ericchiang/pup has a CSS-based query language that conforms closely to your example. In fact, with your input, the following command:

pup "body > div.content > table > tbody > tr > td.data text{}"

produces:

Tabular Content 1
Tabular Content 2

The trailing text{} removes the HTML tags.

One nice feature is that the full path need not be given, so that again with your example:

$ pup 'td.data text{}' < input.html
Tabular Content 1
Tabular Content 2

One advantage of pup is that it uses the golang.org/x/net/html package for parsing HTML5.