Grep and Sed Equivalent for XML Command Line Processing

Solution 1:

I've found xmlstarlet to be pretty good at this sort of thing.

http://xmlstar.sourceforge.net/

Should be available in most distro repositories, too. An introductory tutorial is here:

http://www.ibm.com/developerworks/library/x-starlet.html

Solution 2:

Some promising tools:

  • nokogiri: parsing HTML/XML DOMs in ruby using XPath & CSS selectors

  • hpricot: deprecated

  • fxgrep: Uses its own XPath-like syntax to query documents. Written in SML, so installation may be difficult.

  • LT XML: XML toolkit derived from SGML tools, including sggrep, sgsort, xmlnorm and others. Uses its own query syntax. The documentation is very formal. Written in C. LT XML 2 claims support of XPath, XInclude and other W3C standards.

  • xmlgrep2: simple and powerful searching with XPath. Written in Perl using XML::LibXML and libxml2.

  • XQSharp: Supports XQuery, the extension to XPath. Written for the .NET Framework.

  • xml-coreutils: Laird Breyer's toolkit equivalent to GNU coreutils. Discussed in an interesting essay on what the ideal toolkit should include.

  • xmldiff: Simple tool for comparing two xml files.

  • xmltk: doesn't seem to have package in debian, ubuntu, fedora, or macports, hasn't had a release since 2007, and uses non-portable build automation.

xml-coreutils seems the best documented and most UNIX-oriented.

Solution 3:

There is also xml2 and 2xml pair. It will allow usual string editing tools to process XML.

Example. q.xml:

<?xml version="1.0"?>
<foo>
    text
    more text
    <textnode>ddd</textnode><textnode a="bv">dsss</textnode>
    <![CDATA[ asfdasdsa <foo> sdfsdfdsf <bar> ]]>
</foo>

xml2 < q.xml

/foo=
/foo=   text
/foo=   more text
/foo=   
/foo/textnode=ddd
/foo/textnode
/foo/textnode/@a=bv
/foo/textnode=dsss
/foo=
/foo=    asfdasdsa <foo> sdfsdfdsf <bar> 
/foo=

xml2 < q.xml | grep textnode | sed 's!/foo!/bar/baz!' | 2xml

<bar><baz><textnode>ddd</textnode><textnode a="bv">dsss</textnode></baz></bar>

P.S. There are also html2 / 2html.