extracting filenames from href elements

Solution 1:

For an XML file with this simple format, you can use grep:

grep -Po 'href="\K[^"]*' file.xml > filenames.lst
  • -P Use Perl compatible regex (PCRE)
  • -o Output the match only
  • \K Keep everything until here out of the match
  • [^"]* Match any number of characters (*) that is not (^) a double quote (").

However, if you had a more complex xml, you could and should prefer a proper xml parser, e.g. xmlstarlet:

xmlstarlet sel -t -v '//item/@href' -n file.xml > filenames.lst

This can be installed via

sudo apt install xmlstarlet

As you have tagged your question with python, of course you can also use that:

#!/usr/bin/env python3
import xml.etree.ElementTree as ET
root = ET.parse('file.xml')
for item in root.findall('.//item'):
    print(item.attrib['href'])