extracting filenames from href elements
Solution 1:
For an XML file with this simple format, you can use grep
:
grep -Po 'href="\K[^"]*' file.xml > filenames.lst
-
-P
Use Perl compatible regex (PCRE) -
-o
Output the match only -
\K
Keep everything until here out of the match -
[^"]*
Match any number of characters (*
) that is not (^
) a double quote ("
).
However, if you had a more complex xml
, you could and should prefer a proper xml
parser, e.g. xmlstarlet
:
xmlstarlet sel -t -v '//item/@href' -n file.xml > filenames.lst
This can be installed via
sudo apt install xmlstarlet
As you have tagged your question with python
, of course you can also use that:
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
root = ET.parse('file.xml')
for item in root.findall('.//item'):
print(item.attrib['href'])