XPath and namespace specification for XML documents with an explicit default namespace

I'm struggling to get the correct combination of an XPath expression and the namespace specification as required by package XML (argument namespaces) for a XML document that has an explicit xmlns namespace defined at the top element.

UPDATE

Thanks to har07 I was able to put it together:

Once you query the namespaces, the first entry of ns has no name yet and that's the problem:

nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))

> ns
                                             omegahat                          r 
    "http://something.org"  "http://www.omegahat.org" "http://www.r-project.org" 

So we'll just assign a name that serves as a prefix (this can be any valid R name):

names(ns)[1] <- "xmlns"

Now all we have to do is using that default namespace prefix everywhere in our XPath expressions:

getNodeSet(doc, "/xmlns:doc//xmlns:b[@omegahat:status='foo']", ns)

For those interested in alternative solutions based on name() and namespace-uri() (amongst others) might find this post helpful.


Just for the sake of reference: this was the trial-and-error code before we came to the solution:

Consider the example from ?xmlParse:

require("XML")

doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))

> doc
<?xml version="1.0"?>
<doc>
  <!-- A comment -->
  <a xmlns:omegahat="http://www.omegahat.org" xmlns:r="http://www.r-project.org">
    <b>
      <c>
        <b/>
      </c>
    </b>
    <b omegahat:status="foo">
      <r:d>
        <a status="xyz"/>
        <a/>
        <a status="1"/>
      </r:d>
    </b>
  </a>
</doc>
nsDefs <- xmlNamespaceDefinitions(getNodeSet(doc, "/doc/a")[[1]])
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))
getNodeSet(doc, "/doc//b[@omegahat:status='foo']", ns)[[1]]

In my document, however, the namespaces are already defined in <doc> tag, so I adapted the example XML code accordingly:

xml_source <- c(
  "<?xml version=\"1.0\"?>",
  "<doc xmlns:omegahat=\"http://www.omegahat.org\" xmlns:r=\"http://www.r-project.org\">",
  "<!-- A comment -->",
  "<a>",
  "<b>",
  "<c>",
  "<b/>",
  "</c>",
  "</b>",
  "<b omegahat:status=\"foo\">",
  "<r:d>",
  "<a status=\"xyz\"/>",
  "<a/>",
  "<a status=\"1\"/>",
  "</r:d>",
  "</b>",
  "</a>",
  "</doc>"
)
write(xml_source, file="exampleData_2.xml")  
doc <- xmlParse("exampleData_2.xml")
nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))    
getNodeSet(doc, "/doc", namespaces = ns)
getNodeSet(doc, "/doc//b[@omegahat:status='foo']", namespaces = ns)[[1]]  

Everything still works fine. What's more, though, is that my XML code additionally has an explicit definition of the default namespace (xmlns):

xml_source <- c(
  "<?xml version=\"1.0\"?>",
  "<doc xmlns=\"http://something.org\" xmlns:omegahat=\"http://www.omegahat.org\" xmlns:r=\"http://www.r-project.org\">",
  "<!-- A comment -->",
  "<a>",
  "<b>",
  "<c>",
  "<b/>",
  "</c>",
  "</b>",
  "<b omegahat:status=\"foo\">",
  "<r:d>",
  "<a status=\"xyz\"/>",
  "<a/>",
  "<a status=\"1\"/>",
  "</r:d>",
  "</b>",
  "</a>",
  "</doc>"  
)
write(xml_source, file="exampleData_3.xml")  
doc <- xmlParse("exampleData_3.xml")
nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))

What used to work fails now:

> getNodeSet(doc, "/doc", namespaces = ns)
list()
attr(,"class")
[1] "XMLNodeSet"
Warning message:
using http://something.org as prefix for default namespace http://something.org 

> getNodeSet(doc, "/xmlns:doc", namespaces = ns)
XPath error : Undefined namespace prefix
XPath error : Invalid expression
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression /xmlns:doc
In addition: Warning message:
using http://something.org as prefix for default namespace http://something.org 
getNodeSet(doc, "/xmlns:doc", 
  namespaces = matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs)
)

This seems to get me closer:

> getNodeSet(doc, "/xmlns:doc",
+ namespaces = matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs)
+ )[[1]]
<doc xmlns="http://something.org" xmlns:omegahat="http://www.omegahat.org" xmlns:r="http://www.r-project.org">
  <!-- A comment -->
  <a>
    <b>
      <c>
        <b/>
      </c>
    </b>
    <b omegahat:status="foo">
      <r:d>
        <a status="xyz"/>
        <a/>
        <a status="1"/>
      </r:d>
    </b>
  </a>
</doc> 

attr(,"class")
[1] "XMLNodeSet"

Yet, now I don't know how to proceed in order to get to the children nodes:

> getNodeSet(doc, "/xmlns:doc//b[@omegahat:status='foo']", ns)[[1]]
XPath error : Undefined namespace prefix
XPath error : Invalid expression
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression /xmlns:doc//b[@omegahat:status='foo']
In addition: Warning message:
using http://something.org as prefix for default namespace http://something.org 

> getNodeSet(doc, "/xmlns:doc//b[@omegahat:status='foo']",
+ namespaces = c(
+ matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs),
+ matchNamespaces(doc, namespaces="omegahat", nsDefs = nsDefs)
+ )
+ )
list()
attr(,"class")
[1] "XMLNodeSet"

Solution 1:

Namespace definition without prefix (xmlns="...") is default namespace. In case of XML document having default namespace, the element where default namespace declared and all of it's descendant without prefix and without different default namespace declaration are considered in that aforementioned default namespace.

Therefore, in your case you need to use prefix registered for default namespace at the beginning of all elements in the XPath, for example :

/xmlns:doc//xmlns:b[@omegahat:status='foo']

UPDATE :

Actually I'm not a user of r, but looking at some references on net something like this may work :

getNodeSet(doc, "/ns:doc//ns:b[@omegahat:status='foo']", c(ns="http://something.org"))

Solution 2:

I think @HansHarhoff provides a very good solution.

For anybody else still searching for a solution, in my experience I think the following works more generally since a single XML document can have multiple namespaces.

doc <- xmlInternalTreeParse(xml_source)

ns <- getDefaultNamespace(doc)[[1]]$uri
names(ns)[1] <- "xmlns"

getNodeSet(doc, "//xmlns:Parameter", namespaces = ns)