Exclude results from DBpedia SPARQL query based on URI prefix
How can I excluding a group of concepts when using the DBpedia SPARQL endpoint? I'm using the following basic query to get a list of concepts:
SELECT DISTINCT ?concept
WHERE {
?x a ?concept
}
LIMIT 100
SPARQL Results
This gives me a list of 100 concepts. I want to exclude all the concepts that fall into the YAGO class/group (i.e., whose IRIs begin with http://dbpedia.org/class/yago/
). I can filter out individual concepts like this:
SELECT DISTINCT ?concept
WHERE {
?x a ?concept
FILTER (?concept != <http://dbpedia.org/class/yago/1950sScienceFictionFilms>)
}
LIMIT 100
SPARQL Results
But what I can't seem to understand is how to exclude all YAGO sub-classes from my results? I tried using a *
like this but this didn't achieve anything:
FILTER (?concept != <http://dbpedia.org/class/yago/*>)
Update:
This query with regex
seems to do the trick, but it's really, really slow and ugly. I'm really looking forward to a better alternative.
SELECT DISTINCT ?type WHERE {
[] a ?type
FILTER( regex(str(?type), "^(?!http://dbpedia.org/class/yago/).+"))
}
ORDER BY ASC(?type)
LIMIT 10
Solution 1:
It might seem a little awkward, but your comment about casting to a string and doing some string-based checks is probably on the right track. You can do it a little bit more efficiently using the SPARQL 1.1 function strstarts
:
SELECT DISTINCT ?concept
WHERE {
?x a ?concept
FILTER ( !strstarts(str(?concept), "http://dbpedia.org/class/yago/") )
}
LIMIT 100
SPARQL Results
The other alternative would be to find a top level YAGO class, and to exclude those concepts that are rdfs:subClassOf
that top level class. This would probably be a better solution in the long run (since it doesn't require casting to strings, and it's based on graph structure). Unfortunately, it doesn't look like there is a single top level YAGO class comparable to owl:Thing
. I just downloaded the YAGO type hierarchy from DBpedia's download page and ran this query, which asks for classes with no superclasses, against it:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?root where {
[] rdfs:subClassOf ?root
filter not exists { ?root rdfs:subClassOf ?superRoot }
}
and I got these nine results:
----------------------------------------------------------------
| root |
================================================================
| <http://dbpedia.org/class/yago/YagoLegalActorGeo> |
| <http://dbpedia.org/class/yago/WaterNymph109550125> |
| <http://dbpedia.org/class/yago/PhysicalEntity100001930> |
| <http://dbpedia.org/class/yago/Abstraction100002137> |
| <http://dbpedia.org/class/yago/YagoIdentifier> |
| <http://dbpedia.org/class/yago/YagoLiteral> |
| <http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity> |
| <http://dbpedia.org/class/yago/Thing104424418> |
| <http://dbpedia.org/class/yago/Dryad109551040> |
----------------------------------------------------------------
Given that the YAGO concepts aren't quite as structured as some of the others, it looks like the string based approach may be the best in this case. However, if you wanted to, you could do the a non-string-based query like this, which asks for 100 concepts, excluding those which have one of those nine results as a superclass:
select distinct ?concept where {
[] a ?concept .
filter not exists {
?concept rdfs:subClassOf* ?super .
values ?super {
yago:YagoLegalActorGeo
yago:WaterNymph109550125
yago:PhysicalEntity100001930
yago:Abstraction100002137
yago:YagoIdentifier
yago:YagoLiteral
yago:YagoPermanentlyLocatedEntity
yago:Thing104424418
yago:Dryad109551040
}
}
}
limit 100
SPARQL Results
I'm not sure which ends up being faster. The first requires a conversion to string, and the strstarts
, if implemented in a naïve fashion, has to consume http://dbpedia.org/class/
in each concept before something is a mismatch. The second requires nine comparisons that, if IRIs are interned, are just object identity checks. It's a an interesting question for further investigation.