federated SPARQL query planning #1217
-
I've been exploring federated SPARQL query in Comunica—really impressive so far. I have a question about how it decides which queries to submit to SPARQL endpoints. It seems that when you have multiple SPARQL endpoint datasources, it treats them like triple pattern fragment servers, submitting single-triple queries (which makes sense). But it seems to make some unnecessary requests. The example I'm talking about is here. I added two SPARQL endpoints:
This is my query: PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT *
WHERE {
wd:Q50265665 wdt:P683 ?chebi_id .
BIND(IRI(CONCAT("http://purl.obolibrary.org/obo/CHEBI_", ?chebi_id)) AS ?chebi)
?chebi rdfs:subClassOf ?superclass .
}
LIMIT 10 The first triple pattern exists in Wikidata, and the second exists in UniProt. These are the queries submitted by the SPARQL engine:
Why does it not avoid executing query 3, which unnecessarily downloads 5 GB of data, especially since it later executes query 12? And why submit any of the SELECT queries when the corresponding COUNT query is I tested the same query on two triple pattern fragments servers running locally, backed by these two SPARQL endpoints. The result is more like what I expect; it avoids downloading a huge amount of data from UniProt, and runs more quickly. Thanks—Comunica looks like a great platform. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @balhoff, thanks for the detailed findings! You're right to indicate that if a count query returns 0, that no following SELECT query should be initiated. Internally, Comunica will fetch the COUNT and SELECT queries at the same time. The precise location where this lazy stream is created is here: https://github.com/comunica/comunica/blob/master/packages/actor-rdf-resolve-hypermedia-sparql/lib/RdfSourceSparql.ts#L184 I'll convert this discussion into an issue, as it's a bug. |
Beta Was this translation helpful? Give feedback.
Hi @balhoff, thanks for the detailed findings!
You're right to indicate that if a count query returns 0, that no following SELECT query should be initiated.
Internally, Comunica will fetch the COUNT and SELECT queries at the same time.
However, the SELECT query should be lazy, and only be actually triggered from the moment it actually becomes required.
As you have indicated, something seems to be going wrong with this lazyness, which is making the query always get triggered.
The precise location where this lazy stream is created is here: https://github.com/comunica/comunica/blob/master/packages/actor-rdf-resolve-hypermedia-sparql/lib/RdfSourceSparql.ts#L184
So either this stream is not ac…