federated SPARQL query planning #1217

balhoff · 2023-06-01T00:32:02Z

balhoff
Jun 1, 2023

I've been exploring federated SPARQL query in Comunica—really impressive so far. I have a question about how it decides which queries to submit to SPARQL endpoints. It seems that when you have multiple SPARQL endpoint datasources, it treats them like triple pattern fragment servers, submitting single-triple queries (which makes sense). But it seems to make some unnecessary requests. The example I'm talking about is here.

I added two SPARQL endpoints:

https://query.wikidata.org/sparql
https://sparql.uniprot.org/sparql

This is my query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT *
WHERE {
  wd:Q50265665 wdt:P683 ?chebi_id .
  BIND(IRI(CONCAT("http://purl.obolibrary.org/obo/CHEBI_", ?chebi_id)) AS ?chebi)
  ?chebi rdfs:subClassOf ?superclass .
}
LIMIT 10

The first triple pattern exists in Wikidata, and the second exists in UniProt. These are the queries submitted by the SPARQL engine:

UniProt: SELECT (COUNT(*) AS ?count) WHERE { ?chebi <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> 70666020
UniProt: SELECT (COUNT(*) AS ?count) WHERE { <http://www.wikidata.org/entity/Q50265665> <http://www.wikidata.org/prop/direct/P683> ?chebi_id. } -> 0
UniProt: SELECT ?chebi ?superclass WHERE { ?chebi <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> More than 5 GB data
UniProt: SELECT ?chebi_id WHERE { <http://www.wikidata.org/entity/Q50265665> <http://www.wikidata.org/prop/direct/P683> ?chebi_id. } -> empty
Wikidata: SELECT (COUNT(*) AS ?count) WHERE { ?chebi <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> 0
Wikidata: SELECT (COUNT(*) AS ?count) WHERE { <http://www.wikidata.org/entity/Q50265665> <http://www.wikidata.org/prop/direct/P683> ?chebi_id. } -> 1
Wikidata: SELECT ?chebi ?superclass WHERE { ?chebi <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> empty
Wikidata: SELECT ?chebi_id WHERE { <http://www.wikidata.org/entity/Q50265665> <http://www.wikidata.org/prop/direct/P683> ?chebi_id. } -> one row
Wikidata: SELECT (COUNT(*) AS ?count) WHERE { <http://purl.obolibrary.org/obo/CHEBI_5931> <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> 0
UniProt: SELECT (COUNT(*) AS ?count) WHERE { <http://purl.obolibrary.org/obo/CHEBI_5931> <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> 4
Wikidata: SELECT ?superclass WHERE { <http://purl.obolibrary.org/obo/CHEBI_5931> <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> empty
UniProt: SELECT ?superclass WHERE { <http://purl.obolibrary.org/obo/CHEBI_5931> <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass. } -> 4 rows

Why does it not avoid executing query 3, which unnecessarily downloads 5 GB of data, especially since it later executes query 12? And why submit any of the SELECT queries when the corresponding COUNT query is 0?

I tested the same query on two triple pattern fragments servers running locally, backed by these two SPARQL endpoints. The result is more like what I expect; it avoids downloading a huge amount of data from UniProt, and runs more quickly. Thanks—Comunica looks like a great platform.

Answered by rubensworks

Jun 1, 2023

Hi @balhoff, thanks for the detailed findings!

You're right to indicate that if a count query returns 0, that no following SELECT query should be initiated.

Internally, Comunica will fetch the COUNT and SELECT queries at the same time.
However, the SELECT query should be lazy, and only be actually triggered from the moment it actually becomes required.
As you have indicated, something seems to be going wrong with this lazyness, which is making the query always get triggered.

The precise location where this lazy stream is created is here: https://github.com/comunica/comunica/blob/master/packages/actor-rdf-resolve-hypermedia-sparql/lib/RdfSourceSparql.ts#L184
So either this stream is not ac…

View full answer

rubensworks · 2023-06-01T06:55:13Z

rubensworks
Jun 1, 2023
Maintainer

Hi @balhoff, thanks for the detailed findings!

You're right to indicate that if a count query returns 0, that no following SELECT query should be initiated.

Internally, Comunica will fetch the COUNT and SELECT queries at the same time.
However, the SELECT query should be lazy, and only be actually triggered from the moment it actually becomes required.
As you have indicated, something seems to be going wrong with this lazyness, which is making the query always get triggered.

The precise location where this lazy stream is created is here: https://github.com/comunica/comunica/blob/master/packages/actor-rdf-resolve-hypermedia-sparql/lib/RdfSourceSparql.ts#L184
So either this stream is not actually lazy, or somewhere else in the codebase (e.g. the join operators), the stream is being triggered too early.

I'll convert this discussion into an issue, as it's a bug.

1 reply

balhoff Jun 1, 2023
Author

Great, thank you @rubensworks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

federated SPARQL query planning #1217

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

federated SPARQL query planning #1217

balhoff Jun 1, 2023

Replies: 1 comment · 1 reply

rubensworks Jun 1, 2023 Maintainer

balhoff Jun 1, 2023 Author

balhoff
Jun 1, 2023

Replies: 1 comment 1 reply

rubensworks
Jun 1, 2023
Maintainer

balhoff Jun 1, 2023
Author