Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching remote discovery results #2976

Open
guruz opened this issue Mar 17, 2015 · 17 comments
Open

Caching remote discovery results #2976

guruz opened this issue Mar 17, 2015 · 17 comments

Comments

@guruz
Copy link
Contributor

guruz commented Mar 17, 2015

What about caching discovery results in the journal?

Normally this should not be needed as we're always syncing all files/dirs and storing the proper ETag in the journal.
However in practice I think there's often issues like fatal sync errors, network timeouts etc which then on the next sync mean a complete re-discovery..

Something to chat about next week.

@ogoffart @dragotin @ckamm

@guruz guruz added this to the 1.9 - Multi-account milestone Mar 17, 2015
@MorrisJobke
Copy link
Contributor

Currently we noticed some problems with randomly broken connections to an LDAP server. So the user backend can't properly hand back the home directory. To avoid random guessing what the correct home directory could be we hand back an exception (owncloud/core#15332). In this case the sync client also could see such responses (that will cause a HTTP 500). If this happens during discovery the whole discovery process is aborted and starts from scratch.(and maybe never finished because of this) Is it possible to just repeat the last (failed) request (because it is likely that the error was just temporarily caused) and then resume the discovery from that point? We maybe should improve the error message that is handed back to say "this was just a temporary error and could be fixed with the next request". (503 with Retry-After header)

That would make the client a bit more robust for these cases.

@guruz
Copy link
Contributor Author

guruz commented Apr 4, 2017

@ogoffart @ckamm @jturcotte @PVince81 @DeepDiver1975 Now that we're thinking of doing the recursive/infinity PROPFIND I'm wondering if it makes sense to think about this again.

The infinity PROPFIND could be implemented by having it fill a local cache.
map (remote_dirName+ETag)->contents

The issue was originaly about any discovery issue. It would also help with flaky network connections, basically everytime when the discovery would abort.

This should be implemented in a way to not fill so much memory again. (e.g. sqlite3?)

@guruz guruz reopened this Apr 4, 2017
@guruz guruz added this to the 2.4.0 milestone Apr 4, 2017
@ogoffart
Copy link
Contributor

ogoffart commented Apr 4, 2017

that cache already exists. It's called the database (sync journal)

@guruz
Copy link
Contributor Author

guruz commented Apr 4, 2017 via email

@guruz
Copy link
Contributor Author

guruz commented Apr 5, 2017

...which means it could be flushed after a successful sync.

@guruz
Copy link
Contributor Author

guruz commented Apr 5, 2017

Similar how it was implemented before: 980c176

@dragotin
Copy link
Contributor

dragotin commented Apr 5, 2017

@ogoffart 💟
@guruz isn't it true that we only do real propfinds into a tree if the ETag of the directory has changed, and read from the client journal instead if not?

IIRC what we discussed long time ago was to keep the entire tree in memory to not have to rebuild the whole tree again, maybe that should be discussed again.

@ogoffart
Copy link
Contributor

ogoffart commented Apr 5, 2017

It's true we could have another table in the database if we want to persist the value between failled sync.

@dragotin
Copy link
Contributor

dragotin commented Apr 5, 2017

@ogoffart how would that look like? And what would make you confident that it is still valid? What would it make different from the now existing table?

@guruz
Copy link
Contributor Author

guruz commented Apr 5, 2017

If we want to persist it between failed syncs but for simplicity not between client restarts, we could use a sqlite3 temporary DB which gets automatically cleaned up and if small enough stays in memory: http://stackoverflow.com/a/32004907/2941

I was so far only thinking about a simple key value store: (remote_dirName+ETag)->dir_contents
The dir_contents could even be plain XML returned from PROPFIND. We dice the incoming infinity PROPFIND into single PROPFINds or if we request a single PROPFIND we put it in directly.

@guruz
Copy link
Contributor Author

guruz commented Apr 5, 2017

And what would make you confident that it is still valid?

Having ETag in the cache key.

@ogoffart
Copy link
Contributor

ogoffart commented Apr 5, 2017

This is the current algorithm:

remote_discovery(folder) {
   folder_from_network = run_propfind(folder);
   folder_from_db = load_from_db(folder);
   if (folder_from_network.etag == folder_from_db) {
      readFromDb(folder);
   } else {
      file_tree_walk(folder_from_network);  // recursively go to every folder
  }
}

The new alorithm would looke like:

remote_discovery(folder) {
   folder_from_network = run_propfind(folder);
   folder_from_db = load_from_db_recursively(folder);
   if (folder_from_network.etag == folder_from_db) {
      readFromDb(folder);
   } else {
          folder_from_cache = load_from_cache(folder);
         if (folder_from_network.etag == folder_from_cache) {
              readFromCache(folder);
        } else {
            file_tree_walk(folder_from_network);  // recursively go to every folder
            add_to_cache(folder_from_network);
       }
  }
}

sounds pretty simple

If we want to persist it between failed syncs but for simplicity not between client restarts,

I think if it's in the db, it's as simple to make it persist accross client restart. We'd clear the cache after every successful sync anyway.

@ckamm
Copy link
Contributor

ckamm commented Apr 7, 2017

I agree that it makes sense to cache PROPFIND responses - even if the rest of the sync failed and the resulting changes aren't propagated yet because of it.

@ckamm
Copy link
Contributor

ckamm commented Feb 15, 2018

I agree with @ogoffart - it should be relatively straightforward since our metadata table already functions as a propfind cache. Just having a second table with the same kind of staleness and retrieval logic should work. Data could be cleared when the data arrives in the journal.

@guruz guruz modified the milestones: 2.5.0, 2.6.0 Mar 27, 2018
@ogoffart ogoffart self-assigned this Jul 4, 2018
@ckamm ckamm modified the milestones: 2.6.0, 2.7.0 Mar 25, 2019
@ckamm ckamm changed the title [Brainstorming] Discovery cache Caching remote discovery results Apr 11, 2019
@ckamm ckamm self-assigned this Apr 11, 2019
@ckamm
Copy link
Contributor

ckamm commented Apr 11, 2019

I've started looking at this.

@ogoffart
Copy link
Contributor

I think it would be better to do the discovery and propagation in parallel

Now that the discovery has been refactored, it should be easier.

Then the cache wouldn't make sense anymore

@ckamm
Copy link
Contributor

ckamm commented Apr 12, 2019

@ogoffart Yeah, anything that reduces time between server query and writing to the db would do, and parallel propagation would have other advantages. Need to deal with deletes though. Let's discuss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants