Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, we are trying to open source our training/serving framework based on tensorflow. It includes several patches to tensorflow and tensorflow serving. We want to eliminate those patches by merging reasonable changes to the official repo.
This PR introduces a new version policy which is called latest_once. Models using this policy only loads the latest version once and skips the later polling. This is similar to the latest policy with file_system_poll_wait_seconds=0, except that it is model level setting rather than process level. We are unable to do it on a process level because we want to serve multi models with different policies in the single instance.
The use case we are applying this is online training giant recommendation models(>10T), which mainly contains large sparse embedding tables. The framework mentioned above contains a dynamic embedding table, which support serving time insertion/deletion/updating. When a model is published to serving, it loads the latest version and listens on deltas of new updates. The benefits are reduced memory(only one version) and reduced gap between serving and training(because deltas are small and fast).