Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo Lock [RHELDST-14968] #198

Merged
merged 1 commit into from Mar 22, 2023
Merged

Conversation

amcmahon-rh
Copy link
Contributor

Some pub tasks work on mutable pulp data, so if two such tasks are run at the same time one task may overwrite the results from the other. To address this, we can create a "lock" using the repository notes field. The task submits a note and waits for it to become valid, then starts executing the actual task if it's the oldest valid lock in the repository.

First pass attempt at implementing what Rohan suggested. Looking for general comments on the overall approach. There's things like changing print statements to logs and adding tests that still need to be addressed.

I imagined this being used in a with statement. So the maintenance tasks become:

with self.pulp_client.get_repo_lock("redhat-maintenance"):
   report = self.get_maintenance_report().result()
   report = self.adjust_maintenance_report(report)
   self.set_maintenance(report).result()

@amcmahon-rh
Copy link
Contributor Author

Last Friday I tested this out with a small script that added/ removed records from the QA "redhat-maintenance" repository. This worked as expected and checking repos.json showed the modifications from all the scripts.

Here's a Screenshot of the scripts in progress. I had all the commands written out and started each script within a second of each other. One thing that might stand out is on the bottom right it jumps from 3rd to 1st in the queue. That's because the sleep between querying the notes is based on how many locks there are in the queue. While the lock was in 3rd place, the sleep delay was long enough to skip over a query while it was 2nd place.

Copy link
Member

@rohanpm rohanpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. I know you were only looking for high-level comments but I put inline comments anyway to save time.

May I suggest to you also that your script you've been using for testing could be turned into a proper example (see examples directory). A lot of the library features have an example script in the repo which exercises that one feature independently and also gives you a way to enable loggers at DEBUG level.

If you do that first, you might find this more efficient to develop, e.g. you won't need to go back and delete 'print' statements later since you can use proper logging right from the beginning (and whatever logs you need while initially developing this, might be needed later in debugging it, so it's potentially better to implement them as DEBUG logs than prints anyway).

@@ -448,6 +448,14 @@ def update_repository(self, repository):

return out

def get_repo_lock(self, repo_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be better off living on the Repository class, and also with no 'get' prefix, so the caller would write something like:

with repo.lock():
    repo.do_some_stuff(...)

Whether it lives on Client or on Repository is somewhat arbitrary, either will technically work. I feel that Repository is more consistent with the rest of the API which exists today.

Also, it forces the caller to verify the repo's existence first. That might seem a negative (as you're forced to do one extra arguably unnecessary request), but in fact in many places where we first tried to make use of a repo without explicitly verifying it exists first, users had trouble understand the error messages if they passed the ID of a nonexistent repo.

It would be good also if we don't document the object returned by lock() and don't make it public API. i.e. we document that it can be used as a context manager, and that's it. This avoids us getting locked into an implementation that we might not be able to support long-term.

from more_executors.futures import f_map


LOCK_CLAIM_STR = "lock-claim-"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably start with "pulplib-" so it's possible to trace these back to the library which created them, and also avoid clashes with anything else.

LOCK_CLAIM_STR = "lock-claim-"

LOCK_VALID_FROM_OFFSET = 10
LOCK_EXPIRATION_OFFSET = 3600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that the expiration should be part of the public API. This could be a default, but only the caller has an idea of how long they'd expect to be holding the lock, so it'd be best if they can set a smaller expiration in some cases.

For example in the maintenance-on/off case, it should only need the lock for a few seconds. The default would be safe but it would also be overkill.

The other magic numbers here probably don't need to be public API though we generally make them tunable by "hidden" environment variables, so that we can at least tweak them in emergency scenarios without code changes if that turns out to be needed.

LOCK_EXPIRATION_OFFSET = 3600

# How long to sleep before checking if we can activate the lock again
# This should be a sleep per locks ahead of us. If there's 3 locks ahead of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a sleep per locks ahead of us

I feel like this part of the logic is unnecessary and might be better off being dropped.
I guess the benefit is only to reduce the number of requests to the server, but it seems negligible.

On the downside seems likely to introduce unnecessary delays. I'd expect in practice that clients often need to hold the lock for a fair bit less than 15 seconds, so this could easily introduce a half minute delay where it's just sleeping even though everyone else already released their locks.

self._lock_claim = None

def get_repo_lock_data(self, filter_invalid=False):
print("Getting current repo lock data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you're done with these, rather than removing the prints you should probably change them to debug or info logs.

AWAIT_LOCK_ACTIVATION_SLEEP
* lock_claims.index(self._lock_claim)
)
except ValueError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we really need to catch this at all. This would happen if, for example, the lock notes contain corrupt data?

You can't really do anything other than raise an exception in such cases, and if you raise something other than the original exception it arguably just obscures the error. Also, if the backtrace shows more than one exception, a lot of developers will not read it properly and will randomly pick whichever of the exceptions they noticed first as being the root cause...

@amcmahon-rh amcmahon-rh force-pushed the repo-lock branch 2 times, most recently from 508133e to 0a28dd3 Compare February 9, 2023 10:30
@amcmahon-rh
Copy link
Contributor Author

@rohanpm

I've gone through and addressed your comments :)

I'm not entirely sure how I feel about having the AWAIT_ACTIVATION_SLEEP and LOCK_EXPIRATION_OFFSET values in the repo base file. I played with passing a kwarg dict instead but it ""looked yucky"". Though it's probably also a case of fixating too much on specific implementation details.

examples/repo_lock.py Outdated Show resolved Hide resolved
examples/repo_lock.py Outdated Show resolved Hide resolved
Comment on lines 602 to 604
Lock the repo to show other tasks that it is being worked on.
This does not prevent modifications to the repo with the Pulp API, it
just provides a way to pause tasks to prevent race conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the first sentence, there should be a blank line, this is what separates a summary of the method from the more detailed description.

I think the doc string should also use the term "advisory lock" somewhere as I understand this to be the standard term for a lock which only works if everyone cooperates to check whether there's a lock. (e.g. man 2 flock)

And maybe clarify that this is a lock only recognized by pubtools-pulplib Client objects.

So, here's my attempt at writing a bit more detailed doc string which spells out the behavior:

Suggested change
Lock the repo to show other tasks that it is being worked on.
This does not prevent modifications to the repo with the Pulp API, it
just provides a way to pause tasks to prevent race conditions.
Obtain an exclusive advisory lock on this repository.
Returns a context manager representing the lock, intended to be used
via a `with` statement. When the context is entered, the caller will
wait until the lock can be acquired (or raise an exception if the lock
can't be acquired).
Only a single :class:`~pubtools.pulplib.Client` is able to hold the lock
on a repository at any given time. The lock does not prevent modifications
to the repo with the Pulp API, and does not affect other Pulp client
implementations or instances of :class:`~pubtools.pulplib.Client` not
using the `lock` method.


Args:
duration
How long the lock should remain valid for in seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clarify that this is used to release the lock if the current process dies, but it would normally be released earlier.

Here's an attempt:

Suggested change
How long the lock should remain valid for in seconds.
Maximum duration of the lock, in seconds.
This value is used only if this client fails to release the lock
(for example, because the current process is killed).
In this case, the duration will be used by other clients in order
to detect and release stale locks, avoiding a deadlock.
There is no way to extend the duration of an acquired lock,
so the caller should always ensure they request a `duration`
high enough to cover the entire expected lifetime of the lock.

@@ -595,6 +597,30 @@ def sync(self, options=None):
)
)

def lock(self, duration=LOCK_EXPIRATION_OFFSET, retry_wait=AWAIT_ACTIVATION_SLEEP):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of keeping the API minimal, may I request that we don't make retry_wait part of the public API - just force the same value all the time (tunable by the "hidden" env var accessed elsewhere). I don't think the caller needs control of it.

Ideally, the client doesn't even know that we're implementing this by polling.

Also, about importing these constants here which you were uncomfortable with. I generally prefer to just put the default value as None, then have logic in the implementation which says "if the value is None, apply this default". That keeps the implementation details more hidden & also means that the caller can explicitly pass value=None to mean "please give me the default".

self._activation_await_sleep = activation_await_sleep

def __enter__(self):
LOG.info("Attempting to lock %s." % self._repo_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please let the logging framework do the string formatting instead, like this:

Suggested change
LOG.info("Attempting to lock %s." % self._repo_id)
LOG.info("Attempting to lock %s.", self._repo_id)

This is actually a pylint complaint that we should usually obey - https://pylint.pycqa.org/en/latest/user_guide/messages/warning/logging-format-interpolation.html. Many legacy projects ignore it but we should try to stick with it in the newer stuff.

url=os.path.join(self._client._url, url),
**kwargs
),
expiration_offset=duration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to request that we clamp the caller's input to a reasonable range somewhere in here, because of the high impact in case the user provides bad values. e.g.

  • if the caller passes a too low duration, we can end up with the same silent data corruption we're trying to fix here.
  • if the caller passes a too high duration, we can end up with extreme delays/apparent deadlock in the cases where a stale lock needs to be released.

I'll make a rough guess:

  • if caller requested a duration less than 5 minutes, we should force it to 5 minutes (because basically anything could take that long if there's an outage which forces requests to be retried).
  • if caller requested a duration more than 6 hours, we should force it to 6 hours (kind of arbitrarily picked as one quarter of a day and fitting into one working day... and nothing comes to mind which should ever genuinely need the lock for longer than that.)

I prefer "clamp" rather than "validate" because if the approach is to validate, then later changes to it are potentially backwards-incompatible (e.g. if we accept 5 minutes today but later need to change it to 15 minutes, that could break someone).

if not lock.is_valid
}
if invalid_locks_delta:
LOG.debug("Removing invalid locks from %s." % self._repo_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this only happens if someone has crashed, I think this is worthy of WARNING log level, and probably mentioning every affected lock ID as well.

pubtools/pulplib/_impl/model/repository/base.py Outdated Show resolved Hide resolved
retry_wait:
How long to wait between attempting to activate the lock in
seconds.
Maximum duration of the lock, in seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an off-by-one whitespace error here, this will probably mess with the docs rendering (should be possible to check by tox -e docs).

Suggested change
Maximum duration of the lock, in seconds.
Maximum duration of the lock, in seconds.

except HTTPError as e:
if self._pulp_exception_from_missing_lock(e, list(note_delta.keys())):
logging.error("Lock %s has already been deleted.", self._lock_claim.id)
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can just be deleted.



def timestamp(date):
# datetime.timestamp was added in py 3.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to support any python other than 3.3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...meant to say "older than 3.3" of course.

@amcmahon-rh amcmahon-rh force-pushed the repo-lock branch 5 times, most recently from 6f74a17 to 9ffc98b Compare February 20, 2023 13:16
Copy link
Member

@rohanpm rohanpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think to check it earlier, but I've checked now and I see the new feature doesn't work at all when using the fake client ( https://release-engineering.github.io/pubtools-pulplib/api/testing.html ).

It'll crash like this:

  File "/home/rmcgover/src/pubtools-pulplib/pubtools/pulplib/_impl/model/repository/base.py", line 636, in <lambda>
    lambda url, **kwargs: self._client._request_executor.submit(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'FakeClient' object has no attribute '_request_executor'

Can you please make it work also for the fake client?

The bare minimum of "make it work" in this case means just "doesn't crash", I wouldn't insist on implementing any actual locking, though it could be done.


LOCK_CLAIM_STR = "pulplib-lock-claim-"

LOCK_VALID_FROM_OFFSET = os.getenv("PULPLIB_LOCK_VALID_FROM_OFFSET", 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, it seems like pylint these days detects the same issue, which I didn't know about. The static CI config currently complains about this:

pubtools/pulplib/_impl/model/repository/repo_lock.py:14:25: W1508: os.getenv default type is builtins.int. Expected str or None. (invalid-envvar-default)

@@ -0,0 +1,74 @@
from pubtools.pulplib import Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please fix up some consistency issues between this example and the others?

  • naming: _ vs -
  • naming: .py vs no extension
  • should be marked executable and start with a shebang line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Rohan, do you have any general advice for how to "make it work" regarding the FakeClient?

The request function lambda requires three attributes that don't exist in the FakeClient class. A compounding issue is that one of these attributes then tries to get the method. Is this just a case of yes everything there will need to be set up?

Looking through the repository code I can't find any special considerations for the FakeClient class, so I assume it's a case of modifying the FakeClient class to get it to work, while leaving the other code as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, though I said just "don't crash" could be the goal before, later I thought ahead to the changes needed in pubtools-pulp-maintenance-on to make use of this new feature and how it would be covered in unit tests there.

In fact it would be quite valuable if there were a way, when using the fake client, to access the repo lock history (same as today you can access the repo publish history). The way of testing the pubtools-pulp-maintenance-on change would then be as easy as, in existing test_maintenance_on test there, something like:

# It should have locked and then unlocked the repo
assert len(controller.repository_lock_history) == 2
assert controller.repository_lock_history[0].repository.id == "redhat-maintenance"
assert controller.repository_lock_history[0].event_type == "lock"
assert controller.repository_lock_history[1].repository.id == "redhat-maintenance"
assert controller.repository_lock_history[1].event_type == "unlock"

So I think making something like the above work should be the goal.

The usual approach so far has been for the real and fake clients to implement some of the same (private) API. For example, they both implement a def _publish_repository which takes care of real or fake publishing. A fair bit of refactoring would be needed here to follow that approach though.

I wouldn't expect this to be handled at the HTTP request layer, i.e. adding _request_executor and _do_request onto the fake client wouldn't be a good approach I think.

I think there are two reasonable approaches to take here:

  1. refactor it so it works similarly to the current approach already used for repo publishing (def _publish_repository and file uploads (_do_upload). Give real and fake clients different implementations of a compatible API for locking/unlocking repos, then use it from RepoLock class. Real client does the same kind of HTTP requests as currently prepared in RepoLock, fake client always succeeds (as long as repo exists) and just stores a lock/unlock event on a new attribute in FakeState class.

  2. refactor so that RepoLock receives a reference to a client object. Then it can explicitly check if the client object is a fake client and, if so, it can jump to a different simplified implementation.

with lock:
pass

assert ctrl._state.repo_lock_history == expected_calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of issues with this:

  • the _state property isn't public API, so there still isn't a proper way to use this from other projects. Properties on _state which are meant to be public can be accompanied by an accessor on FakeController class, which is also where you'd write a doc string for it.
  • the format is nasty - string values. Is the user supposed to parse them as JSON? We should avoid exposing the implementation detail that locks are achieved by updating repo notes which happens to use JSON documents.

Ideally this would be consistent with publish_history, sync_history and return named tuples.

os.getenv("PULPLIB_LOCK_EXPIRATION_OFFSET_MAX", "216000")
)

AWAIT_ACTIVATION_SLEEP = os.getenv("PULPLIB_AWAIT_ACTIVATION_SLEEP", 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has the same bug as above used to (it will return an int if the env var is not set, and a str if the env var is set). I suppose if you set the env var, the later sleep(AWAIT_ACTIVATION_SLEEP) will crash.

def repo_lock_history(self):
"""
A list containing all lock actions carried out.
Each entry is a list itself, since several lock updates can be made
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case does this happen? Is it only when the client decides to remove expired locks and there's more than one at a time?

It seems like a bit of a niche case, while having it be a list-of-list is more cumbersome to use than just a list, so I'm wondering if it'd be better to just flatten it.

By "more cumbersome to use", I mean for example where it has already been used in the tests:

    assert len(lock_history) == 2
    assert lock_history[0][0].repository == "test-repo"
    assert lock_history[0][0].action == "lock"
    assert lock_history[1][0].repository == "test-repo"
    assert lock_history[1][0].action == "unlock"

This test doesn't actually verify that there were two lock-related operations. It verifies that there were two groups of lock operations and it's never verified that each group contains only the expected operation (e.g. there's no assert len(lock_history[0]) == 1. So that test can pass even if there are extra unexpected lock operations, and you've got to check at multiple levels of the structure if you want to avoid that. Seems annoying to use.

Copy link
Member

@rohanpm rohanpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, do you want to do final steps on this now?

  • remove WIP from the pull request message
  • ideally squash all the commits into just one
  • rebase it to fix the conflict and get a fresh CI run

os.getenv("PULPLIB_LOCK_EXPIRATION_OFFSET_MAX", "216000")
)

AWAIT_ACTIVATION_SLEEP = int(os.getenv("PULPLIB_AWAIT_ACTIVATION_SLEEP", 10))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will technically work now, though the second parameter to getenv being an int rather than str is inconsistent with all the previous lines for no reason. Not sure if pylint will still complain about this too.

@amcmahon-rh amcmahon-rh changed the title Repo Lock WIP [RHELDST-14968] Repo Lock [RHELDST-14968] Mar 21, 2023
@amcmahon-rh amcmahon-rh force-pushed the repo-lock branch 2 times, most recently from f3ef622 to 3bcf93b Compare March 22, 2023 12:52
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (8bca412) 100.00% compared to head (33e0d6b) 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##            master      #198    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           46        47     +1     
  Lines         3008      3153   +145     
==========================================
+ Hits          3008      3153   +145     
Impacted Files Coverage Δ
pubtools/pulplib/_impl/client/client.py 100.00% <100.00%> (ø)
pubtools/pulplib/_impl/fake/client.py 100.00% <100.00%> (ø)
pubtools/pulplib/_impl/fake/controller.py 100.00% <100.00%> (ø)
pubtools/pulplib/_impl/fake/state.py 100.00% <100.00%> (ø)
pubtools/pulplib/_impl/model/repository/base.py 100.00% <100.00%> (ø)
...btools/pulplib/_impl/model/repository/repo_lock.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

There are some opperations in pulp which should not be done in parallel.
Multiple instances of the same task can cause a race condition and
overwrite data.

This change uses a repo's notes field to implement a locking system.
This signals to other tasks that the repo is being worked on and should
not be modified until the lock is removed.
@amcmahon-rh
Copy link
Contributor Author

I've gone through and refactored/ extended tests to meet the 100% coverage requirement, as well as addressed the issues causing the static analysis to fail.

The bandit test is failing with request_without_timeout for every detected issue. We discussed this earlier in the week since it was affecting rcm-pub, and there's already a bug lodged with bandit for false positives.
PyCQA/bandit#996

@rohanpm
Copy link
Member

rohanpm commented Mar 22, 2023

The bandit test is failing with request_without_timeout for every detected issue. We discussed this earlier in the week since it was affecting rcm-pub, and there's already a bug lodged with bandit for false positives. PyCQA/bandit#996

There is also RHELDST-16791 which says that we should be pinning the bandit version here, like we do with other packages used in CI, so the CI will not flip from passing to failing when nobody made changes breaking anything.

@rohanpm rohanpm merged commit af722c8 into release-engineering:master Mar 22, 2023
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants