Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix chord element counting #6354

Merged
merged 6 commits into from Oct 14, 2020
Merged

Conversation

maybe-sybr
Copy link
Contributor

This change fixes chord accounting in the redis backend by ensuring that
group IDs are not overwritten by chained callbacks.

Amends the changeset from #5984

celery/canvas.py Outdated Show resolved Hide resolved
@lgtm-com
Copy link

lgtm-com bot commented Sep 18, 2020

This pull request introduces 2 alerts when merging 71f0f9e into 14a3524 - view on LGTM.com

new alerts:

  • 1 for Non-exception in 'except' clause
  • 1 for Module is imported with 'import' and 'import from'

@maybe-sybr
Copy link
Contributor Author

maybe-sybr commented Sep 18, 2020

Ah, I think the CI failures were a result of me slightly misunderstanding the base Signature type and brain explosion around how results get passed along. Should be fixed now 🤞

@maybe-sybr
Copy link
Contributor Author

maybe-sybr commented Sep 18, 2020

Hmm, interestingly a group result in a chord doesn't behave like I might expect in redis. Not sure if it's my fault or not just yet:

    def test_nested_chord_group_chain_group_tail(self, manager):
        """
        Sanity check that a deeply nested group is completed as expected.
    
        Groups at the end of chains nested in chords have had issues and this
        simple test sanity check that such a tsk structure can be completed.
        """
        try:
            manager.app.backend.ensure_chords_allowed()
        except NotImplementedError as e:
            raise pytest.skip(e.args[0])
    
        sig = chord(group(chain(
            identity.s(42),     # -> 42
            group(
                identity.s(),   # -> 42
                identity.s(),   # -> 42
            ),                  # [42, 42]
        )), identity.s())       # [[42, 42]]
        res = sig.delay()
>       assert res.get(timeout=TIMEOUT) == [[42, 42]]
E       assert [42, 42] == [[42, 42]]
E         At index 0 diff: 42 != [42, 42]
E         Left contains one more item: 42
E         Full diff:
E         - [[42, 42]]
E         ? -        -
E         + [42, 42]

It failed the other way on other backends in the previous CI pipeline because I had it asserting res.get(timeout=TIMEOUT) == [42, 42] without the encapsulating list :/

I think it might be failing based on some quirk of how chords absorb their headers but I'm not at all sure why it would be behaving differently on different backends.

@maybe-sybr
Copy link
Contributor Author

It also looks like I may have broken a test case for the cache backend:
https://travis-ci.org/github/celery/celery/jobs/728244102

It expects a chorderror but it must not be getting raised as expected...

@lgtm-com
Copy link

lgtm-com bot commented Sep 28, 2020

This pull request introduces 1 alert when merging e5d9e57 into ea37db1 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File "/home/travis/build/celery/celery/.tox/3.8-integration-rabbitmq/lib/python3.8/site-packages/pytest_rerunfailures.py", line 7, in
pytest is throwing the error in integration tests
from _pytest.resultlog import ResultLog

ModuleNotFoundError: No module named '_pytest.resultlog'

@maybe-sybr
Copy link
Contributor Author

@auvipy - I think this got broken for master/everything (https://travis-ci.org/github/celery/celery/builds/730832672) :( perhaps some test deps got updated and broke it?

@auvipy auvipy added this to the Future milestone Sep 30, 2020
Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try to pin some test deps to run the integration tests?

celery/canvas.py Outdated Show resolved Hide resolved
@thedrow
Copy link
Member

thedrow commented Sep 30, 2020

I can't restart the build for some reason.
This does not reproduce locally for me.

@auvipy auvipy closed this Sep 30, 2020
@auvipy auvipy reopened this Sep 30, 2020
@auvipy
Copy link
Member

auvipy commented Sep 30, 2020

Redis & cache integration tests are failing now

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you re check this failure? > assert res.get(timeout=TIMEOUT) == [[42, 42]]

E assert [42, 42] == [[42, 42]]

E At index 0 diff: 42 != [42, 42]

E Left contains one more item: 42

E Full diff:

E - [[42, 42]]

E ? - -

E + [42, 42]

t/integration/test_canvas.py:1015: AssertionError

@maybe-sybr
Copy link
Contributor Author

Hey @auvipy . Sorry this PR hasn't gotten much love recently, I've been a bit busy!

wrt that test failure, I am now under the impression that the (AFAICT undocumented?) behaviour that groups included in groups and chord get unrolled as part of the fix for #1509 many years ago might be blame for some of this. I have added some new tests which enumerate the behaviour as I understand it and should make it more clear what we need to fix prior to merging this.

I've also found a couple of other issues in my adventures testing this PR, and I'm about to push up some new commits which fix those. One is closely related to chord counting, the other is about nested groups in chains and the promises used to fulfil them. Let me know if you'd like the promises on to be split into its own PR (it's arguably separate from this PR).

@maybe-sybr maybe-sybr force-pushed the maybe/fix-chord-counting branch 3 times, most recently from 34bd1c7 to ddb2985 Compare October 13, 2020 05:57
@maybe-sybr
Copy link
Contributor Author

maybe-sybr commented Oct 13, 2020

The latest build shows that the redis and key-value backends behave differently. Perhaps I've misunderstood unrolling and it doesn't happen with chords? I'm adding some more intermediate test cases locally, but if someone could chime in meanwhile to explain unrolling (if anyone knows :o ) it'd be helpful.

Edit: I think I understand why I was confused now. Groups will unroll nested groups, but obviously won't unroll chords. And chords won't unroll groups in their header either. So all my tests seems to be accurate, but the last one should have a doubly-nested list instead of a singly nested one. That also means that the redis backend needs to be convinced to return a double nested list in that case without breaking other cases. I think I know how to move forward now.

I have some lint fixes to push but I'll save them for whenever I fix the integration tests.

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2020

This pull request introduces 1 alert when merging ddb2985 into 70cbed3 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@auvipy
Copy link
Member

auvipy commented Oct 13, 2020

sounds great. rabbit & redis integrations tests are now passing while others are failing

@thedrow
Copy link
Member

thedrow commented Oct 14, 2020

I'm confused. Is #6411 part of this PR as well?

@maybe-sybr
Copy link
Contributor Author

I'm confused. Is #6411 part of this PR as well?

I'm in the middle of splitting the change up for merge in #6411 out of this PR since it's independent. I'll have a new changeset up here shortly once I sanity check a couple of things, @thedrow .

@maybe-sybr
Copy link
Contributor Author

The change from #6411 has been stripped out of this PR now. I believe I've addressed all of the coverage commentary and have left extra commentary where requested. I'll wait for the coverage report to come back before I resolve some of those review comments.

@auvipy
Copy link
Member

auvipy commented Oct 14, 2020

needs rebase

@lgtm-com
Copy link

lgtm-com bot commented Oct 14, 2020

This pull request fixes 1 alert when merging 2892737 into 735f167 - view on LGTM.com

fixed alerts:

  • 1 for Module is imported with 'import' and 'import from'

@maybe-sybr
Copy link
Contributor Author

Rebased on top of master

@auvipy auvipy modified the milestones: Future, 5.0.1 Oct 14, 2020
@lgtm-com
Copy link

lgtm-com bot commented Oct 14, 2020

This pull request fixes 2 alerts when merging 71e40cf into 9367d36 - view on LGTM.com

fixed alerts:

  • 1 for Non-exception in 'except' clause
  • 1 for Module is imported with 'import' and 'import from'

When a group task which is part of a chain was to be delayed by
`trace_task()`, it would be reconstructed from the serialized request.
Normally, this sets the `group_id` of encapsulated tasks to the ID of
the group being instantiated. However, in the specific situation of a
group that is the last task in a chain which contributes to the
completion of a chord, it is essential that the group ID of the top-most
group is used instead. This top-most group ID is used by the redis
backend to track the completions of "final elements" of a chord in the
`on_chord_part_return()` implementation. By overwriting the group ID
which was already set in the `options` dictionaries of the child tasks
being deserialized, the chord accounting done by the redis backend would
be made inaccurate and chords would never complete.

This change alters how options are overridden for signatures to ensure
that if a `group_id` has already been set, it cannot be overridden.
Since group ID should be generally opaque to users, this should not be
disruptive.
This change amends the implementation of `chord.__length_hint__()` to
ensure that all child task types are correctly counted. Specifically:

 * all sub-tasks of a group are counted recursively
 * the final task of a chain is counted recursively
 * the body of a chord is counted recursively
 * all other simple signatures count as a single "final element"

There is also a deserialisation step if a `dict` is seen while counting
the final elements in a chord, however this should become less important
with the merge of celery#6342 which ensures that tasks are recursively
deserialized by `.from_dict()`.
These tests are intended to show that group unrolling should be
respected in various ways by all backends. They should make it more
clear what behaviour we should be expecting from nested canvas
components and ensure that all the implementations (mostly relevant to
chords and `on_chord_part_return()` code) behave sensibly.
This avoids an issue where the `on_chord_part_return()` implementation
would check the the length of the result of a chain ending in a nested
group. This would manifest in behaviour where a worker would be blocked
waiting for for the result object it holds to complete since it would
attempt to `.join()` the result object. In situations with plenty of
workers, this wouldn't really cause any noticable issue apart from some
latency or unpredictable failures - but in concurrency constrained
situations like the integrations tests, it causes deadlocks.

We know from previous commits in this series that chord completion is
more complex than just waiting for a direct child, so we correct the
`size` value in `BaseKeyValueStoreBackend.on_chord_part_return()` to
respect the `chord_size` value from the request, falling back to the
length of the `deps` if that value is missing for some reason (this is
necessary to keep a number of the tests happy but it's not clear to me
if that will ever be the case in real life situations).
This change fixes the chord result flattening issue which manifested
when using the Redis backend due to its deliberate throwing away of
information about the header result structure. Rather than assuming that
all results which contribute to the finalisation of a chord should be
siblings, this change checks if any are complex (ie. `GroupResult`s) and
falls back to behaviour similar to that implemented in the
`KeyValueStoreBackend` which restores the original `GroupResult` object
and `join()`s it.

We retain the original behaviour which is billed as an optimisation in
f09b041. We could behave better in the complex header result case by not
bothering to stash the results of contributing tasks under the `.j` zset
since we won't be using them, but without checking for the presence of
the complex group result on every `on_chord_part_return()` call, we
can't be sure that we won't need those stashed results later on. This
would be an opportunity for optimisation in future if we were to use an
`EVAL` to only do the `zadd()` if the group result key doesn't exist.
However, avoiding the result encoding work in `on_chord_part_return()`
would be more complicated. For now, it's not worth the brainpower.

This change also slightly refactors the redis backend unit tests to make
it easier to build fixtures and hit both the complex and simple result
structure cases.
@maybe-sybr
Copy link
Contributor Author

This new changeset should address the last of the missed coverage 🤞

@auvipy auvipy merged commit beddbee into celery:master Oct 14, 2020
@maybe-sybr
Copy link
Contributor Author

Thanks @auvipy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants