Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor concurrent uploading of local changes #18449

Closed
wants to merge 31 commits into from
Closed

Conversation

ovitrif
Copy link
Contributor

@ovitrif ovitrif commented May 17, 2023

Resolves #17463

This PR adapts the logic for uploading local changes to the cloud and introduces unit tests for the batch job of uploading all local sites posts & pages.

We've noticed in Sentry the crash still occurs, and attempted to fix it via the mutex.withLock { … } workaround that we've merged via PR #18067 into 22.4. Unfortunately it didn't eradicate the crash as seen in this Sentry report for 22.4-rc-1.

Note To reviewers
Would love your input on steps 2 & 3 from below. Please don't merge, I will handle the merge after collecting more input.

To Test

  1. Verify local draft posts and pages are uploaded
  2. Review thoroughly the code changes and this unit test
  3. Brainstorm what other tests / ideas I could try
  4. Verify CI unit tests job succeeded

Regression Notes

  1. Potential unintended areas of impact
    Offline posts and pages upload

  2. What I did to test those areas of impact (or what existing automated tests I relied on)
    Manual testing and unit testing

  3. What automated tests I added (or what prevented me from doing so)
    Unit tests for the updated code

PR submission checklist:

  • I have completed the Regression Notes.
  • I have considered adding accessibility improvements for my changes.
  • I have considered if this change warrants user-facing release notes and have added them to RELEASE-NOTES.txt if necessary.
  • UI Changes testing checklist: n/a

@ovitrif ovitrif changed the title Issue/17463 tdd fix Refactor concurrent uploading of draft posts and pages May 17, 2023
@ovitrif ovitrif self-assigned this May 17, 2023
@ovitrif ovitrif added this to the 22.5 milestone May 17, 2023
@wpmobilebot
Copy link
Contributor

wpmobilebot commented May 17, 2023

Jetpack📲 You can test the changes from this Pull Request in Jetpack by scanning the QR code below to install the corresponding build.
App NameJetpack Jetpack
FlavorJalapeno
Build TypeDebug
Versionpr18449-c3ab67a
Commitc3ab67a
Direct Downloadjetpack-prototype-build-pr18449-c3ab67a.apk
Note: Google Login is not supported on these builds.

@wpmobilebot
Copy link
Contributor

wpmobilebot commented May 17, 2023

WordPress📲 You can test the changes from this Pull Request in WordPress by scanning the QR code below to install the corresponding build.
App NameWordPress WordPress
FlavorJalapeno
Build TypeDebug
Versionpr18449-c3ab67a
Commitc3ab67a
Direct Downloadwordpress-prototype-build-pr18449-c3ab67a.apk
Note: Google Login is not supported on these builds.

@ovitrif ovitrif marked this pull request as ready for review May 17, 2023 18:23
@ovitrif ovitrif requested review from zwarm, a team and thomashorta May 17, 2023 18:23
Copy link
Contributor

@thomashorta thomashorta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking inside the Mutex code for the version we use (1.6.4) and the crash we see in Sentry is in a very specific situation that I wasn't able to reproduce at all.

I see you removed an unlock call from the code, which is definitely good, but the code inside Mutex makes me think something could go wrong even just using withMutex since it unlocks inside a finally but there's that other code that unlocks as well in the very specific situation of the continuation already being canceled right after trying to resume the Lock waiter.

Anyway, I think that Mutex implementation with LockWaiter is pretty confusing to follow, and I see it was changed in version 1.7.0 (Kotlin/kotlinx.coroutines#3020), so I wonder if that version would have this issue at all... 🤔

}
} catch (e: Exception) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure this will cause the other coroutines (for other sites) to not be canceled since this try/catch is around the coroutine launches (which return instantly) and not around the actual code that throws the exception (call to update). Also, this function is creating a new coroutineScope which does not use a SupervisorJob (line 114) so cancellations will be propagated to sibling coroutine jobs if any job fails.

Also, even though we are talking about Coroutine cancellation, this doesn't mean that code will actually stop running immediately because coroutine cancellation is cooperative.

To be honest, I couldn't quite understand what exactly this class is trying to accomplish in terms of job execution order, parallel work, and structured concurrency, so it is a bit hard to understand if the proposed fixes here and previous PR would work as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, and I think this actually might shed light on the original issue. I see that this same possibility exists in an earlier version of this implementation, and I believe that the underlying "double-unlock" issue could be a result of this.

One theory / possibility (after reading the backscroll, and this comment) is that the forEach is spawning multiple jobs on the ioDispatcher (with a thread pool), most of them blocking on the lock() invocation. Then, if any of the child jobs throws, the parent job cancels, and all siblings too, and then things get a little bit tricky, imo. Since lock() docs say:

**This function releases the lock if it was already acquired by this function before the CancellationException was thrown.

and

Note that this function does not check for cancellation when it is not suspended.

So, although each site's uploads should be blocked and served in FIFO manner, I wonder if the cancellation bubbling up to the parent job and then back down allows another child job to acquire the lock before it is cancelled as a sibling 🤔

Though there is a "prompt cancellation guarantee", does this guarantee that it cancels immediately when a sibling cancels? (Since those children can be on different threads, along with the fact that we catch and re-throw, it makes me wonder).

I have not tested this hypothesis, so please take it with a grain of salt. I just wanted to suggest it as I've skimmed the back scroll, and your comment seems that it might shed some light on the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was actually trying to reproduce it with some custom code (though I was running it as a Unit test on the JVM and not on an actual device, so I believe thread implementation might be different).

Because of the mutex, no upload jobs will be running at the same time, though their coroutines would've been started and they're all suspended at the lock function.

I wonder if the cancellation bubbling up to the parent job and then back down allows another child job to acquire the lock before it is cancelled as a sibling 🤔

In my tests, I see this definitely happens, though it doesn't seem to cause of the crash because in that scenario the unlock call throwing an exception would be either this or this, but not the one causing the Sentry crash, which is this one.

Since the crash happens, I am guessing is also possible that timing can make that be called right when one of the jobs released the lock AND while the cancellation is reaching the internal continuation in suspendCancellableCoroutineReusable from the Mutex.

This is a tricky crash, and I still think it could happen even using just the withLock and not calling unlock on our side, though that extra unlock on our side definitely doesn't help, so maybe @ovitrif changes here help solve the issue.

I feel that updating to coroutines to 1.7.0 will definitely solve this crash (since the Mutex internals were completely changed) though not sure what would take to make that dependency update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just adding more information about this part here:

To be honest, I couldn't quite understand what exactly this class is trying to accomplish in terms of job execution order, parallel work, and structured concurrency, so it is a bit hard to understand if the proposed fixes here and previous PR would work as expected.

What I mean is: I don't understand why we are starting multiple parallel jobs but suspending them all on the same mutex. In this scenario it feels to me the mutex (and concurrency) is not needed since in practice we are just running each job sequentially.

It looks like all those jobs could be called from 1 coroutine, with some exception handling, instead of relying on concurrent jobs and a syncing mechanism (mutex in this case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the crash happens, I am guessing is also possible that timing can make that be called right when one of the jobs released the lock AND while the cancellation is reaching the internal continuation in suspendCancellableCoroutineReusable from the Mutex.

Yeah, this is exactly what I was thinking might be happening (and thanks for linking to the various sources where unlock is called). It is possible that the test dispatcher exhibits different behavior from the ioDispatcher actually used in production. If it is indeed a timing issue / race condition of some sort, this might explain why it is hard to reproduce in the test environment.

This is a tricky crash, and I still think it could happen even using just the withLock

I also agree, since, as @ovitrif pointed out, it is basically doing the same thing we were doing (i.e. calling unlock in a finally).

In this scenario it feels to me the mutex (and concurrency) is not needed since in practice we are just running each job sequentially.

It looks like all those jobs could be called from 1 coroutine, with some exception handling, instead of relying on concurrent jobs and a syncing mechanism (mutex in this case).

I agree with this as well, especially since the jobs are blocked at the highest level, before any control flow traverses into the more granular layers of the implementation where service boundaries might offer concurrency advantages (disk vs. network performance characteristics).

Also, considering performance from a high level, if we hypothetically implemented more granular concurrency, I have doubts that such advantages would be significant anyway, since the likely bottleneck would be the device's network bandwidth (in most circumstances) - so uploading to multiple sites simultaneously likely won't be much (if at all) faster than uploading to each one sequentially, and incurs this complexity cost, as well as an increased risk of multiple interrupted transfers instead of just one in the case of network failure.

That said, currently, it seems that the purpose of the mutex is described in the code comment:

    /**
     * When the app comes to foreground both `queueUploadFromAllSites` and `queueUploadFromSite` are invoked.
     * The problem is that they can run in parallel and `uploadServiceFacade.isPostUploadingOrQueued(it)` might return
     * out-of-date result and a same post is added twice.
     */

so maybe more work needs to be done to make those queueUploadFrom{AllSites,Site} methods idempotent? I don't have enough context to know what problem this mutex solves here. Perhaps the "deduping" should be happening at the layer closer to the queue, with any needed synchronization there? @ParaskP7 👋 😄 I see that this code comment was from 43949f1 but it appears originate a bit earlier: (#10878) - so I'm not sure if you have full context of the original reason for that comment. But I wonder wdyt about this? I.e. do you think it is possible this synchronization / deduping could be done in the service somehow?

Copy link
Contributor Author

@ovitrif ovitrif May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the input @thomashorta and @mkevins 🙇🏻 , you're awesome and really helpful 🙏🏻

so maybe more work needs to be done to make those queueUploadFrom{AllSites,Site} methods idempotent? I don't have enough context to know what problem this mutex solves here

If I can think outside the box on this one, this is also my view. That mutex should be replaced with a different solution that avoids the post duplication 👍🏻 . To me, locking an async job to enforce synchronicity (basically manually locking and unlocking mutex to force sequential batching) is more like a workaround than an actual solution, therefore we should not take the mutex as a given for solving the underlying issue it attempted to solve. This would automatically fix the crash we are seeing in Sentry.

@ParaskP7
Copy link
Contributor

ParaskP7 commented May 19, 2023

👋 all!

I feel that updating to coroutines to 1.7.0 will definitely solve this crash (since the Mutex internals were completely changed) though not sure what would take to make that dependency update.

I don't think updating Coroutines to its latest 1.7.1 will take much effort. 🤞

FYI: Plus, you should now be able to do any such update yourselves, with confidence, by having me guiding you when and if you feel the need. With that I mean that but you shouldn't feel the need to get my 👍 on any such updates going forward, especially after us merging this last major beast into trunk next Monday.

PS: I'll create a P2 post when this whole dependency updates work is done, just to make all that crystal clear for you all.

@ParaskP7 👋 😄 I see that this code comment was from 43949f1 but it appears originate a bit earlier: (#10878)

Yes @mkevins , you are right, I was just reverting d5b9948 at this point, adding back the previous Mutex solution and nothing more.

...so I'm not sure if you have full context of the original reason for that comment. But I wonder wdyt about this? I.e. do you think it is possible this synchronization / deduping could be done in the service somehow?

Actually no @mkevins , I had zero context of the original reason and was trying to ping Jirka on it to help figure this out (see here). 🤷

FYI: See also this origin parent PR, child PR and commit.

If I can think outside the box on this one, this is also my view. That mutex should be replaced with a different solution that avoids the post duplication 👍🏻 . To me, locking an async job to enforce synchronicity (basically manually locking and unlocking mutex to force sequential batching) is more like a workaround than an actual solution, therefore we should not take the mutex as a given for solving the underlying issue it attempted to solve. This would automatically fix the crash we are seeing in Sentry.

I agree with @ovitrif here, and actually all of you (to be honest), I too think it would be now prudent to step back and try to solve the actual problem (sequential batching), not its workaround (mutex). 💯

@ovitrif
Copy link
Contributor Author

ovitrif commented May 23, 2023

As expected by some of us here, the mutex.withLock patch we merged into 22.4 still repros:

image

Source in Sentry Discover

FYI This issue occurred on a Pixel test device used by @SiobhyB while beta testing. We're trying to confirm in the meantime that we really know a stable repro scenario and environment.

If we get to this milestone we might be able to proceed much more "in the known" with fixing this by manually testing before merging, instead of relying on the sentry reports to figure out if the crash is fixed or not.

@ovitrif
Copy link
Contributor Author

ovitrif commented May 23, 2023

If we get to this milestone we might be able to proceed much more "in the known" with fixing this by manually testing before merging, instead of relying on the sentry reports to figure out if the crash is fixed or not.

Would've been nice but it doesn't seem to be the case 😅

@ovitrif ovitrif changed the title Refactor concurrent uploading of draft posts and pages Refactor concurrent uploading of local changes May 25, 2023
@ovitrif
Copy link
Contributor Author

ovitrif commented May 25, 2023

@thomashorta and @mkevins I might have been able to trigger the same exception we're seeing in Sentry, inside a failing unit test, please see f377b6e!

Thanks to having a unit test to repro this, we can also understand now that it could indeed result in data loss, as once the mutex exception is thrown, the queued uploads are cancelled.

@mkevins
Copy link
Contributor

mkevins commented May 26, 2023

@thomashorta and @mkevins I might have been able to trigger the same exception we're seeing in Sentry, inside a failing unit test, please see f377b6e!

Thanks to having a unit test to repro this, we can also understand now that it could indeed result in data loss, as once the mutex exception is thrown, the queued uploads are cancelled.

Thanks for looking into this, Ovi! I really like the idea of creating a failing unit test to repro the issue, but I wonder if invoking mutex.unlock() in the mock is enough to fully validate the actual concurrency issue we hypothesize 🤔 .. my doubt is from the fact that the TestScope may not have the same behavior with regard to the underlying thread pools.

I wonder if there is a way (in the test environment) to simulate unlock invocation from cancellation in the ioDispatcher (with a realistic thread pool) ? It would be interesting to validate how the double unlocking is coming from a race in the job cancellation structure. I.e. could we get a similar failure by throwing some exception in one of the queueUploadFromSite jobs?

@ovitrif
Copy link
Contributor Author

ovitrif commented May 26, 2023

Thanks for looking into this, Ovi! I really like the idea of creating a failing unit test to repro the issue, but I wonder if invoking mutex.unlock() in the mock is enough to fully validate the actual concurrency issue we hypothesize 🤔 .. my doubt is from the fact that the TestScope may not have the same behavior with regard to the underlying thread pools.

I wonder if there is a way (in the test environment) to simulate unlock invocation from cancellation in the ioDispatcher (with a realistic thread pool) ? It would be interesting to validate how the double unlocking is coming from a race in the job cancellation structure. I.e. could we get a similar failure by throwing some exception in one of the queueUploadFromSite jobs?

Thank you @mkevins for validating my approach 🙇🏻, this is what I was looking for! I also had my doubts it's super useful, but nonetheless imho at least the fact that the rest of the invocations don't happen (ie. the expected number is not actual in the assert) adds a bit of weight to the importance of finding a solution here 👍🏻

I will continue my investigation, going forward I will dive into debugging & following the callstack at runtime.

The first thing I'm after is to simulate the scenario where we would encounter duplicate uploads, basically to see more clearly what problem the mutex is actually solving here 👍🏻

@ovitrif ovitrif modified the milestones: 22.5, 22.6 May 26, 2023
@thomashorta
Copy link
Contributor

Thank you @ovitrif for looking into it again. It would definitely help to have a failing test, but I agree with @mkevins that maybe handling the Mutex and doing extra stuff in the test/mock itself might not emulate the issue. 🙇

I wonder if there is a way (in the test environment) to simulate unlock invocation from cancellation in the ioDispatcher (with a realistic thread pool) ? It would be interesting to validate how the double unlocking is coming from a race in the job cancellation structure. I.e. could we get a similar failure by throwing some exception in one of the queueUploadFromSite jobs?

I thought the same and tried to create a class from scratch that does similar stuff in terms of blocking threads, suspension, and coroutine scopes and write something (as a unit test) to see if I could break it somehow. But I couldn't find a way of reproducing anything that would crash it, let alone the crash we got from Sentry.

The test code I made is available in the test/mutex-issues branch:
https://github.com/wordpress-mobile/WordPress-Android/compare/test/mutex-issues

I will continue my investigation, going forward I will dive into debugging & following the callstack at runtime.

The first thing I'm after is to simulate the scenario where we would encounter duplicate uploads, basically to see more clearly what problem the mutex is actually solving here 👍🏻

Nice! I think you're on the right path here. I feel like finding a way to consistently reproduce this specific race condition is gonna be pretty complicated (though I agree it would be great to do it) and I think it really boils down to timing, otherwise, I'm sure we would have this issue happening WAY MORE.

It might make more sense, in this case, to do our best to understand what could be happening (like the hypothesis we have), figure out what we want our code to do exactly, and improve our solution with those things in mind.

Count on me to help with this problem, if needed! ✋

@ovitrif ovitrif modified the milestones: 22.6, 22.7 Jun 12, 2023
@ovitrif
Copy link
Contributor Author

ovitrif commented Jun 27, 2023

Closing this for now, since we're now on coroutines 1.7.0, and once again it proved too complex to solve in one run.

@ovitrif ovitrif closed this Jun 27, 2023
@ovitrif ovitrif deleted the issue/17463-tdd-fix branch June 27, 2023 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants