Refactor concurrent uploading of local changes #18449

ovitrif · 2023-05-17T14:58:24Z

Resolves #17463

This PR adapts the logic for uploading local changes to the cloud and introduces unit tests for the batch job of uploading all local sites posts & pages.

We've noticed in Sentry the crash still occurs, and attempted to fix it via the mutex.withLock { … } workaround that we've merged via PR #18067 into 22.4. Unfortunately it didn't eradicate the crash as seen in this Sentry report for 22.4-rc-1.

Note To reviewers
Would love your input on steps 2 & 3 from below. Please don't merge, I will handle the merge after collecting more input.

To Test

Verify local draft posts and pages are uploaded
Review thoroughly the code changes and this unit test
Brainstorm what other tests / ideas I could try
Verify CI unit tests job succeeded

Regression Notes

Potential unintended areas of impact
Offline posts and pages upload
What I did to test those areas of impact (or what existing automated tests I relied on)
Manual testing and unit testing
What automated tests I added (or what prevented me from doing so)
Unit tests for the updated code

PR submission checklist:

I have completed the Regression Notes.
I have considered adding accessibility improvements for my changes.
I have considered if this change warrants user-facing release notes and have added them to RELEASE-NOTES.txt if necessary.
UI Changes testing checklist: n/a

wpmobilebot · 2023-05-17T15:23:33Z

📲 You can test the changes from this Pull Request in Jetpack by scanning the QR code below to install the corresponding build.

	App Name	Jetpack
	Flavor	Jalapeno
	Build Type	Debug
	Version	pr18449-c3ab67a
	Commit	`c3ab67a`
	Direct Download	`jetpack-prototype-build-pr18449-c3ab67a.apk`

Note: Google Login is not supported on these builds.

wpmobilebot · 2023-05-17T15:24:20Z

📲 You can test the changes from this Pull Request in WordPress by scanning the QR code below to install the corresponding build.

	App Name	WordPress
	Flavor	Jalapeno
	Build Type	Debug
	Version	pr18449-c3ab67a
	Commit	`c3ab67a`
	Direct Download	`wordpress-prototype-build-pr18449-c3ab67a.apk`

Note: Google Login is not supported on these builds.

thomashorta

I was looking inside the Mutex code for the version we use (1.6.4) and the crash we see in Sentry is in a very specific situation that I wasn't able to reproduce at all.

I see you removed an unlock call from the code, which is definitely good, but the code inside Mutex makes me think something could go wrong even just using withMutex since it unlocks inside a finally but there's that other code that unlocks as well in the very specific situation of the continuation already being canceled right after trying to resume the Lock waiter.

Anyway, I think that Mutex implementation with LockWaiter is pretty confusing to follow, and I see it was changed in version 1.7.0 (Kotlin/kotlinx.coroutines#3020), so I wonder if that version would have this issue at all... 🤔

thomashorta · 2023-05-17T20:19:27Z

WordPress/src/main/java/org/wordpress/android/ui/uploads/UploadStarter.kt

            }
+        } catch (e: Exception) {


I'm not quite sure this will cause the other coroutines (for other sites) to not be canceled since this try/catch is around the coroutine launches (which return instantly) and not around the actual code that throws the exception (call to update). Also, this function is creating a new coroutineScope which does not use a SupervisorJob (line 114) so cancellations will be propagated to sibling coroutine jobs if any job fails.

Also, even though we are talking about Coroutine cancellation, this doesn't mean that code will actually stop running immediately because coroutine cancellation is cooperative.

To be honest, I couldn't quite understand what exactly this class is trying to accomplish in terms of job execution order, parallel work, and structured concurrency, so it is a bit hard to understand if the proposed fixes here and previous PR would work as expected.

This is a good point, and I think this actually might shed light on the original issue. I see that this same possibility exists in an earlier version of this implementation, and I believe that the underlying "double-unlock" issue could be a result of this.

One theory / possibility (after reading the backscroll, and this comment) is that the forEach is spawning multiple jobs on the ioDispatcher (with a thread pool), most of them blocking on the lock() invocation. Then, if any of the child jobs throws, the parent job cancels, and all siblings too, and then things get a little bit tricky, imo. Since lock() docs say:

**This function releases the lock if it was already acquired by this function before the CancellationException was thrown.

and

Note that this function does not check for cancellation when it is not suspended.

So, although each site's uploads should be blocked and served in FIFO manner, I wonder if the cancellation bubbling up to the parent job and then back down allows another child job to acquire the lock before it is cancelled as a sibling 🤔

Though there is a "prompt cancellation guarantee", does this guarantee that it cancels immediately when a sibling cancels? (Since those children can be on different threads, along with the fact that we catch and re-throw, it makes me wonder).

I have not tested this hypothesis, so please take it with a grain of salt. I just wanted to suggest it as I've skimmed the back scroll, and your comment seems that it might shed some light on the issue.

Yeah, I was actually trying to reproduce it with some custom code (though I was running it as a Unit test on the JVM and not on an actual device, so I believe thread implementation might be different).

Because of the mutex, no upload jobs will be running at the same time, though their coroutines would've been started and they're all suspended at the lock function.

I wonder if the cancellation bubbling up to the parent job and then back down allows another child job to acquire the lock before it is cancelled as a sibling 🤔

In my tests, I see this definitely happens, though it doesn't seem to cause of the crash because in that scenario the unlock call throwing an exception would be either this or this, but not the one causing the Sentry crash, which is this one.

Since the crash happens, I am guessing is also possible that timing can make that be called right when one of the jobs released the lock AND while the cancellation is reaching the internal continuation in suspendCancellableCoroutineReusable from the Mutex.

This is a tricky crash, and I still think it could happen even using just the withLock and not calling unlock on our side, though that extra unlock on our side definitely doesn't help, so maybe @ovitrif changes here help solve the issue.

I feel that updating to coroutines to 1.7.0 will definitely solve this crash (since the Mutex internals were completely changed) though not sure what would take to make that dependency update.

Also, just adding more information about this part here:

To be honest, I couldn't quite understand what exactly this class is trying to accomplish in terms of job execution order, parallel work, and structured concurrency, so it is a bit hard to understand if the proposed fixes here and previous PR would work as expected.

What I mean is: I don't understand why we are starting multiple parallel jobs but suspending them all on the same mutex. In this scenario it feels to me the mutex (and concurrency) is not needed since in practice we are just running each job sequentially.

It looks like all those jobs could be called from 1 coroutine, with some exception handling, instead of relying on concurrent jobs and a syncing mechanism (mutex in this case).

Since the crash happens, I am guessing is also possible that timing can make that be called right when one of the jobs released the lock AND while the cancellation is reaching the internal continuation in suspendCancellableCoroutineReusable from the Mutex.

Yeah, this is exactly what I was thinking might be happening (and thanks for linking to the various sources where unlock is called). It is possible that the test dispatcher exhibits different behavior from the ioDispatcher actually used in production. If it is indeed a timing issue / race condition of some sort, this might explain why it is hard to reproduce in the test environment.

This is a tricky crash, and I still think it could happen even using just the withLock

I also agree, since, as @ovitrif pointed out, it is basically doing the same thing we were doing (i.e. calling unlock in a finally).

In this scenario it feels to me the mutex (and concurrency) is not needed since in practice we are just running each job sequentially.

It looks like all those jobs could be called from 1 coroutine, with some exception handling, instead of relying on concurrent jobs and a syncing mechanism (mutex in this case).

I agree with this as well, especially since the jobs are blocked at the highest level, before any control flow traverses into the more granular layers of the implementation where service boundaries might offer concurrency advantages (disk vs. network performance characteristics).

Also, considering performance from a high level, if we hypothetically implemented more granular concurrency, I have doubts that such advantages would be significant anyway, since the likely bottleneck would be the device's network bandwidth (in most circumstances) - so uploading to multiple sites simultaneously likely won't be much (if at all) faster than uploading to each one sequentially, and incurs this complexity cost, as well as an increased risk of multiple interrupted transfers instead of just one in the case of network failure.

That said, currently, it seems that the purpose of the mutex is described in the code comment:

/** * When the app comes to foreground both `queueUploadFromAllSites` and `queueUploadFromSite` are invoked. * The problem is that they can run in parallel and `uploadServiceFacade.isPostUploadingOrQueued(it)` might return * out-of-date result and a same post is added twice. */

so maybe more work needs to be done to make those queueUploadFrom{AllSites,Site} methods idempotent? I don't have enough context to know what problem this mutex solves here. Perhaps the "deduping" should be happening at the layer closer to the queue, with any needed synchronization there? @ParaskP7 👋 😄 I see that this code comment was from 43949f1 but it appears originate a bit earlier: (#10878) - so I'm not sure if you have full context of the original reason for that comment. But I wonder wdyt about this? I.e. do you think it is possible this synchronization / deduping could be done in the service somehow?

Thanks a lot for the input @thomashorta and @mkevins 🙇🏻 , you're awesome and really helpful 🙏🏻

so maybe more work needs to be done to make those queueUploadFrom{AllSites,Site} methods idempotent? I don't have enough context to know what problem this mutex solves here

If I can think outside the box on this one, this is also my view. That mutex should be replaced with a different solution that avoids the post duplication 👍🏻 . To me, locking an async job to enforce synchronicity (basically manually locking and unlocking mutex to force sequential batching) is more like a workaround than an actual solution, therefore we should not take the mutex as a given for solving the underlying issue it attempted to solve. This would automatically fix the crash we are seeing in Sentry.

ParaskP7 · 2023-05-19T13:49:43Z

👋 all!

I feel that updating to coroutines to 1.7.0 will definitely solve this crash (since the Mutex internals were completely changed) though not sure what would take to make that dependency update.

I don't think updating Coroutines to its latest 1.7.1 will take much effort. 🤞

FYI: Plus, you should now be able to do any such update yourselves, with confidence, by having me guiding you when and if you feel the need. With that I mean that but you shouldn't feel the need to get my 👍 on any such updates going forward, especially after us merging this last major beast into trunk next Monday.

PS: I'll create a P2 post when this whole dependency updates work is done, just to make all that crystal clear for you all.

@ParaskP7 👋 😄 I see that this code comment was from 43949f1 but it appears originate a bit earlier: (#10878)

Yes @mkevins , you are right, I was just reverting d5b9948 at this point, adding back the previous Mutex solution and nothing more.

...so I'm not sure if you have full context of the original reason for that comment. But I wonder wdyt about this? I.e. do you think it is possible this synchronization / deduping could be done in the service somehow?

Actually no @mkevins , I had zero context of the original reason and was trying to ping Jirka on it to help figure this out (see here). 🤷

FYI: See also this origin parent PR, child PR and commit.

If I can think outside the box on this one, this is also my view. That mutex should be replaced with a different solution that avoids the post duplication 👍🏻 . To me, locking an async job to enforce synchronicity (basically manually locking and unlocking mutex to force sequential batching) is more like a workaround than an actual solution, therefore we should not take the mutex as a given for solving the underlying issue it attempted to solve. This would automatically fix the crash we are seeing in Sentry.

I agree with @ovitrif here, and actually all of you (to be honest), I too think it would be now prudent to step back and try to solve the actual problem (sequential batching), not its workaround (mutex). 💯

ovitrif · 2023-05-23T12:20:48Z

As expected by some of us here, the mutex.withLock patch we merged into 22.4 still repros:

Source in Sentry Discover

FYI This issue occurred on a Pixel test device used by @SiobhyB while beta testing. We're trying to confirm in the meantime that we really know a stable repro scenario and environment.

If we get to this milestone we might be able to proceed much more "in the known" with fixing this by manually testing before merging, instead of relying on the sentry reports to figure out if the crash is fixed or not.

ovitrif · 2023-05-23T13:57:52Z

If we get to this milestone we might be able to proceed much more "in the known" with fixing this by manually testing before merging, instead of relying on the sentry reports to figure out if the crash is fixed or not.

Would've been nice but it doesn't seem to be the case 😅

ovitrif · 2023-05-25T17:56:29Z

@thomashorta and @mkevins I might have been able to trigger the same exception we're seeing in Sentry, inside a failing unit test, please see f377b6e!

Thanks to having a unit test to repro this, we can also understand now that it could indeed result in data loss, as once the mutex exception is thrown, the queued uploads are cancelled.

mkevins · 2023-05-26T08:17:40Z

@thomashorta and @mkevins I might have been able to trigger the same exception we're seeing in Sentry, inside a failing unit test, please see f377b6e!

Thanks to having a unit test to repro this, we can also understand now that it could indeed result in data loss, as once the mutex exception is thrown, the queued uploads are cancelled.

Thanks for looking into this, Ovi! I really like the idea of creating a failing unit test to repro the issue, but I wonder if invoking mutex.unlock() in the mock is enough to fully validate the actual concurrency issue we hypothesize 🤔 .. my doubt is from the fact that the TestScope may not have the same behavior with regard to the underlying thread pools.

I wonder if there is a way (in the test environment) to simulate unlock invocation from cancellation in the ioDispatcher (with a realistic thread pool) ? It would be interesting to validate how the double unlocking is coming from a race in the job cancellation structure. I.e. could we get a similar failure by throwing some exception in one of the queueUploadFromSite jobs?

ovitrif · 2023-05-26T10:18:08Z

Thanks for looking into this, Ovi! I really like the idea of creating a failing unit test to repro the issue, but I wonder if invoking mutex.unlock() in the mock is enough to fully validate the actual concurrency issue we hypothesize 🤔 .. my doubt is from the fact that the TestScope may not have the same behavior with regard to the underlying thread pools.

I wonder if there is a way (in the test environment) to simulate unlock invocation from cancellation in the ioDispatcher (with a realistic thread pool) ? It would be interesting to validate how the double unlocking is coming from a race in the job cancellation structure. I.e. could we get a similar failure by throwing some exception in one of the queueUploadFromSite jobs?

Thank you @mkevins for validating my approach 🙇🏻, this is what I was looking for! I also had my doubts it's super useful, but nonetheless imho at least the fact that the rest of the invocations don't happen (ie. the expected number is not actual in the assert) adds a bit of weight to the importance of finding a solution here 👍🏻

I will continue my investigation, going forward I will dive into debugging & following the callstack at runtime.

The first thing I'm after is to simulate the scenario where we would encounter duplicate uploads, basically to see more clearly what problem the mutex is actually solving here 👍🏻

thomashorta · 2023-05-26T21:40:59Z

Thank you @ovitrif for looking into it again. It would definitely help to have a failing test, but I agree with @mkevins that maybe handling the Mutex and doing extra stuff in the test/mock itself might not emulate the issue. 🙇

I wonder if there is a way (in the test environment) to simulate unlock invocation from cancellation in the ioDispatcher (with a realistic thread pool) ? It would be interesting to validate how the double unlocking is coming from a race in the job cancellation structure. I.e. could we get a similar failure by throwing some exception in one of the queueUploadFromSite jobs?

I thought the same and tried to create a class from scratch that does similar stuff in terms of blocking threads, suspension, and coroutine scopes and write something (as a unit test) to see if I could break it somehow. But I couldn't find a way of reproducing anything that would crash it, let alone the crash we got from Sentry.

The test code I made is available in the test/mutex-issues branch:
https://github.com/wordpress-mobile/WordPress-Android/compare/test/mutex-issues

I will continue my investigation, going forward I will dive into debugging & following the callstack at runtime.

The first thing I'm after is to simulate the scenario where we would encounter duplicate uploads, basically to see more clearly what problem the mutex is actually solving here 👍🏻

Nice! I think you're on the right path here. I feel like finding a way to consistently reproduce this specific race condition is gonna be pretty complicated (though I agree it would be great to do it) and I think it really boils down to timing, otherwise, I'm sure we would have this issue happening WAY MORE.

It might make more sense, in this case, to do our best to understand what could be happening (like the hypothesis we have), figure out what we want our code to do exactly, and improve our solution with those things in mind.

Count on me to help with this problem, if needed! ✋

ovitrif · 2023-06-27T09:57:33Z

Closing this for now, since we're now on coroutines 1.7.0, and once again it proved too complex to solve in one run.

ovitrif changed the title ~~Issue/17463 tdd fix~~ Refactor concurrent uploading of draft posts and pages May 17, 2023

ovitrif self-assigned this May 17, 2023

ovitrif added [Type] Crash Offline Support labels May 17, 2023

ovitrif added this to the 22.5 milestone May 17, 2023

ovitrif marked this pull request as ready for review May 17, 2023 18:23

ovitrif added the [Status] Not Ready for Merge label May 17, 2023

ovitrif requested review from zwarm, a team and thomashorta May 17, 2023 18:23

thomashorta reviewed May 17, 2023

View reviewed changes

ovitrif force-pushed the issue/17463-tdd-fix branch from df3ca1c to 875e30f Compare May 22, 2023 11:13

ovitrif force-pushed the issue/17463-tdd-fix branch from 875e30f to 41da654 Compare May 25, 2023 12:11

ovitrif changed the title ~~Refactor concurrent uploading of draft posts and pages~~ Refactor concurrent uploading of local changes May 25, 2023

ovitrif requested review from thomashorta and mkevins May 25, 2023 17:57

ovitrif modified the milestones: 22.5, 22.6 May 26, 2023

ovitrif modified the milestones: 22.6, 22.7 Jun 12, 2023

Refactor: rename context to appContext and rm comment

531c4ae

ovitrif added 22 commits June 14, 2023 13:11

Update: log isPage bool as int for improved readability

ecd44b6

Update: refactor fixtures to check drafts upload order

89481d1

Refactor: cleanup redundant test setup code

5b6a02a

Refactor: extract UploadFixtures to dry test code

0e96e11

Update: improve logs during upload of posts & pages

caf73d4

Update: provide Mutex via DI for testability

90dccd2

Refactor: cleanup code of concurrent test

c06843a

Update: fix post id counter in test

21c1160

Update: fix refs in comment doc of UploadStarter

4c9f4dd

Add: unit test simulating the crash in resume onCancellation handler

3e95c6f

Update: refactor test setup code to allow more flexibility

6874e8f

Update: rewrite mutex unlock test

a3bf1b6

Refactor: cleanup mutex test & reactivate

cf16fd2

Refactor: group mutex test blocks

de1e92c

Refactor: remove unneeded advanceUntilIdle

fdcedc6

Refactor: more cleanup in mutex test

c5f4379

Refactor: make trackAutoUploadAction fun foldable

cc5ac60

Refactor: use run to avoid unnecessary var in mutex test

8157e1c

Update: remove page from tests

10df537

Update: add asserts to test

ae858f6

Refactor: make test fail because of exception

9f0a4c0

Fix: wrap mutex lock/unlock in try/catch block

c3ab67a

ovitrif force-pushed the issue/17463-tdd-fix branch from c5e8a15 to c3ab67a Compare June 15, 2023 12:08

ovitrif linked an issue Jun 19, 2023 that may be closed by this pull request

CompletionHandlerException: Exception in resume onCancellation handler for CancellableContinuation… #18630

Closed

ovitrif mentioned this pull request Jun 19, 2023

CompletionHandlerException: Exception in resume onCancellation handler for CancellableContinuation(DispatchedCont… #17463

Closed

ovitrif modified the milestones: 22.7, 22.8 Jun 22, 2023

ovitrif closed this Jun 27, 2023

ovitrif deleted the issue/17463-tdd-fix branch June 27, 2023 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor concurrent uploading of local changes #18449

Refactor concurrent uploading of local changes #18449

ovitrif commented May 17, 2023 •

edited

wpmobilebot commented May 17, 2023 •

edited

wpmobilebot commented May 17, 2023 •

edited

thomashorta left a comment •

edited

thomashorta May 17, 2023

mkevins May 18, 2023

thomashorta May 18, 2023 •

edited

thomashorta May 18, 2023

mkevins May 19, 2023

ovitrif May 19, 2023 •

edited

ParaskP7 commented May 19, 2023 •

edited

ovitrif commented May 23, 2023 •

edited

ovitrif commented May 23, 2023

ovitrif commented May 25, 2023 •

edited

mkevins commented May 26, 2023

ovitrif commented May 26, 2023

thomashorta commented May 26, 2023

ovitrif commented Jun 27, 2023

Refactor concurrent uploading of local changes #18449

Refactor concurrent uploading of local changes #18449

Conversation

ovitrif commented May 17, 2023 • edited

To Test

Regression Notes

wpmobilebot commented May 17, 2023 • edited

wpmobilebot commented May 17, 2023 • edited

thomashorta left a comment • edited

Choose a reason for hiding this comment

thomashorta May 17, 2023

Choose a reason for hiding this comment

mkevins May 18, 2023

Choose a reason for hiding this comment

thomashorta May 18, 2023 • edited

Choose a reason for hiding this comment

thomashorta May 18, 2023

Choose a reason for hiding this comment

mkevins May 19, 2023

Choose a reason for hiding this comment

ovitrif May 19, 2023 • edited

Choose a reason for hiding this comment

ParaskP7 commented May 19, 2023 • edited

ovitrif commented May 23, 2023 • edited

ovitrif commented May 23, 2023

ovitrif commented May 25, 2023 • edited

mkevins commented May 26, 2023

ovitrif commented May 26, 2023

thomashorta commented May 26, 2023

ovitrif commented Jun 27, 2023

ovitrif commented May 17, 2023 •

edited

wpmobilebot commented May 17, 2023 •

edited

wpmobilebot commented May 17, 2023 •

edited

thomashorta left a comment •

edited

thomashorta May 18, 2023 •

edited

ovitrif May 19, 2023 •

edited

ParaskP7 commented May 19, 2023 •

edited

ovitrif commented May 23, 2023 •

edited

ovitrif commented May 25, 2023 •

edited