internal/driver: fix rendering freeze in mobile #2473
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This is a non-trivial fix. I have to describe my debugging process to formulate why this is a fix of issue 950.
More importantly, this is a similar fix as we have done in #2406.
In the mobile drivers, GL draw calls are executed in the
drawloop
, which is invoked by the iOS periodically:fyne/internal/driver/mobile/app/darwin_ios.go
Lines 222 to 238 in eff859f
On Android, the draw loop is executed in the
mainUI
wheneverworkAvailable
receives an event.fyne/internal/driver/mobile/app/android.go
Lines 472 to 473 in eff859f
The
workAvailable
receives events from apaint.Event
, which calls many GLES functions. On mobile systems, GL calls are done in aDoWork
call which batches cgo calls for performance reasons. However, if a GL call allocates a resource, then that function cannot be called asynchronously. Hence,DoWork
will sync the pace with thepaintWindow()
in apaint.Event
.From debugging trace, the rendering pause is caused by the
paintWindow()
pauses forever and never returns. Therefore, a newerpaint.Event
will never be processed anymore.In the minimized reproducer (see below), after days of tracing (sometimes very difficult to reproduce if there are more logs), logs indicates that the blocking seems to always happen in
func (p *glPainter) drawRectangle
, especially the GL calls happens insidedrawTextureWithDetails
(where callsfunc (p *glPainter) getTexture
->func (p *glPainter) newTexture
-> ...). More importantly, a blocking GL call may be stuck on a channel atfyne/internal/driver/mobile/gl/work.go
Lines 105 to 106 in eff859f
or
fyne/internal/driver/mobile/gl/work.go
Line 114 in eff859f
forever and never continue.
The following log traces shows how this happened:
G1 and G2 are two goroutines in the above log, and G1 sends GL calls to G2, and G2 processes it. We can have:
=> If workAvailable success twice, then DoWork must return twice. So (a,b) and (d,e) are associated; (f-g) and (i-j) are associated;
=> Hence, (h) and (k) are the relevant process to cause the rendering freeze, and (h) is before (l).
=> The timeout of (l) is because workAvaliable is not receiving. Thus (k) is the cause of the timeout.
=> (k)'s send to workAvaliable failed, and last DoWork return (j) leads to DoWork is never called again.
For a fix, we may quickly think about making the workAvailable's buffer a little bit bigger. But it won't fix the issue forever, this is because the workAvailable may still be ignored if the channel's buffer is full. For this reason, we should never skip a
workAvailable
event, which makes the channel's internal buffer to become an arbitrary size.Thankfully, in #2406, we have provided the unbounded channel facility, and therefore the fix would be simple:
Use an unbounded channel for the
workAvailable
, so that no more workAvailable event can be missed.Fixes #950
@andydotxyz @Jacalz
Checklist:
Reproducer
I simplified the button reproducer proposed in #950 as follows (automatic refreshing):