Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/driver: fix rendering freeze in mobile #2473

Merged
merged 1 commit into from
Sep 22, 2021

Conversation

changkun
Copy link
Member

@changkun changkun commented Sep 15, 2021

Description:

This is a non-trivial fix. I have to describe my debugging process to formulate why this is a fix of issue 950.
More importantly, this is a similar fix as we have done in #2406.

In the mobile drivers, GL draw calls are executed in the drawloop, which is invoked by the iOS periodically:

//export drawloop
func drawloop() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
for workAvailable := theApp.worker.WorkAvailable(); ; {
select {
case <-workAvailable:
theApp.worker.DoWork()
case <-theApp.publish:
theApp.publishResult <- PublishResult{}
return
case <-time.After(100 * time.Millisecond): // incase the method blocked!!
return
}
}
}

On Android, the draw loop is executed in the mainUI whenever workAvailable receives an event.

case <-workAvailable:
theApp.worker.DoWork()

The workAvailable receives events from a paint.Event, which calls many GLES functions. On mobile systems, GL calls are done in a DoWork call which batches cgo calls for performance reasons. However, if a GL call allocates a resource, then that function cannot be called asynchronously. Hence, DoWork will sync the pace with the paintWindow() in a paint.Event.

From debugging trace, the rendering pause is caused by the paintWindow() pauses forever and never returns. Therefore, a newer paint.Event will never be processed anymore.

In the minimized reproducer (see below), after days of tracing (sometimes very difficult to reproduce if there are more logs), logs indicates that the blocking seems to always happen in func (p *glPainter) drawRectangle, especially the GL calls happens inside drawTextureWithDetails (where calls func (p *glPainter) getTexture -> func (p *glPainter) newTexture -> ...). More importantly, a blocking GL call may be stuck on a channel at

func (ctx *context) enqueue(c call) uintptr {
ctx.work <- c

or

return uintptr(<-ctx.retvalue)

forever and never continue.

The following log traces shows how this happened:

G1 workAvaliable success: glfnUseProgram, non-block       (a)
G1 workAvaliable success: glfnBlendFunc, non-block        (b)
G1 workAvaliable failed: glfnActiveTexture, non-block     (c)
G2 DoWork returned                                        (d)
G2 DoWork returned                                        (e)
G1 workAvaliable success: glfnBindTexture, non-block      (f)
G1 workAvaliable success: glfnDrawArrays, non-block       (g)
G2 drawloop called                                        (h)
G2 DoWork returned                                        (i)
G2 DoWork returned                                        (j)
G1 workAvaliable failed: glfnGenTexture, block            (k)
G2 drawloop timeout, return                               (l)
G2 drawloop timeout, return
G2 drawloop timeout, return
G2 drawloop timeout, return
G2 drawloop timeout, return
G2 drawloop timeout, return
...
DoWork is never called again.

G1 and G2 are two goroutines in the above log, and G1 sends GL calls to G2, and G2 processes it. We can have:

=> If workAvailable success twice, then DoWork must return twice. So (a,b) and (d,e) are associated; (f-g) and (i-j) are associated;
=> Hence, (h) and (k) are the relevant process to cause the rendering freeze, and (h) is before (l).
=> The timeout of (l) is because workAvaliable is not receiving. Thus (k) is the cause of the timeout.
=> (k)'s send to workAvaliable failed, and last DoWork return (j) leads to DoWork is never called again.

For a fix, we may quickly think about making the workAvailable's buffer a little bit bigger. But it won't fix the issue forever, this is because the workAvailable may still be ignored if the channel's buffer is full. For this reason, we should never skip a workAvailable event, which makes the channel's internal buffer to become an arbitrary size.

Thankfully, in #2406, we have provided the unbounded channel facility, and therefore the fix would be simple:

Use an unbounded channel for the workAvailable, so that no more workAvailable event can be missed.

Fixes #950

@andydotxyz @Jacalz

Checklist:

  • Tests included.
  • Lint and formatter run with no errors.
  • Tests all pass.

Reproducer

I simplified the button reproducer proposed in #950 as follows (automatic refreshing):

package main

import (
	"image/color"
	"log"
	"time"

	"fyne.io/fyne/v2"
	"fyne.io/fyne/v2/app"
	"fyne.io/fyne/v2/canvas"
	"fyne.io/fyne/v2/container"
	"fyne.io/fyne/v2/layout"
)

func main() {
	win := app.New().NewWindow("Some Window")
	w, h := 90, 25
	cells := make([]fyne.CanvasObject, 0, w*h)
	for i := 0; i < w*h; i++ {
		cells = append(cells, canvas.NewRectangle(color.RGBA{R: 255, G: 255, B: 255, A: 128}))
	}
	container := container.New(layout.NewGridLayout(h), cells...)
	colors := [3]color.RGBA{
		{R: 20, G: 30, B: 100, A: 128},
		{R: 100, G: 20, B: 30, A: 128},
		{R: 30, G: 100, B: 20, A: 128},
	}

	go func() {
		c := 0
		tk := time.NewTicker(100 * time.Millisecond)
		for range tk.C {
			for i := 0; i < w; i++ {
				for j := 0; j < h; j++ {
					obj := container.Objects[i*h+j].(*canvas.Rectangle)
					obj.FillColor = colors[c]
					obj.Refresh()
				}
			}
			log.Println("still refreshing")
			c = (c + 1) % 3
		}
	}()
	win.SetContent(container)
	win.ShowAndRun()
}

@changkun changkun changed the title internal/driver: fix dead lock in mobile render loop internal/driver: fix rendering freeze in mobile Sep 17, 2021
@changkun
Copy link
Member Author

@andydotxyz ping

@andydotxyz
Copy link
Member

The code looks good, and it does stop the hang.
However I do get new error messages, are they related and/or a problem?

default	11:47:26.659791+0100	main	Hang detected: 0.35s (always-on hang reporting)

After about 20 of these (which appear a few seconds apart), the demo exits hard.

@changkun
Copy link
Member Author

After about 20 of these (which appear a few seconds apart), the demo exits hard.

I haven't encountered such a problem yet, could you describe it in more detail? such as what does it mean by "exits hard"? Is it run out of memory? or something else? "exists hard" triggers no panics?

@andydotxyz
Copy link
Member

Is it run out of memory? or something else? "exists hard" triggers no panics?

Apologies, since Go 1.16 I don't get panic on iOS logs in the Console - I will downgrade and see what I can gather.

@andydotxyz
Copy link
Member

It seems to be not crashing but being closed. due to:

Assertion did invalidate due to timeout: 48552-48552-46637 (target:[application<io.fyne.demo>:52123])

It looks like the app was just not responding due to high CPU.
Maybe I cranked up the demo hitting it's a ability to process the refreshes...

@andydotxyz
Copy link
Member

Yeah, concluding that the issue is separate to the one this solves. If I dial it back to fewer refreshes the messages may appear but the app doesn't crash - I must have pushed the CPU too hard so the heartbeat fails to respond.

@andydotxyz andydotxyz merged commit d6848f2 into fyne-io:develop Sep 22, 2021
@andydotxyz
Copy link
Member

Thanks, very excited indeed to have this fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants