[Merged by Bors] - Ignore `Timeout` errors on Linux AMD & Intel #5957

nicopap · 2022-09-12T12:13:37Z

Objective

Solution

When running on a Linux machine with some AMD or Intel device, when calling
surface.get_current_texture(), ignore wgpu::SurfaceError::Timeout errors.

Alternative

An alternative solution found in the wgpu examples is:

let frame = surface
    .get_current_texture()
    .or_else(|_| {
        render_device.configure_surface(surface, &swap_chain_descriptor);
        surface.get_current_texture()
    })
    .expect("Error reconfiguring surface");
window.swap_chain_texture = Some(TextureView::from(frame));

See: https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370

Veloren handles the Timeout error the way this PR proposes to handle it.

The reason I went with this PR's solution is that configure_surface seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without configure_surface.

I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the Timeout occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.

crates/bevy_render/src/view/window.rs

nicopap · 2022-09-12T12:23:06Z

@mdickopp @tirithen @alexpyattaev @bobhenkel @chiboreache @paullouisageneau @popojan @etam You seem to have hit this bug, could a benevolent soul try out this patch?

heavyrain266 · 2022-09-12T14:09:35Z

Bevy seems to no longer crash on Wayland for me, finally can disable XWayland in orichalcum.

Ixentus · 2022-09-13T08:41:36Z

Tried the method from 4579, but didn't get any errors. Is there a reliable minimal way to reproduce? Running X11 on AMD XT6700 with 0.9.0-dev.

popojan · 2022-09-13T13:18:14Z

I am sorry, in the meantime I switched to Wayland/sway and I cannot easily reproduce the problem anymore.

When I tried to use nicopap's revision as dependency and recompile I got an error stating cannot provide explicit generic arguments when impl Trait is used in argument position related to ::<SystemStage> usage in mod.rs, but it may well be my fault, I am new to Rust.

mdickopp · 2022-09-13T16:43:19Z

This fixes both #3380 and #4579 for me. Thanks!

(Since I do not use Wayland, I cannot test #3606).

mdickopp · 2022-09-13T17:21:07Z

Please note that I can reproduce both #3380 and #4579 on an Intel device, so I do not think they are specific to AMD devices.

2022-09-13T17:14:48.220831Z  INFO winit::platform_impl::platform::x11::window: Guessed window scale factor: 1.75    
2022-09-13T17:14:48.260219Z  INFO bevy_render::renderer: AdapterInfo { name: "Intel(R) HD Graphics 5500 (BDW GT2)", vendor: 32902, device: 5654, device_type: IntegratedGpu, backend: Vulkan }

nicopap · 2022-09-24T12:58:48Z

Draft until workaround expanded to intel devices. I also happen to have a intel GPU handy, so I might be able to test as well.

nicopap · 2022-09-26T11:28:30Z

@mdickopp

Please note that I can reproduce both #3380 and #4579 on an Intel device

Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.

ksf · 2022-09-27T19:22:22Z

I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:

That way we can still catch degraded application/driver state because when the driver is returning timeouts for a whole, say, second or two something very much looks amiss, at the same time delaying panic on non-problematic configurations seems benign. We can even make the timeout configurable in case some valiant gamer tries to run things on a potato or something.

Bonus: A message like "Graphics driver returned timeouts for X seconds" points end-users squarely at the issue.

alice-i-cecile · 2022-09-27T20:44:01Z

I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:

I'm interested in this solution; the less hardcoded special-casing the better.

nicopap · 2022-09-28T17:58:16Z

@ksf For me personally, the timeout happens every frame, despite the frame clearly drawing in less than the actual timeout, so your proposed solution wouldn't work (frame draws in well below 16ms, timeout is a full second)

Fifo present mode occasionally returns VK_TIMEOUT without blocking in wgpu-rs examples on certain systems gfx-rs/wgpu#1218 Says it happens "every few seconds".
When running the examples on Vulkan on an AMD 6900 XT, surface.get_current_texture() times out the first time each frame gfx-rs/wgpu#2941 logs seem to imply it happens every frame.

It might be possible to not special-case it, which is, from what I understand, how Veloren does it. I guess I was worried that I would break other assumptions. As far as I know, it shouldn't break anything, but "if it works on Linux it works on Windows" is not a sentence I've heard many times… So I kept conservative to exactly what I changed. I'd be happy to remove the #[cfg(target_os = "linux")] if someone else can confirm it works. But I'm worried another contributor will chime in soon and say "if this bug is limited to Linux, why not add #[cfg(target_os = "linux")]?" And then I'll have to write another lengthy reply justifying my decisions.

Anyway, maybe the fact we log on debug! is misleading and is what led you to believe it was intermittent? Should we entirely skip logging or log on trace! level?

Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?)

alice-i-cecile

Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?)

Yep, you're right there. I'm on board with this fix as a solution, even if it is temporary.

ksf · 2022-09-29T05:31:07Z

@nicopap In my case the timeout happens during configuring events, things like resizing, that's why I assumed it was an intermittent issue. But on hindsight, if the user is drag-resizing the window for a second and every request times out and thus the "last successful" time can't get reset my solution would still panic. I'm not sure whether all requests time out in my case, would have to investigate (currently, at the state of development of my code (early) I said "meh" and hard-coded the present mode to AutoNoVSync).

I definitely agree that your solution is better than just crashing, and even when things get more well-behaved upstream we'll have to support older drivers (though at some point I'd say it becomes sensible to tell people to upgrade or disable VSync/eat the performance hit)

mdickopp · 2022-09-29T16:09:23Z

@nicopap

Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.

I re-tested your latest commit (a14a344) on my Intel system, and can confirm that it fixes the bugs for me.

superdump · 2022-09-29T17:07:11Z

This looks controversial. I hear the reasoning. I guess ideally it would be fixed in mesa but even then it will take time for new drivers to roll out. Still, I’d like to understand it before ignoring the error.

nicopap · 2022-09-30T11:14:01Z

100% Agree. I really dislike the fact I basically don't understand what I'm doing here.

mdickopp · 2022-09-30T13:29:32Z

After some digging I found out that the Vulkan backend of wgpu calls vkAcquireNextImageKHR with a timeout of 1 second.

A comment in (an older version of) the source code of Chromium hints at a bug in X11 and how they worked around it: https://chromium.googlesource.com/chromium/src/+/8ec9935d64c1fcc72d09c2d44ac1dfc0a29514f3/gpu/vulkan/x/vulkan_surface_x11.cc#62
For one thing, they use a 2 seconds timeout. I'll do some more testing.

meisme-dev · 2022-10-24T01:33:44Z

I can reproduce this on an NVIDIA GPU (Linux)

nicopap · 2022-10-24T09:12:04Z

@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.

See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md

meisme-dev · 2022-10-24T14:02:04Z

@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.

See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md

Kernel: 5.15.74
Distro: NixOS Unstable (Raccoon)
Card model: RTX 2070 Super
Driver version: 520.56.06
wgpu-info: { name: "NVIDIA GeForce RTX 2070 SUPER", vendor: 4318, device: 7812, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "520.56.06", backend: Vulkan }

cart · 2022-10-30T19:39:16Z

On the one hand: I think its worth hacking around this quirk if we can, in the interest of getting Bevy running on more computers. Panicking at startup (or intermittently) is a high priority bug fix. We have multiple people saying that this works, and Veloren successfully using it is a reasonable indicator that it works. On the other hand, it feels important to understand what is happening here. Theres a chance that doing the wrong thing here will introduce ghosts in the system, hard-to-debug issues, unnecessary screen "flashing" as timeouts occur, etc.

cart · 2022-10-30T19:41:28Z

I'm going to re-add this to the 0.9 milestone, just so we can make a final call on this if the conversation progresses.

mdickopp · 2022-10-31T11:35:35Z

To gain a better understanding of the situation, I could investigate further how other projects deal with the timeout in vkAcquireNextImageKHR. However, I'll be pretty busy in November, so I'll not be able to do this before December.

If you close this issue for 0.9 and open a followup issue, kindly mention me in the new one, so that I don't forget it.

Alternative: An alternative solution found in the `wgpu` examples is: ```rust let frame = surface .get_current_texture() .or_else(|_| { render_device.configure_surface(surface, &swap_chain_descriptor); surface.get_current_texture() }) .expect("Error reconfiguring surface"); window.swap_chain_texture = Some(TextureView::from(frame)); ``` See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370> The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`. I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.

nicopap · 2022-11-01T14:06:11Z

I've resolved the conflicts with main now.

I limit the change strictly to linux and hackishly restrict it to AMD/intel GPUs so that the risk of causing unexpected issues is strictly limited to a smaller subset of users.

Maybe we could limit this to X11 users, since it seems to be limited to X11. This would require adding a x11 feature to bevy_render though which is why I left it out. At this point we are fairly close to fully capture the subset of people who just couldn't run bevy before, so even random graphical bugs would be a net win for them. Linux user are also notorious for pretty good bug reporting and being helpful in diagnosing.

I'm also keeping a close look at the wgpu issue, (gfx-rs/wgpu#1218 and gfx-rs/wgpu#2941) and make sure to revert when it is fixed.

nicopap · 2022-11-05T07:26:32Z

ETA on this? Will it get merged before 0.9? A shame to have something ready that makes bevy usable for a few more people and just overlook it.

coreh · 2022-11-05T21:46:59Z

If we're not entirely sure on the correctness of this, or if other engines take the same approach, could we make it a cargo feature, e.g. ignore-amd-intel-swapchain-timeout, or maybe an environment variable, e.g. BEVY_IGNORE_SWAPCHAIN_TIMEOUT? That way either developers or users can make the call, and we don't postpone the fix until 0.10?

alice-i-cecile · 2022-11-06T01:41:36Z

It's in the milestone: just needs a final call from Cart :)

AxiomaticSemantics · 2022-11-10T21:58:34Z

I've hit this on v0.8.1 Intel Raptor Lake IGP X11+Vulkan while looking into #6417
cargo r --release & sleep 2 && xset dpms force standby makes it easily reproducible for me.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/2849 seems to be the same root issue.

cart · 2022-11-12T08:06:40Z

bors r+

# Objective - Fix #3606 - Fix #4579 - Fix #3380 ## Solution When running on a Linux machine with some AMD or Intel device, when calling `surface.get_current_texture()`, ignore `wgpu::SurfaceError::Timeout` errors. ## Alternative An alternative solution found in the `wgpu` examples is: ```rust let frame = surface .get_current_texture() .or_else(|_| { render_device.configure_surface(surface, &swap_chain_descriptor); surface.get_current_texture() }) .expect("Error reconfiguring surface"); window.swap_chain_texture = Some(TextureView::from(frame)); ``` See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370> Veloren [handles the Timeout error the way this PR proposes to handle it](gfx-rs/wgpu#1218 (comment)). The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`. I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug. Co-authored-by: Carter Anderson <mcanders1@gmail.com>

bors · 2022-11-12T08:20:48Z

Pull request successfully merged into main.

Build succeeded:

# Objective - Fix bevyengine#3606 - Fix bevyengine#4579 - Fix bevyengine#3380 ## Solution When running on a Linux machine with some AMD or Intel device, when calling `surface.get_current_texture()`, ignore `wgpu::SurfaceError::Timeout` errors. ## Alternative An alternative solution found in the `wgpu` examples is: ```rust let frame = surface .get_current_texture() .or_else(|_| { render_device.configure_surface(surface, &swap_chain_descriptor); surface.get_current_texture() }) .expect("Error reconfiguring surface"); window.swap_chain_texture = Some(TextureView::from(frame)); ``` See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370> Veloren [handles the Timeout error the way this PR proposes to handle it](gfx-rs/wgpu#1218 (comment)). The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`. I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug. Co-authored-by: Carter Anderson <mcanders1@gmail.com>

nicopap added C-Bug An unexpected or incorrect behavior A-Rendering Drawing game state to the screen C-Crash A sudden unexpected crash O-Linux Specific to the Linux desktop operating system labels Sep 12, 2022

alice-i-cecile requested a review from superdump September 12, 2022 12:15

alice-i-cecile added this to the Bevy 0.9 milestone Sep 12, 2022

alice-i-cecile reviewed Sep 12, 2022

View reviewed changes

crates/bevy_render/src/view/window.rs Show resolved Hide resolved

nicopap marked this pull request as draft September 24, 2022 12:57

nicopap mentioned this pull request Sep 26, 2022

When running the examples on Vulkan on an AMD 6900 XT, surface.get_current_texture() times out the first time each frame gfx-rs/wgpu#2941

Open

nicopap changed the title ~~Ignore Timeout errors on Linux AMD~~ Ignore Timeout errors on Linux AMD & Intel Sep 26, 2022

nicopap force-pushed the fix-swapchain-timeout-crash branch from 5896bc2 to a14a344 Compare September 26, 2022 12:00

nicopap marked this pull request as ready for review September 26, 2022 12:11

alice-i-cecile approved these changes Sep 28, 2022

View reviewed changes

superdump added the S-Controversial This requires a heightened standard of review before it can be merged label Sep 29, 2022

alice-i-cecile removed this from the Bevy 0.9 milestone Oct 30, 2022

cart added this to the Bevy 0.9 milestone Oct 30, 2022

nicopap added 4 commits November 1, 2022 12:55

Document quirky behavior

69572e0

Workaround timeout bug on intel devices as well

5b3b454

Add different error message for configuration

fb04948

nicopap force-pushed the fix-swapchain-timeout-crash branch from a14a344 to b1698fe Compare November 1, 2022 14:04

Make error message a trace

587b33f

nicopap force-pushed the fix-swapchain-timeout-crash branch from b1698fe to 587b33f Compare November 2, 2022 08:38

improve variables

a988909

bors bot changed the title ~~Ignore Timeout errors on Linux AMD & Intel~~ [Merged by Bors] - Ignore Timeout errors on Linux AMD & Intel Nov 12, 2022

bors bot closed this Nov 12, 2022

connerebbinghaus mentioned this pull request Dec 27, 2022

Fix panic due to missing swapchain texture. mvlabat/bevy_egui#141

Merged

nicopap mentioned this pull request Apr 16, 2023

thread 'main' panicked at 'Failed to acquire next swap chain texture!' #512

Open

nicopap deleted the fix-swapchain-timeout-crash branch August 30, 2023 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Ignore `Timeout` errors on Linux AMD & Intel #5957

[Merged by Bors] - Ignore `Timeout` errors on Linux AMD & Intel #5957

nicopap commented Sep 12, 2022 •

edited

nicopap commented Sep 12, 2022

heavyrain266 commented Sep 12, 2022

Ixentus commented Sep 13, 2022

popojan commented Sep 13, 2022

mdickopp commented Sep 13, 2022

mdickopp commented Sep 13, 2022

nicopap commented Sep 24, 2022

nicopap commented Sep 26, 2022

ksf commented Sep 27, 2022 •

edited

alice-i-cecile commented Sep 27, 2022

nicopap commented Sep 28, 2022

alice-i-cecile left a comment

ksf commented Sep 29, 2022

mdickopp commented Sep 29, 2022

superdump commented Sep 29, 2022

nicopap commented Sep 30, 2022

mdickopp commented Sep 30, 2022

meisme-dev commented Oct 24, 2022

nicopap commented Oct 24, 2022

meisme-dev commented Oct 24, 2022

cart commented Oct 30, 2022

cart commented Oct 30, 2022

mdickopp commented Oct 31, 2022

nicopap commented Nov 1, 2022

nicopap commented Nov 5, 2022

coreh commented Nov 5, 2022 •

edited

alice-i-cecile commented Nov 6, 2022

AxiomaticSemantics commented Nov 10, 2022 •

edited

cart commented Nov 12, 2022

bors bot commented Nov 12, 2022

[Merged by Bors] - Ignore Timeout errors on Linux AMD & Intel #5957

[Merged by Bors] - Ignore Timeout errors on Linux AMD & Intel #5957

Conversation

nicopap commented Sep 12, 2022 • edited

Objective

Solution

Alternative

nicopap commented Sep 12, 2022

heavyrain266 commented Sep 12, 2022

Ixentus commented Sep 13, 2022

popojan commented Sep 13, 2022

mdickopp commented Sep 13, 2022

mdickopp commented Sep 13, 2022

nicopap commented Sep 24, 2022

nicopap commented Sep 26, 2022

ksf commented Sep 27, 2022 • edited

alice-i-cecile commented Sep 27, 2022

nicopap commented Sep 28, 2022

alice-i-cecile left a comment

Choose a reason for hiding this comment

ksf commented Sep 29, 2022

mdickopp commented Sep 29, 2022

superdump commented Sep 29, 2022

nicopap commented Sep 30, 2022

mdickopp commented Sep 30, 2022

meisme-dev commented Oct 24, 2022

nicopap commented Oct 24, 2022

meisme-dev commented Oct 24, 2022

cart commented Oct 30, 2022

cart commented Oct 30, 2022

mdickopp commented Oct 31, 2022

nicopap commented Nov 1, 2022

nicopap commented Nov 5, 2022

coreh commented Nov 5, 2022 • edited

alice-i-cecile commented Nov 6, 2022

AxiomaticSemantics commented Nov 10, 2022 • edited

cart commented Nov 12, 2022

bors bot commented Nov 12, 2022

[Merged by Bors] - Ignore `Timeout` errors on Linux AMD & Intel #5957

[Merged by Bors] - Ignore `Timeout` errors on Linux AMD & Intel #5957

nicopap commented Sep 12, 2022 •

edited

ksf commented Sep 27, 2022 •

edited

coreh commented Nov 5, 2022 •

edited

AxiomaticSemantics commented Nov 10, 2022 •

edited