Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The SPI sharing utilities are broken for fallible chipselect pins #573

Open
GrantM11235 opened this issue Jan 23, 2024 · 3 comments · May be fixed by #574
Open

The SPI sharing utilities are broken for fallible chipselect pins #573

GrantM11235 opened this issue Jan 23, 2024 · 3 comments · May be fixed by #574

Comments

@GrantM11235
Copy link
Contributor

This affects all the spi sharing utilities in some way, but I would like to highlight the worst case scenario:

  • You are using the mutex or critical section impl
  • In thread 1 you do a transaction to device A, but the cs pin fails to deassert
  • Before thread 1 even has a chance to react to the error, it is preempted by thread 2
  • Thread 2 does a transaction to device B while the cs for device A is still active
  • This leads to data corruption at best, and physical damage at worst

Option 1: Infallible only

CS: OutputPin<Error = Infallible>

This is the simplest option. In fact it is even simpler than the existing impls because it allows us to remove the DeviceError enum.

You can still use fallible chipselect pins if you wrap them in some sort of adapter that panics on failure. We could even provide such an adapter.

Option 2: track "poisoned" chipselects

I have some idea how to implement this, it would involve adding a shared flag to each bus, as well as another flag for each SpiDevice. The impl would refuse to do any spi transactions until the offending chipselect is fixed.

This would be a lot of extra overhead and complexity that is completely unnecessary for 99.9% of users. I don't think the compiler will be able to remove the unnecessary overhead.

Option 3: do nothing

Document the broken error handling, how to work around it, and where it is impossible to work around.

If we want to be thorough, the amount of warnings and disclaimers will probably outweigh the amount of actual code:

expand me!
  • how to detect a failed chipselect
    • DeviceError::Cs, obviously, but also
    • DeviceError::Spi, because it can "hide" chipselect errors
    • basically any SpiDevice error
    • downstream errors such as DisplayError::BusWriteError from display_interface (but ironically not DisplayError::CSError)
  • how to properly recover from a chipselect error
    • before using any device on the bus:
      • get access to the SpiDevice by destroying whichever driver was using it
      • get access to the chipselect by destroying the SpiDevice
      • do whatever you need to do to fix and deassert the chipselect pin
    • in multitheaded/concurrent code, it is impossible to ensure that this is done before another thread uses a device on the bus. It is not possible to correctly handle chipselect failure in this case
  • what happens if you use a device on the bus before fixing the chipselect error
    • this is even a minor problem for ExclusiveDevice because the framing will be messed up
    • if two chipselects are active at once
      • the other device will receive garbage data
      • both devices will fight to drive MISO, causing data corruption and possibly even physical damage

Conclusions

If you only have infallible chipselect pins, or if you want to panic on chipselect failure, option 1 is the best.

If you actually want to try to gracefully recover from chipselect errors, option 2 is the only real option.

Option 3 isn't good for either group. It is slightly inconvenient for the first group, and it is almost impossible to use correctly for the second group

My recommendation is option 1 because it is the best option for most people. If someone really needs option 2, they can write it for themself, it doesn't need to be in our crate.

@Sh3Rm4n
Copy link
Contributor

Sh3Rm4n commented Jan 23, 2024

If I understand it correctly, the CS assertion on hardware somewhat resembles a lock in software on the SPI device, so if the CS deassert fails, the software lock should not be unlocked either or at least let this not happen:

  • Thread 2 does a transaction to device B while the cs for device A is still active

If and how to recover the situation, I'm not sure about. Panicking should be avoided. Reporting the chip select error can be useful. The user is at least able to shut down the SPI to avoid any further damage to the hardware (if at risk).

@GrantM11235 GrantM11235 linked a pull request Jan 23, 2024 that will close this issue
5 tasks
@ryankurte
Copy link
Contributor

i'm not super convinced making CS pins infallible / wrapping in panics is the right way to handle this... i can't think of any situations with fallible IOs in which you're likely to be able to -recover- from a CS failure, particularly because if a write fails you're not going to be able to tell what state the pin is in, so you'd need to basically tear down and re-init the bus.

the problem with wrapping in a panic is that it stops the error handling path that would let you at least tear things down neatly, it seems to me like a CS failure should poison the whole shared bus (within the critical section).

@GrantM11235
Copy link
Contributor Author

i'm not super convinced making CS pins infallible / wrapping in panics is the right way to handle this

It's not always the right way to handle it, but I think it is the best way to handle it for the vast majority of cases.

I think #574 strikes a good balance. It is useful to most people, has no unnecessary overhead, and is honest about it's limitations.

Anyone who does need to do more complex error handling is free to write their own SpiDevice impl. After all, there is nothing special about the impls in embedded-hal-bus, they are just some utilities that we think will be useful to other people.

If there is enough interest, we could even add another set of SpiDevice impls to embedded-hal-bus that do poisoning, but I don't think enough people would use it to justify the mild maintenance burden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants