Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll algorithm back to v2.5.3, then add back features to avoid a breaking change #130

Closed
wants to merge 16 commits into from

Conversation

notgull
Copy link
Member

@notgull notgull commented Apr 11, 2024

I have to apologize. I know it sounds self-deprecating, but really, I've been over the line on this particular issue. This should hopefully fix it for now.

The new implementation for the event listener, #51, has been plagued with bugs from its inception. Frankly, I've gotten sick of papering over these bugs, and I know that prod customers are almost certainly impacted but haven't reported it.

This PR reverts the implementation of event-listener back to version 2.5.3; before I made any of the sweeping architectural changes to the crate. Then I re-add new changes to this crate in a "path of least resistance" fashion. The goal is to change the internal algorithm as little as possible, as we know this algorithm is stable and robust.

The only real diversion from this path is my adding of a no_std implementation. I've added a no_std implementation based on an atomic linked list built on an amortized atomically expanded vector. Since it doesn't use concurrent-queue it fixes #109.

Let me run a "crater test" on the smol ecosystem before this is merged. Expect a "post mortem" at some point in the near future.

@notgull notgull changed the title The rollback Roll algorithm back to v2.5.3, then add back features to avoid a breaking change Apr 14, 2024
@notgull notgull marked this pull request as ready for review April 14, 2024 04:57
@notgull
Copy link
Member Author

notgull commented Apr 14, 2024

Paging @smol-rs/admins for reviewing this PR. Also paging @mamaicode, @jbr and @beckend as I've re-written some of your commits here.

benches/bench.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/notify.rs Show resolved Hide resolved
tests/counter.rs Show resolved Hide resolved
@zeenix
Copy link
Member

zeenix commented Apr 15, 2024

While I understand why you did it the way you did (to be able to better test/bisect the changes) but it's very hard to review a big bunch of commits like this. Also since you'll likely be squashing, can you already squash all the reverting commits into one (you can add in the description if you only reverted specific commits or it's a hard reset) and then add the new features on top (with each commit describing the change and why it's different now). After that, please do not squash merge when it's time to merge, to maintain the history.

I'm sorry for asking you to do all that but I've very limited time these days for smol (as you might have noticed). :(

notgull and others added 9 commits April 15, 2024 19:58
I'm man enough to know when I've been wrong.

The intrusive-list-based list implementation is riddled with bugs, fails
MIRI, and overall increases the complexity of this crate. I've invested
a lot of time and energy into improving it; effort which I feel has been
wasted. So I figure I'll cut my losses and revert.

This PR undoes every commit made between now and the release of
event-listener version 2.5.3. The next commits will be replaying newer
features in order to avoid a breaking change.

Signed-off-by: John Nunley <dev@notgullnet>
@taiki-e's commits since v2.5.3 aren't the problem, as they only change
CI/framing details. This commit brings back all work by @taiki-e.

Co-authored-by: John Nunley <dev@notgull.net>
Signed-off-by: John Nunley <dev@notgull.net>
This commit adds back the <T> parameter on Event and EventListener. So
far nothing is done with the generic parameter. However it will
eventually be used to bring back tagged events.

Signed-off-by: John Nunley <dev@notgull.net>
This commit re-adds the "notify.rs" file pretty much verbatim. This
allows for tagged notifications to be used, i.e. convey additional
control information along with the notification. The tags are stored in
the links with the notification.

This also adds some tests for the new functionality.

Signed-off-by: John Nunley <dev@notgull.net>
This commit adds back a feature that allows one to use portable-atomic
instead of regular atomics.

Kind of useless without no-std, but it's here again.

Signed-off-by: John Nunley <dev@notgull.net>
The Listener trait was previously used to abstract over heap and stack
based event listeners. However at this point we only have heap based
event listeners. Still, to avoid a breaking change we need to have this
trait.

Other minor fixups include:

- Make the locking ignore poisoning.
- Reduce formatting code

Signed-off-by: John Nunley <dev@notgull.net>
For now it just creates a listener on the heap. We can rework this
later, maybe to include stack listeners. For now it's best to keep it
this way, I feel.

Signed-off-by: John Nunley <dev@notgull.net>
This way added earlier as a way of telling how many listeners were woken
up by a notify() call. It's very simple to implement with the mutex
strategy.

Signed-off-by: John Nunley <dev@notgull.net>
This also makes sure that MIRI can run here.

Signed-off-by: John Nunley <dev@notgull.net>
@notgull
Copy link
Member Author

notgull commented Apr 16, 2024

While I understand why you did it the way you did (to be able to better test/bisect the changes) but it's very hard to review a big bunch of commits like this. Also since you'll likely be squashing, can you already squash all the reverting commits into one (you can add in the description if you only reverted specific commits or it's a hard reset)

The reverts are all squashed into 6368046. The rest of the commits are re-adding the functionality.

then add the new features on top (with each commit describing the change and why it's different now).

Done.

After that, please do not squash merge when it's time to merge, to maintain the history.

Sounds good!

I'm sorry for asking you to do all that but I've very limited time these days for smol (as you might have noticed). :(

All good, no worries

@notgull notgull force-pushed the notgull/revert branch 3 times, most recently from 46d4236 to aa63e03 Compare April 17, 2024 00:03
notgull and others added 6 commits April 17, 2024 17:58
Replay of #31 with minor modifications

Signed-off-by: John Nunley <dev@notgull.net>
- Update Cargo.toml and CHANGELOG.md
- Add back the smol-rs logo (#63)

Signed-off-by: John Nunley <dev@notgull.net>
cc #48 and #59

Signed-off-by: John Nunley <dev@notgull.net>
cc #86, it's harder to get this info on no_std so I've ignored it for
now

Signed-off-by: John Nunley <dev@notgull.net>
Mirror of #114

Signed-off-by: John Nunley <dev@notgull.net>
.github/workflows/ci.yml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
@notgull notgull force-pushed the notgull/revert branch 2 times, most recently from e9c1f04 to 62ce8fe Compare April 19, 2024 02:27
Co-authored-by: Jacob Rothstein <hi@jbr.me>
Signed-off-by: John Nunley <dev@notgull.net>
@beckend
Copy link
Contributor

beckend commented Apr 30, 2024

@notgull what stops you from merging this?

@notgull
Copy link
Member Author

notgull commented May 1, 2024

@notgull what stops you from merging this?

I'm waiting for reviews from other smol-rs members. I'm aware it's a big PR and that they don't have much time to review it.

@notgull
Copy link
Member Author

notgull commented May 1, 2024

Actually, come to think of it this PR might be a little much.

@smol-rs/admins Would it be preferred if I split this up into smaller PRs, starting with an initial rollback followed by adding on more features?

@zeenix
Copy link
Member

zeenix commented May 1, 2024

@smol-rs/admins Would it be preferred if I split this up into smaller PRs, starting with an initial rollback followed by adding on more features?

I can't say for others but what makes this PR hard to review for me, is it's hard to see what really changed in the end. I understand why you did it the way you did, to be able to figure out where the regression(s) happened but now that you know, I think it would be much better if we don't go back in history, but instead add the needed fixes on top of the current main branch (i-e not doing rollbacks, revert of specific commits is of course good). WDYT?

@notgull
Copy link
Member Author

notgull commented May 4, 2024

I understand why you did it the way you did, to be able to figure out where the regression(s) happened but now that you know, I think it would be much better if we don't go back in history, but instead add the needed fixes on top of the current main branch (i-e not doing rollbacks, revert of specific commits is of course good). WDYT?

The issue is that I'm not 100% sure what the current regressions are caused by, and the current advantages of the current implementation are not worth the effort of my part to investigate them.

@beckend
Copy link
Contributor

beckend commented May 4, 2024

If you have proven stability by adding all these tests, why not call it good enough of a change and merge it.

@zeenix
Copy link
Member

zeenix commented May 5, 2024

The issue is that I'm not 100% sure what the current regressions are caused by, and the current advantages of the current implementation are not worth the effort of my part to investigate them.

Fair enough but then how do we ensure that the regressions weren't re-added by the "new" commits? For all we know, there could be a timing issue and tests don't catch the issue(s) still present. Finding out the issue and fixing it, would be the only certain way of actually fixing the problems. I would hate for us to have to break API again and again.

@zeenix
Copy link
Member

zeenix commented May 5, 2024

Moreover, if there are still regressions/issue, merging this PR as is, would make it super hard to bisect the history to figure out the culprit commits/changes later on.

In any case, since I'm not that involved in the project anymore and you're doing the work, I'd leave it to you how you want to proceed. I would request that you consider my advice here.

@notgull
Copy link
Member Author

notgull commented May 5, 2024

Fair enough but then how do we ensure that the regressions weren't re-added by the "new" commits? For all we know, there could be a timing issue and tests don't catch the issue(s) still present.

I'm using the async-lock rwlock contention tests as a benchmark here. This current code passes this test and master does not. I'll admit that it's not a perfect benchmark but I think it generally works as an indicator.

But, fair enough. I'll do a deep dive into the current code and see if I can figure out what the issue is.

@zeenix
Copy link
Member

zeenix commented May 6, 2024

But, fair enough. I'll do a deep dive into the current code and see if I can figure out what the issue is.

I appreciate your very constructive attitude to my rather blunt feedback. 👍

@notgull
Copy link
Member Author

notgull commented May 16, 2024

cc #133

notgull added a commit that referenced this pull request May 27, 2024
The notify() function contains two optimizations:

- If the `inner` field is not yet initialized (i.e. no listeners have been
  created yet), it returns `0` from the function early.
- If the atomic `notified` field indicates that the current notification would
  do nothing, it returns `0` early as well.

`loom` testing has exposed race conditions in these optimizations that can cause
notifications to be missed, leading to deadlocks. This commit removes these
optimizations in the hope of preventing these deadlocks. Ideally we can restore
them in the future.

Closes #138
cc #130 and #133

Signed-off-by: John Nunley <dev@notgull.net>
@notgull
Copy link
Member Author

notgull commented May 27, 2024

Obsoleted by #139

@notgull notgull closed this May 27, 2024
notgull added a commit that referenced this pull request May 28, 2024
The notify() function contains two optimizations:

- If the `inner` field is not yet initialized (i.e. no listeners have been
  created yet), it returns `0` from the function early.
- If the atomic `notified` field indicates that the current notification would
  do nothing, it returns `0` early as well.

`loom` testing has exposed race conditions in these optimizations that can cause
notifications to be missed, leading to deadlocks. This commit removes these
optimizations in the hope of preventing these deadlocks. Ideally we can restore
them in the future.

Closes #138
cc #130 and #133

Signed-off-by: John Nunley <dev@notgull.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Remove concurrent-queue dependency
6 participants