Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Solaris FEN using EventPorts from x/sys/unix #371

Merged
merged 61 commits into from Oct 13, 2022
Merged

Conversation

nshalman
Copy link
Contributor

@nshalman nshalman commented Jul 8, 2021

What does this pull request do?

This pull request adds fsnotify support for Solaris and illumos via Solaris FEN. It is a cleanup of #263 by @isaacdavis which in turn was heavily based on #196, authored by @gfrey. I have kept both of them in the AUTHORS file and added myself as well.

The biggest change from that iteration was moving all of the cgo and unsafe code out into x/sys/unix (see 324630). Additionally, in the process of that work, I wrote a wrapper for port_getn(3c) which I have used here to allow a single call to retrieve multiple events.

Supersedes #263
Fixes #12

Where should the reviewer start?

I squashed and rebased #263 and have preserved my development branch.
To see how this code compares to that implementation, this URL will render the diff:
nshalman/fsnotify@rebased-solaris-fen...nshalman:illumos-fen-safer

This isn't really recognizable compared to the original any more. Hopefully fen.go having been stripped of all the cgo
is now readable without really requiring event ports / FEN expertise.

The heart of the cgo bits now live on in x/sys/unix

How should this be manually tested?

All of my testing was run on SmartOS, but the underlying event ports code includes tests that successfully ran under CI on a Solaris host. This PR also includes a GitHub action that runs an OmniOS VM to run the tests for CI.

Unfortunately, I've seen intermittent test failures and I haven't be able to fully identify what is happening with those.
I believe I've finally found all the bugs. This code is sensitive to the short sleeps in some of the integration tests, and thus depends on #422.
I've been stress testing with echo {1..72} | xargs -n 1 | xargs -t -I {} -P 24 go test -count={} and the VM tests which used to be flaky are now reliable (and are set to run 10 times rather than just once)

I have also done a cursory test of compiling hugo (v0.92.0) against this branch and running the tests and resulting binary.

hugo $ go test --count=1 github.com/gohugoio/hugo/watcher/filenotify
ok      github.com/gohugoio/hugo/watcher/filenotify     2.818s

Running the resulting hugo binary under truss reveals port_getn being called with arguments revealing that it's the call in fsnotify.

Copy link
Contributor Author

@nshalman nshalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the README also has a table indicating support and will need to be updated as well.

Is there a preference for keeping the changes all as a single commit, or should some of these changes be split into separate commits?

go.mod Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
@@ -1,5 +1,9 @@
# Changelog

## v1.5.0 / 2021-08-28
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This date is made up

@nathany
Copy link
Contributor

nathany commented Jul 31, 2021

Thanks for working on this.

Should #263 be closed?

@nshalman
Copy link
Contributor Author

nshalman commented Aug 2, 2021

Should #263 be closed?

That is what I would do.

While I have your attention, any thoughts on my question:

Is there a preference for keeping the changes all as a single commit, or should some of these changes be split into separate commits?

Thanks!

@nathany
Copy link
Contributor

nathany commented Aug 18, 2021

Is there a preference for keeping the changes all as a single commit, or should some of these changes be split into separate commits?

Fewer commits (or squash merge) is nice, as you've done here.

.github/workflows/test.yml Outdated Show resolved Hide resolved
@nathany
Copy link
Contributor

nathany commented Aug 18, 2021

@nshalman Is there someone with FEN/Solaris experience that can provide your code review and approval?

fen.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@nshalman nshalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into getting this into shape to land after #289, and I'll see if I can rustle up a reviewer with an eye to the Event Ports bits. (Maybe @tklauser??)

.github/workflows/test.yml Outdated Show resolved Hide resolved
fen.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@nshalman nshalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the testing and found the need to switch the call in AddRaw from os.Stat to os.Lstat but there are two other places where I call out to os.Stat and I'm not 100% clear on the semantics in those situations (in Remove and at the end of handleEvent)

I'm not totally convinced that the integration tests are exercising those code paths both with and without symlinks and might need to write some additional tests to figure out for sure if they need to be changed as well or not.

My gut says that they need to be changed too, and that the passing tests when only changing the call in AddRaw indicates a gap in the testing.

fen.go Outdated Show resolved Hide resolved
@nathany
Copy link
Contributor

nathany commented Jan 14, 2022

@nshalman It may be better to build this out as a separate FEN project -- e.g. fsnotify/fen

I can't see myself maintaining or even doing code reviews on this -- I have enough trouble keeping up with the existing platforms.

What do you think?

@nshalman
Copy link
Contributor Author

@nshalman It may be better to build this out as a separate FEN project -- e.g. fsnotify/fen

I can't see myself maintaining or even doing code reviews on this -- I have enough trouble keeping up with the existing platforms.

What do you think?

Hey, I haven't had free time to work on this code myself. I've still seen odd test failures when I stress test it and I'm not sure if it's bug(s) in my code, bugs in the tests, or what.

I have no objection moving the code somewhere else rather than landing it into the tree as is, but my question is how that would enable other software to consume it. Ideally existing software that uses fsnotify that currently doesn't work on Solaris and illumos would be able to start working when a new release of fsnotify comes out that uses it.

Do you mean that I could keep the code in a separate package that fsnotify could pull in?

Thank you for keeping this conversation alive. I'm happy to work toward whatever makes the most sense when I finally have time to pick it up again.

@nathany
Copy link
Contributor

nathany commented Jan 14, 2022

I think there is the potential to eventually make fsnotify utilize an external FEN library.

In the interim, people who really want Solaris support could wrap fsnotify + fen libraries in some way. For example, Hugo uses both fsnotify + polling (I admit I haven't looked too closely at how they are doing it).
https://github.com/gohugoio/hugo/tree/master/watcher/filenotify

The idea of making fsnotify a very light wrapper around platform-specific libraries is something I've been thinking about for a long time. A large part of that is to have small teams that own each platform-specific library and can work freely on them.

The cross-platform aspect makes it really tricky for contributors, so I'd like to find a way to alleviate those troubles if possible.

@jclulow
Copy link

jclulow commented Jan 14, 2022

The cross-platform aspect makes it really tricky for contributors, so I'd like to find a way to alleviate those troubles if possible.

At it's heart, fsnotify is a cross-platform abstraction layer, so it feels like that's what contributors are signing up for when they contribute. Indeed, people have been trying to add support for illumos to this library for years now:

When PR Link
5th February 2017 Add support for solaris FEN #196
28th August 2018 add Solaris FEN support #263
7th July 2021 Add Solaris FEN using EventPorts from x/sys/unix #371

The idea of making fsnotify a very light wrapper around platform-specific libraries is something I've been thinking about for a long time. A large part of that is to have small teams that own each platform-specific library and can work freely on them.

That sounds great, but it's not the status quo today. I think it would be unfortunate to put something that might never happen on the critical path to getting this change integrated so people can make use of it.

In the interim, people who really want Solaris support could wrap fsnotify + fen libraries in some way. For example, Hugo uses both fsnotify + polling (I admit I haven't looked too closely at how they are doing it). https://github.com/gohugoio/hugo/tree/master/watcher/filenotify

This would create a lot of extra technical and social work: rather than amend the cross-platform abstraction library, you are asking just one platform to go and reimplement that abstraction in any number of other bodies of consuming software. It seems like fsnotify is popular; that could be tens or hundreds of other bodies of software, each requiring rework and negotiation with maintainers.

I agree that we, and you, are all already resource constrained enough, which is why people have been trying to do the work just once, in the right place: here.

It isn't clear how much or what sort of additional code review is necessary before a merge. Can you outline your concerns, and what you feel remains outstanding? People in the community can likely assist with maintenance and debugging, say¸ as issues arise in future -- but it really needs to get merged so that people are actually using it in downstream software. Right now it's just sitting in a branch.

@nshalman
Copy link
Contributor Author

nshalman commented Sep 8, 2022

With the closing of golang/go#54363 we should be clear to pull in the latest commit of x/sys/unix and get this work unblocked again!

@arp242
Copy link
Member

arp242 commented Sep 8, 2022

Updated it in the main branch; didn't merge it here.

So if I understand things correctly people will need a fairly recent version of illumos or it will panic? Not sure which versions of illumos distributions contain this change yet (if any?) Need to update the CI and my local test environment, as well as communicate this clearly to users.

@Toasterson
Copy link

Right now I can find the git hash abb88ab1b9516b1ca12094db7f2cfb5d91e0a135 in omnios bloody, OpenIndiana hipster (upstream illumos-gate), and the opensolaris porting project. I can't find it in illumos-joyent but there it might have a different hash as last gate merge was 5 hours ago, so unless it was specifically skipped it's in all upstrem tracking distros. In whcih release it made it already would be a question for @danmcd and @citrus-it as I do not know how to track it beyond scrubbing for it in github.

@danmcd
Copy link

danmcd commented Sep 8, 2022

It's in illumos-omnios:master, which will eventually land in a bloody build:

nowhere(~/ws/illumos-omnios)[0]% git branch
* master
  nfs-zone
  r151030
  r151032
  r151034
  r151036
  r151038
  r151042
nowhere(~/ws/illumos-omnios)[0]% git log -1 abb88ab1b9516b1ca12094db7f2cfb5d91e0a135
commit abb88ab1b9516b1ca12094db7f2cfb5d91e0a135
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   Sun Aug 7 20:20:53 2022 +0000

    14898 port_associate PORT_SOURCE_FILE doesn't update user obj properly
    Reviewed by: Nahum Shalman <nahamu+illumos@gmail.com>
    Reviewed by: Andy Fiddaman <illumos@fiddaman.net>
    Approved by: Richard Lowe <richlowe@richlowe.net>
nowhere(~/ws/illumos-omnios)[0]% git branch -a --contains abb88ab1b9516b1ca12094db7f2cfb5d91e0a135
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/master
nowhere(~/ws/illumos-omnios)[0]% 

@danmcd
Copy link

danmcd commented Sep 8, 2022

And FWIW, you need a fairly recent SmartOS build for this fix:

kebe(~/ws/ij-cr)[0]% git branch -a --contains abb88ab1b9516b1ca12094db7f2cfb5d91e0a135
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/OS-8403
  remotes/origin/master
  remotes/origin/release-20220825
  remotes/origin/release-20220908
kebe(~/ws/ij-cr)[0]% 

@nshalman
Copy link
Contributor Author

nshalman commented Sep 8, 2022

So if I understand things correctly people will need a fairly recent version of illumos or it will panic? Not sure which versions of illumos distributions contain this change yet (if any?) Need to update the CI and my local test environment, as well as communicate this clearly to users.

It might panic the golang application (not the operating system.) For most uses cases it probably won't. It took a fair amount of stress testing for me to trip the bug.... If it does panic, the error message points explicitly at golang/go#54254 which will guide users to look into an OS update.

@papertigers did ping me on IRC about the need for an update for the CI OS image. Might take a little bit for that to be ready.

@arp242
Copy link
Member

arp242 commented Sep 8, 2022

So if I understand things correctly people will need a fairly recent version of illumos or it will panic? Not sure which versions of illumos distributions contain this change yet (if any?) Need to update the CI and my local test environment, as well as communicate this clearly to users.

It might panic the golang application (not the operating system.) For most uses cases it probably won't. It took a fair amount of stress testing for me to trip the bug.... If it does panic, the error message points explicitly at golang/go#54254 which will guide users to look into an OS update.

papertigers did ping me on IRC about the need for an update for the CI OS image. Might take a little bit for that to be ready.

Ah right, I see your latest commit improved things a bit; I didn't look at the details before and just assumed it was 100% fixed on illumos side.

I think this is mostly good to merge then, right? We still have the issue of the CI not running here; I looked in that a few weeks ago, and it's not running because I changed:

on: ['push', 'pull_request']

to:

on:
  push:
    paths: ['**.go', 'go.mod', '.github/workflows/*']

It seems that it will only run "on" a push to this repo, but not to forks. Makes sense, but behaves different than I expected.

These "double" CI runs were causing me real grief because GitHub macOS runners are very limited, and we're spinning off 7 of them now for the various tests; I was waiting almost an hour at times for the CI to finish, so I'd rather not "just" add it back. Even with just push it's already pretty long waits at times.

What we need is something like:

on: ['push_from_anywhere_including_fork']

But I couldn't really figure out how to do that :-/

Maybe I should just change it back for now...

Replace deprecated use of ioutil
During even processing, use LStat, but Stat explicitly watched symlinks
Make event bit checking more succinct
@nshalman
Copy link
Contributor Author

I think this is mostly good to merge then, right? We still have the issue of the CI not running here; I looked in that a few weeks ago, and it's not running because I changed:

The behavior on symlinks you noted worries me; I'd prefer to get it fixed before landing if possible.

The TestWatch/watch_a_symlink_to_a_file I added in #498 (and merged here too) seems to generate a lot of Write events:

--- FAIL: TestWatch/watch_a_symlink_to_a_file (0.76s)
    fsnotify_test.go:214: 
        have:
        	WRITE                "/link"
        	WRITE                "/link"
        	[.. 16578 more elided ..]
        	WRITE                "/link"
        	WRITE                "/link"
        want:
        	WRITE                "/link"

The watch_a_symlink_to_a_dir test seems fine.

The symlink behavior quirks also remind me about #387 and all the other levels of confusion that symlinks can cause relative to fsnotify semantics...

We might want to test and document our current behavior on symlinks on Linux in a bunch of different situations and make some improved test cases around them to give me more to work with while hunting it down.

I was able to trivially make the tests pass by incorrectly changing the call to reassociate a path that had fired using lstat info, but I'm pretty sure that would fail a more robust test case of multiple write events needing to properly fire.

I'm hoping to poke around today to see if I can figure out the correct fix and will update accordingly.

@nshalman
Copy link
Contributor Author

When I remove the calls to skip the symlink tests, sometimes I see this:

warning: GOPATH set to GOROOT (/home/nshalman/go) has no effect
--- FAIL: TestWatch (0.00s)
    --- FAIL: TestWatch/watch_a_symlink_to_a_dir (1.65s)
        helpers_test.go:384: this EventPort is already closed
        helpers_test.go:354: event stream was not closed after 1s
    --- FAIL: TestWatch/watch_a_symlink_to_a_file (1.70s)
        helpers_test.go:384: this EventPort is already closed
        helpers_test.go:354: event stream was not closed after 1s

@nshalman
Copy link
Contributor Author

Okay, I now understand the symlink issue (passing unix.FILE_NOFOLLOW as part of the events and a symlink path, but the FileInfo from Stat rather than Lstat seems to basically cause the kernel to fire an event immediately and it's not wrong...)

I have a fix in place, the test skips are removed, and the tests pass!

@nshalman nshalman marked this pull request as ready for review September 11, 2022 20:57
@nshalman
Copy link
Contributor Author

I have a reproducible test failure on illumos with go 1.19 when run as root:

--- FAIL: TestAdd (0.00s)
    --- FAIL: TestAdd/permission_denied (0.20s)
        fsnotify_test.go:889: error is nil

The test passes cleanly when run as a non-root user.
I'm guessing that on illumos the root user isn't stopped by mere permissions in that test case. 🤔

@arp242
Copy link
Member

arp242 commented Sep 15, 2022

The behavior on symlinks you noted worries me; I'd prefer to get it fixed before landing if possible.

Ah yeah, I forgot about that. So many comments and hard to keep track of stuff.

I'll finish up the last few things today or tomorrow and do a new release; as far as I'm concerned it's okay to merge it after that. It doesn't need to be 100% perfect and we can create new issues and improve things before the next release. Personally I'd say that's a tad easier to keep track of things, but whatever works for you is fine with me so it's your call.

@nshalman
Copy link
Contributor Author

I'll finish up the last few things today or tomorrow and do a new release; as far as I'm concerned it's okay to merge it after that. It doesn't need to be 100% perfect and we can create new issues and improve things before the next release. Personally I'd say that's a tad easier to keep track of things, but whatever works for you is fine with me so it's your call.

I agree that this is in a good enough state to merge.

@nshalman
Copy link
Contributor Author

nshalman commented Oct 9, 2022

@arp242 I imagine you're pretty busy, but it's been a few weeks so I figured I'd check in. Is there anything I can do to help get the next release cut so that this can be merged after that? I did fix the symlink issue, so I'm feeling very good about this code. Thanks!!

@arp242
Copy link
Member

arp242 commented Oct 13, 2022

Yeah, sorry; new job and some other issues I had to deal with 😅

I released v1.6.0 so we have a bit of time to shake out some potential issues before a new release.

@arp242 arp242 merged commit 2b19046 into fsnotify:main Oct 13, 2022
@shogo82148 shogo82148 mentioned this pull request Mar 6, 2024
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File Events Notifications on Solaris
8 participants