Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: symlink support revised #381

Open
e2 opened this issue Apr 27, 2016 · 6 comments
Open

Feature: symlink support revised #381

e2 opened this issue Apr 27, 2016 · 6 comments
Assignees
Labels
✨ Feature Adds a new feature

Comments

@e2
Copy link
Contributor

e2 commented Apr 27, 2016

Intro

The goal of Listen is portability. This is tricky, because each backend behaves differently. This means things like symlink support and TCP support had to be put aside for now.

But, with a better architecture, it should be possible to extend Listen without all the pitfalls of doing so.

Milestone: A new architecture.

Currently it goes something like this:

  1. Backend adapter (Linux, Darwin, Windows, BSD, Polling) creates some kind of event. ("Some kind" is the issue here.
  2. That event is translated into "something changed".
  3. Listen does complex gymnastics to work out what really did change.
  4. Changes are organized and optimized to avoid spamming (too many identical events) and for editor support (moving/renaming due to just saving a file)
  5. Changes are queued for a period of time before the callback is called.

The new architecture would work something like this:

  1. Backend adapter generates an "directories which may have changed" event with a list of directories which have changed/added/removed files.
  2. Listen keeps scanning directories and building a symlink "database" until things are no longer changing (or a timeout happens).
  3. Listen reports how long the scanning took (or if it timed out).
  4. Listen then compares the changes with the "filesystem snapshot" (generated on startup and before every callback with changes). That's how it generate a list of changed files and "changed symlinks". (All possible "combinations" are reported, so pattern matching can work reliably).
  5. Listen now has a list of "changed files". It then reduces, optimizes and "reinterprets" that list of changes (for editor support) and this is also where pattern ignoring happens (no sooner, because the ignore filters are needed for editor support).
  6. The "filesystem snapshot" is updated and the user's callback is called with the changes.

Why

Current "architecture mistakes" of Listen:

  1. Backends produce different kinds of events. (Hard to maintain/refactor without breaking things).
  2. Backend not "directory oriented". (Listen is insanely complex because of this).
  3. Not "transactional" - handling changes "as they come" is just impossible to unit test effectively.
  4. No concept of "filesystem snapshot" due to "optimizing too much". This made development hard.
  5. Too much dependency on timing. (Makes debugging too difficult and complex).
  6. Scanning the filesystem and optimizing results in the same classes. (Very difficult to mock/stub and test, gets in the way of network support).
  7. Listen is not split into independent components. (Too much of the overall architecture is actually a workaround/optimization for OSX and Polling, which is confusing).

First step

Just getting every backend returning a list of changed directories. If a backend can "guarantee" that all file are reported properly (Linux and Windows backends only), then those files can help as "optimizer hints".

Otherwise, full "directory scanning" should be the default. E.g. if Linux reports that foo.rb was modified, then it reports a change in the directory containing foo.rb - and maybe hinting that only foo.rb was changed in that directory.

Correctness is more important for speed, so any "optimization" should be almost possible to disable at all times. (To keep the codebase simple).

Full directory scanning seems like overkill, but it's absolutely necessary for Listen to be maintainable, testable and "correct". This may slow things down on Linux a tiny bit, but that shouldn't be noticeable. (Especially since new optimizations would be possible).

Also, "full directory scanning" is the only sane way to provide support for symlinks.

If anyone is interested on tearing up the codebase to make this happen, I'd be more than happy to support you.

@e2 e2 added the ✨ Feature Adds a new feature label Apr 27, 2016
@driskell
Copy link

driskell commented May 1, 2016

I did continue work on the rework in my fork and I'm currently using it. Though I did strip out all the symbolic link management and it works fairly great though I've found a few racey data access.

I'm unsure what you mean by symlink database though. This I guess is going to be the entire reason for the architecture change as sym links do add a huge headache. It'll be useful to know you're thoughts on structure and handling and if it needs to interact with multiple back ends?

I agree with most of what you say and I do see being able to reuse a lot of what is already there. I just don't want to make assumptions as unsure on symlink database inner workings.

On a side note, I found some fundamental issues with rb-fsevents, such as it ignores kFSEventStreamEventFlagMustScanSubDirs which of course isn't ideal. Though I've not seen any real world occurrence as of yet! Just thought I'd add in here while we discuss futures :)

@e2
Copy link
Contributor Author

e2 commented May 1, 2016

It'll be useful to know you're thoughts on structure and handling and if it needs to interact with multiple back ends?

A symlink database would basically handle reporting changes. e.g. if symlinked/foo.rb changes, then you'll get two changes reported to the end user: real/foo.rb1 (but only if the user is also explicitly watching therealdirectory) andsymlinked/foo.rb(because user is watchingsymlinked`, and matching patterns to it).

So the symlink database is just to later translate changes into symlink targets changed. It would be separate from the existing record/filesystem database for easier testing, etc. It's just a "last step". Most of Listen will work on real paths. So symlink -> real happens at the beginning (to remember what the user really cares about watching) and real -> symlinks happens at the end (to translate real changes into all the possible combinations of symlinks to report them as changed). So this is "outside" the normal processing. It could even be added right now, but I'd improve the architecture first.

On a side note, I found some fundamental issues with rb-fsevents

I wouldn't know. rb-fsevent isn't optimized (it has it's own wait loop), but it could be integrated tighter with Listen for better optimization - given the architecture is better. The idea is: the faster rb-fsevent can respond, and the more "hints" it can give about changes, the more it can be optimized.

And the more Listen's parts can be reused.

E.g. file size may be a faster and more reliable "first check" rather than mtime.

I think the overall "philosophy" behind the new architecture is: "directory diff based". So it's about comparing directory snapshots over time. Which is very different than storing an "active" state of the filesystem in the Record structures.

Scanning and diffing directories seems like it's expensive, but it's what actually happens on Polling and Darwin adapters (they share most of the same code paths). My earlier mistake was over-optimizing Record access instead of turning it into 3 separate databases:

  1. A flat "watched directory" list, mapping symlinks to real watched directories. (Currently this is more of a "configuration" option, and not a real internal structure).
  2. A flat "symlink database" for turning a real event into all the possible combinations of "symlinks changed"
  3. A flat list of directory snapshots (where each snapshot stores last time scan started, last time scan ended, all the data of each entry, including mod, size, mtime of all subdirectories and files within).

And these could actually be in sqlite anyway. Tree structures are too mind-boggling to work with. They'd have to make heavy use of classes to stay maintainable.

@driskell
Copy link

driskell commented May 1, 2016

Oh I see. That sounds great.

Regarding the third database, flat list of directory snapshots, isn't that what Record already is? And Record is currently really simple tree structure. Maybe it's useful to have Record an interface, then you can have a memory-based tree if you wanted and didn't care about offline changes, and also have a SQLLite option (or even whatever your application uses as a backend). All under one interface. I'm not really convinced the tree structure for Record is mind-boggling though or in need of replacing. It's DirectoryPath->State mapping and that's it. And any SQLite implementation would be the same, surely? Unless you're referring to some other parts?

In my fork on the rework I stripped symlink and made Directory support two scan modes, recursive and shallow. I use it to do incremental syncs to vagrant machines for developers and it's blazing my fast. I've now got a need to add symlink back in as we are using them more and more with "composer" libraries. I don't mind drawing that work out to help start some improvements as I know what you mean by difficult to mock/test as I've somewhat neglected TDD. And it would be great to see it made acceptable for wider consumption instead of hidden away :)

@e2
Copy link
Contributor Author

e2 commented May 1, 2016

Regarding the third database, flat list of directory snapshots, isn't that what Record already is?

No, it's still a tree, see this TODO: https://github.com/guard/listen/blob/e21066c/lib/listen/record.rb#L7

Also, while the tests are much better than they used to be, they're still hard to maintain: https://github.com/guard/listen/blob/e21066c/spec/lib/listen/record_spec.rb#L285

It's best to switch to objects representing snapshots and not a "build-as-you-go" structure.

The idea here is to also avoid any filesystem operations outside scanning. Ideally only the backend would touch the filesystem, report an updated directory snapshot, and then the filesystem is never accessed until the next event.

The biggest problem with understanding, maintenance and testing right now is that the filesystem is accessed multiple times between the even and the callback. This makes most of the codebase non-deterministic.

All under one interface. I'm not really convinced the tree structure for Record is mind-boggling though or in need of replacing.

I worked hard to keep it simple, fast, properly covered and memory efficient. It still makes too many "hidden" assumptions. And edge cases are almost impossible to map out in reasonable time. E.g. the HFS filesystem uses 1-second mtime resolution (like FAT). This has to currently be "kept in mind" when changing anything. It's also hard to simulate in tests, too.

The Record simply has too many responsibilities for a single structure.

It's DirectoryPath->State mapping and that's it.

One word: symlinks. It's better to think in terms of inode numbers. E.g. you have a directory name (a possible symlink) that resolves to an inode number. And the inode is basically what you're watching, not the "directory name". So even if a directory is renamed, you're still watching the same inode.

So a better "structure" would be: { dir_inode_number => DirInfo } and DirInfo would contain DirEntries, e.g. { entry_inode_number => EntryInfo }. (e.g. this may mean 2 sqlite tables).

If you rename a directory, the directory doesn't change. What changes is that directory's entry in their parent. (This makes detecting moves/renames more robust, while avoiding unnecessary dir scanning and file stat/md5).

It seems more complex, but you're dealing with much simpler classes/objects. Much easier to test, less overall edge cases to support and it's all deterministic and uninfluenced by timing. Not to mention - it's much easier to find code bottlenecks and e.g. break the work into threads. (For e.g. a faster response time).

Again, this would allow reusing components for things like filesystem syncing. So creating a Ruby implementation of RSync would be possible and trivial.

I've somewhat neglected TDD

I'd say that's the result of the current codebase, so I don't hold that against you ;)

Also, the "snapshot" feature would make it easier to create use cases. E.g. someone on OSX could just dump the structures - and I could reproduce an OSX-specific timing-related bug in pure code.

This would also allow a "record and replay" option. Kind of like Wireshark supports. And that would also speed up integration tests. (And they'd be backend-independent anyway).

You'd no longer need to test any backend, either - except to the extent that they all report events in a unified way. (Basically - just returning current snapshots of changed directories + optimizer hints).

And if those directory snapshots could be sent over TCP, you'd end up with a very low latency solution. (Sure, the packets would be large, but all the processing and optimizing could be done on the remote side). So you could implement a high-performing "reverse-shared-folder" on VMs, etc.

So tons of issues could be solved and new options could become available.

@driskell
Copy link

driskell commented May 4, 2016

One word: symlinks. It's better to think in terms of inode numbers. E.g. you have a directory name (a possible symlink) that resolves to an inode number. And the inode is basically what you're watching, not the "directory name". So even if a directory is renamed, you're still watching the same inode.

So a better "structure" would be: { dir_inode_number => DirInfo } and DirInfo would contain DirEntries, e.g. { entry_inode_number => EntryInfo }. (e.g. this may mean 2 sqlite tables).

If you rename a directory, the directory doesn't change. What changes is that directory's entry in their parent. (This makes detecting moves/renames more robust, while avoiding unnecessary dir scanning and file stat/md5).

Symlinks underlying are files containing a text path to the target. If the target is renamed the symlink becomes broken. Inode numbers are only involved in hard links which is only for directories and complicates matters further and to be honest I'm unsure how the various file change notification systems in the various OS react to hard links.

Also when you look at issues with Filebeat by Elastic it is becoming clear that with some log rotations and likely in other circumstances too, inodes, especially on files, get reused very very quickly so unless the file change notifications contain them it's highly likely safer to keep md5 signature scanning. Especially if the desire is to convert file change notifications into directory snapshots as that might mean keeping inode information is difficult.

Using file paths instead of inode is better I think too as if you did have a dump of a snapshot it would be more readable. And it's good enough for open and CreateFile etc

You also get into more cross platform wilderness as Windows uses a device serial number and two index numbers per file. And depending on Linux variant you have either unsigned or signed values etc. Not sure how Ruby would expose any of it either.

This all sounds pretty awesome though and well thought out!

Generating directory snapshots first and then comparing them to previous is a great idea and I can see it would make testing far far easier. Depending on Record layout it might even be faster as each call is a bulk for the snapshot rather than trickle calls.

@e2
Copy link
Contributor Author

e2 commented May 4, 2016

Yes. I'd ignore hard links at first. I don't think they really matter. As
for inode numbers, I mostly meant that as a concept, not exactly using
inodes. More like a column index in a database. (Good point about them
being reused).
On May 4, 2016 9:03 AM, "Jason Woods" notifications@github.com wrote:

One word: symlinks. It's better to think in terms of inode numbers. E.g.
you have a directory name (a possible symlink) that resolves to an inode
number. And the inode is basically what you're watching, not the "directory
name". So even if a directory is renamed, you're still watching the same
inode.

So a better "structure" would be: { dir_inode_number => DirInfo } and
DirInfo would contain DirEntries, e.g. { entry_inode_number => EntryInfo }.
(e.g. this may mean 2 sqlite tables).

If you rename a directory, the directory doesn't change. What changes is
that directory's entry in their parent. (This makes detecting moves/renames
more robust, while avoiding unnecessary dir scanning and file stat/md5).

Symlinks underlying are files containing a text path to the target. If the
target is renamed the symlink becomes broken. Inode numbers are only
involved in hard links which is only for directories and complicates
matters further and to be honest I'm unsure how the various file change
notification systems in the various OS react to hard links.

Also when you look at issues with Filebeat by Elastic it is becoming clear
that with some log rotations and likely in other circumstances too, inodes,
especially on files, get reused very very quickly so unless the file change
notifications contain them it's highly likely safer to keep md5 signature
scanning. Especially if the desire is to convert file change notifications
into directory snapshots as that might mean keeping inode information is
difficult.

Using file paths instead of inode is better I think too as if you did have
a dump of a snapshot it would be more readable. And it's good enough for
open and CreateFile etc

You also get into more cross platform wilderness as Windows uses a device
serial number and two index numbers per file. And depending on Linux
variant you have either unsigned or signed values etc. Not sure how Ruby
would expose any of it either.

This all sounds pretty awesome though and well thought out!

Generating directory snapshots first and then comparing them to previous
is a great idea and I can see it would make testing far far easier.
Depending on Record layout it might even be faster as each call is a bulk
for the snapshot rather than trickle calls.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#381 (comment)

@ioquatix ioquatix self-assigned this Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ Feature Adds a new feature
Projects
None yet
Development

No branches or pull requests

3 participants