Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Change diff crate #102

Closed

Conversation

vincentdephily
Copy link

This explores a fix for #94.

There are two replacement crates in this PR, but it only makes sense to merge one. Neither crate can be used as an exact replacement of the unmaintained crate, because the diffing algorithms don't match : the calculated "distance" is different and there is no way to configure a "split character".

Nevertheless, I used the two crates to showcase two possibilities:

  • dissimilar retains the same API; programs will compile with only a function name change, but if they used to configure a split or a distance, the results may be different.
  • similar ignores API stability, and exposes more features (even more could be exposed if we wanted)

Both crates have their own feature flag, and I resisted bikeshedding the name of the constructors.

IMHO the second one is more tempting, except if we want to avoid tying ourselves into a particular diffing crate and want to to the most common denominator features.

We could make a release that deprecates the difference feature and adds the similar feature, or we could make an immediate replacement (simplifying the bikeshedding and cosing the rustsec advisory faster).

TODO:

  • Choose a path (this is my first PR here, so I won't dare)
  • Update changelog and docs
  • Change default feature
  • Remove a diff crate or two
  • Any other similar feature we want to expose ?

This is a PoC to replace the unmaintained `difference` crate. It's feature-gated and lives alongside
the original feature for easy comparison.

* As expected we had to give up on the `split` functionality, but it might not be needed at all
  because this algorithm should auto-detect relevant chunks, and give better outputs by default.
* The `distance` metric is probably subtly different, this needs proper investigation.
* `dissimilar` returns `&str` which we need to convert to `String` when using reflection, not sure
  if there a non-API-breaking way around that. The `difference` crate always allocated strings, so
  we're not losing anything.
Since all we need to prodive is a Display'able type, convert to String directly. Moved code into
`impl DissimilarPredicate` as one could imagine different display options.
This is just a convenience method to use in place of products().

Incidentally, there doesn't seem to be any case of duplicated product name, so the the API seems
needlessly cumbersome. We could probably expose a `map<String, Product>` instead, which would be
more practical.
Another feature-gated PoC to replace the `difference` crate.

This commit exposes the same interface as the `dissimilar` crate but using the more feature-packed
`similar` crate. The "difference" metric is a bit different due to `similar` supporting a `Replaced`
op in addition to `Same/Add/Remove`.

In the next commit I'm going to expose more `similar` features, such as algorithm selection and diff
ratio.

The doc-based unittests feel a bit spartan for the purpose of comparring diff crates, might need to
expand them a bit or move to standalone tests.
…ethod

Just a stub to not break the API, though it does break the behaviour.
Algorithm is pretty straightforward, but adding ratio() while keeping distance() involved some
refactorings and API experiments.
@vincentdephily
Copy link
Author

Looks like easy CI fixes, I'll do that during the week.

Interestingly, looks like pretty-assertions selected diff as a replacement after looking at similar. For some reason I hadn't looked at diff, not sure if it's any better for our usecase.

Copy link
Contributor

@epage epage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar seems more complicated than we need (timeouts and so on).

The main question is what would be the impact of a breaking change and if we can get away with it.

@asomers in my quick glance, you seem like one of the few crate authors that re-export predicates. Any thoughts on this?

Comment on lines +79 to +80
dissimilar::Chunk::Delete(ref s) => write!(f, "\x1b[92m{}\x1b[0m", s),
dissimilar::Chunk::Insert(ref s) => write!(f, "\x1b[91m{}\x1b[0m", s),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if it needs to be in this PR or not but I'd like to switch to a crate for handling color

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That crossed my mind too, but this doesn't seem to be the right PR for this. Also, if we do that we'll want to provide a no-color operation, and it's not clear how to do that with diffs that aren't line-based.

Comment on lines +45 to +46
/// No longer has any effect. The diff algorithm should find the best split automatically.
#[deprecated(since="1.1.0" note="No longer needed, diff algorithm should find the best split")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what led to the decision to make this a no-op

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that as a way to keep src compatibility, but don't like it TBH. The problem is that neither dissimilar nor similar provide an equivalent feature, because they use a more modern diff algorithm which find boundaries automatically. On the other hand, similar offers 3 different algorithms to choose from.

The diff crate (not in this PR) has APIs to diff by "chars", "line", or "slice", which might fit 90% of our user's need for .split().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that neither dissimilar nor similar provide an equivalent feature, because they use a more modern diff algorithm which find boundaries automatically.

similar does have this, but the switch is at a function level, rather than an argument level

  • from_lines
  • from_words / from_unicode_words
  • from_chars / from_graphemes

We could represent these as functions publicly and store it as an enum internally for when we do call the functions.

I think every diff algorithm require the implementer to determine what the atoms are; its just some algorithms hard code the atoms.

don't like it TBH

Setting a value like this depends on what level a user expects differences to be made. At the word or character level, you get fine tuned output but it can be noisy. At the line level, you get less noise.

I believe I've actually used this part of the API but haven't been able to confirm because the function name is too generic; I'm not finding good search terms to find uses of it among my crates :/

Comment on lines +125 to +128
pub fn algorithm(mut self, alg: similar::Algorithm) -> Self {
self.algorithm = alg;
self
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo algorithm should be cosmetic and we probably don't need it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each algorithm finds different chunks, which influences the distance and ratio measurements. Since we were already offering some fine-tuning via .split(), it seemed right to offer the choice of algorithm.

IMHO it would be ok to declare both .algorithm() and .split() as overkill for our usecase. diff's char/lne/auto settings seem like a good middleground.

let ratio = similar::get_diff_ratio(&chunks, self.old.len(), new.len());
(self.op.eval(ratio >= r), chunks, Ratio(ratio))
}
Changes(d) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make a breaking change, I'd be tempted to remove changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted too, don't want to bloat the API. On the other hand:

  • number of changes seems like a more useful metric than percentage of change
  • I could see myself wanting a "additions are ok but removals are not" kind of logic sometimes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge with changes / ratio is you don't know what was diverged, if it mattered or not. Instead, we should be encouraging new line normalization and other methods for bringing the two sides in alignment.

I'm somewhat tempted to pull the API just so someone complains so we have a clear justification for it, rather than speculation.

let chunks =
similar::capture_diff_slices(self.algorithm, self.old.as_bytes(), new.as_bytes());
match self.limit {
Ratio(r) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on usefulness of ratio? Particularly, since I'm unsure of changes being in

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, if we have neither ratio nor changes then this predicate becomes the same as a basic string equality predicate except with a nice diff output ? If that's all we want, we should merge the diff functionality into the basic equality predicate. It can still be feature-gated, just compute and format the diff only if strings are not equal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, still need to discuss split :)

There is a part of me that would want them distinct because I feel like they are different use cases. A one word string probably wouldn't want to use similar, the output would just get too noisy. Granted, I can see having a helper to turn a string predicate into a diff predicate.

@asomers
Copy link
Contributor

asomers commented May 10, 2021

Which predicates are changing exactly? Is it just similar or are there others too? I would say that you should do what you have to, and I'll note it in Mockall's CHANGELOG .

@epage
Copy link
Contributor

epage commented May 10, 2021

Which predicates are changing exactly? Is it just similar or are there others too? I would say that you should do what you have to, and I'll note it in Mockall's CHANGELOG .

We might be changing the [settings for the DifferencePredicate](https://docs.rs/predicates/1.0.8/predicates/str/struct.DifferencePredicate.html#method.split) which is created via predicates::str::similar. This would require a major version bump for predicates(but notpredicates-corewhere the Traits live) and any crate that re-exportspredicates` would need to do similar.

If we do a breaking change, my question is if we need either of these functions (choose the split, allow edit distance besides 0)

@vincentdephily
Copy link
Author

vincentdephily commented May 10, 2021

We might be changing the settings for the DifferencePredicate which is created via predicates::str::similar.

Also, even if we keep the .distance() API intact, the measured distance value will no longer be the same. So if a user has a test with a fine-tuned distance value, it's likely no longer correct and implies a major version bump too.

@epage
Copy link
Contributor

epage commented May 29, 2021

Thanks again for all of this research. I've pulled it together in #107. I ended up using a different crate just because another of my crates was using it. We can always swap out and add features as there is demand.

@epage epage closed this May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants