WIP: Change diff crate #102

vincentdephily · 2021-05-09T20:32:33Z

This explores a fix for #94.

There are two replacement crates in this PR, but it only makes sense to merge one. Neither crate can be used as an exact replacement of the unmaintained crate, because the diffing algorithms don't match : the calculated "distance" is different and there is no way to configure a "split character".

Nevertheless, I used the two crates to showcase two possibilities:

dissimilar retains the same API; programs will compile with only a function name change, but if they used to configure a split or a distance, the results may be different.
similar ignores API stability, and exposes more features (even more could be exposed if we wanted)

Both crates have their own feature flag, and I resisted bikeshedding the name of the constructors.

IMHO the second one is more tempting, except if we want to avoid tying ourselves into a particular diffing crate and want to to the most common denominator features.

We could make a release that deprecates the difference feature and adds the similar feature, or we could make an immediate replacement (simplifying the bikeshedding and cosing the rustsec advisory faster).

TODO:

Choose a path (this is my first PR here, so I won't dare)
Update changelog and docs
Change default feature
Remove a diff crate or two
Any other similar feature we want to expose ?

This is a PoC to replace the unmaintained `difference` crate. It's feature-gated and lives alongside the original feature for easy comparison. * As expected we had to give up on the `split` functionality, but it might not be needed at all because this algorithm should auto-detect relevant chunks, and give better outputs by default. * The `distance` metric is probably subtly different, this needs proper investigation. * `dissimilar` returns `&str` which we need to convert to `String` when using reflection, not sure if there a non-API-breaking way around that. The `difference` crate always allocated strings, so we're not losing anything.

Since all we need to prodive is a Display'able type, convert to String directly. Moved code into `impl DissimilarPredicate` as one could imagine different display options.

This is just a convenience method to use in place of products(). Incidentally, there doesn't seem to be any case of duplicated product name, so the the API seems needlessly cumbersome. We could probably expose a `map<String, Product>` instead, which would be more practical.

Another feature-gated PoC to replace the `difference` crate. This commit exposes the same interface as the `dissimilar` crate but using the more feature-packed `similar` crate. The "difference" metric is a bit different due to `similar` supporting a `Replaced` op in addition to `Same/Add/Remove`. In the next commit I'm going to expose more `similar` features, such as algorithm selection and diff ratio. The doc-based unittests feel a bit spartan for the purpose of comparring diff crates, might need to expand them a bit or move to standalone tests.

…ethod Just a stub to not break the API, though it does break the behaviour.

Algorithm is pretty straightforward, but adding ratio() while keeping distance() involved some refactorings and API experiments.

vincentdephily · 2021-05-09T20:49:01Z

Looks like easy CI fixes, I'll do that during the week.

Interestingly, looks like pretty-assertions selected diff as a replacement after looking at similar. For some reason I hadn't looked at diff, not sure if it's any better for our usecase.

epage

similar seems more complicated than we need (timeouts and so on).

The main question is what would be the impact of a breaking change and if we can get away with it.

@asomers in my quick glance, you seem like one of the few crate authors that re-export predicates. Any thoughts on this?

epage · 2021-05-10T20:31:10Z

src/str/dissimilar.rs

+                dissimilar::Chunk::Delete(ref s) => write!(f, "\x1b[92m{}\x1b[0m", s),
+                dissimilar::Chunk::Insert(ref s) => write!(f, "\x1b[91m{}\x1b[0m", s),


Unsure if it needs to be in this PR or not but I'd like to switch to a crate for handling color

That crossed my mind too, but this doesn't seem to be the right PR for this. Also, if we do that we'll want to provide a no-color operation, and it's not clear how to do that with diffs that aren't line-based.

epage · 2021-05-10T20:31:49Z

src/str/dissimilar.rs

+    /// No longer has any effect. The diff algorithm should find the best split automatically.
+    #[deprecated(since="1.1.0" note="No longer needed, diff algorithm should find the best split")]


I'm curious what led to the decision to make this a no-op

I added that as a way to keep src compatibility, but don't like it TBH. The problem is that neither dissimilar nor similar provide an equivalent feature, because they use a more modern diff algorithm which find boundaries automatically. On the other hand, similar offers 3 different algorithms to choose from.

The diff crate (not in this PR) has APIs to diff by "chars", "line", or "slice", which might fit 90% of our user's need for .split().

The problem is that neither dissimilar nor similar provide an equivalent feature, because they use a more modern diff algorithm which find boundaries automatically.

similar does have this, but the switch is at a function level, rather than an argument level

from_lines

from_words / from_unicode_words

from_chars / from_graphemes

We could represent these as functions publicly and store it as an enum internally for when we do call the functions.

I think every diff algorithm require the implementer to determine what the atoms are; its just some algorithms hard code the atoms.

don't like it TBH

Setting a value like this depends on what level a user expects differences to be made. At the word or character level, you get fine tuned output but it can be noisy. At the line level, you get less noise.

I believe I've actually used this part of the API but haven't been able to confirm because the function name is too generic; I'm not finding good search terms to find uses of it among my crates :/

epage · 2021-05-10T20:32:40Z

src/str/similar.rs

+    pub fn algorithm(mut self, alg: similar::Algorithm) -> Self {
+        self.algorithm = alg;
+        self
+    }


imo algorithm should be cosmetic and we probably don't need it

Each algorithm finds different chunks, which influences the distance and ratio measurements. Since we were already offering some fine-tuning via .split(), it seemed right to offer the choice of algorithm.

IMHO it would be ok to declare both .algorithm() and .split() as overkill for our usecase. diff's char/lne/auto settings seem like a good middleground.

epage · 2021-05-10T20:34:34Z

src/str/similar.rs

+                let ratio = similar::get_diff_ratio(&chunks, self.old.len(), new.len());
+                (self.op.eval(ratio >= r), chunks, Ratio(ratio))
+            }
+            Changes(d) => {


If we make a breaking change, I'd be tempted to remove changes.

I'm tempted too, don't want to bloat the API. On the other hand:

number of changes seems like a more useful metric than percentage of change

I could see myself wanting a "additions are ok but removals are not" kind of logic sometimes

The challenge with changes / ratio is you don't know what was diverged, if it mattered or not. Instead, we should be encouraging new line normalization and other methods for bringing the two sides in alignment.

I'm somewhat tempted to pull the API just so someone complains so we have a clear justification for it, rather than speculation.

epage · 2021-05-10T20:34:39Z

src/str/similar.rs

+        let chunks =
+            similar::capture_diff_slices(self.algorithm, self.old.as_bytes(), new.as_bytes());
+        match self.limit {
+            Ratio(r) => {


Any thoughts on usefulness of ratio? Particularly, since I'm unsure of changes being in

Hum, if we have neither ratio nor changes then this predicate becomes the same as a basic string equality predicate except with a nice diff output ? If that's all we want, we should merge the diff functionality into the basic equality predicate. It can still be feature-gated, just compute and format the diff only if strings are not equal.

Well, still need to discuss split :)

There is a part of me that would want them distinct because I feel like they are different use cases. A one word string probably wouldn't want to use similar, the output would just get too noisy. Granted, I can see having a helper to turn a string predicate into a diff predicate.

asomers · 2021-05-10T20:50:29Z

Which predicates are changing exactly? Is it just similar or are there others too? I would say that you should do what you have to, and I'll note it in Mockall's CHANGELOG .

epage · 2021-05-10T21:11:40Z

Which predicates are changing exactly? Is it just similar or are there others too? I would say that you should do what you have to, and I'll note it in Mockall's CHANGELOG .

We might be changing the [settings for the DifferencePredicate](https://docs.rs/predicates/1.0.8/predicates/str/struct.DifferencePredicate.html#method.split) which is created via predicates::str::similar. This would require a major version bump for predicates(but notpredicates-corewhere the Traits live) and any crate that re-exportspredicates` would need to do similar.

If we do a breaking change, my question is if we need either of these functions (choose the split, allow edit distance besides 0)

vincentdephily · 2021-05-10T22:07:49Z

We might be changing the settings for the DifferencePredicate which is created via predicates::str::similar.

Also, even if we keep the .distance() API intact, the measured distance value will no longer be the same. So if a user has a test with a fine-tuned distance value, it's likely no longer correct and implies a major version bump too.

epage · 2021-05-29T17:46:42Z

Thanks again for all of this research. I've pulled it together in #107. I ended up using a different crate just because another of my crates was using it. We can always swap out and add features as there is demand.

vincentdephily added 9 commits April 7, 2021 12:46

refactor: Don't reallocate chunks in DissimilarPredicate.find_case

b36ad32

Since all we need to prodive is a Display'able type, convert to String directly. Moved code into `impl DissimilarPredicate` as one could imagine different display options.

test: fix dissimilar doctests

87b232b

feat: Add back DissimilarPredicate.split() backward-compatibility m…

51addcb

…ethod Just a stub to not break the API, though it does break the behaviour.

fix: Proper diff format in SimilarPredicate

83af711

feat: Add ratio() and algorithm() methods to SimilarPredicate

8cc3b11

Algorithm is pretty straightforward, but adding ratio() while keeping distance() involved some refactorings and API experiments.

chore: update copyright year

118b8c4

epage requested changes May 10, 2021

View reviewed changes

epage mentioned this pull request May 10, 2021

How do you feel abut a predicates breaking change? asomers/mockall#280

Closed

epage mentioned this pull request May 12, 2021

Predicates for string line count #103

Closed

epage closed this May 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Change diff crate #102

WIP: Change diff crate #102

vincentdephily commented May 9, 2021

vincentdephily commented May 9, 2021

epage left a comment

epage May 10, 2021

vincentdephily May 10, 2021

epage May 10, 2021

vincentdephily May 10, 2021 •

edited

epage May 11, 2021

epage May 10, 2021

vincentdephily May 10, 2021

epage May 10, 2021

vincentdephily May 10, 2021

epage May 11, 2021

epage May 10, 2021

vincentdephily May 10, 2021

epage May 11, 2021

asomers commented May 10, 2021

epage commented May 10, 2021

vincentdephily commented May 10, 2021 •

edited

epage commented May 29, 2021

		dissimilar::Chunk::Delete(ref s) => write!(f, "\x1b[92m{}\x1b[0m", s),
		dissimilar::Chunk::Insert(ref s) => write!(f, "\x1b[91m{}\x1b[0m", s),

		/// No longer has any effect. The diff algorithm should find the best split automatically.
		#[deprecated(since="1.1.0" note="No longer needed, diff algorithm should find the best split")]

WIP: Change diff crate #102

WIP: Change diff crate #102

Conversation

vincentdephily commented May 9, 2021

vincentdephily commented May 9, 2021

epage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentdephily May 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asomers commented May 10, 2021

epage commented May 10, 2021

vincentdephily commented May 10, 2021 • edited

epage commented May 29, 2021

vincentdephily May 10, 2021 •

edited

vincentdephily commented May 10, 2021 •

edited