New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Change diff crate #102
Conversation
This is a PoC to replace the unmaintained `difference` crate. It's feature-gated and lives alongside the original feature for easy comparison. * As expected we had to give up on the `split` functionality, but it might not be needed at all because this algorithm should auto-detect relevant chunks, and give better outputs by default. * The `distance` metric is probably subtly different, this needs proper investigation. * `dissimilar` returns `&str` which we need to convert to `String` when using reflection, not sure if there a non-API-breaking way around that. The `difference` crate always allocated strings, so we're not losing anything.
Since all we need to prodive is a Display'able type, convert to String directly. Moved code into `impl DissimilarPredicate` as one could imagine different display options.
This is just a convenience method to use in place of products(). Incidentally, there doesn't seem to be any case of duplicated product name, so the the API seems needlessly cumbersome. We could probably expose a `map<String, Product>` instead, which would be more practical.
Another feature-gated PoC to replace the `difference` crate. This commit exposes the same interface as the `dissimilar` crate but using the more feature-packed `similar` crate. The "difference" metric is a bit different due to `similar` supporting a `Replaced` op in addition to `Same/Add/Remove`. In the next commit I'm going to expose more `similar` features, such as algorithm selection and diff ratio. The doc-based unittests feel a bit spartan for the purpose of comparring diff crates, might need to expand them a bit or move to standalone tests.
…ethod Just a stub to not break the API, though it does break the behaviour.
Algorithm is pretty straightforward, but adding ratio() while keeping distance() involved some refactorings and API experiments.
Looks like easy CI fixes, I'll do that during the week. Interestingly, looks like pretty-assertions selected diff as a replacement after looking at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar
seems more complicated than we need (timeouts and so on).
The main question is what would be the impact of a breaking change and if we can get away with it.
@asomers in my quick glance, you seem like one of the few crate authors that re-export predicates. Any thoughts on this?
dissimilar::Chunk::Delete(ref s) => write!(f, "\x1b[92m{}\x1b[0m", s), | ||
dissimilar::Chunk::Insert(ref s) => write!(f, "\x1b[91m{}\x1b[0m", s), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if it needs to be in this PR or not but I'd like to switch to a crate for handling color
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That crossed my mind too, but this doesn't seem to be the right PR for this. Also, if we do that we'll want to provide a no-color operation, and it's not clear how to do that with diffs that aren't line-based.
/// No longer has any effect. The diff algorithm should find the best split automatically. | ||
#[deprecated(since="1.1.0" note="No longer needed, diff algorithm should find the best split")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what led to the decision to make this a no-op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that as a way to keep src compatibility, but don't like it TBH. The problem is that neither dissimilar
nor similar
provide an equivalent feature, because they use a more modern diff algorithm which find boundaries automatically. On the other hand, similar
offers 3 different algorithms to choose from.
The diff
crate (not in this PR) has APIs to diff by "chars", "line", or "slice", which might fit 90% of our user's need for .split()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that neither dissimilar nor similar provide an equivalent feature, because they use a more modern diff algorithm which find boundaries automatically.
similar
does have this, but the switch is at a function level, rather than an argument level
- from_lines
- from_words / from_unicode_words
- from_chars / from_graphemes
We could represent these as functions publicly and store it as an enum
internally for when we do call the functions.
I think every diff algorithm require the implementer to determine what the atoms are; its just some algorithms hard code the atoms.
don't like it TBH
Setting a value like this depends on what level a user expects differences to be made. At the word or character level, you get fine tuned output but it can be noisy. At the line level, you get less noise.
I believe I've actually used this part of the API but haven't been able to confirm because the function name is too generic; I'm not finding good search terms to find uses of it among my crates :/
pub fn algorithm(mut self, alg: similar::Algorithm) -> Self { | ||
self.algorithm = alg; | ||
self | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo algorithm should be cosmetic and we probably don't need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each algorithm finds different chunks, which influences the distance and ratio measurements. Since we were already offering some fine-tuning via .split()
, it seemed right to offer the choice of algorithm.
IMHO it would be ok to declare both .algorithm()
and .split()
as overkill for our usecase. diff
's char/lne/auto settings seem like a good middleground.
let ratio = similar::get_diff_ratio(&chunks, self.old.len(), new.len()); | ||
(self.op.eval(ratio >= r), chunks, Ratio(ratio)) | ||
} | ||
Changes(d) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make a breaking change, I'd be tempted to remove changes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted too, don't want to bloat the API. On the other hand:
- number of changes seems like a more useful metric than percentage of change
- I could see myself wanting a "additions are ok but removals are not" kind of logic sometimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge with changes
/ ratio
is you don't know what was diverged, if it mattered or not. Instead, we should be encouraging new line normalization and other methods for bringing the two sides in alignment.
I'm somewhat tempted to pull the API just so someone complains so we have a clear justification for it, rather than speculation.
let chunks = | ||
similar::capture_diff_slices(self.algorithm, self.old.as_bytes(), new.as_bytes()); | ||
match self.limit { | ||
Ratio(r) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on usefulness of ratio
? Particularly, since I'm unsure of changes
being in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, if we have neither ratio
nor changes
then this predicate becomes the same as a basic string equality predicate except with a nice diff output ? If that's all we want, we should merge the diff functionality into the basic equality predicate. It can still be feature-gated, just compute and format the diff only if strings are not equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, still need to discuss split
:)
There is a part of me that would want them distinct because I feel like they are different use cases. A one word string probably wouldn't want to use similar
, the output would just get too noisy. Granted, I can see having a helper to turn a string predicate into a diff predicate.
Which predicates are changing exactly? Is it just |
We might be changing the [settings for the If we do a breaking change, my question is if we need either of these functions (choose the split, allow edit distance besides 0) |
Also, even if we keep the .distance() API intact, the measured distance value will no longer be the same. So if a user has a test with a fine-tuned distance value, it's likely no longer correct and implies a major version bump too. |
Thanks again for all of this research. I've pulled it together in #107. I ended up using a different crate just because another of my crates was using it. We can always swap out and add features as there is demand. |
This explores a fix for #94.
There are two replacement crates in this PR, but it only makes sense to merge one. Neither crate can be used as an exact replacement of the unmaintained crate, because the diffing algorithms don't match : the calculated "distance" is different and there is no way to configure a "split character".
Nevertheless, I used the two crates to showcase two possibilities:
dissimilar
retains the same API; programs will compile with only a function name change, but if they used to configure a split or a distance, the results may be different.similar
ignores API stability, and exposes more features (even more could be exposed if we wanted)Both crates have their own feature flag, and I resisted bikeshedding the name of the constructors.
IMHO the second one is more tempting, except if we want to avoid tying ourselves into a particular diffing crate and want to to the most common denominator features.
We could make a release that deprecates the
difference
feature and adds thesimilar
feature, or we could make an immediate replacement (simplifying the bikeshedding and cosing the rustsec advisory faster).TODO:
similar
feature we want to expose ?