Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal (O(D^2)) memory usage #396

Open
quark-zju opened this issue Mar 3, 2023 · 10 comments
Open

Suboptimal (O(D^2)) memory usage #396

quark-zju opened this issue Mar 3, 2023 · 10 comments

Comments

@quark-zju
Copy link

quark-zju commented Mar 3, 2023

It seems https://github.com/kpdecker/jsdiff/blob/3b654c2ed7d5262ed9946de841ad8dae990286c7/src/diff/base.js does not implement "4b. A Linear Space Refinement" mentioned in Myers' paper. So the memory usage is O(D^2). From the code it looks like O(len(bestPath) * len(components in bestPath)) might be that complexity.

This makes it practically too slow to complete diff calculation if there are many changes. For example, comparing the prettified Hearthstone cards.json in different languages wouldn't complete in 20 minutes using jsdiff:

Reproduce:

# Prepare test data
wget https://api.hearthstonejson.com/v1/168129/enUS/cards.json -O 1
wget https://api.hearthstonejson.com/v1/168129/zhCN/cards.json -O 2
python3 -m json.tool < 1 > a
python3 -m json.tool < 2 > b
// a.js - Calculate diff
const fs = require('fs')
const diff = require('diff');

const a = fs.readFileSync('a', {encoding: 'utf-8'});
const b = fs.readFileSync('b', {encoding: 'utf-8'});

console.log(`pid: ${process.pid}`);
console.log(diff.diffLines(a, b).length);

Profiler shows that nodejs spent 70% time on GC:

image

As a comparison, it seems diff-match-patch implements the linear space refinement and can completes the above test case in ~24s. (For reference, git diff --stat --minimal completes in ~0.3s).

facebook-github-bot pushed a commit to facebook/sapling that referenced this issue Mar 7, 2023
Summary:
Before this change, there are 3 Myers diff algorithms used in the dependency tree:
- diff-match-patch (1.0.5)
- diff (4.0.1)
- diff-sequences (via jest -> jest-diff -> diff-sequences)

We'd like to simplify the dependency tree. The short answer is:
- Use `diff-sequences`, or `jest-diff` which uses `diff-sequences` internally.

For best performance, do:
- Strip common prefix and suffix.
- Make line comparison O(1), avoid `line1 === line2` which can be O(line
  length).
- Consider skipping "cleanup" in `jest-diff` for long input.

----

Long answer of picking a diff library:

I wrote a benchmark script to get some idea about their performance:

    const fs = require('fs')
    const dmp = new (require('diff-match-patch').diff_match_patch)();
    const diff = require('diff');
    const ds = require('diff-sequences').default;
    const jd = require('jest-diff');

    dmp.Diff_Timeout = 120;

    // Diff functions. Output format: Chunk[]
    // Chunk is one of:
    // [0, n]: n common lines (same on both side)
    // [-1, n]: n left-side-only lines
    // [1, n]: n right-side-only lines

    function diff1(chars1, chars2) {
      return dmp.diff_main(chars1, chars2).map(v => [v[0], v[1].length]);
    }

    function diff1a(chars1, chars2) {
      return dmp.diff_main(chars1, chars2, false).map(v => [v[0], v[1].length]);
    }

    function diff2(chars1, chars2) {
      return diff.diffChars(chars1, chars2).map(v => {
        const d = v.added ? 1 : (v.removed ? -1 : 0);
        return [d, v.count];
      });
    }

    function diff3(chars1, chars2) {
      function isCommon(ai, bi) {
        return chars1[ai] == chars2[bi];
      }
      const r = [];
      let lastA = 0, lastB = 0;
      function foundSequence(n, na, nb) {
        if (na > lastA) {
          r.push([-1, na - lastA]);
          lastA = na;
        }
        if (nb > lastB) {
          r.push([1, nb - lastB]);
          lastB = nb;
        }
        if (n > 0) {
          r.push([0, n]);
          lastA += n;
          lastB += n;
        }
      }
      ds(chars1.length, chars2.length, isCommon, foundSequence);
      foundSequence(0, chars1.length, chars2.length);
      return r;
    }

    function diff3a(chars1, chars2) {
      return jd.diffStringsRaw(chars1, chars2, false).map((d) => [d[0], d[1].length]);
    }

    function diff3b(chars1, chars2) {
      return jd.diffStringsRaw(chars1, chars2, true).map((d) => [d[0], d[1].length]);
    }

    function bench(a, b) {
      const {chars1, chars2} = dmp.diff_linesToChars_(a, b);
      function stringify(obj) {
        if (obj.length > 20) {
          return `${obj.length} items`;
        } else {
          return JSON.stringify(obj);
        }
      }

      [
        ['diff-match-patch', diff1],
        ['diff-match-patch (checklines=false)', diff1a],
        ['diff-sequences', diff3],
        ['jest-diff (diff-sequences), no cleanup', diff3a],
        ['jest-diff (diff-sequences), with cleanup', diff3b],
        ['jsdiff', diff2],
      ].forEach(([name, diffFunc]) => {
        // node --expose_gc
        if (global.gc) {
          gc();
        }
        const label = `  ${name}`;
        console.time(label);
        console.log(' ', stringify(diffFunc(chars1, chars2)));
        console.timeEnd(label);
      });
    }

    let a, b;

    console.log('\nwith common prefix and suffix 1');
    a = 'aaaaaaa\n'.repeat(50000) + 'bbbb\n' + 'dddd\n'.repeat(50000);
    b = 'aaaaaaa\n'.repeat(50000) + 'cccc\n' + 'dddd\n'.repeat(50000);
    bench(a, b);

    console.log('\nwith common prefix and suffix 2');
    a = 'aaaaaaa\n'.repeat(50000) + 'bbbbbbb\n' + 'dddd\n'.repeat(50000);
    b = 'aaaaaaa\n'.repeat(50100) + 'cccc\n' + 'dddd\n'.repeat(49900);
    bench(a, b);

    console.log('\nwithout common prefix or suffix 1');
    a = 'c\n' + 'aaaaaaa\n'.repeat(50000) + 'dddd\n'.repeat(50000);
    b = 'aaaaaaa\n'.repeat(50000) + 'dddd\n'.repeat(50100) + 'z\n';
    bench(a, b);

    console.log('\nwithout common prefix or suffix 2');
    a = 'cccc\n' + 'aaaaaaa\n'.repeat(50000) + 'bbbbbbb\n' + 'dddd\n'.repeat(50000) + 'z\n';
    b = 'aaaaaaa\n'.repeat(50100) + 'cccc\n' + 'dddd\n'.repeat(49900) + 'z\ny\n';
    bench(a, b);

    // Hearthstone cards.json in different languages.
    // This is somewhat challenging since many lines are changed.
    // wget https://api.hearthstonejson.com/v1/168129/enUS/cards.json -O 1
    // wget https://api.hearthstonejson.com/v1/168129/zhCN/cards.json -O 2
    // python3 -m json.tool < 1 > 1.json
    // python3 -m json.tool < 2 > 2.json
    console.log('\ncards.json with different languages');
    a = fs.readFileSync('1.json', {encoding: 'utf-8'});
    b = fs.readFileSync('2.json', {encoding: 'utf-8'});
    bench(a, b);

The output looks like:

    with common prefix and suffix 1
      [[0,50000],[-1,1],[1,1],[0,50000]]
      diff-match-patch: 5.073ms
      [[0,50000],[-1,1],[1,1],[0,50000]]
      diff-match-patch (checklines=false): 0.481ms
      [[0,50000],[-1,1],[1,1],[0,50000]]
      diff-sequences: 7.589ms
      [[0,50000],[-1,1],[1,1],[0,50000]]
      jest-diff (diff-sequences), no cleanup: 10.915ms
      [[0,50000],[-1,1],[1,1],[0,50000]]
      jest-diff (diff-sequences), with cleanup: 10.588ms
      [[0,50000],[-1,1],[1,1],[0,50000]]
      jsdiff: 22.664ms

    with common prefix and suffix 2
      [[0,50000],[-1,101],[1,101],[0,49900]]
      diff-match-patch: 10.688ms
      [[0,50000],[-1,101],[1,101],[0,49900]]
      diff-match-patch (checklines=false): 2.619ms
      [[0,50000],[-1,101],[1,101],[0,49900]]
      diff-sequences: 12.687ms
      [[0,50000],[-1,101],[1,101],[0,49900]]
      jest-diff (diff-sequences), no cleanup: 11.055ms
      [[0,50000],[-1,101],[1,101],[0,49900]]
      jest-diff (diff-sequences), with cleanup: 4.356ms
      [[0,50000],[-1,1],[1,101],[0,49900],[-1,100]]
      jsdiff: 59.359ms

    without common prefix or suffix 1
      [[-1,1],[0,100000],[1,101]]
      diff-match-patch: 632.863ms
      [[-1,1],[0,100000],[1,101]]
      diff-match-patch (checklines=false): 607.796ms
      [[-1,1],[0,50000],[1,51],[0,50000],[1,50]]
      diff-sequences: 12.366ms
      [[-1,1],[0,50000],[1,51],[0,50000],[1,50]]
      jest-diff (diff-sequences), no cleanup: 11.096ms
      [[-1,1],[0,100000],[1,51],[1,50]]
      jest-diff (diff-sequences), with cleanup: 1.029s
      [[-1,1],[0,100000],[1,101]]
      jsdiff: 13.163ms

    without common prefix or suffix 2
      [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]]
      diff-match-patch: 2.773s
      [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]]
      diff-match-patch (checklines=false): 1.402s
      [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]]
      diff-sequences: 22.216ms
      [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]]
      jest-diff (diff-sequences), no cleanup: 20.546ms
      [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]]
      jest-diff (diff-sequences), with cleanup: 19.222ms
      [[-1,1],[0,50000],[-1,1],[1,101],[0,49900],[-1,100],[0,1],[1,1]]
      jsdiff: 33.82ms

    cards.json with different languages
      67781 items
      diff-match-patch: 1:04.122 (m:ss.mmm)
      57514 items
      diff-match-patch (checklines=false): 2:00.283 (m:ss.mmm)
      67781 items
      diff-sequences: 1:09.486 (m:ss.mmm)
      67781 items
      jest-diff (diff-sequences), no cleanup: 1:06.452 (m:ss.mmm)
      52937 items
      jest-diff (diff-sequences), with cleanup: 1:09.118 (m:ss.mmm)
      ...
      (jsdiff cannot complete this test case in 20+ minutes)

Observations:
- In the last test case, `jsdiff` does not implement O(D^2) -> O(D) space
  optimization so it is practically unusable (reported as kpdecker/jsdiff#396).
  `diff-match-patch` and `jest-diff` both implement the linear space
  optimization, and have similar performance.
- `diff-match-patch` strips common prefix and suffix, which makes it faster
  than `jest-diff` in "common prefix and suffix" test cases.
- Both `diff-match-patch` and `jest-diff` can take a long time on "cleanup".
  See the "without common prefix or suffix 1" test case. We probably want
  to only enable cleanup for smaller input.
- `diff-match-patch` performs visibly worse on the "without common prefix
  or suffix 2" test case. From the code it looks like `diff-match-patch` uses
  some kind of heuristics that tries to speed up things but ends up slowing it
  down.
- Without cleanup, `jest-diff` might output `[1,51],[1,50]` that can be
  "obviously" merged to `[1,101]`. We might use a lightweight cleanup logic
  for that.
- Reading the code, `diff-match-patch` turns lines into char codes. It cannot
  handle 65536 unique lines.
  (https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503)

Conclusions:
- `jest-diff` (and `diff-sequences` under the hood) is overall the best choice.
  It has expected time and space complexities, and provides flexibility to skip
  the potentially slow "cleanup", and can support >65k unique lines.
- `jest-diff` misses the "skip common prefix / suffix" optimization that
  `diff-match-patch` has, and seems practically important (editing a line in
  the editor - all lines are common prefixes and suffixes except for the line
  being edited). The optimization is not hard to implement. This diff
  implements it.
- For certain use-cases (ex. linelog) where the diff content is not needed
  (at least for the left / "a" side), it should use `diff-sequences` to avoid
  overhead preparing the diff content.
  - `jest-diff`'s `diffLines` outputs one line per `Diff` but we want
    one chunk per `Diff`.
  - `jest-diff`'s `diffStringsRaw` produces one `Diff` per chunk, and because
    [`string.slice` is O(1) in V8](https://stackoverflow.com/a/72545403), it has
    acceptable performance. But mapping lines to chars would introduce the
    65535 unique line limit undesirably.

Reviewed By: evangrayk

Differential Revision: D43857949

fbshipit-source-id: 9a3d85ebf10c9b82da8ab5cba4e14e519bbf264d
@tomByrer
Copy link

tomByrer commented Apr 9, 2023

I'm curious to see a better algorithm...

@hyrious hyrious mentioned this issue Apr 14, 2023
@ExplodingCabbage
Copy link
Collaborator

ExplodingCabbage commented Jan 11, 2024

Hmmmmm...

I definitely think it makes sense to implement a version of the algorithm based on the linear space refinement in which we explore the edit graph in both directions (from the top-left, and from the bottom-right) until they join up in the middle. This will have the practical effect of making jsdiff faster and reducing its memory usage (though without, I think, reducing its worst-case time or space complexity).

I think the actual linear space refinement proposed by Myers is different to (and cleverer than) this, though, and I'm not convinced of the wisdom of fully implementing it; certainly I think it shouldn't be on by default. If I'm understanding it correctly - and it took me a while to get this - the key idea of that refinement is that you can navigate through the edit graph from both sides without keeping track of the paths you've taken, just how far they've got in each diagonal, until your two paths hit each other somewhere in the middle of the edit graph, and then the snake they hit each other on is guaranteed to be the middle snake of the optimal diff, and then the way you compensate for not having kept a record of the paths you took is by recursively running the algorithm again on the regions before and after the middle snake to find the middle snakes of those, and so on recursively until all the middle snakes you've found together amount to a full path through the edit graph. Effectively you're repeating lots of work, roughly doubling the time taken in some cases, to compensate for not keeping a record of the work you've done.

From a computer scientist's perspective, judging performance only in terms of time and space complexity, this is a brilliant trick that gets the space complexity down without worsening the time complexity. But pragmatically speaking, I'm not convinced there's any point? I have encountered situations where jsdiff basically hung "forever" trying to compute a large diff but never any situation where it ran out of memory or even used a nontrivial amount of my computer's memory. Any tradeoff that involves making jsdiff slower in order to reduce its memory requirements intuitively feels like a bad tradeoff to me (though I am open to being persuaded otherwise). I'd be hesitant accept a contribution adding this as an optional off-by-default feature without some persuasion that it's worth it.

(Do note that the perf issues around clonePath that you highlight are already basically fixed by a separate change, #411.)

@ExplodingCabbage
Copy link
Collaborator

I'm gonna label this as planned for next release with the intent that I'll implement the idea of navigating the edit graph from both ends, but not actually implement the linear space algorithm.

@quark-zju
Copy link
Author

Part of the space complexity can manifest as time complexity via the GC. The hung "forever" cases might be actually caused by the space complexity. I didn't observe "out of memory" error either.

The hearthstone cards.json example is a simplified version of a production slow diff case we observed in the past. Both are large generated JSON files. For that specific production case, while I cannot share the file contents, I can share some numbers: even git diff without --minimal took 5 seconds.

@ExplodingCabbage
Copy link
Collaborator

ExplodingCabbage commented Jan 15, 2024

Part of the space complexity can manifest as time complexity via the GC

No, I don't think so - or at least, I don't think GC can possibly increase the time complexity in a way you wouldn't expect if you analyzed the time complexity without regard for GC. Every change object creation is already accounted for in the time complexity and there are only as many object garbage collections are there are object creations.

Again, do also note that #411 (merged to master but not yet released to npm) has drastically reduced the amount of garbage collection needed and completely eliminated clonePath.

@quark-zju
Copy link
Author

I don't think GC can possibly increase the time complexity in a way you wouldn't expect if you analyzed the time complexity without regard for GC

For example, if you append N element to a linked list. It is O(N) time complexity since appending is O(1) per element. However, if for some reason the GC pauses O(N) times and had to scan the whole linked list each time, GC would cause a O(N^2) time complexity.

@ExplodingCabbage
Copy link
Collaborator

Ah, interesting. Yeah, that makes sense; I was assuming the cost of garbage collection would be proportional to the number of objects in need of collecting but of course that's not the case because of the cost of finding the items that can be collected, which I was completely ignoring. Hmm...

@ExplodingCabbage
Copy link
Collaborator

ExplodingCabbage commented Jan 16, 2024

I think I'm persuaded that it'd make sense from a performance perspective to make this change, but one thing still gives me pause, which is that the linear space refinement from the paper gives different results to the core Myers algorithm. Consider diffing old text A against new text AAA, for instance. With the core Myers diff algorithm, we should keep the first A and then insert two As after it. But now consider following this algorithm for finding the middle snake:

image

Here's our edit graph:

image

At D=0, the forward and reverse paths both just preserve one A (i.e. move diagonally):

image

Then when D=1, we do an insertion on the forward path and similarly a vertical move on the reverse path. (I'm omitting paths that go off the edge of the edit graph, although technically we also do those if we're following the algorithm faithfully.)

image

Now the forward path is at (x, y) = (1, 2) and the reverse path is at (u, v) = (0, 1), satisfying the definition of the paths "overlapping" in diagonal k=-1. (Note we use D = 2 when we apply this definition, even though the variable D in the algorithm is 1 at this point; they're not the same D.)

image

But now, per the algorithm, "The last snake of the reverse path is the middle snake." - i.e. the empty snake from (0, 1) to (0, 1). We're thus now forced to build a 2-path that goes through (0, 1), even though the core Myers algorithm would not go through that point.

I was slightly doubting my own interpretation, but the interactive visualisation at https://blog.robertelder.org/diff-algorithm/ agrees:

image

So if we made this change we'd be modifying the results jsdiff returns. Do you see a tweak to the published refinement that would let us avoid this?

@quark-zju
Copy link
Author

quark-zju commented Jan 16, 2024

Git's xdiff does some post-processing (ex. shifting the diff chunk up or down, merging adjacent diff chunks) to "normalize" the results.

@ExplodingCabbage
Copy link
Collaborator

Hmm. I wonder if that consistently achieves results identical to the core Myers diff algorithm, though? (It's not immediately obvious to me whether there's any reasonable way to achieve that at all.) Would need to scrutinise the post-processing logic carefully...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants