Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff.Text OOMs with large different inputs #26

Open
myitcv opened this issue Mar 29, 2021 · 10 comments
Open

diff.Text OOMs with large different inputs #26

myitcv opened this issue Mar 29, 2021 · 10 comments

Comments

@myitcv
Copy link

myitcv commented Mar 29, 2021

$ go version
go version devel +dec3d00b28 Thu Mar 25 04:21:29 2021 +0000 linux/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/myitcv/.cache/go-build"
GOENV="/home/myitcv/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/myitcv/gostuff/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/myitcv/gostuff"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/myitcv/gos"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel +dec3d00b28 Thu Mar 25 04:21:29 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/tmp.uHtLe5z1Zd/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build267893791=/tmp/go-build -gno-record-gcc-switches"

To reproduce:

cd $(mktemp -d)
curl -sL https://github.com/pkg/diff/files/6221280/repro.zip -o repro.zip
unzip repro.zip
go run main.go

Above, repro.zip is a reference to: repro.zip

@mvdan
Copy link
Contributor

mvdan commented Mar 29, 2021

I added some quick debug printing, and it looks like you're just butting heads with the double loop in myers.Diff. It isn't surprising that, for an input of many megabytes, O(n*n) results in OOM. I run out of memory at about line 38k with 16GiB of system memory, and there are a total of 130k lines. That seems to suggest you'd need at least 64GiB to finish :)

I think the right solution is to drop the quadratic behavior. This is already mentioned in the godoc:

// Diff calculates an edit.Script for ab using the Myers diff algorithm.
// This implementation uses the algorithm described in the first half
// of Myers' paper, which requires quadratric space.
// (An implementation of the linear space version is forthcoming.)
//
// Because diff calculation can be expensive, Myers supports cancellation via ctx.
func Diff(ctx context.Context, ab Pair) edit.Script {

Russ also hints at that here: golang/go#45200 (comment)

I'm not sure if @josharian plans to work on this soon (I imagine not) or how soon Russ's will be ready, since he said "next month".

Until either of those, I'd suggest to avoid diffing if the input is large (e.g. beyond 10k lines or 1MiB) or to add a timeout via a context. I'd probably go with the former, because one could imagine a very fast CPU with not a lot of spare memory still OOMing with a few-second timeout.

@myitcv
Copy link
Author

myitcv commented Mar 29, 2021

I'd suggest to avoid diffing if the input is large (e.g. beyond 10k lines or 1MiB) or to add a timeout via a context. I'd probably go with the former, because one could imagine a very fast CPU with not a lot of spare memory still OOMing with a few-second timeout.

Then we need to fix testscript 😄

https://github.com/rogpeppe/go-internal/blob/8ef127344c684f0eb23e5eea40f303b74de097b2/testscript/cmd.go#L145

@myitcv
Copy link
Author

myitcv commented Nov 23, 2021

Just wondering if anyone is aware of any progress on the issues/topics related to this one?

@josharian
Copy link
Contributor

Not that I'm aware of. If you're interested in implementing the linear time diff, I'm happy to point you in the right direction.

@myitcv
Copy link
Author

myitcv commented Nov 23, 2021

Not that I'm aware of. If you're interested in implementing the linear time diff, I'm happy to point you in the right direction.

Sadly I don't see myself having the time. But will tweet out in case someone else is interested. A fun little project.

@thepudds
Copy link

thepudds commented Feb 18, 2022

Hi there 👋. FYI, Russ has a new diff implementation in CL 384255. It is an implementation of patience diff. It runs in O(n log n) time, and I think his version might be O(n + m) space.

I copied it here, and added a trivial cmd to be able run it on files:

https://github.com/thepudds/patience-diff

For the example from this issue, it runs in 0.07 sec and uses 38 MB RSS, vs. pkg/diff runs in 19 sec and 18 GB RSS (roughly ~250x slower and ~500x more RAM).

patience-diff:

$ go install github.com/thepudds/patience-diff/cmd/patience-diff@latest

$ /usr/bin/time -v patience-diff got want > patience-diff.out
        [...]
        User time (seconds): 0.06
        System time (seconds): 0.01
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
        Maximum resident set size (kbytes): 38876

pkg/diff repro:

$ /usr/bin/time -v ./mod.com got want > pkg-diff.out
        [...]
        User time (seconds): 10.79
        System time (seconds): 8.90
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.38
        Maximum resident set size (kbytes): 18085500

The patience-diff version is not a minimal diff:

$ wc -l *.out
96935 patience-diff.out
56391 pkg-diff.out

...but at least in theory, it is often friendlier to humans.

A better location for this might be rogpeppe/go-internal (which is an accurate name for where this came from, and would eliminate an extra dependency from rogpeppe/go-internal, and the more reasons there are to go to rogpeppe/go-internal then the more testscripts probably spreads, which is a good thing...). I'd be interested in sending a PR there if there is interest.

Alternatively, pkg/diff could be updated to use this (perhaps as a parallel package to the meyers package), but I think that might be more work?

And of course, people could use the copy at thepudds/patience-diff, but that might mean random gophers having to wonder whether or not they can trust this "thepudds" character 😅.

Finally, I should mention that Russ' CL hasn't been merged yet.

@thepudds
Copy link

All that said, there is also golang/go#45200 -- presumably that will be a time and space efficient algorithm. If that ends up happening soon, then perhaps not much benefit having this elsewhere.

@josharian
Copy link
Contributor

I'm mostly AFK for the next week. But at a high level, I'd be absolutely delighted to add a patience diff implementation here, as long as Russ is OK with it. I've been meaning to write one myself, but it never got to the top of the stack. Ideally it would be just like the myers diff one, that is, in its own package, with a similar API, etc. And then update the default to be patience instead of myers.

random gophers having to wonder whether or not they can trust this "thepudds" character

Those in the know know that they can. :)

@thepudds
Copy link

Hi @josharian, I’ll take a closer look at the API that the myers package presents here, and see if I can send a PR. No guarantees though. And agreed it makes sense to confirm with Russ before getting too far.

mvdan added a commit to mvdan/go-internal that referenced this issue Mar 22, 2023
The main reason to prefer a copy of Go's internal/diff over pkg/diff is
that internal/diff is much more efficient in both time and memory usage.
In particular, pkg/diff required quadratic space in memory,
which could easily cause "out of memory" errors in Go tests
per pkg/diff#26.

Beyond making the `cmp` command better able to handle large files,
this also moves us back to having zero external dependencies,
which is always a nice to have.

The long_diff test still appears to work well;
the output is changed since the new package produces a shorter,
but still entirely correct, diff.

It also seems like the new package includes a leading "diff" line to
show the two filenames. That seems like a harmless change.
mvdan added a commit to rogpeppe/go-internal that referenced this issue Mar 22, 2023
The main reason to prefer a copy of Go's internal/diff over pkg/diff is
that internal/diff is much more efficient in both time and memory usage.
In particular, pkg/diff required quadratic space in memory,
which could easily cause "out of memory" errors in Go tests
per pkg/diff#26.

Beyond making the `cmp` command better able to handle large files,
this also moves us back to having zero external dependencies,
which is always a nice to have.

The long_diff test still appears to work well;
the output is changed since the new package produces a shorter,
but still entirely correct, diff.

It also seems like the new package includes a leading "diff" line to
show the two filenames. That seems like a harmless change.
@mvdan
Copy link
Contributor

mvdan commented Mar 22, 2023

For those following along, https://pkg.go.dev/github.com/rogpeppe/go-internal/diff@master exists now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants