Optimize a couple of nested loops in the scanner. #537

cjcullen · 2019-11-04T17:25:51Z

There are two places in scannerc.go where we iterate through the full simple_keys stack that I think we can avoid:

when determining whether a key is possible
in the staleness check.

I've included a couple of test cases that show the bad behavior, even with the recent depth limits :

$ git checkout c9e8ea1b90fe07c9447c7cbcc212e927d6988388 && go test -run=NONE -bench=. -benchmem ./... > old.txt
$ git checkout 3717a577a65196747809a63ffe76c5ca6b0d88e8 && go test -run=NONE -bench=. -benchmem ./... > new.txt
$ benchcmp old.txt new.txt 
benchmark                                old ns/op       new ns/op      delta
Benchmark1000KB100Aliases-6              1056925022      1094599057     +3.56%
Benchmark1000KBDeeplyNestedSlices-6      282869888       7835556        -97.23%
Benchmark1000KBDeeplyNestedMaps-6        266289230       7375439        -97.23%
Benchmark1000KBDeeplyNestedIndents-6     4360666         4514662        +3.53%
Benchmark1000KB1000IndentLines-6         449900306       451543002      +0.37%
Benchmark1KBMaps-6                       599228          620644         +3.57%
Benchmark10KBMaps-6                      5801815         5892398        +1.56%
Benchmark100KBMaps-6                     53699570        57544454       +7.16%
Benchmark1000KBMaps-6                    513662800       538354297      +4.81%
BenchmarkDeepSlice-6                     46967064873     448771530      -99.04%
BenchmarkDeepFlow-6                      45121229024     648750899      -98.56%

benchmark                                old allocs     new allocs     delta
Benchmark1000KB100Aliases-6              5832441        5832438        -0.00%
Benchmark1000KBDeeplyNestedSlices-6      9068           9249           +2.00%
Benchmark1000KBDeeplyNestedMaps-6        9072           9252           +1.98%
Benchmark1000KBDeeplyNestedIndents-6     10082          10084          +0.02%
Benchmark1000KB1000IndentLines-6         4093516        4093517        +0.00%
Benchmark1KBMaps-6                       3126           3128           +0.06%
Benchmark10KBMaps-6                      30780          30782          +0.01%
Benchmark100KBMaps-6                     307269         307271         +0.00%
Benchmark1000KBMaps-6                    3072079        3072082        +0.00%
BenchmarkDeepSlice-6                     2048121        2048304        +0.01%
BenchmarkDeepFlow-6                      3334112        3334113        +0.00%

benchmark                                old bytes     new bytes     delta
Benchmark1000KB100Aliases-6              393120640     393118752     -0.00%
Benchmark1000KBDeeplyNestedSlices-6      4689806       4748760       +1.26%
Benchmark1000KBDeeplyNestedMaps-6        4690070       4749212       +1.26%
Benchmark1000KBDeeplyNestedIndents-6     2969933       2970002       +0.00%
Benchmark1000KB1000IndentLines-6         143718544     143718560     +0.00%
Benchmark1KBMaps-6                       218984        219112        +0.06%
Benchmark10KBMaps-6                      2178160       2178284       +0.01%
Benchmark100KBMaps-6                     22002813      22002938      +0.00%
Benchmark1000KBMaps-6                    220560496     220560768     +0.00%
BenchmarkDeepSlice-6                     120114824     120174234     +0.05%
BenchmarkDeepFlow-6                      190656952     190656984     +0.00%

cjcullen · 2019-11-04T17:30:33Z

@niemeyer @liggitt

liggitt · 2019-11-04T17:34:22Z

this looks good to me

laverya · 2019-11-05T18:09:04Z

Should this be a v3 pr? (is the same problem present in v3?)

liggitt · 2019-11-05T18:17:36Z

v2 is the default branch. v3 is in prerelease.

niemeyer · 2019-11-05T19:21:13Z

Let's get the issue fixed in v2 first and then we can easily port it forward to v3.

I will have a deeper look into the issue and the suggested solution and get back to you soon.

…ent lookup in fetch.

cjcullen · 2019-11-06T01:54:30Z

First attempt had a bug dealing w/ complex keys added to the stack immediately after another key was "removed" (keys are not actually taken out of the stack, but do get overwritten, and the new key might not be a possible simple_key, so min_possible needs to account for that).

$ git checkout c292eee44f0f38e8507803c26868dc8b889e2550 && go test -run=NONE -bench=. -benchmem ./... > old.txt
$ git checkout 01c0521edf7a0d83650270898a38972a8d8264b4 && go test -run=NONE -bench=. -benchmem ./... > new.txt
$ benchcmp old.txt new.txt
benchmark                                old ns/op       new ns/op     delta
Benchmark1000KB100Aliases-6              933118408       924264884     -0.95%
Benchmark1000KBDeeplyNestedSlices-6      292079720       8114551       -97.22%
Benchmark1000KBDeeplyNestedMaps-6        276815596       8199085       -97.04%
Benchmark1000KBDeeplyNestedIndents-6     4605536         4406578       -4.32%
Benchmark1000KB1000IndentLines-6         446018719       447012500     +0.22%
Benchmark1KBMaps-6                       606299          597170        -1.51%
Benchmark10KBMaps-6                      5848238         6323868       +8.13%
Benchmark100KBMaps-6                     55045361        57967332      +5.31%
Benchmark1000KBMaps-6                    502323739       533391064     +6.18%
BenchmarkDeepSlice-6                     46958325534     439644015     -99.06%
BenchmarkDeepFlow-6                      44440589216     601371522     -98.65%

benchmark                                old allocs     new allocs     delta
Benchmark1000KB100Aliases-6              5832424        5832427        +0.00%
Benchmark1000KBDeeplyNestedSlices-6      9066           9248           +2.01%
Benchmark1000KBDeeplyNestedMaps-6        9072           9251           +1.97%
Benchmark1000KBDeeplyNestedIndents-6     10083          10084          +0.01%
Benchmark1000KB1000IndentLines-6         4093516        4093517        +0.00%
Benchmark1KBMaps-6                       3126           3128           +0.06%
Benchmark10KBMaps-6                      30780          30782          +0.01%
Benchmark100KBMaps-6                     307269         307271         +0.00%
Benchmark1000KBMaps-6                    3072079        3072081        +0.00%
BenchmarkDeepSlice-6                     2048122        2048307        +0.01%
BenchmarkDeepFlow-6                      3334111        3334113        +0.00%

benchmark                                old bytes     new bytes     delta
Benchmark1000KB100Aliases-6              393118196     393119108     +0.00%
Benchmark1000KBDeeplyNestedSlices-6      4689654       4748743       +1.26%
Benchmark1000KBDeeplyNestedMaps-6        4690070       4748824       +1.25%
Benchmark1000KBDeeplyNestedIndents-6     2969955       2969997       +0.00%
Benchmark1000KB1000IndentLines-6         143718512     143718560     +0.00%
Benchmark1KBMaps-6                       218984        219112        +0.06%
Benchmark10KBMaps-6                      2178160       2178281       +0.01%
Benchmark100KBMaps-6                     22002906      22002931      +0.00%
Benchmark1000KBMaps-6                    220560496     220560672     +0.00%
BenchmarkDeepSlice-6                     120114936     120173917     +0.05%
BenchmarkDeepFlow-6                      190656856     190656976     +0.00%

niemeyer · 2019-11-06T09:20:15Z

It's good to avoid force pushing in those cases, so that we can see what changed between the revisions.

Either way, as we chatted over email I hadn't touched this logic from the ported C library so far as it seemed to do its job, but I'd like to try taking a step back before adding more complexity to see if there's an alternative approach that could get rid of the issues found while simplifying the implementation instead of making it harder to understand and maintain.

I believe this is the case here. I will push an alternative algorithm to v3 that is both simpler and faster so you can have a look.

This should simplify the logic and significantly improve performance in edge cases as found and reported on #537 by CJ Cullen.

niemeyer · 2019-11-06T09:28:40Z

Please have a look at e228e37 and let me know what you think.

cjcullen · 2019-11-07T01:05:31Z

TL;DR: I think your patch fixes the issue, but I got fairly confused along the way :).
Do you want me to open a PR for your patch on v2? And maybe a separate one to add the two testcases that it fixes?
...

I haven't looked to much at the code on v3, but it appears to be more strict on some sketchy (but I think valid) YAML. E.g.

{a:{}}

produces: unexpected error: yaml: did not find expected ',' or '}'
And

{a,a}

produces: line 1: mapping key "a" already defined at line 1 (I think because d.uniqueKeys defaults to true now?)

The DeepFlow benchmark I added would now fail because of the reasons above, but runs out of memory on my machine before it can fail. That test looks something like:

{a,b:
 {a,b:{a,b:{a,b:{1,1,1,1,1,1,1,1}}}}}

Where the maps are nested to max depth, with ~1000kB of crap (~500,000 tokens) at the bottom.

I made the bottom map in the test smaller to gather some numbers.

100 tokens:
BenchmarkDeepFlow-6   	      99	  12571431 ns/op	 7661837 B/op	   40271 allocs/op
1000 tokens:
BenchmarkDeepFlow-6   	       5	 218142322 ns/op	101838585 B/op	 2022118 allocs/op
10000 tokens:
BenchmarkDeepFlow-6   	       1	18689774500 ns/op	9762953496 B/op	200040215 allocs/op

I think this is because we are generating per-pair error messages for each re-defined key here: https://github.com/go-yaml/yaml/blob/v3/decode.go#L749-L760. That is probably fairly easy to clean up.

When I flip d.uniqueKeys to false, I still see a notable performance degradation from the fixed v2 to the fixed v3

$ git checkout cjv3 && go test -run=NONE -bench=. -benchmem ./... > v3.txt
$ git checkout deepflow && go test -run=NONE -bench=. -benchmem ./... > v2.txt
$ benchcmp v2.txt v3.txt
benchmark                                old ns/op     new ns/op      delta
Benchmark1000KB100Aliases-6              922020576     1720174932     +86.57%
Benchmark1000KBDeeplyNestedSlices-6      8025403       60725443       +656.67%
Benchmark1000KBDeeplyNestedMaps-6        8060659       59798584       +641.86%
Benchmark1000KBDeeplyNestedIndents-6     4701986       15094606       +221.03%
Benchmark1000KB1000IndentLines-6         451119036     808144598      +79.14%
Benchmark1KBMaps-6                       630891        1199927        +90.20%
Benchmark10KBMaps-6                      6462013       15317658       +137.04%
Benchmark100KBMaps-6                     54991397      124880821      +127.09%
Benchmark1000KBMaps-6                    516057186     1114736076     +116.01%
BenchmarkDeepSlice-6                     439363971     1091913514     +148.52%
BenchmarkDeepFlow-6                      614039880     1259611292     +105.14%

benchmark                                old allocs     new allocs     delta
Benchmark1000KB100Aliases-6              5832426        6344290        +8.78%
Benchmark1000KBDeeplyNestedSlices-6      9247           9090           -1.70%
Benchmark1000KBDeeplyNestedMaps-6        9252           9095           -1.70%
Benchmark1000KBDeeplyNestedIndents-6     10084          10107          +0.23%
Benchmark1000KB1000IndentLines-6         4093517        4093555        +0.00%
Benchmark1KBMaps-6                       3128           3651           +16.72%
Benchmark10KBMaps-6                      30782          35923          +16.70%
Benchmark100KBMaps-6                     307271         358502         +16.67%
Benchmark1000KBMaps-6                    3072081        3584125        +16.67%
BenchmarkDeepSlice-6                     2048293        2540164        +24.01%
BenchmarkDeepFlow-6                      3334113        3816152        +14.46%

benchmark                                old bytes     new bytes      delta
Benchmark1000KB100Aliases-6              393118424     1470212888     +273.99%
Benchmark1000KBDeeplyNestedSlices-6      4748518       13930102       +193.36%
Benchmark1000KBDeeplyNestedMaps-6        4749088       13930332       +193.33%
Benchmark1000KBDeeplyNestedIndents-6     2969992       12260117       +312.80%
Benchmark1000KB1000IndentLines-6         143718560     697303904      +385.19%
Benchmark1KBMaps-6                       219112        649889         +196.60%
Benchmark10KBMaps-6                      2178283       11417515       +424.15%
Benchmark100KBMaps-6                     22002941      114370482      +419.80%
Benchmark1000KBMaps-6                    220560624     1297664104     +488.35%
BenchmarkDeepSlice-6                     120172712     1182242296     +883.79%
BenchmarkDeepFlow-6                      190657024     1067076568     +459.68%

I think most of that is just v3-specific stuff (I don't think I'm running into this, but I could be), because comparing it to v2 w/ just your patch applied gets much better numbers:

$ benchcmp v2_cj.txt v2_gustavo.txt
benchmark                                old ns/op     new ns/op     delta
Benchmark1000KB100Aliases-6              922020576     928285149     +0.68%
Benchmark1000KBDeeplyNestedSlices-6      8025403       46426178      +478.49%
Benchmark1000KBDeeplyNestedMaps-6        8060659       45873584      +469.10%
Benchmark1000KBDeeplyNestedIndents-6     4701986       4353733       -7.41%
Benchmark1000KB1000IndentLines-6         451119036     437655646     -2.98%
Benchmark1KBMaps-6                       630891        591242        -6.28%
Benchmark10KBMaps-6                      6462013       5891734       -8.83%
Benchmark100KBMaps-6                     54991397      52778245      -4.02%
Benchmark1000KBMaps-6                    516057186     487526584     -5.53%
BenchmarkDeepSlice-6                     439363971     454282346     +3.40%
BenchmarkDeepFlow-6                      614039880     595380818     -3.04%

benchmark                                old allocs     new allocs     delta
Benchmark1000KB100Aliases-6              5832426        5832427        +0.00%
Benchmark1000KBDeeplyNestedSlices-6      9247           9066           -1.96%
Benchmark1000KBDeeplyNestedMaps-6        9252           9070           -1.97%
Benchmark1000KBDeeplyNestedIndents-6     10084          10082          -0.02%
Benchmark1000KB1000IndentLines-6         4093517        4093516        -0.00%
Benchmark1KBMaps-6                       3128           3126           -0.06%
Benchmark10KBMaps-6                      30782          30780          -0.01%
Benchmark100KBMaps-6                     307271         307269         -0.00%
Benchmark1000KBMaps-6                    3072081        3072079        -0.00%
BenchmarkDeepSlice-6                     2048293        2048121        -0.01%
BenchmarkDeepFlow-6                      3334113        3334111        -0.00%

benchmark                                old bytes     new bytes     delta
Benchmark1000KB100Aliases-6              393118424     393119352     +0.00%
Benchmark1000KBDeeplyNestedSlices-6      4748518       4689528       -1.24%
Benchmark1000KBDeeplyNestedMaps-6        4749088       4689814       -1.25%
Benchmark1000KBDeeplyNestedIndents-6     2969992       2969939       -0.00%
Benchmark1000KB1000IndentLines-6         143718560     143718512     -0.00%
Benchmark1KBMaps-6                       219112        218984        -0.06%
Benchmark10KBMaps-6                      2178283       2178155       -0.01%
Benchmark100KBMaps-6                     22002941      22002804      -0.00%
Benchmark1000KBMaps-6                    220560624     220560496     -0.00%
BenchmarkDeepSlice-6                     120172712     120114866     -0.05%
BenchmarkDeepFlow-6                      190657024     190656896     -0.00%

niemeyer · 2019-11-07T18:15:48Z

Do you want me to open a PR for your patch on v2? And maybe a separate one to add the two testcases that it fixes?

If you're happy with it as a solution, yes, that would be great, thanks!

I haven't looked to much at the code on v3, but it appears to be more strict on some sketchy (but I think valid) YAML.

It is more strict indeed, and the prevention of duplicated keys by default is intended. It's not supported by the specification, so v3 makes this harder to overlook.

That said, the {a:{}} snippet is invalid yaml and doesn't work in v2 either:

https://play.golang.org/p/OpSmn7UKjmS

It violates this part of the spec:

Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair.

For that same reason, note that {12:34} actually means {"12:34": null}, not {"12": 34}. Please don't blame me.. I didn't write that spec. :-)

Would be nice to fix the proposed tests so they don't depend on invalid yaml.

The DeepFlow benchmark I added would now fail because of the reasons above, but runs out of memory on my machine before it can fail. That test looks something like:
...
I think this is because we are generating per-pair error messages for each re-defined key here: https://github.com/go-yaml/yaml/blob/v3/decode.go#L749-L760. That is probably fairly easy to clean up.

Ah, good catch! Let's fix that as well then.

This should simplify the logic and significantly improve performance in edge cases as found and reported on go-yaml#537 by CJ Cullen.

cjcullen · 2019-11-07T19:44:02Z

Opened #543 for the port to v2 (with tests)

niemeyer · 2019-11-08T12:39:07Z

One detail I forgot commenting on your earlier post, and which is certainly due out of respect for the proper research done, is the fact the suggested simpler implementation is slower than the original one on this PR. This is true, but assuming the simpler version solves the abuse well enough (when I tested originally it was about 95% improvement according to benchcmp), I think it's worth the simpler version on the basis that the more complex version won't really affect normal uses at all. That assumes most yaml is trivially nested, which sounds fair.

If that's incorrect somehow, please do let me know.

cjcullen · 2019-11-09T01:34:02Z

In my testing, your patch ported onto v2 was not significantly slower, except in a couple of the max-depth error cases, and even then, we're talking about 50ms vs. 1ms.

$ benchcmp cj.txt gus.txt
benchmark                                old ns/op      new ns/op      delta
Benchmark1000KB100Aliases-6              1043562949     1050274409     +0.64%
Benchmark1000KBDeeplyNestedSlices-6      8077086        48635551       +502.14%
Benchmark1000KBDeeplyNestedMaps-6        8580521        50397244       +487.34%
Benchmark1000KBDeeplyNestedIndents-6     4508057        4559411        +1.14%
Benchmark1000KB1000IndentLines-6         455249763      436821141      -4.05%
Benchmark1KBMaps-6                       626294         593838         -5.18%
Benchmark10KBMaps-6                      5938881        6193108        +4.28%
Benchmark100KBMaps-6                     55397757       54890679       -0.92%
Benchmark1000KBMaps-6                    532778194      493524839      -7.37%
BenchmarkDeepSlice-6                     448975265      456474390      +1.67%
BenchmarkDeepFlow-6                      631368987      403562717      -36.08%

benchmark                                old allocs     new allocs     delta
Benchmark1000KB100Aliases-6              5832440        5832440        +0.00%
Benchmark1000KBDeeplyNestedSlices-6      9249           9066           -1.98%
Benchmark1000KBDeeplyNestedMaps-6        9252           9071           -1.96%
Benchmark1000KBDeeplyNestedIndents-6     10083          10083          +0.00%
Benchmark1000KB1000IndentLines-6         4093517        4093516        -0.00%
Benchmark1KBMaps-6                       3128           3126           -0.06%
Benchmark10KBMaps-6                      30782          30780          -0.01%
Benchmark100KBMaps-6                     307271         307269         -0.00%
Benchmark1000KBMaps-6                    3072081        3072079        -0.00%
BenchmarkDeepSlice-6                     2048297        2048121        -0.01%
BenchmarkDeepFlow-6                      3334113        1978105        -40.67%

benchmark                                old bytes     new bytes     delta
Benchmark1000KB100Aliases-6              393119144     393120544     +0.00%
Benchmark1000KBDeeplyNestedSlices-6      4748894       4689568       -1.25%
Benchmark1000KBDeeplyNestedMaps-6        4748926       4689851       -1.24%
Benchmark1000KBDeeplyNestedIndents-6     2969957       2969951       -0.00%
Benchmark1000KB1000IndentLines-6         143718592     143718512     -0.00%
Benchmark1KBMaps-6                       219112        218984        -0.06%
Benchmark10KBMaps-6                      2178284       2178156       -0.01%
Benchmark100KBMaps-6                     22002935      22002808      -0.00%
Benchmark1000KBMaps-6                    220560624     220560496     -0.00%
BenchmarkDeepSlice-6                     120173610     120114824     -0.05%
BenchmarkDeepFlow-6                      190656976     115058064     -39.65%

This should simplify the logic and significantly improve performance in edge cases as found and reported on #537 by CJ Cullen.

niemeyer · 2019-11-19T11:56:22Z

This PR has been obsoleted by the follow ups, per comments above.

Thank you very much for the collaboration here, @cjcullen.

cjcullen force-pushed the deepflow branch from 3717a57 to 04b5057 Compare November 4, 2019 21:13

cjcullen added 3 commits November 5, 2019 17:46

Add tests for deeply nested large sets of tokens.

c292eee

Optimize the staleness scan through the simple_keys stack.

ca9f163

Track which simple_keys are still possible by token_number for effici…

01c0521

…ent lookup in fetch.

cjcullen force-pushed the deepflow branch from 04b5057 to 01c0521 Compare November 6, 2019 01:47

niemeyer added a commit that referenced this pull request Nov 6, 2019

Drop scanner iteration over stale simple keys

e228e37

This should simplify the logic and significantly improve performance in edge cases as found and reported on #537 by CJ Cullen.

cjcullen pushed a commit to cjcullen/yaml that referenced this pull request Nov 7, 2019

Drop scanner iteration over stale simple keys

4ccf7e2

This should simplify the logic and significantly improve performance in edge cases as found and reported on go-yaml#537 by CJ Cullen.

cjcullen pushed a commit to cjcullen/yaml that referenced this pull request Nov 7, 2019

Drop scanner iteration over stale simple keys

4ce73f1

This should simplify the logic and significantly improve performance in edge cases as found and reported on go-yaml#537 by CJ Cullen.

cjcullen mentioned this pull request Nov 7, 2019

Port stale simple_keys fix to v2 #543

Merged

niemeyer pushed a commit that referenced this pull request Nov 19, 2019

Port stale simple_keys fix to v2 (#543)

36babc3

This should simplify the logic and significantly improve performance in edge cases as found and reported on #537 by CJ Cullen.

niemeyer closed this Nov 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize a couple of nested loops in the scanner. #537

Optimize a couple of nested loops in the scanner. #537

cjcullen commented Nov 4, 2019 •

edited

cjcullen commented Nov 4, 2019

liggitt commented Nov 4, 2019

laverya commented Nov 5, 2019

liggitt commented Nov 5, 2019

niemeyer commented Nov 5, 2019

cjcullen commented Nov 6, 2019

niemeyer commented Nov 6, 2019

niemeyer commented Nov 6, 2019

cjcullen commented Nov 7, 2019

niemeyer commented Nov 7, 2019

cjcullen commented Nov 7, 2019

niemeyer commented Nov 8, 2019

cjcullen commented Nov 9, 2019

niemeyer commented Nov 19, 2019

Optimize a couple of nested loops in the scanner. #537

Optimize a couple of nested loops in the scanner. #537

Conversation

cjcullen commented Nov 4, 2019 • edited

cjcullen commented Nov 4, 2019

liggitt commented Nov 4, 2019

laverya commented Nov 5, 2019

liggitt commented Nov 5, 2019

niemeyer commented Nov 5, 2019

cjcullen commented Nov 6, 2019

niemeyer commented Nov 6, 2019

niemeyer commented Nov 6, 2019

cjcullen commented Nov 7, 2019

niemeyer commented Nov 7, 2019

cjcullen commented Nov 7, 2019

niemeyer commented Nov 8, 2019

cjcullen commented Nov 9, 2019

niemeyer commented Nov 19, 2019

cjcullen commented Nov 4, 2019 •

edited