Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate strings in binlogs #6017

Merged
merged 6 commits into from Feb 8, 2021
Merged

Deduplicate strings in binlogs #6017

merged 6 commits into from Feb 8, 2021

Conversation

KirillOsenkov
Copy link
Member

When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs.

Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings.

Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index.

This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point.

We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs).

FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings.

This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement.

We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.

Testing

I've done extensive testing for correctness, performance, and hash collisions and performance.

Hash function selection is documented here:
https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/String-Hashing
In summary, Fnv-1a provides no collisions with 64-bit hash size, and very decent performance (11 seconds to hash all the strings in our largest binlog, 2.7 million strings). Additionally it's very simple and minimal code.

Playback:
I've tested various scenarios for playback:

  1. binlog format
  • playing back a legacy binlog (format version < 10), with no String or NameValueList records
  • playing back version 10 binlog
  1. process bitness
  • on 32-bit we need to use the StringStorage helper class when reading new binlogs, to avoid OOM
  • on 32-bit we're fine when reading legacy binlogs as they're fully streaming
  • on 64-bit we just store all strings in memory, and it results in much faster performance

The data looks good:
image

Binlog size reduction:

Strings: Binlog: Legacy binlog size (KB) New binlog size (KB)
2,744,490 vswin.binlog 703,863 274,714
2,711,714 790.binlog 791,488 283,364
1,633,734 23GB.binlog 22,558,820 172,617
681,346 mmitche.binlog 590,938 157,915
302,685 roslyn.binlog 194,354 49,277
288,136 dotnet.binlog 64,577 24,599
220,177 vsmac.binlog 143,178 35,095
42,658 Ide.binlog 20,753 12,334

Playback of binlog into binlog, perf improvement:
Playback of legacy vswin.binlog by legacy: 6:54
Playback of legacy vswin.binlog by new: 2:56
Playback of new vswin.binlog on 32-bit: 3:17
Playback of new vswin.binlog on 64-bit: 2:13

In summary, I expect builds to be faster (up to 4 minutes improvement for really large builds) and binlogs to be smaller (up to 50% for small binlogs, 2-4x for medium size, and 131x (from 23 GB to 172 MB) for our most extreme binlog.

@KirillOsenkov
Copy link
Member Author

The latest viewer version (2.1.300) has already been updated to read the binlog format version 10.

@KirillOsenkov
Copy link
Member Author

Hey @mmitche your binlog goes down in size from 590 MB to 157 MB.

@aolszowka I'm probably too late with this, but your binlog goes down from 23 GB to 172 MB! You get the first prize!

Copy link
Member

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing PR. Just some small things I saw while reading through.

src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs Outdated Show resolved Hide resolved
src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs Outdated Show resolved Hide resolved
src/Build/Microsoft.Build.csproj Outdated Show resolved Hide resolved
@aolszowka
Copy link

@aolszowka I'm probably too late with this, but your binlog goes down from 23 GB to 172 MB! You get the first prize!

This is a HUGE win, I'll make sure to pass it along and keep it in my toolbox :) Thank you!

Copy link
Member

@rainersigwald rainersigwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so excited about this! 🙌🏻

Should we add some test cases around deliberately malformed logs (misordered, unused indexes, that sort of thing)?

A few nits but looking great. My only big question is tactical: should we guard this behind a change wave? If we fork now and put this in the very beginning of 16.10 we can probably just make it unconditional, since making the write side conditioned would be a bit inconvenient.

@KirillOsenkov
Copy link
Member Author

I'm personally not in a rush for 16.9 so I'd rather just wait until we fork for 16.9 and then just merge into master so we have lots of runway to dogfood for 16.10. This way if we have to tweak the file format we'll have more time.

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

@KirillOsenkov
Copy link
Member Author

I actually discovered that the CLSCompliant warnings have disappeared. Maybe in early iterations the types I added were public and that's why I added the NoWarns? But just removing them now seems to be fine.

When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs.

Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings.

Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index.

This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point.

We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs).

FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings.

This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement.

We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.
Copy link
Member

@benvillalobos benvillalobos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very quick skim, submitting small nits. The results look great 😀

Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome wins and shame on GZipStream that it leaves so much redundant data in it the compressed stream.

Not having encountered hash code collisions in test data doesn't mean that they don't exist. I understand that this is "just logging" and if, with an extremely low probability, the log lies, we are kind of fine with that. Still, isn't it fundamentally a bug? It's unusual for a component to have this probabilistic correctness characteristics in an otherwise precise system. Do you happen to know how much of the wins would be lost if the string lookup was implemented "correctly"? Thank you!

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs Outdated Show resolved Hide resolved
@KirillOsenkov
Copy link
Member Author

@ladipro I totally agree on your point about hash collisions. It's a very valid concern.

Without relying on hashes though, we'd have to hold on to all strings we saw because that's the only way to be absolutely sure the strings match. Holding on to strings is absolutely not an option in a 32-bit process (as we have gigabytes of them), even if we choose a single string per hashcode as a "sample". Writing them to a page file like I do with Reader is an option, but you'd need to constantly seek to read out a string from disk for every new string you encounter. It would absolutely kill perf and defeat the purpose.

I'm afraid that if we want the perf wins, we have to allow a small possibility of a collision, where in a random binlog (1) a random string you happened to be looking at (2) is switched to another random string. You only look at a tiny small fraction of all strings in the binlog (there are millions), and chances are that you happened to look at the corrupt one are infinitesimally small (given that I haven't seen any collisions in our top 10 largest binlogs). I say it's worth the risk!

@KirillOsenkov
Copy link
Member Author

I guess we could also store the string length and the first character (in addition to hash), to further drastically reduce the chance of collision. But honestly I don't think it's even worth it.

@cdmihai
Copy link
Contributor

cdmihai commented Jan 14, 2021

Regarding the hash problem and the large memory required to store the strings, wouldn't using msbuild's string interner (maybe using @ladipro's rewritten version of it) fix the problem? The interner also dedupes strings, so the in-memory string size should be the same as the deduped string size from the binlog

@aolszowka
Copy link

Writing them to a page file like I do with Reader is an option, but you'd need to constantly seek to read out a string from disk for every new string you encounter. It would absolutely kill perf and defeat the purpose.

If there was a big concern about a collision what about a post process step? So you opt in to "do it fast with a hash" but allow a slower pass that captured all the strings then performed this process, at that point if there was a collision you could bark, but if there was no collision you reap the gains when sharing the binlog?

Does nothing for the perf though which is what I assume was driving a lot of this?

I suspect the big concern is trying to capture this one off, realistically though by the time we're digging though the bin logs its probably reproducible on our end so its not like we're totally out of luck.

@ladipro
Copy link
Member

ladipro commented Jan 14, 2021

Regarding the hash problem and the large memory required to store the strings, wouldn't using msbuild's string interner (maybe using @ladipro's rewritten version of it) fix the problem?

The interner holds on to strings only as long as something else is keeping them alive so it wouldn't guarantee correctness either. It's basically a cache of recently seen and not yet collected strings, not of all strings.

@ladipro
Copy link
Member

ladipro commented Jan 14, 2021

I guess we could also store the string length and the first character (in addition to hash), to further drastically reduce the chance of collision. But honestly I don't think it's even worth it.

Agreed that it wouldn't be worth it. It's basically like using a wider hashcode but with poorer collision characteristics.

I'm afraid that if we want the perf wins, we have to allow a small possibility of a collision, where in a random binlog (1) a random string you happened to be looking at (2) is switched to another random string. You only look at a tiny small fraction of all strings in the binlog (there are millions), and chances are that you happened to look at the corrupt one are infinitesimally small (given that I haven't seen any collisions in our top 10 largest binlogs). I say it's worth the risk!

Let's do it!

Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, mostly nits.

Reduce maximum strings allocated in memory to 2GB (1 billion chars).
This avoids manually switching from currentRecordWriter to originalBinaryWriter in three different places. It's also easier this way to find the places where the switch happens.
@KirillOsenkov
Copy link
Member Author

I think I've answered or resolved all comments, but let me know if you have more or disagree with some of my resolutions. Meanwhile I'll do some more final testing on it.

Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Forgind Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Jan 22, 2021
@KirillOsenkov
Copy link
Member Author

KirillOsenkov commented Jan 26, 2021

I'm happy to report a nice performance improvement from this PR.
master is 32s with /bl, this PR is 26s.
Pure MSBuild without any logging is 15s, with text file diagnostic log 55s.

master #6017 no /bl /flp1:v=diag
32.377 27.737 15.476 55.61
31.758 26.561 15.416 56.145
31.481 26.486 15.536 55.72

(all single-core build). We see that /bl makes it twice as slow, but diag text log is almost 4x as slow. I think we can reduce binlog overhead further.

@KirillOsenkov
Copy link
Member Author

Shall we merge this?

@Forgind Forgind merged commit 62f5140 into master Feb 8, 2021
@Forgind Forgind deleted the dev/kirillo/dedup branch February 8, 2021 17:00
@Forgind
Copy link
Member

Forgind commented Feb 8, 2021

Yep!

I'm still trying to merge things in a gradual way, but I'll get to your other PR in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants