Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate strings in binlogs #6017

Merged
merged 6 commits into from Feb 8, 2021
Merged

Deduplicate strings in binlogs #6017

merged 6 commits into from Feb 8, 2021

Commits on Jan 11, 2021

  1. Deduplicate strings in binlogs

    When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs.
    
    Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings.
    
    Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index.
    
    This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point.
    
    We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs).
    
    FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings.
    
    This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement.
    
    We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.
    KirillOsenkov committed Jan 11, 2021
    Copy the full SHA
    aa6fbab View commit details
    Browse the repository at this point in the history

Commits on Jan 15, 2021

  1. Add some comments.

    KirillOsenkov committed Jan 15, 2021
    Copy the full SHA
    b324bad View commit details
    Browse the repository at this point in the history
  2. Copy the full SHA
    e6ee9a5 View commit details
    Browse the repository at this point in the history

Commits on Jan 17, 2021

  1. Make some fields readonly.

    Reduce maximum strings allocated in memory to 2GB (1 billion chars).
    KirillOsenkov committed Jan 17, 2021
    Copy the full SHA
    9d8f857 View commit details
    Browse the repository at this point in the history
  2. Introduce a RedirectionScope in BuildEventArgsWriter

    This avoids manually switching from currentRecordWriter to originalBinaryWriter in three different places. It's also easier this way to find the places where the switch happens.
    KirillOsenkov committed Jan 17, 2021
    Copy the full SHA
    9ba9c9e View commit details
    Browse the repository at this point in the history
  3. Copy the full SHA
    77a4ff3 View commit details
    Browse the repository at this point in the history