Skip to content

Commit

Permalink
Deduplicate strings in binlogs
Browse files Browse the repository at this point in the history
When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs.

Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings.

Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index.

This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point.

We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs).

FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings.

This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement.

We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.
  • Loading branch information
KirillOsenkov committed Jan 11, 2021
1 parent 70d3538 commit aa6fbab
Show file tree
Hide file tree
Showing 8 changed files with 821 additions and 184 deletions.
3 changes: 2 additions & 1 deletion ref/Microsoft.Build/net/Microsoft.Build.cs
Expand Up @@ -1591,9 +1591,10 @@ public sealed partial class BinaryLogReplayEventSource : Microsoft.Build.Logging
public void Replay(string sourceFilePath) { }
public void Replay(string sourceFilePath, System.Threading.CancellationToken cancellationToken) { }
}
public partial class BuildEventArgsReader
public partial class BuildEventArgsReader : System.IDisposable
{
public BuildEventArgsReader(System.IO.BinaryReader binaryReader, int fileFormatVersion) { }
public void Dispose() { }
public Microsoft.Build.Framework.BuildEventArgs Read() { throw null; }
}
public delegate void ColorResetter();
Expand Down
3 changes: 2 additions & 1 deletion ref/Microsoft.Build/netstandard/Microsoft.Build.cs
Expand Up @@ -1585,9 +1585,10 @@ public sealed partial class BinaryLogReplayEventSource : Microsoft.Build.Logging
public void Replay(string sourceFilePath) { }
public void Replay(string sourceFilePath, System.Threading.CancellationToken cancellationToken) { }
}
public partial class BuildEventArgsReader
public partial class BuildEventArgsReader : System.IDisposable
{
public BuildEventArgsReader(System.IO.BinaryReader binaryReader, int fileFormatVersion) { }
public void Dispose() { }
public Microsoft.Build.Framework.BuildEventArgs Read() { throw null; }
}
public delegate void ColorResetter();
Expand Down
4 changes: 2 additions & 2 deletions src/Build.UnitTests/BuildEventArgsSerialization_Tests.cs
Expand Up @@ -513,7 +513,7 @@ public void ReadingCorruptedStreamThrows()
memoryStream.Position = 0;

var binaryReader = new BinaryReader(memoryStream);
var buildEventArgsReader = new BuildEventArgsReader(binaryReader, BinaryLogger.FileFormatVersion);
using var buildEventArgsReader = new BuildEventArgsReader(binaryReader, BinaryLogger.FileFormatVersion);

Assert.Throws<EndOfStreamException>(() => buildEventArgsReader.Read());
}
Expand Down Expand Up @@ -570,7 +570,7 @@ private void Roundtrip<T>(T args, params Func<T, string>[] fieldsToCompare)
memoryStream.Position = 0;

var binaryReader = new BinaryReader(memoryStream);
var buildEventArgsReader = new BuildEventArgsReader(binaryReader, BinaryLogger.FileFormatVersion);
using var buildEventArgsReader = new BuildEventArgsReader(binaryReader, BinaryLogger.FileFormatVersion);
var deserializedArgs = (T)buildEventArgsReader.Read();

Assert.Equal(length, memoryStream.Position);
Expand Down
2 changes: 2 additions & 0 deletions src/Build/Logging/BinaryLogger/BinaryLogRecordKind.cs
Expand Up @@ -25,5 +25,7 @@ internal enum BinaryLogRecordKind
UninitializedPropertyRead,
EnvironmentVariableRead,
PropertyInitialValueSet,
NameValueList,
String,
}
}
9 changes: 7 additions & 2 deletions src/Build/Logging/BinaryLogger/BinaryLogReplayEventSource.cs
Expand Up @@ -33,7 +33,12 @@ public void Replay(string sourceFilePath, CancellationToken cancellationToken)
using (var stream = new FileStream(sourceFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
var gzipStream = new GZipStream(stream, CompressionMode.Decompress, leaveOpen: true);
var binaryReader = new BinaryReader(gzipStream);

// wrapping the GZipStream in a buffered stream significantly improves performance
// and the max throughput is reached with a 32K buffer. See details here:
// https://github.com/dotnet/runtime/issues/39233#issuecomment-745598847
var bufferedStream = new BufferedStream(gzipStream, 32768);
var binaryReader = new BinaryReader(bufferedStream);

int fileFormatVersion = binaryReader.ReadInt32();

Expand All @@ -45,7 +50,7 @@ public void Replay(string sourceFilePath, CancellationToken cancellationToken)
throw new NotSupportedException(text);
}

var reader = new BuildEventArgsReader(binaryReader, fileFormatVersion);
using var reader = new BuildEventArgsReader(binaryReader, fileFormatVersion);
while (true)
{
if (cancellationToken.IsCancellationRequested)
Expand Down
11 changes: 9 additions & 2 deletions src/Build/Logging/BinaryLogger/BinaryLogger.cs
Expand Up @@ -35,7 +35,13 @@ public sealed class BinaryLogger : ILogger
// - This was used in a now-reverted change but is the same as 9.
// version 9:
// - new record kinds: EnvironmentVariableRead, PropertyReassignment, UninitializedPropertyRead
internal const int FileFormatVersion = 9;
// version 10:
// - new record kinds:
// * String - deduplicate strings by hashing and write a string record before it's used
// * NameValueList - deduplicate arrays of name-value pairs such as properties, items and metadata
// in a separate record and refer to those records from regular records
// where a list used to be written in-place
internal const int FileFormatVersion = 10;

private Stream stream;
private BinaryWriter binaryWriter;
Expand Down Expand Up @@ -137,6 +143,7 @@ public void Initialize(IEventSource eventSource)
}

stream = new GZipStream(stream, CompressionLevel.Optimal);
stream = new BufferedStream(stream, bufferSize: 32768);
binaryWriter = new BinaryWriter(stream);
eventArgsWriter = new BuildEventArgsWriter(binaryWriter);

Expand Down Expand Up @@ -175,8 +182,8 @@ public void Shutdown()
{
eventArgsWriter.WriteBlob(BinaryLogRecordKind.ProjectImportArchive, projectImportsCollector.GetAllBytes());
}
projectImportsCollector.Close();

projectImportsCollector.Close();
projectImportsCollector = null;
}

Expand Down

0 comments on commit aa6fbab

Please sign in to comment.