Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Introduce logging sampling #5123

Open
evgenyfedorov2 opened this issue Apr 23, 2024 · 7 comments
Open

[API Proposal]: Introduce logging sampling #5123

evgenyfedorov2 opened this issue Apr 23, 2024 · 7 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged

Comments

@evgenyfedorov2
Copy link

Background and motivation

As a follow-up of this comment and this issue, I am proposing a more generalized solution below.

Note: a new optional State with log record timestamps which mentioned in the Risks section is not part of the proposal. It will be added later as soon as we agree on everything else included in this API proposal.

Background info provided by @geeknoid:

We propose augmenting the .NET logging feature set to expose mechanisms designed specifically to enable application developers to intelligently control the amount of log data pushed by their application to telemetry backends. By eliding redundant log state early in the log pipeline, application developers can dramatically reduce their telemetry COGs.

Log Record Matching

The mechanisms described below depend on the idea of log record matching. As the application produces log records, log record matching is used to determine the specific treatment to apply to each record.
Through configuration, the application specifies an ordered list of matcher functions which are invoked in sequence to determine whether the log record is a match or not. Applications can have custom matchers with whatever sophistication they want, and the platform includes a simple matcher to address the 99% use case. The simple matcher uses configuration state of the form:

class LogRecordPattern
{
    string? Category { get; set; }
    EventId? EventId { get; set; }
    LogLevel? LogLevel { get; set; }
    KeyValuePair<string, string>[]? Tags { get; set; }   
}

These properties are all optional, they are evaluated in order and have AND semantics, so all defined properties must match. When a match occurs, the corresponding configuration state describes what to do with the log record.
Whereas category, event id, and log level are used to match against specific types of log records produced by the application, the tags are used to match against runtime constraints. For example, using tags you can match log records produced from a particular region or cluster, or records that mention a specific artifact or path, etc. For the initial go at this, we propose to do strict equality matching. In other words, for a record to match, it would need to contain all the listed tags with the specific values. Thus, no support for wildcards, regexes, Boolean combinations, etc. This is not intended to be a SQL query, it’s meant to be relatively fast and simple.

Global Controls

These controls apply at the level of the whole application, independent of the context in which logging takes place.

Filtering

Matching records can be filtered. Filtering is implemented via a simple callback model that gives a yes/no response to discard or keep a record. Customers can use a built-in statical sampler to implement this filter or can roll their own.
The statistical sampler we provide works at the application level. It will disable or enable logging of matching records globally for the life of the application based on a random number chosen at startup. This allows a population of application instances to emit the matching records, while the rest of the application instances won’t emit those records.

Buffering

Matching records can be buffered in a named circular buffer. The size and number of circular buffers is determined through configuration.
Buffered records are normally not emitted by the application and just get discarded. To emit buffered records, the application makes an explicit API call, supplying the name of the buffer to flush. Through configuration, the application can control whether to temporarily suspend buffering after a flush has occurred. This makes it possible for an application to obtain log records N seconds before and N seconds after an incident.

Request-Oriented Controls

These controls are aware of the notion of ASP.NET requests and provide the ability to moderate log output within the scope of individual requests.

Filtering

Matching records are filtered or not within the scope of a single request. Like in the global case above, this filtering can use custom application logic or a simple statistical sampler to control which request’s logging records will be filtered out.
In addition to statistical sampling, or perhaps in combination with it, we can factor in hints delivered through distributed tracing to determine whether to filter out matching records.

Buffering

This is like the global buffering described above, except that the buffers being used are not circular and they are private to an individual request’s lifetime. The buffer expands to capture all matching log records for the duration of a request. The buffer’s contents are discarded at the end of a request’s lifetime, unless the application has signaled explicitly that it should be flushed instead.

How It Works

Here’s the flow of things:
• Configuration contains an ordered list of log record match conditions.
• When a record arrives, it is matched against the conditions until a match is found or the end of the list is reached.
• If a match is not found, then the log record is dispatched normally.
• If a match is found, then the configuration state associated with the match is used to determine how to handle the record. One of four things can happen:
o Global Filtering. The configuration state holds a list of filters to invoke, including statistical samplers. If any of the filters returns false, then the log record is discarded and processing ends.
o Global Buffering. The configuration state holds the name of the global buffer the record should be directed to. The log record is serialized and inserted into the named circular buffer and processing ends.
o Request-Level Filtering. The configuration state holds a list of request-oriented filters to invoke, including statistical samplers. If any of the filters returns false, then the log record is discarded and processing ends.
o Request-Level Buffering. The log record is serialized and inserted in the current request’s buffer.

API Proposal

/// <summary>
/// A pattern to match log records against.
/// </summary>
public class LogRecordPattern
{
    /// <summary>
    /// Gets or sets log record category.
    /// </summary>
    public string? Category { get; set; }

    /// <summary>
    /// Gets or sets log record event ID.
    /// </summary>
    public EventId? EventId { get; set; }

    /// <summary>
    /// Gets or sets log record log level.
    /// </summary>
    public LogLevel? LogLevel { get; set; }

    /// <summary>
    /// Gets or sets log records state tags.
    /// </summary>
    public KeyValuePair<string, string>[]? Tags { get; set; }
}

/// <summary>
/// Enumerates actions one of which can be executed on a matching log record.
/// </summary>
public enum ControlAction
{
    /// <summary>
    /// Filter log records globally.
    /// </summary>
    GlobalFilter,

    /// <summary>
    /// Buffer log records globally.
    /// </summary>
    GlobalBuffer,

    /// <summary>
    /// Filter log records withing an HTTP request flow.
    /// </summary>
    RequestFilter,

    /// <summary>
    /// Buffer log records for the duration of an HTTP requst flow.
    /// </summary>
    RequestBuffer
}

/// <summary>
/// Lets you register logging samplers in a dependency injection container.
/// </summary>
public static class LoggingSamplingExtensions
{
    /// <summary>
    /// Enable logging sampling.
    /// </summary>
    /// <param name="builder">An instance of <see cref="ILoggingBuilder"/> to enable sampling in.</param>
    /// <param name="configure">A delegate to fine-tune the sampling.</param>
    /// <returns>The value of <paramref name="builder"/>.</returns>
    public static ILoggingBuilder EnableSampling(this ILoggingBuilder builder, Action<ILoggingSamplingBuilder> configure)
    {
...
    }

    /// <summary>
    /// Set the built-in simple sampler.
    /// </summary>
    /// <param name="builder">An instance of <see cref="ILoggingSamplingBuilder"/> to set the simple sampler in.</param>
    /// <param name="configure">A delegate to fine-tune the sampling.</param>
    /// <returns>The value of <paramref name="builder"/>.</returns>
    public static ILoggingSamplingBuilder SetSimpleSampler(this ILoggingSamplingBuilder builder, Action<LogSamplingOptions> configure)
    {
...
    }

    /// <summary>
    /// Set a logging sampler.
    /// </summary>
    /// <typeparam name="T">A sampler type</typeparam>
    /// <param name="builder">An instance of <see cref="ILoggingSamplingBuilder"/> to set the logging sampler in.</param>
    /// <param name="configure">A delegate to fine-tune the sampling.</param>
    /// <returns>The value of <paramref name="builder"/>.</returns>
    public static ILoggingSamplingBuilder SetSampler<T>(this ILoggingSamplingBuilder builder, Action<LogSamplingOptions> configure)
        where T : class, ILoggingSampler
    {
...
    }
}

/// <summary>
/// Options to configure log sampling.
/// </summary>
public class LogSamplingOptions
{
    /// <summary>
    /// Gets or sets a list of log pattern matchers.
    /// </summary>
    public List<Matcher> Matchers { get; set; }

    /// <summary>
    /// Gets or sets a list of log buffers.
    /// </summary>
    public ISet<LogBuffer> Buffers { get; set; }
}

/// <summary>
/// Represents a component that samples log records.
/// </summary>
public interface ILoggingSampler
{
    /// <summary>
    /// Sample a log record if it matches the <paramref name="logRecordPattern"/>.
    /// </summary>
    /// <param name="logRecordPattern">A log record pattern to match against.</param>
    /// <returns>True, if the log record was sampled. False otherwise.</returns>
    public bool Sample(LogRecordPattern logRecordPattern);
}

/// <summary>
/// An interface for configuring logging sampling.
/// </summary>
public interface ILoggingSamplingBuilder
{
    /// <summary>
    /// Gets the <see cref="IServiceCollection"/> where logging sampling services are configured.
    /// </summary>
    public IServiceCollection Services { get; }
}

/// <summary>
/// Represents a circular log buffer configuration.
/// </summary>
public class LogBuffer
{
    /// <summary>
    /// Gets or sets log buffer name.
    /// </summary>
    public string Name { get; set; }

    /// <summary>
    /// Gets or sets duration to suspend buffering after the flush operation occurred.
    /// </summary>
    public TimeSpan? SuspendAfterFlushDuration { get; set; }

    /// <summary>
    /// Gets or sets a circular buffer duration.
    /// </summary>
    public TimeSpan? BufferingDuration { get; set; }

    /// <summary>
    /// Gets or sets buffer size.
    /// </summary>
    public long? BufferSize { get; set; }
}

// <summary>
/// A log pattern matcher.
/// </summary>
public class Matcher
{
    /// <summary>
    /// Gets a filtering delegate.
    /// </summary>
    public Func<LogRecordPattern, bool>? Filter { get; }

    /// <summary>
    /// Gets a buffering delegate.
    /// </summary>
    public Action<BufferingTool, LogRecordPattern>? Buffer { get; }

    /// <summary>
    /// Gets a control action to perform in case there is a match.
    /// </summary>
    public ControlAction ControlAction { get; }

    /// <summary>
    /// Initializes a new instance of the <see cref="Matcher"/> class.
    /// </summary>
    /// <param name="pattern">A log record pattern to match.</param>
    /// <param name="filter">A global filtering delegate.</param>
    public Matcher(LogRecordPattern pattern, Func<LogRecordPattern, bool> filter)
    {
        ControlAction = ControlAction.GlobalFilter;
        ...
    }

    /// <summary>
    /// Initializes a new instance of the <see cref="Matcher"/> class.
    /// </summary>
    /// <param name="pattern">A log record pattern to match.</param>
    /// <param name="buffer">A global buffering delegate.</param>
    public Matcher(LogRecordPattern pattern, Action<BufferingTool, LogRecordPattern> buffer)
    {
        ControlAction = ControlAction.GlobalBuffer;
        ...
    }

    /// <summary>
    /// Matches the log record pattern against the supplied <paramref name="pattern"/>.
    /// </summary>
    /// <param name="pattern">A log record pattern to match against.</param>
    /// <returns>True if there is a match. False otherwise.</returns>
    public bool Match(LogRecordPattern pattern)
    {
...
    }
}

API Usage

// implementation example in LoggerConfig.cs:
internal sealed class LoggerConfig
{
    public LoggerConfig(..., ILoggingSampler[] samplers)
    {
        Samplers = samplers;
        ...
    }

    public ILoggingSampler[] Samplers { get; }
}

// implementation example in ExtendedLogger.cs:
    public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception? exception, Func<TState, Exception?, string> formatter)
    {
...
        var pattern = new LogRecordPattern
        {
            Category = // get category,
            Tags = state,
            EventId = eventId,
            LogLevel = logLevel,
        };

        foreach (var sampler in _factory.Config.Samplers)
        {
            if (sampler.Sample(pattern))
            {
                return; // the record was sampled, hence we don't log it.
            }
        }
    }

// an example of the built-in simple sampler implementation:
internal class SimpleSampler : ILoggingSampler
{
    private readonly List<Matcher> _matchers;
    private readonly BufferingTool _bufferingTool;

    public SimpleSampler(IOptions<LogSamplingOptions> options, BufferingTool bufferingTool)
    {
        _matchers = options.Value.Matchers;
        _bufferingTool = bufferingTool;
    }

    public bool Sample(LogRecordPattern logRecordPattern)
    {
        foreach (var matcher in _matchers)
        {
            if (matcher.Match(logRecordPattern))
            {
                switch (matcher.ControlAction)
                {
                    case ControlAction.GlobalFilter:
                        if (!matcher.Filter(logRecordPattern))
                        {
                            return true;
                        }

                        break;
                    case ControlAction.GlobalBuffer:
                        matcher.Buffer(_bufferingTool, logRecordPattern);
                        return true;
                    case ControlAction.RequestFilter:
                        break;
                    case ControlAction.RequestBuffer:
                        break;
                }
            }
        }

        return false;
    }
}

// usage example:
LoggerFactory.Create(builder =>
{
    _ = builder.EnableSampling(samplingBuilder => samplingBuilder

            // add the built-in simple sampler:
            .SetSimpleSampler(o =>
            {
                o.Matchers = new List<Matcher>
                {
                    new Matcher(
                        new LogRecordPattern
                        {
                            LogLevel = LogLevel.Information,
                            Category = "Microsoft.Extensions.Hosting",
                        },

                        // drop 99% of logs of Hosting category:
                        (pattern) => Random.Shared.NextDouble() < 0.01),
                    new Matcher(
                        new LogRecordPattern
                        {
                            LogLevel = LogLevel.Error,
                        },

                       // buffer all error logs:
                        (tool, pattern) => tool.Buffer("MyBuffer")),
                };
                o.Buffers = new HashSet<LogBuffer>
                {
                    new LogBuffer
                    {
                        Name = "MyBuffer",
                        SuspendAfterFlushDuration = TimeSpan.FromSeconds(10),
                        BufferingDuration = TimeSpan.FromSeconds(10),
                        BufferSize = 1_000_000,
                    },
                };
            }));
});

Alternative Designs

Previous proposal is here

Risks

Provided by @geeknoid:

Buffering Challenges

The existing logging infrastructure was not designed to support buffering. In particular, if log records are buffered early in the infrastructure, the records will not have been timestamped, and would get timestamped only when they were emitted from the process. This is not acceptable.
To support the model, we will augment the logging infrastructure to pass new optional state down with every log record flowing through the system, which will include the timestamp at which the record was captured. Log processing engines, like OpenTelemetry, will need to be retrofitted to recognize this state and use the supplied timestamp instead of capturing a timestamp themselves.
If a log provider hasn’t been upgraded to recognize the new optional state, the log records flowing through that provider will be timestamped with an incorrect value. Developers will be made aware to watch for this issue when first enabling buffering and to upgrade their log provider if needed.
A second challenge with buffering is that the TState value supplied by the application to the logging infrastructure is expected to be fully consumed before the Log method returns to the application. If we simply held on to the TState instances during buffering, we could easily end up with invalid state by the time the buffer is flushed. To avoid this, we need to serialize the TState to a neutral form before returning to the application.

Performance Considerations

The performance profile of an application will be impacted by filtering and buffering behavior. In particular, buffering can lead to potentially substantial spikes in log output when problems occur.
This trade-off is unavoidable, application developers need to be made aware so that they can make informed choices. We need to make it easy to turn off all these mechanisms to enable application developers to experience/measure the worst-case scenario.
It’s interesting to consider the effects of global vs request-oriented controls:
• With global controls, an individual application process either outputs matching log records for its life time or it doesn’t. So some processes will have a higher computational burden then others. We could consider reevaluating whether a process should output those log records on a regular basis (say every hour), but it’s not clear whether this flexibility is needed/desirable.
• With request-level controls, the performance of individual requests processed by every application process varies. Some requests have more overhead than others.

@evgenyfedorov2 evgenyfedorov2 added untriaged api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Apr 23, 2024
@geeknoid
Copy link
Member

@noahfalk This is as we discussed a little while ago.

@noahfalk
Copy link
Member

Cool! I've got this flagged to come and take a closer look but it might be a little bit.

@noahfalk
Copy link
Member

Thanks for putting this together!

A couple thoughts here:

  • For the timestamping issue, this sounds like something we'd want to firm up prior to doing major work. I think my main concern is that if we expect the commonly used ILoggerProviders (say ConsoleProvider and OTel's provider at least) to test the TState for some pre-existing timestamp information, that is going to impose a performance overhead on every log message being processed, regardless whether the application is using the features proposed here. If we are asking all .NET logging users to take a perf regression so that a small fraction of those users can use new features we should do due diligence to ensure that perf overhead is very small. If the mechanism to recognize log messages that are pre-timestamped requires any new shared APIs in M.E.L.A then we should also ensure we are going to be able to ship that in .NET 9.

    Specific things to firm up:

    1. How do we expect ILoggerProviders to recognize and treat buffered data?
    2. What is the performance overhead for checking a TState to determine it isn't buffered?
    3. What APIs (if any) need to be added to M.E.L.A to make this work?
  • How do we expect the sampling API interacts with the filtering API?
    Is the idea that filters happen first, then sampling happens on what remains?
    If I want some data for some new category to be buffered, do I have to both add a filter rule to include it and a buffer rule to get it buffered, or is it sufficient to just add the buffering rule?

  • From an API design perspective, filtering and sampling seem related so could we make their usage appear more unified? For example if I wanted to include only Microsoft.Extensions.Hosting informational and all errors I could write this:

builder // ILoggingBuilder
        .AddFilter("Microsoft.Extensions.Hosting", LogLevel.Informational)
        .AddFilter(null, LogLevel.Error);

If I wanted to select that same set of logs to be buffered and sampled could I do something like this?

builder // ILoggingBuilder
        .AddBuffer("MyBuffer", durationSecs:10, size: 1_000_000)
        .AddBufferedFilter("MyBuffer", "Microsoft.Extensions.Hosting", LogLevel.Informational,
             samplingRate: 0.01)
        .AddBufferedFilter("MyBuffer", null, LogLevel.Error);               

I'm not tied to that approach at all - its just an off-the-cuff example of what I am guessing would be possible. My broader hopes from the API design perspective:

  1. Can we make the buffering and sampling appear more unified with the filtering that is already there? Or if you think making them appear unified will create confusion lets discuss why.
  2. Can we make scenarios we expect to be common more compact? Reducing the number of lines of code, types, and method calls users need to interact with should make it easier to understand and author. This shouldn't preclude having verbose APIs available to handle more complex cases.

There is probably more fine-grained API feedback but I think we'd want to look at the high level stuff before the smaller stuff.

@tarekgh @samsp-msft - you guys also probably want to look at this.

@tarekgh
Copy link
Member

tarekgh commented May 1, 2024

I second @noahfalk for all points he raised. One more thing, why we are combining the filtering and buffering together. Buffering maybe used in more scenarios than just buffering. And should we follow the pattern we used in redaction? I means something like:

    builder.EnableRedaction(...)
               .EnableFilttering(....) // EnableSampler?
               .EnableBuffering(....)

And create a global filter class which can easily registered to support global filtering. For request filtering, can have local filter and can be a setting to enable in aspnet so anyone handle http requests? this can simplify the APIs I guess and remove ControlAction. just thoughts!

@samsp-msft
Copy link
Member

I have a number of thoughts, but I think this also needs to be fit into the larger picture with OpenTelemetry.

Filtering based on EventID

I have heard the need from customers that the granularity of the logging levels is not sufficient, and that they would like to be able to collect specific log messages from a lower log level (debug ==1, Critical ==5), or filter out higher level messages as they occur too frequently.

This should happen regardless of whether the log message is associated with a trace, and the frequency of its usage. Ideally is handled through configuration along with the rest of the log levels, so it provides an override of the more granular defaults.

Head-based sampling of per-request telemetry.

If your service is handing a large number of requests, collecting tracing and log messages from all of them use up a lot of data. Rather than sampling each source of data indepedently, you want to sample them using the same algorithm based on the traceID. Each source can be given a percentage for what to keep. If the same hash algorithm is used to decide what to keep, then requests that fall under the lowest common denominator will have all their records, others will be filtered out as applicable.

Ideally the same processing can happen to log messages with a traceid as the distributed tracing so both the trace and log messages will be emitted together. Because this is a head-based sampling algorithm, it doesn't depend on buffering and waiting for completion to be useful.

Rate limiting

The classic Application Insights SDK has a sampling algorithm that supports rate based limiting - so you can cap the tracing at 5 requests/sec (as an example). I believe its a semi-adaptive algorithm so its not just the first 5 requests that will be emitted, but it uses an average over a timespan to determine what to retain.

I think we need the same algorithm to be implemented as a processor for OpenTelemetry so its not dependent on being used with a specific exporter. When Aspire Apps are deployed to ACA, this mechanism could be used to ensure that the dashboard is not overloaded with results.
The great thing about this kind of algorithm is that its unlikely to filter out records when doing local debugging (unless load testing) so you are not wondering where your telemetry went.

  1. Hard limits
    The adaptive algorithm can be augmented with additional safeguards to ensure that you don't get too much data or too little data. For example, you could have overrides of a min sampling percentage, so you get 0.1% of requests regardless of the rate limit.

It could also have max limits to reduce the chance of coding/configuration errors causing high bills. For example have a max-log-messages-per-traceid, max-spans-per-traceid and max-events-per-span which would be calculated and limit the data per request, per process. That should be doable on a per-process basis without needing to do tail sampling as long as there is an event for when the request is completed, and therefore the count stats can be removed.

In the case that the limits are reached, ideally an additional log message is emitted that details that messages have been suppressed, and what their counts were. This can then be used to understand the patterns that have caused the suppression to occur and the extent of the problem.

Log messages not associated with a trace

Not all log messages are going to be associated with a request, and therefore have the context of a traceid. These should have the same concepts of fixed rate and time-based sampling applied to them. The values for those rates should be independent from those for trace-based sampling.

Processing order

  • I suspect the EventID based filtering should be applied first, and be performed regardless of the other options. If a message is elevated, or suppressed, it should then go through the rest of the algorithm to determine what happens.

  • Ideally the decision for whether a trace is sampled is made at the time the traceid is retrieved from the request headers, and the Recorded bit is set. There should be an option as to whether to:

    • always honor the Recorded bit from the incoming request
    • ignore the incoming Recorded bit and use the algorithm to set the value
    • only use the algorithm if the incoming request doesn't have the recorded bit

    This should then set the Recorded bit on the Activity so that code can use that bit to determine whether to write telemetry or not.

    • Ideally the IsEnabled() method for ILogger should use the Activity Recorded bit to determine if the message should be written. If no activity is present, it short-circuits that check

    • In the output processing for OpenTelemetry an additional filter is applied - that will remove Traces and Logs that are trace specific, so that the above is an optimization, but bad logging code may have wasted CPU, but will not affect the output.

Tail-based sampling

All of the above is applied on a per-process basis and is decided up front without knowlege of the final success/failure of the request. In most high performance systems, this is the most cost effective way to make decisions.
An alternative would be to use an out of process, tail based sampler typically implemented in an agent such as OTel Collector or Strato. As the implementation of that is external to the process, it's not specific to .NET and so not part of this feature set.

@tarekgh
Copy link
Member

tarekgh commented May 9, 2024

I'm just checking in to see if we're moving forward with this. The reason I ask is because we have the tracking issue dotnet/runtime#82465, and we need to determine whether we should close it if we're going ahead with the proposal mentioned here. Naturally, we can collaborate to decide what needs to be incorporated into the runtime, if anything.

@evgenyfedorov2
Copy link
Author

Yes, we will be moving forward. I've been busy with other things recently, and I still am, which is why there's been a delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged
Projects
None yet
Development

No branches or pull requests

5 participants