The boto3sqs instrumentation is broken and can't be fixed #2509

Dr-Emann · 2024-05-08T15:59:30Z

This is a strong statement, but I believe it to be true: the boto3sqs instrumentation is fundamentally flawed, and cannot be fixed.

The Boto3-sqs instrumentation attempts to automatically instrument code which attempts to use SQS messages. The way it does this is:

Messages are recieved in a batch via the recieve_message boto3 call, which retrieves up to 10 messages from the queue
It starts a span for every message received in the batch
When the message is accessed (via indexing the returned list of messages, or iterating over the returned list of messages), the span created for that message is activated
When the delete_message boto3 function is called, it ends the corresponding span if it finds one for that message

Issues

This leads to some issues (partly copied from #1599 (comment)):

All spans for a batch of messages are started at the same time, rather than starting when processing for that individual message occcurs
To determine the end of a span the instrumentation relies that user code calls sqs.delete_message API. In case of an exception in the user code delete_message might never be called, leaving behind an unended span. Not calling delete_message should be fine. It may be that this application sends the message elsewhere, and that other component is in charge of delete the message, for instance. In this case, memory use will grow without bound!
There are many ways for user code to not match the expected use of boto3 which will result in various levels of misbehavior. For example:
1. Using multiprocessing to handle each message in parallel: a totally different process will end up calling delete_message
2. As discussed above, not sending the delete_message call will lead to unbounded memory growth
3. Iterating the list of messages multiple times, or creating a new dict/list from the list passed, slicing the returned list
A single global current_context_token is used, if multiple threads are performing SQS operations, they will overwrite the current token, threads will end each others spans, etc.

The otel context stuff is built heavily on the assumption that any context operations will be fully nested, not overlapping. Any context operations (like wrapping in a new span) around the delete_message call will severely break
this assumption

response = client.receive_message(
    QueueUrl="http://sqs.us-east-1.host.docker.internal:4566/000000000000/fpc-events-local.fifo",
    MaxNumberOfMessages=10,
)
messages = response.get("Messages", [])
# Indexing into the messages attaches a new context (1) with a new span as current, and remembers the previous context (0)
message = messages[0]
# start_as_current_span attaches a new context (2) with a new span as current and remembers the previous context (1)
with tracer.start_as_current_span("inner_span"):
    # On message delete, the current context is reset to before the indexing (0)
    client.delete_message(
        QueueUrl="http://sqs.us-east-1.host.docker.internal:4566/000000000000/fpc-events-local.fifo",
        ReceiptHandle=message["ReceiptHandle"],
    )
# On exit of the with block, the current context is reset to before the start_as_current_span (1)
# The current context now has the span for the `messages[0]` as the current span, all future spans
# will use that as their parent

Note: This is a fundamental flaw of the fact that the boto3sqs instrumentation attempts to use spans which are not lexically scoped, around arbitrary calling code.

Some of these issues are just fixable bugs, but some of them (especially the last) I believe are fundamental, insurmountable issues with the very concept of attempting to automatically instrument this process.

The text was updated successfully, but these errors were encountered:

Dr-Emann · 2024-05-08T17:27:57Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The boto3sqs instrumentation is broken and can't be fixed #2509

The boto3sqs instrumentation is broken and can't be fixed #2509

Dr-Emann commented May 8, 2024 •

edited

Dr-Emann commented May 8, 2024

xrmx commented Jun 3, 2024

Dr-Emann commented Jun 3, 2024

The boto3sqs instrumentation is broken and can't be fixed #2509

The boto3sqs instrumentation is broken and can't be fixed #2509

Comments

Dr-Emann commented May 8, 2024 • edited

Issues

Dr-Emann commented May 8, 2024

xrmx commented Jun 3, 2024

Dr-Emann commented Jun 3, 2024

Dr-Emann commented May 8, 2024 •

edited