Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The boto3sqs instrumentation is broken and can't be fixed #2509

Open
Dr-Emann opened this issue May 8, 2024 · 3 comments
Open

The boto3sqs instrumentation is broken and can't be fixed #2509

Dr-Emann opened this issue May 8, 2024 · 3 comments

Comments

@Dr-Emann
Copy link

Dr-Emann commented May 8, 2024

This is a strong statement, but I believe it to be true: the boto3sqs instrumentation is fundamentally flawed, and cannot be fixed.

The Boto3-sqs instrumentation attempts to automatically instrument code which attempts to use SQS messages. The way it does this is:

  1. Messages are recieved in a batch via the recieve_message boto3 call, which retrieves up to 10 messages from the queue
  2. It starts a span for every message received in the batch
  3. When the message is accessed (via indexing the returned list of messages, or iterating over the returned list of messages), the span created for that message is activated
  4. When the delete_message boto3 function is called, it ends the corresponding span if it finds one for that message

Issues

This leads to some issues (partly copied from #1599 (comment)):

  1. All spans for a batch of messages are started at the same time, rather than starting when processing for that individual message occcurs

  2. To determine the end of a span the instrumentation relies that user code calls sqs.delete_message API. In case of an exception in the user code delete_message might never be called, leaving behind an unended span. Not calling delete_message should be fine. It may be that this application sends the message elsewhere, and that other component is in charge of delete the message, for instance. In this case, memory use will grow without bound!

  3. There are many ways for user code to not match the expected use of boto3 which will result in various levels of misbehavior. For example:

    1. Using multiprocessing to handle each message in parallel: a totally different process will end up calling delete_message
    2. As discussed above, not sending the delete_message call will lead to unbounded memory growth
    3. Iterating the list of messages multiple times, or creating a new dict/list from the list passed, slicing the returned list
  4. A single global current_context_token is used, if multiple threads are performing SQS operations, they will overwrite the current token, threads will end each others spans, etc.

  5. The otel context stuff is built heavily on the assumption that any context operations will be fully nested, not overlapping. Any context operations (like wrapping in a new span) around the delete_message call will severely break
    this assumption

    response = client.receive_message(
        QueueUrl="http://sqs.us-east-1.host.docker.internal:4566/000000000000/fpc-events-local.fifo",
        MaxNumberOfMessages=10,
    )
    messages = response.get("Messages", [])
    # Indexing into the messages attaches a new context (1) with a new span as current, and remembers the previous context (0)
    message = messages[0]
    # start_as_current_span attaches a new context (2) with a new span as current and remembers the previous context (1)
    with tracer.start_as_current_span("inner_span"):
        # On message delete, the current context is reset to before the indexing (0)
        client.delete_message(
            QueueUrl="http://sqs.us-east-1.host.docker.internal:4566/000000000000/fpc-events-local.fifo",
            ReceiptHandle=message["ReceiptHandle"],
        )
    # On exit of the with block, the current context is reset to before the start_as_current_span (1)
    # The current context now has the span for the `messages[0]` as the current span, all future spans
    # will use that as their parent

    Note: This is a fundamental flaw of the fact that the boto3sqs instrumentation attempts to use spans which are not lexically scoped, around arbitrary calling code.

Some of these issues are just fixable bugs, but some of them (especially the last) I believe are fundamental, insurmountable issues with the very concept of attempting to automatically instrument this process.

@Dr-Emann
Copy link
Author

Dr-Emann commented May 8, 2024

See also:

#1702
#1703
#1704

@xrmx
Copy link
Contributor

xrmx commented Jun 3, 2024

@Dr-Emann Do you think is the instrumentation fixable or the service is untraceable?

@Dr-Emann
Copy link
Author

Dr-Emann commented Jun 3, 2024

Kinda neither: I believe the idea of automatic instrumentation of "the application-level processing of a message received from SQS" is not fixable, but I think the service is traceable, just not automatically at that level.

I think automatic instrumentation attempting to capture time spent handling messages from SQS impossible to get right in the general case, and downsides of doing it wrong are unavoidable, and disastrous (not "oops we just can't auto instrument that span", but "oops we corrupt all future spans emitted").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants