Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metrics): fix race when accessing metric registry #2409

Merged
merged 1 commit into from
Jan 10, 2023

Conversation

vincentbernat
Copy link
Contributor

A race condition was introduced in
5b04c98 (feat(metrics): track consumer-fetch-response-size) when passing the metric registry around to get additional metrics. Notably, handleResponsePromise() could access the registry after the broker has been closed and is tentatively being reopened. This triggers a data race because b.metricRegistry is being set during Open() (as it is part of the configuration).

We fix this by reverting the addition of handleResponsePromise() as a method to Broker. Instead, we provide it with the metric registry as an argument. An alternative would have been to get the metric registry before the select call. However, removing it as a method make it clearer than this function is not allowed to access the broker internals as they are not protected by the lock and the broker may not be alive any more.

All the following calls to b.metricRegistry are done while the lock is held:

  • inside Open(), the lock is held, including inside the goroutine
  • inside Close(), the lock is held
  • AsyncProduce() has a contract that it must be called while the broker is open, we keep a copy of the metric registry to use inside the callback
  • sendInternal(), has a contract that the lock should be held
  • authenticateViaSASLv1() is called from Open() and sendWithPromise(), both of them holding the lock
  • sendAndReceiveSASLHandshake() is called from
  • authenticateViaSASLv0/v1(), which are called from Open() and sendWithPromise()

I am unsure about responseReceiver(), however, it is also calling b.readFull() which accesses b.conn, so I suppose it is safe.

This leaves sendAndReceive() which is calling send(), which is calling sendWithPromise() which puts a lock. We move the lock to sendAndReceive() instead. send() is only called from sendAndReceiver() and we put a lock for sendWithPromise() other caller.

The test has been stolen from #2393 from @samuelhewitt. #2393 is an alternative proposal using a RW lock to protect b.metricRegistry.

Fix #2320

A race condition was introduced in
5b04c98 (feat(metrics): track
consumer-fetch-response-size) when passing the metric registry around to
get additional metrics. Notably, `handleResponsePromise()` could access
the registry after the broker has been closed and is tentatively being
reopened. This triggers a data race because `b.metricRegistry` is being
set during `Open()` (as it is part of the configuration).

We fix this by reverting the addition of `handleResponsePromise()` as a
method to `Broker`. Instead, we provide it with the metric registry as
an argument. An alternative would have been to get the metric registry
before the `select` call. However, removing it as a method make it
clearer than this function is not allowed to access the broker internals
as they are not protected by the lock and the broker may not be alive
any more.

All the following calls to `b.metricRegistry` are done while the lock is
held:

- inside `Open()`, the lock is held, including inside the goroutine
- inside `Close()`, the lock is held
- `AsyncProduce()` has a contract that it must be called while the broker
  is open, we keep a copy of the metric registry to use inside the callback
- `sendInternal()`, has a contract that the lock should be held
- `authenticateViaSASLv1()` is called from `Open()` and
  `sendWithPromise()`, both of them holding the lock
- `sendAndReceiveSASLHandshake()` is called from
- `authenticateViaSASLv0/v1()`, which are called from `Open()` and
  `sendWithPromise()`

I am unsure about `responseReceiver()`, however, it is also calling
`b.readFull()` which accesses `b.conn`, so I suppose it is safe.

This leaves `sendAndReceive()` which is calling `send()`, which is
calling `sendWithPromise()` which puts a lock. We move the lock to
`sendAndReceive()` instead. `send()` is only called from
`sendAndReceiver()` and we put a lock for `sendWithPromise()` other
caller.

The test has been stolen from IBM#2393 from @samuelhewitt. IBM#2393 is an
alternative proposal using a RW lock to protect `b.metricRegistry`.

Fix IBM#2320
@samuelhewitt
Copy link

@dnwe This seems like a more appropriate fix to #2320, I like the classification of handleResponsePromise as not part of broker, since, as @vincentbernat points out, the method could have been acting on a broker that is not alive.

If everyone agrees, would be great to see this merged and I can close PR #2393

@dnwe dnwe added the fix label Jan 10, 2023
Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vincentbernat — great analysis and fix 👍🏻

@dnwe
Copy link
Collaborator

dnwe commented Jan 10, 2023

@samuelhewitt yep agreed, looks like this PR should cover the necessary changes — thanks for all your work on this as well

@dnwe dnwe merged commit b0eda59 into IBM:main Jan 10, 2023
vincentbernat added a commit to vincentbernat/sarama that referenced this pull request Jan 26, 2023
On failure, `Broker.Open()` can be called several times while a producer
is running. In IBM#2409, it was assumed that AsyncProduce can only be
called with an open broker, however, it should be read that a user
should call it after opening the broker. The broker could be
disconnected and in progress of being reconnected. This is not hard to
fix as we already have a lock protecting the creation of the registry:
just don't create a new metric registry when attempting to reopen the
broker.

Fix IBM#2320 (again)
dnwe pushed a commit that referenced this pull request Jan 26, 2023
)

On failure, `Broker.Open()` can be called several times while a producer
is running. In #2409, it was assumed that AsyncProduce can only be
called with an open broker, however, it should be read that a user
should call it after opening the broker. The broker could be
disconnected and in progress of being reconnected. This is not hard to
fix as we already have a lock protecting the creation of the registry:
just don't create a new metric registry when attempting to reopen the
broker.

Fix #2320 (again)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race in broker v1.36.0
3 participants