Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT #107984

Open
DaveCTurner opened this issue Apr 28, 2024 · 1 comment
Open
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Meta Team:Distributed Meta label for distributed team >tech debt

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Apr 28, 2024

Today we quietly set a default of 30s for MasterNodeRequest#masterNodeTimeout, relying on callers to override it where appropriate. However it is essentially always appropriate to override this timeout in production code. Forgetting to do so leads to subtle but critical bugs that only arise when a cluster is struggling to process cluster state updates as fast as normal, in which there is no way to lengthen these timeouts at runtime to bring the cluster back to health. There are several such bugs in Elasticsearch today (e.g. #107857, but I've found others too).

Instead in production code we should move away from having any quiet default for this timeout and instead require callers to specify an appropriate value when creating the request instance in the first place. For example:

  • Instances that relate to a REST request should always set this timeout according to the ?master_timeout request parameter, whose default of RestUtils#REST_MASTER_TIMEOUT_DEFAULT (30s) is explicitly documented and conceptually separate from MasterNodeRequest#DEFAULT_MASTER_NODE_TIMEOUT (also coincidentally 30s).
  • Instances that relate to internal activities should probably set this timeout very long since it's normally better to wait patiently instead of failing sooner, especially if the failure simply triggers a retry. Alternatively, they could expose the relevant timeout using a setting.
  • In test code it doesn't matter so much, we can default to 30s unless the test specifically needs a different value, but there is no particular need to avoid passing this value on request construction either.

It's not a trivial change since there are currently over 200 implementations of MasterNodeRequest, all of which currently rely on the default behaviour, so it's going to take a few steps to complete this work. Therefore I'm opening this meta-issue to track the work needed in this area.

@DaveCTurner DaveCTurner added >bug Meta :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >tech debt labels Apr 28, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Apr 28, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 28, 2024
The values `30s` and `1m` are used as defaults in various places in ES,
there's no need to create a new `TimeValue` instance each time they
appear. Moreover we already have constants for `0` and `-1`, but we
don't use these constants when reading the values off the wire.

This commit adds constants for `30s` and `1m` and adjusts the
deserialization code to avoid unnecessary allocation for common
`TimeValue` instances.

Relates elastic#107984
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 28, 2024
Many APIs accept a `?master_timeout` parameter, but reading this
parameter requires a little unnecessary boilerplate to specify the
literal parameter name and default value. Moreover, today's convention
is to construct a `MasterNodeRequest` and then read the default master
timeout from the freshly-created request. In practice this results in a
default of 30s, but we specify in the docs that this default is _always_
30s, and in principle one could create a transport request with a
different initial value which would deviate from the documented
behaviour.

This commit introduces a utility method for reading this parameter in a
fashion which is completely consistent with the documented behaviour.

Relates elastic#107984
DaveCTurner added a commit that referenced this issue Apr 29, 2024
Many APIs accept a `?master_timeout` parameter, but reading this
parameter requires a little unnecessary boilerplate to specify the
literal parameter name and default value. Moreover, today's convention
is to construct a `MasterNodeRequest` and then read the default master
timeout from the freshly-created request. In practice this results in a
default of 30s, but we specify in the docs that this default is _always_
30s, and in principle one could create a transport request with a
different initial value which would deviate from the documented
behaviour.

This commit introduces a utility method for reading this parameter in a
fashion which is completely consistent with the documented behaviour.

Relates #107984
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 29, 2024
There's a couple of APIs whose `?timeout` parameter also sets the
`?master_timeout` timeout unless the latter is specified separately.
Today this logic happens within the relevant requests' `timeout()`
setters, but it'd be enormously preferable to implement this
REST-layer-specific logic in the REST layer instead, and configure the
timeouts in the transport-layer requests explicitly. This commit does
so.

Relates elastic#107984
elasticsearchmachine pushed a commit that referenced this issue Apr 29, 2024
There's a couple of APIs whose `?timeout` parameter also sets the
`?master_timeout` timeout unless the latter is specified separately.
Today this logic happens within the relevant requests' `timeout()`
setters, but it'd be enormously preferable to implement this
REST-layer-specific logic in the REST layer instead, and configure the
timeouts in the transport-layer requests explicitly. This commit does
so.

Relates #107984
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 29, 2024
There's no good reason for this field to have `protected` visibility,
and we definitely don't want subclasses to be able to set it to `null`.
This commit makes it `private`.

Relates elastic#107984
elasticsearchmachine pushed a commit that referenced this issue Apr 29, 2024
The values `30s` and `1m` are used as defaults in various places in ES,
there's no need to create a new `TimeValue` instance each time they
appear. Moreover we already have constants for `0` and `-1`, but we
don't use these constants when reading the values off the wire.

This commit adds constants for `30s` and `1m` and adjusts the
deserialization code to avoid unnecessary allocation for common
`TimeValue` instances.

Relates #107984
DaveCTurner added a commit that referenced this issue Apr 30, 2024
There's no good reason for this field to have `protected` visibility,
and we definitely don't want subclasses to be able to set it to `null`.
This commit makes it `private`.

Relates #107984
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 8, 2024
Removes the default constructors for `MasterRequest`,
`MasterReadRequest` and `AcknowledgedRequest` in favour of constructors
which require subclasses to specify the relevant timeouts. This will
avoid bugs like elastic#107857 which are caused by a missing `super()` call.

Also deprecates and renames the default to make it clear it should not
be used in new code.

Relates elastic#107984
elasticsearchmachine pushed a commit that referenced this issue May 13, 2024
Removes the default constructors for `MasterRequest`,
`MasterReadRequest` and `AcknowledgedRequest` in favour of constructors
which require subclasses to specify the relevant timeouts. This will
avoid bugs like #107857 which are caused by a missing `super()` call.

Also deprecates and renames the default to make it clear it should not
be used in new code.

Relates #107984
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 14, 2024
These overloads are no longer used so this commit removes them.

Relates elastic#107984
elasticsearchmachine pushed a commit that referenced this issue May 14, 2024
These overloads are no longer used so this commit removes them.

Relates #107984
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 14, 2024
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 16, 2024
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 16, 2024
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 16, 2024
DaveCTurner added a commit that referenced this issue May 16, 2024
nicktindall pushed a commit to nicktindall/elasticsearch that referenced this issue May 17, 2024
nicktindall pushed a commit to nicktindall/elasticsearch that referenced this issue May 17, 2024
@DaveCTurner DaveCTurner changed the title Remove trappy implicit default MasterNodeRequest#masterNodeTimeout Remove MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT May 17, 2024
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 17, 2024
parkertimmins pushed a commit to parkertimmins/elasticsearch that referenced this issue May 17, 2024
parkertimmins pushed a commit to parkertimmins/elasticsearch that referenced this issue May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Meta Team:Distributed Meta label for distributed team >tech debt
Projects
None yet
Development

No branches or pull requests

2 participants