Remove `MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` #107984

DaveCTurner · 2024-04-28T09:50:11Z

Today we quietly set a default of 30s for MasterNodeRequest#masterNodeTimeout, relying on callers to override it where appropriate. However it is essentially always appropriate to override this timeout in production code. Forgetting to do so leads to subtle but critical bugs that only arise when a cluster is struggling to process cluster state updates as fast as normal, in which there is no way to lengthen these timeouts at runtime to bring the cluster back to health. There are several such bugs in Elasticsearch today (e.g. #107857, but I've found others too).

Instead in production code we should move away from having any quiet default for this timeout and instead require callers to specify an appropriate value when creating the request instance in the first place. For example:

Instances that relate to a REST request should always set this timeout according to the ?master_timeout request parameter, whose default of RestUtils#REST_MASTER_TIMEOUT_DEFAULT (30s) is explicitly documented and conceptually separate from MasterNodeRequest#DEFAULT_MASTER_NODE_TIMEOUT (also coincidentally 30s).
Instances that relate to internal activities should probably set this timeout very long since it's normally better to wait patiently instead of failing sooner, especially if the failure simply triggers a retry. Alternatively, they could expose the relevant timeout using a setting.
In test code it doesn't matter so much, we can default to 30s unless the test specifically needs a different value, but there is no particular need to avoid passing this value on request construction either.

It's not a trivial change since there are currently over 200 implementations of MasterNodeRequest, all of which currently rely on the default behaviour, so it's going to take a few steps to complete this work. Therefore I'm opening this meta-issue to track the work needed in this area.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-04-28T09:50:35Z

Pinging @elastic/es-distributed (Team:Distributed)

The values `30s` and `1m` are used as defaults in various places in ES, there's no need to create a new `TimeValue` instance each time they appear. Moreover we already have constants for `0` and `-1`, but we don't use these constants when reading the values off the wire. This commit adds constants for `30s` and `1m` and adjusts the deserialization code to avoid unnecessary allocation for common `TimeValue` instances. Relates elastic#107984

Many APIs accept a `?master_timeout` parameter, but reading this parameter requires a little unnecessary boilerplate to specify the literal parameter name and default value. Moreover, today's convention is to construct a `MasterNodeRequest` and then read the default master timeout from the freshly-created request. In practice this results in a default of 30s, but we specify in the docs that this default is _always_ 30s, and in principle one could create a transport request with a different initial value which would deviate from the documented behaviour. This commit introduces a utility method for reading this parameter in a fashion which is completely consistent with the documented behaviour. Relates elastic#107984

Many APIs accept a `?master_timeout` parameter, but reading this parameter requires a little unnecessary boilerplate to specify the literal parameter name and default value. Moreover, today's convention is to construct a `MasterNodeRequest` and then read the default master timeout from the freshly-created request. In practice this results in a default of 30s, but we specify in the docs that this default is _always_ 30s, and in principle one could create a transport request with a different initial value which would deviate from the documented behaviour. This commit introduces a utility method for reading this parameter in a fashion which is completely consistent with the documented behaviour. Relates #107984

There's a couple of APIs whose `?timeout` parameter also sets the `?master_timeout` timeout unless the latter is specified separately. Today this logic happens within the relevant requests' `timeout()` setters, but it'd be enormously preferable to implement this REST-layer-specific logic in the REST layer instead, and configure the timeouts in the transport-layer requests explicitly. This commit does so. Relates elastic#107984

There's a couple of APIs whose `?timeout` parameter also sets the `?master_timeout` timeout unless the latter is specified separately. Today this logic happens within the relevant requests' `timeout()` setters, but it'd be enormously preferable to implement this REST-layer-specific logic in the REST layer instead, and configure the timeouts in the transport-layer requests explicitly. This commit does so. Relates #107984

There's no good reason for this field to have `protected` visibility, and we definitely don't want subclasses to be able to set it to `null`. This commit makes it `private`. Relates elastic#107984

The values `30s` and `1m` are used as defaults in various places in ES, there's no need to create a new `TimeValue` instance each time they appear. Moreover we already have constants for `0` and `-1`, but we don't use these constants when reading the values off the wire. This commit adds constants for `30s` and `1m` and adjusts the deserialization code to avoid unnecessary allocation for common `TimeValue` instances. Relates #107984

There's no good reason for this field to have `protected` visibility, and we definitely don't want subclasses to be able to set it to `null`. This commit makes it `private`. Relates #107984

Removes the default constructors for `MasterRequest`, `MasterReadRequest` and `AcknowledgedRequest` in favour of constructors which require subclasses to specify the relevant timeouts. This will avoid bugs like elastic#107857 which are caused by a missing `super()` call. Also deprecates and renames the default to make it clear it should not be used in new code. Relates elastic#107984

Removes the default constructors for `MasterRequest`, `MasterReadRequest` and `AcknowledgedRequest` in favour of constructors which require subclasses to specify the relevant timeouts. This will avoid bugs like #107857 which are caused by a missing `super()` call. Also deprecates and renames the default to make it clear it should not be used in new code. Relates #107984

These overloads are no longer used so this commit removes them. Relates elastic#107984

These overloads are no longer used so this commit removes them. Relates #107984

Relates elastic#107984

Relates #107984

Relates elastic#107984.

Relates elastic#107984

Relates #107984

Relates elastic#107984

…8714) Relates elastic#107984

Relates elastic#107984

…8714) Relates elastic#107984

DaveCTurner added >bug Meta :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >tech debt labels Apr 28, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Apr 28, 2024

DaveCTurner mentioned this issue Apr 28, 2024

Intern common TimeValue constants #107985

Merged

DaveCTurner mentioned this issue Apr 28, 2024

Introduce RestUtils#getMasterNodeTimeout #107986

Merged

DaveCTurner mentioned this issue Apr 29, 2024

Fix timeouts which override master-node timeouts #107992

Merged

DaveCTurner mentioned this issue Apr 29, 2024

Encapsulate MasterNodeRequest#masterNodeTimeout #107999

Merged

DaveCTurner mentioned this issue Apr 30, 2024

Remove pointless AbstractClient indirection #108002

Closed

DaveCTurner mentioned this issue May 8, 2024

Make master-node timeout less implicit #108414

Merged

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 14, 2024

Remove unused ?master_node_timeout compat shims

2a02128

These overloads are no longer used so this commit removes them. Relates elastic#107984

DaveCTurner mentioned this issue May 14, 2024

Remove unused ?master_node_timeout compat shims #108601

Merged

elasticsearchmachine pushed a commit that referenced this issue May 14, 2024

Remove unused ?master_node_timeout compat shims (#108601)

3fe4785

These overloads are no longer used so this commit removes them. Relates #107984

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 14, 2024

Fix timeouts in Datastream Lifecycle module

d0e81e1

Relates elastic#107984

DaveCTurner mentioned this issue May 14, 2024

Fix timeouts in Datastream Lifecycle module #108621

Merged

nicktindall mentioned this issue May 15, 2024

Populate timeout parameters from REST requests #108595

Open

elasticsearchmachine pushed a commit that referenced this issue May 15, 2024

Fix timeouts in Datastream Lifecycle module (#108621)

6922441

Relates #107984

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 16, 2024

Add ?master_node_timeout to RestDeprecationInfoAction

8d9345d

Relates elastic#107984.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 16, 2024

Add ?master_node_timeout to RestDeprecationInfoAction

eca95d1

Relates elastic#107984.

DaveCTurner mentioned this issue May 16, 2024

Add ?master_timeout to RestDeprecationInfoAction #108714

Merged

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 16, 2024

Add ?master_timeout to ent search APIs

fc58993

Relates elastic#107984

DaveCTurner mentioned this issue May 16, 2024

Add ?master_timeout to ent search APIs #108719

Merged

DaveCTurner added a commit that referenced this issue May 16, 2024

Add ?master_timeout to ent search APIs (#108719)

56fb4ac

Relates #107984

DaveCTurner added a commit that referenced this issue May 16, 2024

Add ?master_node_timeout to RestDeprecationInfoAction (#108714)

e454165

Relates #107984

nicktindall pushed a commit to nicktindall/elasticsearch that referenced this issue May 17, 2024

Add ?master_timeout to ent search APIs (elastic#108719)

f43ef02

Relates elastic#107984

nicktindall pushed a commit to nicktindall/elasticsearch that referenced this issue May 17, 2024

Add ?master_node_timeout to RestDeprecationInfoAction (elastic#10…

8b406b9

…8714) Relates elastic#107984

DaveCTurner changed the title ~~Remove trappy implicit default MasterNodeRequest#masterNodeTimeout~~ Remove MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT May 17, 2024

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 17, 2024

Expose ?master_timeout in autoscaling APIs

5e32b6f

Relates elastic#107984

DaveCTurner mentioned this issue May 17, 2024

Expose ?master_timeout in autoscaling APIs #108759

Open

parkertimmins pushed a commit to parkertimmins/elasticsearch that referenced this issue May 17, 2024

Add ?master_timeout to ent search APIs (elastic#108719)

9599846

Relates elastic#107984

parkertimmins pushed a commit to parkertimmins/elasticsearch that referenced this issue May 17, 2024

Add ?master_node_timeout to RestDeprecationInfoAction (elastic#10…

0cf9a76

…8714) Relates elastic#107984

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` #107984

Remove `MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` #107984

DaveCTurner commented Apr 28, 2024 •

edited

elasticsearchmachine commented Apr 28, 2024

Remove MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT #107984

Remove MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT #107984

Comments

DaveCTurner commented Apr 28, 2024 • edited

elasticsearchmachine commented Apr 28, 2024

Remove `MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` #107984

Remove `MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` #107984

DaveCTurner commented Apr 28, 2024 •

edited