Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shard has exceeded the maximum number of retries [1] #1630

Open
kazukiyashiro opened this issue Nov 26, 2021 · 0 comments
Open

shard has exceeded the maximum number of retries [1] #1630

kazukiyashiro opened this issue Nov 26, 2021 · 0 comments

Comments

@kazukiyashiro
Copy link

Hello!

For usage questions and help

https://discuss.elastic.co/t/curator-shard-has-exceeded-the-maximum-number-of-retries-1/290059

When the curator tries to allocate a replica shard of shrunken index I've got this error:

{
  "index" : "example-index-2021-09-29-shrink",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-11-23T12:26:19.515Z",
    "failed_allocation_attempts" : 1,
    "details" : "failed shard on node [8r_zhRD4RDm2peWnDun_3w]: failed recovery, failure RecoveryFailedException[[example-index-2021-09-29-shrink][0]: Recovery failed from {node15}{nWOPSov3TFKUunoiooVxMQ}{PSAfiXvZQx-NLyKpnXGs1A}{192.168.0.164}{192.168.0.164:9300}{ml.machine_memory=135291469824, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {node13}{8r_zhRD4RDm2peWnDun_3w}{KU0HhEPMQ_ilSV3RCe4XNw}{192.168.0.162}{192.168.0.162:9300}{ml.machine_memory=135291469824, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[node15][172.17.0.3:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [85] files with total size of [24.8gb]]; nested: ReceiveTimeoutTransportException[[node13][192.168.0.162:9300][internal:index/shard/recovery/file_chunk] request_id [1586168734] timed out after [899897ms]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "8r_zhRD4RDm2peWnDun_3w",
      "node_name" : "node13",
      "transport_address" : "192.168.0.162:9300",
      "node_attributes" : {
        "ml.machine_memory" : "135291469824",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [1] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-11-23T12:26:19.515Z], failed_attempts[1], delayed=false, details[failed shard on node [8r_zhRD4RDm2peWnDun_3w]: failed recovery, failure RecoveryFailedException[[example-index-2021-09-29-shrink][0]: Recovery failed from {node15}{nWOPSov3TFKUunoiooVxMQ}{PSAfiXvZQx-NLyKpnXGs1A}{192.168.0.164}{192.168.0.164:9300}{ml.machine_memory=135291469824, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {node13}{8r_zhRD4RDm2peWnDun_3w}{KU0HhEPMQ_ilSV3RCe4XNw}{192.168.0.162}{192.168.0.162:9300}{ml.machine_memory=135291469824, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[node15][172.17.0.3:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [85] files with total size of [24.8gb]]; nested: ReceiveTimeoutTransportException[[node13][192.168.0.162:9300][internal:index/shard/recovery/file_chunk] request_id [1586168734] timed out after [899897ms]]; ], allocation_status[no_attempt]]]"
        }
      ]

Is there a way to increase the "index.allocation.max_retries" in curator settings?

Action file:

actions:
  1:
    action: shrink
    description: >-
      Shrink selected indices on the node with the most available space.
      Delete source index after successful shrink, then reroute the shrunk
      index with the provided parameters.
    options:
      ignore_empty_list: True
      shrink_node: DETERMINISTIC
      node_filters:
        permit_masters: True
      number_of_shards: 1
      number_of_replicas: ${REPLICA_COUNT:1}
      shrink_prefix:
      shrink_suffix: '-shrink'
      copy_aliases: True
      delete_after: True
      wait_for_active_shards: 1
      extra_settings:
        settings:
          index.codec: best_compression
      wait_for_completion: True
      wait_for_rebalance: True
      wait_interval: 9
      max_wait: -1
    filters:
     - filtertype: pattern
       kind: prefix
       value: ${INDEX_PREFIX}
     - filtertype: age
       source: name
       direction: older
       timestring: ${TIMESTAMP:'%Y-%m-%d'}
       unit: ${PERIOD:days}
       unit_count: ${PERIOD_COUNT}

Curator version: 5.8.4
OS: Centos 7

I've tried to create a template:

"shrink" : {
    "order" : 0,
    "index_patterns" : [
      "*-shrink"
    ],
    "settings" : {
      "index" : {
        "allocation" : {
          "max_retries" : "5"
        }
      }

But it doesn't help.
Here are indices settings after successful shrink:

GET /example-index-shrink/_settings

{
  "example-index-shrink" : {
    "settings" : {
      "index" : {
        "allocation" : {
          "max_retries" : "1"
        },
        "shrink" : {
          "source" : {
            "name" : "example-index",
            "uuid" : "mecKKzDDTzu77ViMv5N3EA"
          }
        },
        "blocks" : {
          "write" : null
        },
        "provided_name" : "example-index-shrink",
        "creation_date" : "1637751350836",
        "number_of_replicas" : "1",
        "uuid" : "MI_wbW35R8ubkYZOySfp1g",
        "version" : {
          "created" : "6080899",
          "upgraded" : "6080899"
        },
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "initial_recovery" : {
              "_id" : "nWOPSov3TFKUunoiooVxMQ"
            },
            "require" : {
              "_name" : null
            }
          }
        },
        "number_of_shards" : "1",
        "routing_partition_size" : "1",
        "resize" : {
          "source" : {
            "name" : "example-index",
            "uuid" : "mecKKzDDTzu77ViMv5N3EA"
          }
        }
      }
    }
  }
}

Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant