Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Doc+) Flush out Data Tiers #107981

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
133 changes: 94 additions & 39 deletions docs/reference/datatiers.asciidoc
Expand Up @@ -2,40 +2,66 @@
[[data-tiers]]
== Data tiers

A _data tier_ is a collection of nodes with the same data role that
typically share the same hardware profile:

* <<content-tier, Content tier>> nodes handle the indexing and query load for content such as a product catalog.
* <<hot-tier, Hot tier>> nodes handle the indexing load for time series data such as logs or metrics
and hold your most recent, most-frequently-accessed data.
* <<warm-tier, Warm tier>> nodes hold time series data that is accessed less-frequently
A _data tier_ is a collection of <<modules-node,nodes>> within a cluster which share the same
stefnestor marked this conversation as resolved.
Show resolved Hide resolved
<<node-roles,data node role>>. Elastic recommends this collection of nodes also shares the same
hardware profile to avoid <<hotspotting,hot spotting>>. Data tiers' usage generally splits along
<<data-management,data categories>> for _content_ and _time series_ data. {es} available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "content data" is a little ambiguous here, perhaps … for time series and non time series data.?

data tiers:
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

* <<content-tier,Content tier>> nodes handle the indexing and query load for content
indices, such as a <<system-indices,system index>> or a product catalog.
Comment on lines +15 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System indices and data streams can also be time series data, so I don't think we should use it as an example here. I think we should stick with a timeseries/non-timeseries distinction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be another πŸ˜• point for me then if we can discuss:

  1. Lower down on the existing page under Content header already says "System indices and other indices that aren’t part of a data stream are automatically allocated to the content tier." which is why I didn't realize I might be misunderstanding.
  2. Support encourages users to keep all system indices on hot/content. Does Dev agree?
  3. AFAIK (and it's an ongoing discussion / definition-problem) system indices are the indices which report from the snapshot's feature states. So from the unofficial list I wrote for Support we later learned e.g. .ilm-history and .kibana-event-log don't qualify as system indices. So e.g. only (A) qualify as system indices and AFAICT that subset doesn't have time series data (at least no indices which'd rollover. EDIT: other than the ML ones if that's what you were referencing?).
(A)
{
  "feature_states": [
    {
      "feature_name": "security",
      "indices": [".security-tokens-7",".security-7",".security-profile-8"]
    },
    {
      "feature_name": "geoip",
      "indices": [".geoip_databases"]
    },
    {
      "feature_name": "async_search",
      "indices": [".async-search"]
    },
    {
      "feature_name": "machine_learning",
      "indices": [".ml-inference-native-000002",".ml-inference-000005",".ml-config"]
    },
    {
      "feature_name": "transform",
      "indices": [".transform-internal-007"]
    },
    {
      "feature_name": "kibana",
      "indices": [
        ".kibana_analytics_8.12.2_001",
        ".kibana_task_manager_8.12.2_001",
        ".kibana_ingest_8.12.2_001",
        ".apm-custom-link",
        ".apm-agent-configuration",
        ".kibana_8.12.2_001",
        ".kibana_security_session_1",
        ".kibana_security_solution_8.12.2_001",
        ".kibana_alerting_cases_8.12.2_001"
      ]
    },
    {
      "feature_name": "tasks",
      "indices": [".tasks"]
    },
    {
      "feature_name": "fleet",
      "indices": [
        ".fleet-agents-7",
        ".fleet-enrollment-api-keys-7",
        ".fleet-actions-7",
        ".fleet-policies-7",
        ".fleet-servers-7",
        ".fleet-policies-leader-7"
      ]
    }
  ]
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support encourages users to keep all system indices on hot/content.

Users cannot control these at all. System indices cannot be configured apart from specialized APIs. Generally, we shouldn't be talking about system indices with users (if at all), since they are meant to be used only for system usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a recent change? Because users sending .kibana* or .reporting* system indices or or .alert* if they count as system indices to warm/cold tier is an ongoing lowkey concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the concern with them moving the indices?

Inspecting the code here, we differentiate between internal and external system indices, and those managed by ES and unmanaged by ES (I stand corrected, sorry about my confusion here). It looks like we do have a concept of a system index that a user could influence somewhat (through setting an appropriate origin).

* <<hot-tier,Hot tier>> nodes handle the indexing load for time series,
stefnestor marked this conversation as resolved.
Show resolved Hide resolved
such as logs or metrics. They hold your most recent, most-frequently-accessed data.
* <<warm-tier,Warm tier>> nodes hold time series data that is accessed less-frequently
and rarely needs to be updated.
* <<cold-tier,Cold tier>> nodes hold time series data that is accessed
infrequently and not normally updated. To save space, you can keep
<<fully-mounted,fully mounted indices>> of
<<ilm-searchable-snapshot,{search-snaps}>> on the cold tier. These fully mounted
indices eliminate the need for replicas, reducing required disk space by
approximately 50% compared to the regular indices.
* <<frozen-tier, Frozen tier>> nodes hold time series data that is accessed
* <<frozen-tier,Frozen tier>> nodes hold time series data that is accessed
rarely and never updated. The frozen tier stores <<partially-mounted,partially
mounted indices>> of <<ilm-searchable-snapshot,{search-snaps}>> exclusively.
This extends the storage capacity even further β€” by up to 20 times compared to
the warm tier.

stefnestor marked this conversation as resolved.
Show resolved Hide resolved
IMPORTANT: {es} generally expects nodes within a data tier to share the same
hardware profile. Variations not following this recommendation should be
carefully architected to avoid <<hotspotting,hot spotting>>.

When you index documents directly to a specific index, they remain on content tier nodes indefinitely.
Content data will remain on the <<content-tier,content tier>> for its entire
data lifecycle. You can configure your time series data to progress through the
descending temperature data tiers hot, warm, cold, and frozen according to your
performance, resiliency, and data retention requirements. Elastic recommends
automating these lifecycle transitions via <<index-lifecycle-management,{ilm}>>,
specifically also using <<data-streams,Data Streams>>.
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

[TIP]
====
A data tiers' performance is highly subjective to its backing hardware profile.
stefnestor marked this conversation as resolved.
Show resolved Hide resolved
See {cloud}/ec-configure-deployment-settings.html#ec-hardware-profiles[{ecloud}'s
hardware profiles] for example {cloud}/ec-reference-hardware.html[hardware configurations].
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

{es} itself does not require but Elastic generally assumes, for example in {ecloud}
Deployment configurations, that descending temperature data tiers have an increasing
multiplier of cpu and/or heap resources to their data storage ratio, so that later data
tiers can gain more space for data storage at the cost of slower response times.
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

Under this assumption for a general architecture baseline, the above outline of
descending temperature data tier access proportionalities would reflect as searches
hitting 85% hot, 10% warm, 5% cold, and 1% frozen and ingest targeting
95% hot, 4% warm, 1% cold, and 0% frozen as checked via
<<cat-thread-pool,CAT Threadpools>>. These proportions are not required by {es}
although they encourage stable and highly responsive clusters. They're only intended
to serve as a general architecture baseline to then be applied to your specific
use case, hardware profiles, and architecture per Elastic's
https://www.elastic.co/blog/it-depends[It Depends] philosphy.
stefnestor marked this conversation as resolved.
Show resolved Hide resolved
====

When you index documents to a data stream, they initially reside on hot tier nodes.
You can configure <<index-lifecycle-management, {ilm}>> ({ilm-init}) policies
to automatically transition your time series data through the hot, warm, and cold tiers
according to your performance, resiliency and data retention requirements.
[discrete]
[[available-tier]]
=== Available data tiers
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

[discrete]
[[content-tier]]
=== Content tier
==== Content tier

// tag::content-tier[]
Data stored in the content tier is generally a collection of items such as a product catalog or article archive.
Expand All @@ -50,13 +76,14 @@ While they are also responsible for indexing, content data is generally not inge
as time series data such as logs and metrics. From a resiliency perspective the indices in this
tier should be configured to use one or more replicas.

The content tier is required. System indices and other indices that aren't part
of a data stream are automatically allocated to the content tier.
The content tier is required and is frequently seen deployed within the same node
stefnestor marked this conversation as resolved.
Show resolved Hide resolved
grouping as the hot tier. System indices and other indices that aren't part
of a data stream are automatically allocated to the content tier.
// end::content-tier[]

[discrete]
[[hot-tier]]
=== Hot tier
==== Hot tier

// tag::hot-tier[]
The hot tier is the {es} entry point for time series data and holds your most-recent,
Expand All @@ -71,7 +98,7 @@ data stream>> are automatically allocated to the hot tier.

[discrete]
[[warm-tier]]
=== Warm tier
==== Warm tier

// tag::warm-tier[]
Time series data can move to the warm tier once it is being queried less frequently
Expand All @@ -84,7 +111,7 @@ For resiliency, indices in the warm tier should be configured to use one or more

[discrete]
[[cold-tier]]
=== Cold tier
==== Cold tier

// tag::cold-tier[]
When you no longer need to search time series data regularly, it can move from
Expand All @@ -106,7 +133,7 @@ but doesn't reduce required disk space compared to the warm tier.

[discrete]
[[frozen-tier]]
=== Frozen tier
==== Frozen tier

// tag::frozen-tier[]
Once data is no longer being queried, or being queried rarely, it may move from
Expand All @@ -120,9 +147,13 @@ sometimes fetch frozen data from the snapshot repository, searches on the frozen
tier are typically slower than on the cold tier.
// end::frozen-tier[]

[discrete]
[[configure-data-tiers]]
=== Configure data tiers
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

[discrete]
[[configure-data-tiers-cloud]]
=== Configure data tiers on {ess} or {ece}
==== On {ess} or {ece}
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

The default configuration for an {ecloud} deployment includes a shared tier for
hot and content data. This tier is required and can't be removed.
Expand Down Expand Up @@ -156,7 +187,7 @@ tier].

[discrete]
[[configure-data-tiers-on-premise]]
=== Configure data tiers for self-managed deployments
==== On self-managed deployments
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

For self-managed deployments, each node's <<data-node,data role>> is configured
in `elasticsearch.yml`. For example, the highest-performance nodes in a cluster
Expand All @@ -174,25 +205,49 @@ tier.
[[data-tier-allocation]]
=== Data tier index allocation

When you create an index, by default {es} sets
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
to `data_content` to automatically allocate the index shards to the content tier.

When {es} creates an index as part of a <<data-streams, data stream>>,
by default {es} sets
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
to `data_hot` to automatically allocate the index shards to the hot tier.

You can explicitly set `index.routing.allocation.include._tier_preference`
to opt out of the default tier-based allocation.
You can check an existing index's data tier by <<indices-get-settings,polling its
settings>> for <<tier-preference-allocation-filter,`index.routing.allocation.include._tier_preference`>>:

[source,console]
--------------------------------------------------
GET /my-index-000001/_settings?filter_path=*.settings.index.routing.allocation.include._tier_preference
--------------------------------------------------
shainaraskas marked this conversation as resolved.
Show resolved Hide resolved

This `_tier_preference` setting may include a descending preference list for later data tier
temperatures, for example <<cold-tier,cold tier>> would state `data_cold,data_warm,data_hot`.
See <<ilm-migrate,ILM Migrate>> for more context.

{es} will attempt to <<index-modules-allocation,allocate>> the index's shards
according to this setting. This setting will not overpower and may conflict with
other allocation settings preventing the shard from allocating. This historically
has occurred when a cluster has not yet been or has been insufficiently <<troubleshoot-migrate-to-tiers,migrated
to data tiers>>. This setting will not unallocate a currently allocated shard, but
may for example prevent it from migrating from its current location to its designated
data tier. To troubleshoot, run <<cluster-allocation-explain,Allocation Explain>>
against the suspected problematic shard.

A created index will default the `_tier_preference` setting to `data_content` which
will allocate the index' shards to the content tier. A <<data-streams,Data Stream>>
will override its backing created index to `data_hot` to instead default allocate to the
hot tier. You can override these default actions upon index creation by explicitly setting
the preferred value either via an <<index-templates,Index Template>>, see
<<getting-started-index-lifecycle-management,bootstrapping ILM>>, or from within the
<<indices-create-index,created index>> request body itself. You may also override this
setting at any time by <<indices-update-settings,updating index settings>> to the preferred
value.

You may set the `_tier_preference` value to `null` to remove the data tier preference
setting which will allow it to allocate to any data node within the cluster and will not
reset the index's setting back to its respective upon-creation default. Forewarning if you
do that an <<ilm-migrate,ILM Migrate>> may apply a value at a later point if the index is managed.
stefnestor marked this conversation as resolved.
Show resolved Hide resolved

[discrete]
[[data-tier-migration]]
=== Automatic data tier migration
==== Automatic data tier migration

{ilm-init} automatically transitions managed
indices through the available data tiers using the <<ilm-migrate, migrate>> action.
By default, this action is automatically injected in every phase.
You can explicitly specify the migrate action with `"enabled": false` to disable automatic migration,
You can explicitly specify the migrate action with `"enabled": false` to <<ilm-disable-migrate-ex,disable automatic migration>>,
for example, if you're using the <<ilm-allocate, allocate action>> to manually
specify allocation rules.