New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Doc+) Flush out Data Tiers #107981
base: main
Are you sure you want to change the base?
(Doc+) Flush out Data Tiers #107981
Changes from 3 commits
94d72a8
362201b
9eed70d
24035b3
955650a
e6388c2
b71b016
0b9ca75
c72d632
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,40 +2,71 @@ | |
[[data-tiers]] | ||
== Data tiers | ||
|
||
A _data tier_ is a collection of nodes with the same data role that | ||
typically share the same hardware profile: | ||
|
||
* <<content-tier, Content tier>> nodes handle the indexing and query load for content such as a product catalog. | ||
* <<hot-tier, Hot tier>> nodes handle the indexing load for time series data such as logs or metrics | ||
and hold your most recent, most-frequently-accessed data. | ||
* <<warm-tier, Warm tier>> nodes hold time series data that is accessed less-frequently | ||
A _data tier_ is a collection of <<modules-node,nodes>> within a cluster that share the same | ||
<<node-roles,data node role>>. Elastic recommends this collection of nodes also shares the same | ||
hardware profile to avoid <<hotspotting,hot spotting>>. Data tiers' usage generally splits along | ||
<<data-management,data categories>> for _content_ and _time series_ data. {es} available | ||
data tiers: | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* <<content-tier,Content tier>> nodes handle the indexing and query load for content | ||
indices, such as a <<system-indices,system index>> or a product catalog. | ||
Comment on lines
+15
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. System indices and data streams can also be time series data, so I don't think we should use it as an example here. I think we should stick with a timeseries/non-timeseries distinction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might be another π point for me then if we can discuss:
(A)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Users cannot control these at all. System indices cannot be configured apart from specialized APIs. Generally, we shouldn't be talking about system indices with users (if at all), since they are meant to be used only for system usage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that a recent change? Because users sending There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the concern with them moving the indices? Inspecting the code here, we differentiate between internal and external system indices, and those managed by ES and unmanaged by ES (I stand corrected, sorry about my confusion here). It looks like we do have a concept of a system index that a user could influence somewhat (through setting an appropriate origin). |
||
* <<hot-tier,Hot tier>> nodes handle the indexing load for time series data, | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
such as logs or metrics. They hold your most recent, most-frequently-accessed data. | ||
* <<warm-tier,Warm tier>> nodes hold time series data that is accessed less-frequently | ||
and rarely needs to be updated. | ||
* <<cold-tier,Cold tier>> nodes hold time series data that is accessed | ||
infrequently and not normally updated. To save space, you can keep | ||
<<fully-mounted,fully mounted indices>> of | ||
<<ilm-searchable-snapshot,{search-snaps}>> on the cold tier. These fully mounted | ||
indices eliminate the need for replicas, reducing required disk space by | ||
approximately 50% compared to the regular indices. | ||
* <<frozen-tier, Frozen tier>> nodes hold time series data that is accessed | ||
* <<frozen-tier,Frozen tier>> nodes hold time series data that is accessed | ||
rarely and never updated. The frozen tier stores <<partially-mounted,partially | ||
mounted indices>> of <<ilm-searchable-snapshot,{search-snaps}>> exclusively. | ||
This extends the storage capacity even further β by up to 20 times compared to | ||
the warm tier. | ||
|
||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
IMPORTANT: {es} generally expects nodes within a data tier to share the same | ||
hardware profile. Variations not following this recommendation should be | ||
IMPORTANT: {es} generally expects nodes within a data tier to share the same | ||
hardware profile. Variations that don't follow this recommendation should be | ||
carefully architected to avoid <<hotspotting,hot spotting>>. | ||
shainaraskas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Content data will remain on the <<content-tier,content tier>> for its entire | ||
data lifecycle. You can configure your time series data to progress through the | ||
descending temperature data tiers hot, warm, cold, and frozen according to your | ||
performance, resiliency, and data retention requirements. Elastic recommends | ||
automating these lifecycle transitions via <<index-lifecycle-management,{ilm}>>, | ||
specifically also using <<data-streams,Data Streams>>. | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
[TIP] | ||
==== | ||
A data tier's performance depends on its backing hardware profile. | ||
See {cloud}/ec-configure-deployment-settings.html#ec-hardware-profiles[{ecloud}'s | ||
hardware profiles] for example {cloud}/ec-reference-hardware.html[hardware configurations]. | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
{es} itself does not require but Elastic generally assumes, for example in {ecloud} | ||
Deployment configurations, that descending temperature data tiers have an increasing | ||
multiplier of cpu and/or heap resources to their data storage ratio, so that later data | ||
tiers can gain more space for data storage at the cost of slower response times. | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Under this assumption for a general architecture baseline, the above outline of | ||
descending temperature data tier access proportionalities would reflect as searches | ||
hitting 85% hot, 10% warm, 5% cold, and 1% frozen and ingest targeting | ||
95% hot, 4% warm, 1% cold, and 0% frozen as checked via | ||
<<cat-thread-pool,CAT Threadpools>>. These proportions are not required by {es} | ||
although they encourage stable and highly responsive clusters. They're only intended | ||
to serve as a general architecture baseline to then be applied to your specific | ||
use case, hardware profiles, and architecture per Elastic's | ||
https://www.elastic.co/blog/it-depends[It Depends] philosphy. | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
==== | ||
|
||
When you index documents directly to a specific index, they remain on content tier nodes indefinitely. | ||
[discrete] | ||
[[available-tier]] | ||
=== Available data tiers | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When you index documents to a data stream, they initially reside on hot tier nodes. | ||
You can configure <<index-lifecycle-management, {ilm}>> ({ilm-init}) policies | ||
to automatically transition your time series data through the hot, warm, and cold tiers | ||
according to your performance, resiliency and data retention requirements. | ||
Learn more about each data tier, including when and how it should be used. | ||
|
||
[discrete] | ||
[[content-tier]] | ||
=== Content tier | ||
==== Content tier | ||
|
||
// tag::content-tier[] | ||
Data stored in the content tier is generally a collection of items such as a product catalog or article archive. | ||
|
@@ -50,13 +81,14 @@ While they are also responsible for indexing, content data is generally not inge | |
as time series data such as logs and metrics. From a resiliency perspective the indices in this | ||
tier should be configured to use one or more replicas. | ||
|
||
The content tier is required. System indices and other indices that aren't part | ||
of a data stream are automatically allocated to the content tier. | ||
The content tier is required and is often deployed within the same node | ||
grouping as the hot tier. System indices and other indices that aren't part | ||
of a data stream are automatically allocated to the content tier. | ||
// end::content-tier[] | ||
|
||
[discrete] | ||
[[hot-tier]] | ||
=== Hot tier | ||
==== Hot tier | ||
|
||
// tag::hot-tier[] | ||
The hot tier is the {es} entry point for time series data and holds your most-recent, | ||
|
@@ -71,7 +103,7 @@ data stream>> are automatically allocated to the hot tier. | |
|
||
[discrete] | ||
[[warm-tier]] | ||
=== Warm tier | ||
==== Warm tier | ||
|
||
// tag::warm-tier[] | ||
Time series data can move to the warm tier once it is being queried less frequently | ||
|
@@ -84,7 +116,7 @@ For resiliency, indices in the warm tier should be configured to use one or more | |
|
||
[discrete] | ||
[[cold-tier]] | ||
=== Cold tier | ||
==== Cold tier | ||
|
||
// tag::cold-tier[] | ||
When you no longer need to search time series data regularly, it can move from | ||
|
@@ -106,7 +138,7 @@ but doesn't reduce required disk space compared to the warm tier. | |
|
||
[discrete] | ||
[[frozen-tier]] | ||
=== Frozen tier | ||
==== Frozen tier | ||
|
||
// tag::frozen-tier[] | ||
Once data is no longer being queried, or being queried rarely, it may move from | ||
|
@@ -120,9 +152,15 @@ sometimes fetch frozen data from the snapshot repository, searches on the frozen | |
tier are typically slower than on the cold tier. | ||
// end::frozen-tier[] | ||
|
||
[discrete] | ||
[[configure-data-tiers]] | ||
=== Configure data tiers | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Follow the instructions for your deployment type to configure data tiers. | ||
|
||
[discrete] | ||
[[configure-data-tiers-cloud]] | ||
=== Configure data tiers on {ess} or {ece} | ||
==== {ess} or {ece} | ||
|
||
The default configuration for an {ecloud} deployment includes a shared tier for | ||
hot and content data. This tier is required and can't be removed. | ||
|
@@ -156,7 +194,7 @@ tier]. | |
|
||
[discrete] | ||
[[configure-data-tiers-on-premise]] | ||
=== Configure data tiers for self-managed deployments | ||
==== Self-managed deployments | ||
|
||
For self-managed deployments, each node's <<data-node,data role>> is configured | ||
in `elasticsearch.yml`. For example, the highest-performance nodes in a cluster | ||
|
@@ -174,25 +212,58 @@ tier. | |
[[data-tier-allocation]] | ||
=== Data tier index allocation | ||
|
||
When you create an index, by default {es} sets | ||
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>> | ||
The <<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>> setting determines the tier index shards should be allocated to. | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When you create an index, by default {es} sets the `_tier_preference` | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
to `data_content` to automatically allocate the index shards to the content tier. | ||
|
||
When {es} creates an index as part of a <<data-streams, data stream>>, | ||
by default {es} sets | ||
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>> | ||
by default {es} sets the `_tier_preference` | ||
to `data_hot` to automatically allocate the index shards to the hot tier. | ||
|
||
You can explicitly set `index.routing.allocation.include._tier_preference` | ||
to opt out of the default tier-based allocation. | ||
At the time of index creation, you can override the default setting by explicitly setting | ||
the preferred value in one of two ways: | ||
|
||
- By using an <<index-templates,index template>>. Refer to <<getting-started-index-lifecycle-management,Automate rollover with ILM>> for details. | ||
- From within the <<indices-create-index,create index>> request body. | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You can override this | ||
setting after index creation by <<indices-update-settings,updating the index setting>> to the preferred | ||
value. | ||
|
||
In this setting, you can provide multiple tiers in order of preference to prevent indices from remaining unallocated if no nodes are available in the preferred tier. | ||
stefnestor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To remove the data tier preference | ||
setting, set the `_tier_preference` value to `null`. This allows the index to allocate to any data node within the cluster. Setting the `_tier_preference` to `null` does not restore the default value. Note that, in the case of managed indices, a <<ilm-migrate,migrate>> action might apply a new value in its place. | ||
|
||
[discrete] | ||
[[data-tier-allocation-value]] | ||
==== Determine the current data tier preference | ||
|
||
You can check an existing index's data tier preference by <<indices-get-settings,polling its | ||
settings>> for `index.routing.allocation.include._tier_preference`: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
GET /my-index-000001/_settings?filter_path=*.settings.index.routing.allocation.include._tier_preference | ||
-------------------------------------------------- | ||
shainaraskas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
[discrete] | ||
[[data-tier-allocation-troubleshooting]] | ||
==== Troubleshooting | ||
|
||
The `_tier_preference` setting might conflict with other allocation settings. This conflict might prevent the shard from allocating. A conflict might occur when a cluster has not yet been completely <<troubleshoot-migrate-to-tiers,migrated | ||
to data tiers>>. | ||
|
||
This setting will not unallocate a currently allocated shard, but might prevent it from migrating from its current location to its designated data tier. To troubleshoot, call the <<cluster-allocation-explain,cluster allocation explain API>> and specify the suspected problematic shard. | ||
|
||
[discrete] | ||
[[data-tier-migration]] | ||
=== Automatic data tier migration | ||
==== Automatic data tier migration | ||
|
||
{ilm-init} automatically transitions managed | ||
indices through the available data tiers using the <<ilm-migrate, migrate>> action. | ||
By default, this action is automatically injected in every phase. | ||
You can explicitly specify the migrate action with `"enabled": false` to disable automatic migration, | ||
You can explicitly specify the migrate action with `"enabled": false` to <<ilm-disable-migrate-ex,disable automatic migration>>, | ||
for example, if you're using the <<ilm-allocate, allocate action>> to manually | ||
specify allocation rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "content data" is a little ambiguous here, perhaps
β¦ for time series and non time series data.
?