Skip to content

Latest commit

 

History

History
253 lines (201 loc) · 11.6 KB

datatiers.asciidoc

File metadata and controls

253 lines (201 loc) · 11.6 KB

Data tiers

A data tier is a collection of nodes within a cluster which share the same data node role. Elastic recommends this collection of nodes also shares the same hardware profile to avoid hot spotting. Data tiers' usage generally splits along data categories for content and time series data. {es} available data tiers:

  • Content tier nodes handle the indexing and query load for content indices, such as a system index or a product catalog.

  • Hot tier nodes handle the indexing load for time series, such as logs or metrics. They hold your most recent, most-frequently-accessed data.

  • Warm tier nodes hold time series data that is accessed less-frequently and rarely needs to be updated.

  • Cold tier nodes hold time series data that is accessed infrequently and not normally updated. To save space, you can keep fully mounted indices of {search-snaps} on the cold tier. These fully mounted indices eliminate the need for replicas, reducing required disk space by approximately 50% compared to the regular indices.

  • Frozen tier nodes hold time series data that is accessed rarely and never updated. The frozen tier stores partially mounted indices of {search-snaps} exclusively. This extends the storage capacity even further — by up to 20 times compared to the warm tier.

Content data will remain on the content tier for its entire data lifecycle. You can configure your time series data to progress through the descending temperature data tiers hot, warm, cold, and frozen according to your performance, resiliency, and data retention requirements. Elastic recommends automating these lifecycle transitions via {ilm}, specifically also using Data Streams.

Tip

A data tiers' performance is highly subjective to its backing hardware profile. See {cloud}/ec-configure-deployment-settings.html#ec-hardware-profiles[{ecloud}'s hardware profiles] for example {cloud}/ec-reference-hardware.html[hardware configurations].

{es} itself does not require but Elastic generally assumes, for example in {ecloud} Deployment configurations, that descending temperature data tiers have an increasing multiplier of cpu and/or heap resources to their data storage ratio, so that later data tiers can gain more space for data storage at the cost of slower response times.

Under this assumption for a general architecture baseline, the above outline of descending temperature data tier access proportionalities would reflect as searches hitting 85% hot, 10% warm, 5% cold, and 1% frozen and ingest targeting 95% hot, 4% warm, 1% cold, and 0% frozen as checked via CAT Threadpools. These proportions are not required by {es} although they encourage stable and highly responsive clusters. They’re only intended to serve as a general architecture baseline to then be applied to your specific use case, hardware profiles, and architecture per Elastic’s It Depends philosphy.

Available data tiers

Content tier

Data stored in the content tier is generally a collection of items such as a product catalog or article archive. Unlike time series data, the value of the content remains relatively constant over time, so it doesn’t make sense to move it to a tier with different performance characteristics as it ages. Content data typically has long data retention requirements, and you want to be able to retrieve items quickly regardless of how old they are.

Content tier nodes are usually optimized for query performance—​they prioritize processing power over IO throughput so they can process complex searches and aggregations and return results quickly. While they are also responsible for indexing, content data is generally not ingested at as high a rate as time series data such as logs and metrics. From a resiliency perspective the indices in this tier should be configured to use one or more replicas.

The content tier is required and is frequently seen deployed within the same node grouping as the hot tier. System indices and other indices that aren’t part of a data stream are automatically allocated to the content tier.

Hot tier

The hot tier is the {es} entry point for time series data and holds your most-recent, most-frequently-searched time series data. Nodes in the hot tier need to be fast for both reads and writes, which requires more hardware resources and faster storage (SSDs). For resiliency, indices in the hot tier should be configured to use one or more replicas.

The hot tier is required. New indices that are part of a data stream are automatically allocated to the hot tier.

Warm tier

Time series data can move to the warm tier once it is being queried less frequently than the recently-indexed data in the hot tier. The warm tier typically holds data from recent weeks. Updates are still allowed, but likely infrequent. Nodes in the warm tier generally don’t need to be as fast as those in the hot tier. For resiliency, indices in the warm tier should be configured to use one or more replicas.

Cold tier

When you no longer need to search time series data regularly, it can move from the warm tier to the cold tier. While still searchable, this tier is typically optimized for lower storage costs rather than search speed.

For better storage savings, you can keep fully mounted indices of {search-snaps} on the cold tier. Unlike regular indices, these fully mounted indices don’t require replicas for reliability. In the event of a failure, they can recover data from the underlying snapshot instead. This potentially halves the local storage needed for the data. A snapshot repository is required to use fully mounted indices in the cold tier. Fully mounted indices are read-only.

Alternatively, you can use the cold tier to store regular indices with replicas instead of using {search-snaps}. This lets you store older data on less expensive hardware but doesn’t reduce required disk space compared to the warm tier.

Frozen tier

Once data is no longer being queried, or being queried rarely, it may move from the cold tier to the frozen tier where it stays for the rest of its life.

The frozen tier requires a snapshot repository. The frozen tier uses partially mounted indices to store and load data from a snapshot repository. This reduces local storage and operating costs while still letting you search frozen data. Because {es} must sometimes fetch frozen data from the snapshot repository, searches on the frozen tier are typically slower than on the cold tier.

Configure data tiers

On {ess} or {ece}

The default configuration for an {ecloud} deployment includes a shared tier for hot and content data. This tier is required and can’t be removed.

To add a warm, cold, or frozen tier when you create a deployment:

  1. On the Create deployment page, click Advanced Settings.

  2. Click + Add capacity for any data tiers to add.

  3. Click Create deployment at the bottom of the page to save your changes.

{ecloud}'s deployment Advanced configuration page

To add a data tier to an existing deployment:

  1. Log in to the {ess-console}[{ecloud} console].

  2. On the Deployments page, select your deployment.

  3. In your deployment menu, select Edit.

  4. Click + Add capacity for any data tiers to add.

  5. Click Save at the bottom of the page to save your changes.

To remove a data tier, refer to {cloud}/ec-disable-data-tier.html[Disable a data tier].

On self-managed deployments

For self-managed deployments, each node’s data role is configured in elasticsearch.yml. For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers:

node.roles: ["data_hot", "data_content"]
Note
We recommend you use dedicated nodes in the frozen tier.

Data tier index allocation

You can check an existing index’s data tier by polling its settings for index.routing.allocation.include._tier_preference:

GET /my-index-000001/_settings?filter_path=*.settings.index.routing.allocation.include._tier_preference

This _tier_preference setting may include a descending preference list for later data tier temperatures, for example cold tier would state data_cold,data_warm,data_hot. See ILM Migrate for more context.

{es} will attempt to allocate the index’s shards according to this setting. This setting will not overpower and may conflict with other allocation settings preventing the shard from allocating. This historically has occurred when a cluster has not yet been or has been insufficiently migrated to data tiers. This setting will not unallocate a currently allocated shard, but may for example prevent it from migrating from its current location to its designated data tier. To troubleshoot, run Allocation Explain against the suspected problematic shard.

A created index will default the _tier_preference setting to data_content which will allocate the index' shards to the content tier. A Data Stream will override its backing created index to data_hot to instead default allocate to the hot tier. You can override these default actions upon index creation by explicitly setting the preferred value either via an Index Template, see bootstrapping ILM, or from within the created index request body itself. You may also override this setting at any time by updating index settings to the preferred value.

You may set the _tier_preference value to null to remove the data tier preference setting which will allow it to allocate to any data node within the cluster and will not reset the index’s setting back to its respective upon-creation default. Forewarning if you do that an ILM Migrate may apply a value at a later point if the index is managed.

Automatic data tier migration

{ilm-init} automatically transitions managed indices through the available data tiers using the migrate action. By default, this action is automatically injected in every phase. You can explicitly specify the migrate action with "enabled": false to disable automatic migration, for example, if you’re using the allocate action to manually specify allocation rules.