New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Doc+) Flush out Data Tiers #107981
base: main
Are you sure you want to change the base?
(Doc+) Flush out Data Tiers #107981
Conversation
ππ½ howdy, team! I highly value the content on this [Data Tiers](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-tiers.html) page. Thanks for writing it! In my experience, some users may become slightly confused by its golden nuggets due to its brevity. This PR attempts to flush out common questions while remaining concise. The main changes are in the first and second-to-last sections; however, I do attempt some heading restructuring to make the TOC idea-groupings more clear for easier scan-throughs. The specific clarifications I'd like to push in order of appearance: - There's content tier (for "data category" > "content" as we've dubbed it on the higher page) and the data temperature tiers (for time series). That the temperature tiers group together is technically not stated so users end up asking about when they'd go hot>warm vs content>warm, etc. I suspect this confusion is only because users come straight to this page instead of starting at the hierarchy-parent page so have linked up. - (Main) Frozen being accessed/searched "rarely" should imply, well rarely. I wrote 1% in the PR `[TIP]` guideline section as a discussion starting point. Frequently we see users not understanding either that they actually have been or that they shouldn't have β₯25% of all searches hitting frozen tier. This comes up because of architecture bugs (e.g. frozen indices with future timestamps) but also just happenstance (e.g. 01605242 where of searches they hit majority hot, ~5% cold, but then again hit 75% frozen). - There's a slew of "how do I check that?", "how do I change that (at creation/later)?", "what if I set it null?" questions we get about `_tier_preference` so just extended the existing section already about it. TIA! π
Documentation preview:
|
@stefnestor please enable the option "Allow edits and access to secrets by maintainers" on your PR. For more information, see the documentation. |
Pinging @elastic/es-docs (Team:Docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π₯ you added so many great details in this PR!
I've reviewed and provided some feedback/edits from an organization and clarity POV. There are some nuances around tier hardware profiles that I didn't completely understand, so I apologize for any inaccuracies I injected with my edits and for any feedback that doesn't exactly align with your goals.
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
ππ½ @shainaraskas , thanks for hanging out! Apologies for the delay, I work weekends so today's my Monday. Your edits are also π₯ , cheers! I accepted all grammar and most rewordings; I've left comments on what remains because I agree it matters to get these parts right to avoid confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just working through your comments on the index allocation section but thought I'd throw these comments your way :)
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking so good! left a couple of comments that are up to your preference.
I think we're basically ready to go, but I'm not sure why the tests are failing. looking into it now. π
edit: this looks like it's maybe the same error as your other PR, so I'm going to rebase this one too.
edit 2: after it's green and you check out my comments, feel free to merge (unless you're waiting on an engineering review).
we can also probably target |
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
|
||
- Search: 85% hot, 10% warm, 5% cold, and 1% frozen | ||
- Ingest: 95% hot, 4% warm, 1% cold, and 0% frozen | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ππ½ @dakrone will you kindly review these proportional percentages per data tier for Dev sign-off? I believe the rest of this PR consolidates content from existing doc pages for clarity, but this call out uniquely makes a new claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did we get these numbers? I don't think we can make generalizations for these kinds of percentages, for example, it's perfectly valid to have a "search" load that's hot
and frozen
, where the searches hit each tier 50% of the time (again, the performance requirements aren't something we can supply, they have to come from the user).
On the ingestion side, I wouldn't expect any indexing at all on the warm and cold tiers, how did we arrive at the 4% and 1% numbers respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did we get these numbers?
In PR description I highlighted that I guesstimated/made-up these numbers. Please only consider them placeholders.
for example, it's perfectly valid to have a "search" load that's hot and frozen, where the searches hit each tier 50% of the time (again, the performance requirements aren't something we can supply, they have to come from the user).
From Support, I may only deal with the situations where searches 50% hitting frozen breaks the cluster. The age-old example is Frozen tier having future dates takes down the entire cluster. I do want to highlight though that the existing doc does already say "Frozen tier nodes hold time series data that is accessed rarely and never updated.". I may be missing the intended interpretation, but "accessed rarely" does not sound like 50% to me but a lot more like the 1% I guesstimated.
On the ingestion side, I wouldn't expect any indexing at all on the warm and cold tiers, how did we arrive at the 4% and 1% numbers respectively?
Again guesstimated from the existing doc saying " Warm tier nodes hold time series data that is accessed less-frequently and rarely needs to be updated. ... Cold tier nodes hold time series data that is accessed infrequently and not normally updated.". I don't know what these numbers should be which is why I requested your feedback π .
I'm on board if in general we're concerned about explicit percentages, but at least from what I see users feel unguided and don't realize for desiring performance that they haven't architected in a way that'd get themselves there. That's the need I'm hoping to fill in better, but I'm not tied on how we do that. So if wording needs to change or we need to have an "it depends" blog instead and just link to it from here, all that's fine by me. But I would like to advocate for something more concrete to point users to for base level architecture / expectation setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments for this change.
I also have concerns that we give a false sense of specificity with giving hard recommendations for percentages in these docs. My preference would be to teach the reader to weigh the values of cost, performance, and configuration complexity rather than giving hard numbers that are likely to mislead a user. I'm curious what your thoughts about this are.
docs/reference/datatiers.asciidoc
Outdated
A _data tier_ is a collection of <<modules-node,nodes>> within a cluster which share the same | ||
<<node-roles,data node role>>. Elastic recommends this collection of nodes also shares the same | ||
hardware profile to avoid <<hotspotting,hot spotting>>. Data tiers' usage generally splits along | ||
<<data-management,data categories>> for _content_ and _time series_ data. {es} available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "content data" is a little ambiguous here, perhaps β¦ for time series and non time series data.
?
* <<content-tier,Content tier>> nodes handle the indexing and query load for content | ||
indices, such as a <<system-indices,system index>> or a product catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System indices and data streams can also be time series data, so I don't think we should use it as an example here. I think we should stick with a timeseries/non-timeseries distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be another π point for me then if we can discuss:
- Lower down on the existing page under Content header already says "System indices and other indices that arenβt part of a data stream are automatically allocated to the content tier." which is why I didn't realize I might be misunderstanding.
- Support encourages users to keep all system indices on hot/content. Does Dev agree?
- AFAIK (and it's an ongoing discussion / definition-problem) system indices are the indices which report from the snapshot's feature states. So from the unofficial list I wrote for Support we later learned e.g.
.ilm-history
and.kibana-event-log
don't qualify as system indices. So e.g. only (A) qualify as system indices and AFAICT that subset doesn't have time series data (at least no indices which'd rollover. EDIT: other than the ML ones if that's what you were referencing?).
(A)
{
"feature_states": [
{
"feature_name": "security",
"indices": [".security-tokens-7",".security-7",".security-profile-8"]
},
{
"feature_name": "geoip",
"indices": [".geoip_databases"]
},
{
"feature_name": "async_search",
"indices": [".async-search"]
},
{
"feature_name": "machine_learning",
"indices": [".ml-inference-native-000002",".ml-inference-000005",".ml-config"]
},
{
"feature_name": "transform",
"indices": [".transform-internal-007"]
},
{
"feature_name": "kibana",
"indices": [
".kibana_analytics_8.12.2_001",
".kibana_task_manager_8.12.2_001",
".kibana_ingest_8.12.2_001",
".apm-custom-link",
".apm-agent-configuration",
".kibana_8.12.2_001",
".kibana_security_session_1",
".kibana_security_solution_8.12.2_001",
".kibana_alerting_cases_8.12.2_001"
]
},
{
"feature_name": "tasks",
"indices": [".tasks"]
},
{
"feature_name": "fleet",
"indices": [
".fleet-agents-7",
".fleet-enrollment-api-keys-7",
".fleet-actions-7",
".fleet-policies-7",
".fleet-servers-7",
".fleet-policies-leader-7"
]
}
]
}
|
||
- Search: 85% hot, 10% warm, 5% cold, and 1% frozen | ||
- Ingest: 95% hot, 4% warm, 1% cold, and 0% frozen | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did we get these numbers? I don't think we can make generalizations for these kinds of percentages, for example, it's perfectly valid to have a "search" load that's hot
and frozen
, where the searches hit each tier 50% of the time (again, the performance requirements aren't something we can supply, they have to come from the user).
On the ingestion side, I wouldn't expect any indexing at all on the warm and cold tiers, how did we arrive at the 4% and 1% numbers respectively?
- Search: 85% hot, 10% warm, 5% cold, and 1% frozen | ||
- Ingest: 95% hot, 4% warm, 1% cold, and 0% frozen | ||
|
||
You can check how your access requests are distributed among your data tiers using the <<cat-thread-pool,CAT thread pools>> API. If your lower temperature tiers are being accessed at higher proportions, then your cluster performance might be impacted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think looking through the cat threadpool API is a big request for an end user. It would be fairly easy to misunderstand, and since it's non-persistent it may give a very skewed view of a workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair! I'm curious what alternative investigation you'd recommend since it's a current user need?
(I again may be ignorant of better ways. For the limited view I have: IME there's only hodge-podge answers like this outlined API currently but that would be a design improvement takeaway but not stop us from telling users the best they can introspect right now. A possible alternative would might be enabling Monitoring and then comparing node ingest rates; would that be better?)
These proportions are intended to serve as a general baseline that you can apply to your specific | ||
use case, hardware profiles, and architecture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would these actually be applied? You mention above "your requests should be distributed to data tiers in the following approximate proportions", but that's not something prescriptive a user can actually do.
We don't want them to try and route queries to different tiers based on ratios, but rather to size things accordingly. Again, I'm worried that we simplify the problem here, it's not only a performance trade-off but also one of cost (for which this does not account).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fair π€.
I did not list (my miss) but expected the answer to line up to Support's (A) hold data in higher tiers longer probably by updating an ILM policy, (B) where possible filter searches by time range to avoid load on lower tiers, or (C) review performance vs billing needs via the currently listed "apply to your specific use case, hardware profiles, and architecture". +(D) we recommend Searchable Snapshots to reduce billing while extending data retention.
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
ππ½ howdy, team!
I highly value the content on this Data Tiers page. Thanks for writing it! In my experience, some users may become slightly confused by its golden nuggets due to its brevity. This PR attempts to flush out common questions while remaining concise.
The main changes are in the first and second-to-last sections; however, I do attempt some heading restructuring to make the TOC idea-groupings more clear for easier scan-throughs.
The specific clarifications I'd like to push in order of appearance:
[TIP]
guideline section as a discussion starting point. Frequently we see users not understanding either that they actually have been or that they shouldn't have β₯25% of all searches hitting frozen tier. This comes up because of architecture bugs (e.g. frozen indices with future timestamps) but also just happenstance (e.g. 01605242 where of searches they hit majority hot, ~5% cold, but then again hit 75% frozen)._tier_preference
so just extended the existing section already about it.TIA! π cc: @dakrone @bytebilly