Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Doc+) Flush out Data Tiers #107981

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

(Doc+) Flush out Data Tiers #107981

wants to merge 9 commits into from

Conversation

stefnestor
Copy link
Contributor

πŸ‘‹πŸ½ howdy, team!

I highly value the content on this Data Tiers page. Thanks for writing it! In my experience, some users may become slightly confused by its golden nuggets due to its brevity. This PR attempts to flush out common questions while remaining concise.

The main changes are in the first and second-to-last sections; however, I do attempt some heading restructuring to make the TOC idea-groupings more clear for easier scan-throughs.

The specific clarifications I'd like to push in order of appearance:

  • There's content tier (for "data category" > "content" as we've dubbed it on the higher page) and the data temperature tiers (for time series). That the temperature tiers group together is technically not stated so users end up asking about when they'd go hot>warm vs content>warm, etc. I suspect this confusion is only because users come straight to this page instead of starting at the hierarchy-parent page so have linked up.
  • Frozen being accessed/searched "rarely" should imply, well rarely. I wrote 1% in the PR [TIP] guideline section as a discussion starting point. Frequently we see users not understanding either that they actually have been or that they shouldn't have β‰₯25% of all searches hitting frozen tier. This comes up because of architecture bugs (e.g. frozen indices with future timestamps) but also just happenstance (e.g. 01605242 where of searches they hit majority hot, ~5% cold, but then again hit 75% frozen).
  • There's a slew of "how do I check that?", "how do I change that (at creation/later)?", "what if I set it null?" questions we get about _tier_preference so just extended the existing section already about it.

TIA! πŸ™ cc: @dakrone @bytebilly

πŸ‘‹πŸ½  howdy, team!

I highly value the content on this [Data Tiers](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-tiers.html) page. Thanks for writing it! In my experience, some users may become slightly confused by its golden nuggets due to its brevity. This PR attempts to flush out common questions while remaining concise. 

The main changes are in the first and second-to-last sections; however, I do attempt some heading restructuring to make the TOC idea-groupings more clear for easier scan-throughs. 

The specific clarifications I'd like to push in order of appearance:

- There's content tier (for "data category" > "content" as we've dubbed it on the higher page) and the data temperature tiers (for time series). That the temperature tiers group together is technically not stated so users end up asking about when they'd go hot>warm vs content>warm, etc. I suspect this confusion is only because users come straight to this page instead of starting at the hierarchy-parent page so have linked up. 
- (Main) Frozen being accessed/searched "rarely" should imply, well rarely. I wrote 1% in the PR `[TIP]` guideline section as a discussion starting point. Frequently we see users not understanding either that they actually have been or that they shouldn't have β‰₯25% of all searches hitting frozen tier. This comes up because of architecture bugs (e.g. frozen indices with future timestamps) but also just happenstance (e.g. 01605242 where of searches they hit majority hot, ~5% cold, but then again hit 75% frozen).
- There's a slew of "how do I check that?", "how do I change that (at creation/later)?", "what if I set it null?" questions we get about `_tier_preference` so just extended the existing section already about it. 

TIA! πŸ™
@stefnestor stefnestor added >enhancement >docs General docs changes Team:Data Management Meta label for data/management team Team:Docs Meta label for docs team Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. labels Apr 27, 2024
Copy link

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

@stefnestor please enable the option "Allow edits and access to secrets by maintainers" on your PR. For more information, see the documentation.

@elasticsearchmachine elasticsearchmachine added v8.15.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Apr 27, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine elasticsearchmachine removed the Team:Data Management Meta label for data/management team label Apr 27, 2024
@stefnestor stefnestor added the Team:Data Management Meta label for data/management team label Apr 27, 2024
@elasticsearchmachine elasticsearchmachine removed the Team:Data Management Meta label for data/management team label Apr 27, 2024
@shainaraskas shainaraskas self-requested a review April 29, 2024 15:33
Copy link
Contributor

@shainaraskas shainaraskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”₯ you added so many great details in this PR!

I've reviewed and provided some feedback/edits from an organization and clarity POV. There are some nuances around tier hardware profiles that I didn't completely understand, so I apologize for any inaccuracies I injected with my edits and for any feedback that doesn't exactly align with your goals.

docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
stefnestor and others added 2 commits May 2, 2024 11:05
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
@stefnestor
Copy link
Contributor Author

πŸ‘‹πŸ½ @shainaraskas , thanks for hanging out! Apologies for the delay, I work weekends so today's my Monday.

Your edits are also πŸ”₯ , cheers! I accepted all grammar and most rewordings; I've left comments on what remains because I agree it matters to get these parts right to avoid confusion.

Copy link
Contributor

@shainaraskas shainaraskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just working through your comments on the index allocation section but thought I'd throw these comments your way :)

docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
Copy link
Contributor

@shainaraskas shainaraskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking so good! left a couple of comments that are up to your preference.

I think we're basically ready to go, but I'm not sure why the tests are failing. looking into it now. πŸ‘

edit: this looks like it's maybe the same error as your other PR, so I'm going to rebase this one too.

edit 2: after it's green and you check out my comments, feel free to merge (unless you're waiting on an engineering review).

docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
@shainaraskas
Copy link
Contributor

we can also probably target 8.14.0, 8.13.3, and 8.13.4 with this so the docs are available asap.

Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>

- Search: 85% hot, 10% warm, 5% cold, and 1% frozen
- Ingest: 95% hot, 4% warm, 1% cold, and 0% frozen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ‘‹πŸ½ @dakrone will you kindly review these proportional percentages per data tier for Dev sign-off? I believe the rest of this PR consolidates content from existing doc pages for clarity, but this call out uniquely makes a new claim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we get these numbers? I don't think we can make generalizations for these kinds of percentages, for example, it's perfectly valid to have a "search" load that's hot and frozen, where the searches hit each tier 50% of the time (again, the performance requirements aren't something we can supply, they have to come from the user).

On the ingestion side, I wouldn't expect any indexing at all on the warm and cold tiers, how did we arrive at the 4% and 1% numbers respectively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we get these numbers?

In PR description I highlighted that I guesstimated/made-up these numbers. Please only consider them placeholders.

for example, it's perfectly valid to have a "search" load that's hot and frozen, where the searches hit each tier 50% of the time (again, the performance requirements aren't something we can supply, they have to come from the user).

From Support, I may only deal with the situations where searches 50% hitting frozen breaks the cluster. The age-old example is Frozen tier having future dates takes down the entire cluster. I do want to highlight though that the existing doc does already say "Frozen tier nodes hold time series data that is accessed rarely and never updated.". I may be missing the intended interpretation, but "accessed rarely" does not sound like 50% to me but a lot more like the 1% I guesstimated.

On the ingestion side, I wouldn't expect any indexing at all on the warm and cold tiers, how did we arrive at the 4% and 1% numbers respectively?

Again guesstimated from the existing doc saying " Warm tier nodes hold time series data that is accessed less-frequently and rarely needs to be updated. ... Cold tier nodes hold time series data that is accessed infrequently and not normally updated.". I don't know what these numbers should be which is why I requested your feedback πŸ™‚ .

I'm on board if in general we're concerned about explicit percentages, but at least from what I see users feel unguided and don't realize for desiring performance that they haven't architected in a way that'd get themselves there. That's the need I'm hoping to fill in better, but I'm not tied on how we do that. So if wording needs to change or we need to have an "it depends" blog instead and just link to it from here, all that's fine by me. But I would like to advocate for something more concrete to point users to for base level architecture / expectation setting.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments for this change.

I also have concerns that we give a false sense of specificity with giving hard recommendations for percentages in these docs. My preference would be to teach the reader to weigh the values of cost, performance, and configuration complexity rather than giving hard numbers that are likely to mislead a user. I'm curious what your thoughts about this are.

A _data tier_ is a collection of <<modules-node,nodes>> within a cluster which share the same
<<node-roles,data node role>>. Elastic recommends this collection of nodes also shares the same
hardware profile to avoid <<hotspotting,hot spotting>>. Data tiers' usage generally splits along
<<data-management,data categories>> for _content_ and _time series_ data. {es} available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "content data" is a little ambiguous here, perhaps … for time series and non time series data.?

Comment on lines +11 to +12
* <<content-tier,Content tier>> nodes handle the indexing and query load for content
indices, such as a <<system-indices,system index>> or a product catalog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System indices and data streams can also be time series data, so I don't think we should use it as an example here. I think we should stick with a timeseries/non-timeseries distinction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be another πŸ˜• point for me then if we can discuss:

  1. Lower down on the existing page under Content header already says "System indices and other indices that aren’t part of a data stream are automatically allocated to the content tier." which is why I didn't realize I might be misunderstanding.
  2. Support encourages users to keep all system indices on hot/content. Does Dev agree?
  3. AFAIK (and it's an ongoing discussion / definition-problem) system indices are the indices which report from the snapshot's feature states. So from the unofficial list I wrote for Support we later learned e.g. .ilm-history and .kibana-event-log don't qualify as system indices. So e.g. only (A) qualify as system indices and AFAICT that subset doesn't have time series data (at least no indices which'd rollover. EDIT: other than the ML ones if that's what you were referencing?).
(A)
{
  "feature_states": [
    {
      "feature_name": "security",
      "indices": [".security-tokens-7",".security-7",".security-profile-8"]
    },
    {
      "feature_name": "geoip",
      "indices": [".geoip_databases"]
    },
    {
      "feature_name": "async_search",
      "indices": [".async-search"]
    },
    {
      "feature_name": "machine_learning",
      "indices": [".ml-inference-native-000002",".ml-inference-000005",".ml-config"]
    },
    {
      "feature_name": "transform",
      "indices": [".transform-internal-007"]
    },
    {
      "feature_name": "kibana",
      "indices": [
        ".kibana_analytics_8.12.2_001",
        ".kibana_task_manager_8.12.2_001",
        ".kibana_ingest_8.12.2_001",
        ".apm-custom-link",
        ".apm-agent-configuration",
        ".kibana_8.12.2_001",
        ".kibana_security_session_1",
        ".kibana_security_solution_8.12.2_001",
        ".kibana_alerting_cases_8.12.2_001"
      ]
    },
    {
      "feature_name": "tasks",
      "indices": [".tasks"]
    },
    {
      "feature_name": "fleet",
      "indices": [
        ".fleet-agents-7",
        ".fleet-enrollment-api-keys-7",
        ".fleet-actions-7",
        ".fleet-policies-7",
        ".fleet-servers-7",
        ".fleet-policies-leader-7"
      ]
    }
  ]
}

docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved

- Search: 85% hot, 10% warm, 5% cold, and 1% frozen
- Ingest: 95% hot, 4% warm, 1% cold, and 0% frozen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we get these numbers? I don't think we can make generalizations for these kinds of percentages, for example, it's perfectly valid to have a "search" load that's hot and frozen, where the searches hit each tier 50% of the time (again, the performance requirements aren't something we can supply, they have to come from the user).

On the ingestion side, I wouldn't expect any indexing at all on the warm and cold tiers, how did we arrive at the 4% and 1% numbers respectively?

- Search: 85% hot, 10% warm, 5% cold, and 1% frozen
- Ingest: 95% hot, 4% warm, 1% cold, and 0% frozen

You can check how your access requests are distributed among your data tiers using the <<cat-thread-pool,CAT thread pools>> API. If your lower temperature tiers are being accessed at higher proportions, then your cluster performance might be impacted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think looking through the cat threadpool API is a big request for an end user. It would be fairly easy to misunderstand, and since it's non-persistent it may give a very skewed view of a workload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair! I'm curious what alternative investigation you'd recommend since it's a current user need?

(I again may be ignorant of better ways. For the limited view I have: IME there's only hodge-podge answers like this outlined API currently but that would be a design improvement takeaway but not stop us from telling users the best they can introspect right now. A possible alternative would might be enabling Monitoring and then comparing node ingest rates; would that be better?)

Comment on lines +61 to +62
These proportions are intended to serve as a general baseline that you can apply to your specific
use case, hardware profiles, and architecture.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would these actually be applied? You mention above "your requests should be distributed to data tiers in the following approximate proportions", but that's not something prescriptive a user can actually do.

We don't want them to try and route queries to different tiers based on ratios, but rather to size things accordingly. Again, I'm worried that we simplify the problem here, it's not only a performance trade-off but also one of cost (for which this does not account).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair πŸ€”.

I did not list (my miss) but expected the answer to line up to Support's (A) hold data in higher tiers longer probably by updating an ILM policy, (B) where possible filter searches by time range to avoid load on lower tiers, or (C) review performance vs billing needs via the currently listed "apply to your specific use case, hardware profiles, and architecture". +(D) we recommend Searchable Snapshots to reduce billing while extending data retention.

docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
docs/reference/datatiers.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Docs Meta label for docs team v8.13.3 v8.13.5 v8.14.0 v8.14.1 v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants