Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset quality] Adjust degraded docs query #179227

Closed
yngrdyn opened this issue Mar 22, 2024 · 13 comments · Fixed by #183183
Closed

[Dataset quality] Adjust degraded docs query #179227

yngrdyn opened this issue Mar 22, 2024 · 13 comments · Fixed by #183183
Assignees
Labels
Feature:Dataset Health Team:obs-ux-logs Observability Logs User Experience Team

Comments

@yngrdyn
Copy link
Contributor

yngrdyn commented Mar 22, 2024

📓 Summary

When elastic/elasticsearch#59946 lands we would need to improve the query to get degraded logs.

✔️ Acceptance criteria

  • Perform field caps request to check if they are datastreams not supporting _ignored aggregation in the selected timeframe.
  • If there are some datastreams not supporting the aggregation let the users know about it, we could use a banner for this with a popup indicating which are the datastream not supporting the aggregation and recommending user how to fix this (through manual rollovers)

❓ Open questions

  • Changing the query will result in partial failures for old indices. How can we handle such cases? Options that I see:
  1. We let them fail partially (for the shards of the old indices) and just put an indicator that data is incomplete because index X, Y, Z couldn't be checked.
  2. We let them fail but we retry for those indices with another mechanism (runtime field? old query?).
  3. We perform an initial check using field caps API, if an index doesn't support aggregation on _ignored we let users know that this will take more time than expected and perform the query with the aggregation on the indices that support it and find an alternative (related to 2) to get the information for indices that don't support the aggregation
@yngrdyn yngrdyn added Team:obs-ux-logs Observability Logs User Experience Team Feature:Dataset Health labels Mar 22, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

@ruflin
Copy link
Member

ruflin commented Mar 26, 2024

Engineering effort wise, what is the least effort? I'm thinking of starting with 1 and then follow up with 3. Or if 3 is as simple as one, directly jump to 3.

One thing we have to consider is also how much load we put on the cluster. I assume for the default of 24h it should be doable, but for longer periods, it could take a long time and make Elasticsearch read all the data.

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Mar 26, 2024

Thanks for your input @ruflin, I have now another question: is there a way to move those indices that don't support the aggregation to a state where they support it?

@ruflin
Copy link
Member

ruflin commented Mar 26, 2024

is there a way to move those indices that don't support the aggregation to a state where they support it?

For data streams, my understanding is as soon as a rollover is triggered and the new index is created, it will be available on all the new data. @salvatore-campagna Can you confirm?

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Mar 26, 2024

For data streams, my understanding is as soon as a rollover is triggered and the new index is created

We could guide users to do this, so next time they can benefit fully from the aggregation

@yngrdyn
Copy link
Contributor Author

yngrdyn commented May 6, 2024

@mdbirnstiehl could you help us here with some communications?

  1. The request might take longer due to datastreams X, Y, Z not supporting _ignored aggregation
  2. In order to support the aggregation for future data, and consequently improving performance in the page, we recommend users to perform manual rollovers to the aforementioned datastreams

@mdbirnstiehl
Copy link
Contributor

@mdbirnstiehl could you help us here with some communications?

  1. The request might take longer due to datastreams X, Y, Z not supporting _ignored aggregation
  2. In order to support the aggregation for future data, and consequently improving performance in the page, we recommend users to perform manual rollovers to the aforementioned datastreams

How about:

Your request may take longer to complete due to datastreams X, Y, Z not supporting _ignored aggregation. Manually [rollover](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html) these indices to prevent future delays.

My concern is whether or not "supporting ignored aggregation" is meaningful to users, but I'm not sure of a more meaningful way to put that without making the message too long.

@felixbarny
Copy link
Member

Just my 2c here but I'd go with option 1 and warn users about partial results and encourage them to roll over the data stream. I'd rather not give users a foot-gun that can affect their cluster's performance. If we choose to still do a fallback query for indices that don't support aggregations for _ignored, I'd implement it as an opt-in, i.e. the user needs to click a button to trigger the slow aggregation.

@yngrdyn
Copy link
Contributor Author

yngrdyn commented May 7, 2024

Hey Felix, initially I though we would need to change the query to something like

"aggs": {
    "degraded": {
      "terms": {
        "field": "_ignored"
      }
    }
  }

With that type of query we will get failures and partial results, the thing is that query will only tell us how many documents we have containing a value of an specific _ignored property
a bucket will look like this in the response

{
   "key": {
     "dataset": "synth",
     "namespace": "default"
   },
   "doc_count": 34260,
   "degraded": {
     "doc_count_error_upper_bound": 0,
     "sum_other_doc_count": 0,
     "buckets": [
       {
         "key": "cloud.availability_zone",
         "doc_count": 3460
       },
       {
         "key": "log.level",
         "doc_count": 260
       }
     ]
   }
}

The problem with that type of response is that we are not able to tell how many documents in that dataset actually have _ignored properties, a document could have only cloud.availability_zone or only log.level or maybe cloud.availability_zone and log.level ignored.

Maybe I'm missing something but still I feel we would need to make use of exists as in the original query at least at this stage.

@felixbarny
Copy link
Member

I think I've found a simple solution to get the ratio of degraded documents only for indices that have doc_values for the _ignored field. Instead of doing an exists query, we could use a missing aggregation like so:

{
  "size": 0,
  "aggs": {
    "non_degraded": {
      "missing": {
        "field": "_ignored"
      }
    }
  }
}

Based on some quick testing, the aggregation fails on shards where the field doesn't have doc_values but succeeds on other shards.

Click to expand example
PUT /test-missing-agg
{
  "mappings": {
    "properties": {
      "ignored":{
        "type": "keyword",
        "doc_values": false,
        "store": true
      }
    }
  }
}

PUT /test-missing-agg2
{
  "mappings": {
    "properties": {
      "ignored":{
        "type": "keyword"
      }
    }
  }
}


POST test-missing-agg/_doc?refresh
Authorization: Basic {{username}} {{password}}
Content-Type: application/json
{
  "foo": "bar",
  "ignored": ["foo", "bar"]
}


POST test-missing-agg2/_doc?refresh
Authorization: Basic {{username}} {{password}}
Content-Type: application/json
{
  "foo": "bar",
  "ignored": ["foo", "bar"]
}

POST /test-missing-agg/_doc?refresh
{
  "foo": "bar"
}


POST /test-missing-agg2/_doc?refresh
{
  "foo": "bar"
}

POST /test-missing-agg/_search
{
  "size": 0,
  "aggs": {
    "non_degraded": {
      "missing": {
        "field": "ignored"
      }
    }
  }
}


### response

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 1,
    "skipped": 0,
    "failed": 1,
    "failures": [
      {
        "shard": 0,
        "index": "test-missing-agg",
        "node": "0G1taQ10S6KjBFKEnwdOkA",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Can't load fielddata on [ignored] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
        }
      }
    ]
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "non_degraded": {
      "doc_count": 1
    }
  }
}

@yngrdyn
Copy link
Contributor Author

yngrdyn commented May 7, 2024

Thanks @felixbarny, I did some quick testings in edge-lite.
The following tables will contain the queries used, the amount of time (ms) that took every run (from 1st to 10th), and lastly the avg of time took in ms

For 5d, including non aggregatable indices

query 1st 2nd 3rd 4rd 5th 6th 7th 8th 9th 10th avg (ms)
exists 87990 57915 98406 59884 92024 48997 73397 79090 74474 94948 76712
missing* 75216 68595 94793 49559 90212 55243 96095 92396 64222 87188 77351

*missing is not returning complete data, instead it returns failures for the indices not supporting the _ignored aggregation

For latest 24h, all indices supporting the aggregation

query 1st 2nd 3rd 4rd 5th 6th 7th 8th 9th 10th avg (ms)
exists 31388 49289 32990 24559 43409 56830 32525 55583 60163 69522 45625
missing 63874 33511 45171 34414 35407 36441 51615 57189 39663 63487 46077

Avg in performance is very similar, do you think for higher amount of data we could conclude that using missing will be more beneficial?

I was also trying to split the query because I think that will speed things up for most of the cases (I expect total of documents with ignored property to be a reduced number compared to the rest) if we could request the queries in parallel

Queries
##### Number of documents with ignored property per dataset
GET /logs-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
          {"range":
            {
              "@timestamp":{
                "gte": "now-24h",
                "lte": "now",
                "format":"epoch_millis"
              }
            }
          },
          {"term":{"data_stream.type":"logs"}
        }
      ],
      "must": {
        "exists": {
          "field": "_ignored"
        }
      }
    }
  },
  "aggs": {
    "datasets":{
      "composite": {
        "size":10000,
        "sources":[
          {"dataset":{"terms":{"field":"data_stream.dataset"}}},
          {"namespace":{"terms":{"field":"data_stream.namespace"}}}
        ]
      }
    }
  }
}

##### Total documents per dataset
GET /logs-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
          {"range":
            {
              "@timestamp":{
                "gte": "now-24h",
                "lte": "now",
                "format":"epoch_millis"
              }
            }
          },
          {"term":{"data_stream.type":"logs"}
        }
      ]
    }
  },
  "aggs": {
    "datasets":{
      "composite": {
        "size":10000,
        "sources":[
          {"dataset":{"terms":{"field":"data_stream.dataset"}}},
          {"namespace":{"terms":{"field":"data_stream.namespace"}}}
        ]
      }
    }
  }
}

Doing that I got the following numbers

For 5d, including non aggregatable indices

query 1st 2nd 3rd 4rd 5th 6th 7th 8th 9th 10th avg (ms)
documents with ignored 4622 1535 2437 36 118 88 744 42 24 30 967
total documents 20698 33684 41278 40503 27794 36216 22184 22100 32606 34497 31156

For latest 24h, all indices supporting the aggregation

query 1st 2nd 3rd 4rd 5th 6th 7th 8th 9th 10th avg (ms)
documents with ignored 204 22 31 173 66 19 45 3339 12 22 393
total documents 30441 17446 12658 22563 17267 21037 13381 24503 26429 15642 20136

If we could do that, parallelise, we might lowering the time taken by elastic search to the ~half because the critical path will be on the longest query. wdyt?

yngrdyn added a commit that referenced this issue May 9, 2024
Relates to #179227.

After gathering some numbers around possible tweaks of the current
degradedDocs query ([more
information](#179227 (comment))),
I decided to move forward and split the query to reduce the time taken
by elastic search aggregating on data streams.

This PR contains the following changes:
- `mSearch` method was added to `DatasetQualityESClient` to allow the
usage of [multi
search](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html).
- `degradedDocsRt` was changed to now include not only the amount of
degradedDocs but also the total docs for the datastreams within the
timerange selected

Nothing visible has changed in terms of functionality



https://github.com/elastic/kibana/assets/1313018/1d836446-7d35-4ff2-8356-16a8087ab505
yngrdyn added a commit that referenced this issue May 14, 2024
…ation (#183183)

Closes #179227.

## 📝  Summary

This PR adds a warning to main page and flyout whenever a dataStream has
indices that doesn't support `_ignored` aggregation

## 🎥 Demo


https://github.com/elastic/kibana/assets/1313018/c6d5fd81-d9a9-4fcc-92d0-8e65b996df9c

#### Flyout


https://github.com/elastic/kibana/assets/1313018/2972e104-8b10-413a-a430-3efc5252b757
@salvatore-campagna
Copy link

I think we should run the query only for indices that have doc values on _ignored, discovering them using the field caps api. For other indices we can have the slow option to be triggered explicitly by the user.

@ruflin
Copy link
Member

ruflin commented May 21, 2024

@salvatore-campagna What you propose above adds a lot of complexity on the UI side for a "temporary" problem for our users. We opted to go for a banner on the top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Dataset Health Team:obs-ux-logs Observability Logs User Experience Team
Projects
None yet
6 participants