Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search with nested sort results in 0 results #2480

Closed
cjdoumas opened this issue Mar 21, 2024 · 8 comments
Closed

search with nested sort results in 0 results #2480

cjdoumas opened this issue Mar 21, 2024 · 8 comments

Comments

@cjdoumas
Copy link

Elasticsearch version (7.17):

elasticsearch-py version (8.12.1):

Description of the problem including expected versus actual behavior:
I cannot retrieve results when using a nested field in the sort argument of the search function.

I want to pull a large amount of data from Elasticsearch. Based on my understanding of the documentation, the most recent recommendation is to use the search function, sort by a unique field, then update the search_after argument after each pull to set up the next search until complete. This appears to work when using the _id field. However, this also results in a warning that says sorting by _id is deprecated and will go away, and to use a unique identifier or combination of identifiers within the body. There is a unique combination of fields that are nested within a parent. However, when I try to use a nested sort option, I receive 0 results.

I understand there is likely a by including this in the body, but I wanted to use the function as intended instead.

Steps to reproduce:
This is how I am sorting by id:

import elasticsearch as es

username_query_model = {
    "range": {"common.eventTime": {"gte": "now-2h", "lte": "now"}},
}

index_data = client.search(
    index="my_index",
    query=username_query_model,
    sort=['_id'],
    size=1000
)

This get results and the following warning:

ElasticsearchWarning: Loading the fielddata on the _id field is deprecated and will be removed in future versions. If you require sorting or aggregating on this field you should also include the id in the body of your documents, and map this field as a keyword field that has [doc_values] enabled

Here I am trying to sort by a nested field:

index_data = client.search(
    index="my_index",
    query=username_query_model,
    sort='_parent.child',
    size=1000
)

And that has 0 results. Note that there is no error, just no results. I have tried a bunch of different fields, but none of the nested ones work, only the parents.

What I would eventually like to build to is something like the following:

index_data = client.search(
    index="my_index",
    query=username_query_model,
    sort=['_parent.child1', '_parent.child2', '_parent.child3'],
    size=1000,
    search_after=[x, y, z]
)
@pquentin
Copy link
Member

pquentin commented Mar 22, 2024

Hello @cjdoumas, and thanks for the report.

Sorting within nested objects requires special syntax, please see https://www.elastic.co/guide/en/elasticsearch/reference/7.17/sort-search-results.html#nested-sorting. Do you get results when using nested in sort? If not, I'd appreciate a more complete example, including mappings and actual docs that reproduce the issue.

Also note that using version 8.x of the client against a 7.x cluster can cause issues. (But probably not in this specific case.) Clients are forward-compatible, though, so the opposite is OK.

@cjdoumas
Copy link
Author

So, I tried that syntax, and have had either empty or not sorted results. Honestly, though I am pretty sure I am not using it correctly.

So, here are 4 records:

{'_index': 'my_index', '_type': '_doc', '_id': '55tgr45s-ffh6-f8ik-aae1-llkf345shdu5', '_common': {'eventTime': '2024-03-22 10:31:45.125', 'entryId': 'tfg4435', 'entryGroup': 505, 'entryNumber': 58, 'name': 'ecd_1234'}},

{'_index': 'my_index', '_type': '_doc', '_id': 'sdfkjyu5-554h-fft4-eg35-llkfhut758hf', '_common': {'eventTime': '2024-03-22 09:26:33.122', 'entryId': 'tfg4435', 'entryGroup': 215, 'entryNumber': 7, 'name': 'tmx1234'}},

{'_index': 'my_index', '_type': '_doc', '_id': 'xvnmjdue-44gd-dddf-57fh-lprtuericd3s', '_common': {'eventTime': '2024-03-22 09:55:16.654', 'entryId': 'tfg4435', 'entryGroup': 215, 'entryNumber': 11, 'name': 'ecd_1234'}}, 

{'_index': 'my_index', '_type': '_doc', '_id': 'asdfe553-0gj4-998f-fd34-vnbmc643hfd2', '_common': {'eventTime': '2024-03-22 08:59:03.258', 'entryId': 'l57fhvg', 'entryGroup': 77, 'entryNumber': 3, 'name': 'ecd_1234'}}

So, for the end result, I want to sort by 3 vars in the following order: _common.entryID, _common.entryGroup, and _common.entryNumber. However, first, I am just trying to sort on _common.entryNumber to figure out the nested syntax. The code below does not throw an error, but results in 0 results. I am certain I am misunderstanding something.

# Set the query
query_model = {
    "range": {"common.eventTime": {"gte": "now-2h", "lte": "now"}},
}

# Set the sort instructions
sort_instructions = '{" _common.entryNumber": {"order": "asc", "nested": {"path": "_common"}}}'

index_data = client.search(
    index="my_index",
    query=query_model,
    sort=sort_instructions,
    size=10
)

So, first step is getting the nested to work for just one var, then to get it to work with all three in that order. (Afterwards, I'd need to then incorporate the search_after argument to be able datasets beyond 10,000 entries)

Am I providing what you need?

@cjdoumas
Copy link
Author

Note, I have also formatted the sort_instructions without the all encompassing quotes, and that also returned 0 results, as such:

sort_instructions = {"_common.entryNumber": {"order": "asc", "nested": {"path": "_common"}}}

@cjdoumas
Copy link
Author

cjdoumas commented Apr 1, 2024

Any updates on this issue?

I'm also not familiar with submitting issues in Github, so if just let me know if I need to be patient or there is something else I should be doing.

@pquentin
Copy link
Member

Sorry for the delay, you've been more than patient enough. Your code wasn't really complete enough as I did not get the mapping and had to guess about them. I believe the issue was that sort instruction should not be a string and that your query also needs to take into account that _common is nested. Here's a working example.

We begin by setting up the connection to the cluster. I tested this both with Elasticsearch 8.x and Elasticsearch 7.17, but only with the latest version of the client.

from elasticsearch import Elasticsearch

client = Elasticsearch("http://localhost:9200")
print(client.info()["version"])

Next we define the mappings and index your 4 sample docs:

client.options(ignore_status=[404]).indices.delete(index="my_index")
client.indices.create(
    index="my_index",
    mappings={
        "properties": {
            "_common": {
                "type": "nested",
                "properties": {
                    "eventTime": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss.SSS"},
                    "entryId": {"type": "keyword"},
                    "entryGroup": {"type": "integer"},
                    "entryNumber": {"type": "integer"},
                    "name": {"type": "keyword"},
                },
            }
        }
    },
)


for doc in [
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "55tgr45s-ffh6-f8ik-aae1-llkf345shdu5",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 10:31:45.125",
                "entryId": "tfg4435",
                "entryGroup": 505,
                "entryNumber": 58,
                "name": "ecd_1234",
            }
        },
    },
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "sdfkjyu5-554h-fft4-eg35-llkfhut758hf",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 09:26:33.122",
                "entryId": "tfg4435",
                "entryGroup": 215,
                "entryNumber": 7,
                "name": "tmx1234",
            }
        },
    },
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "xvnmjdue-44gd-dddf-57fh-lprtuericd3s",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 09:55:16.654",
                "entryId": "tfg4435",
                "entryGroup": 215,
                "entryNumber": 11,
                "name": "ecd_1234",
            }
        },
    },
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "asdfe553-0gj4-998f-fd34-vnbmc643hfd2",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 08:59:03.258",
                "entryId": "l57fhvg",
                "entryGroup": 77,
                "entryNumber": 3,
                "name": "ecd_1234",
            }
        },
    },
]:
    client.create(index=doc["_index"], id=doc["_id"], body=doc["_source"])

And finally we execute the query, modifying query_model and sort_instructions:

import json

query_model = {
    "nested": {
        "path": "_common",
        "query": {
            "range": {
                "_common.eventTime": {
                    "gte": "2024-03-22 09:00:00.000",
                    "lte": "now",
                }
            }
        },
    }
}


# Set the sort instructions
sort_instructions = {
    "_common.entryNumber": {
        "order": "asc",
        "nested": {
            "path": "_common",
        },
    }
}

index_data = client.search(
    index="my_index", query=query_model, sort=sort_instructions, size=10
)
print(json.dumps(index_data.body, indent=2))

Which prints 3 out of 4 docs, sorted by entryNumber:

{
  "took": 79,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_id": "sdfkjyu5-554h-fft4-eg35-llkfhut758hf",
        "_score": null,
        "_source": {
          "_common": {
            "eventTime": "2024-03-22 09:26:33.122",
            "entryId": "tfg4435",
            "entryGroup": 215,
            "entryNumber": 7,
            "name": "tmx1234"
          }
        },
        "sort": [
          7
        ]
      },
      {
        "_index": "my_index",
        "_id": "xvnmjdue-44gd-dddf-57fh-lprtuericd3s",
        "_score": null,
        "_source": {
          "_common": {
            "eventTime": "2024-03-22 09:55:16.654",
            "entryId": "tfg4435",
            "entryGroup": 215,
            "entryNumber": 11,
            "name": "ecd_1234"
          }
        },
        "sort": [
          11
        ]
      },
      {
        "_index": "my_index",
        "_id": "55tgr45s-ffh6-f8ik-aae1-llkf345shdu5",
        "_score": null,
        "_source": {
          "_common": {
            "eventTime": "2024-03-22 10:31:45.125",
            "entryId": "tfg4435",
            "entryGroup": 505,
            "entryNumber": 58,
            "name": "ecd_1234"
          }
        },
        "sort": [
          58
        ]
      }
    ]
  }
}

Now you can take that code and use search_after with the point-in-time API if you need a large amount of results: https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html. Note that as mentioned in that page, sorting can slow this operation down, and only use it if you actually need it.

@cjdoumas
Copy link
Author

cjdoumas commented May 3, 2024

So, maybe I just don't understand the mapping or know how to find the mapping. I tried your syntax, however that was returning 0 values still. However, by just simply removing the nested argument, it seemed to work.

sort_instructions = { "_common.entryNumber": { "order": "asc", } }

I thought that the period and structure of the record meant that entryNumber is nested under _common. However, does the fact that the above code worked imply that is is not nested, but something else?

@pquentin
Copy link
Member

pquentin commented May 4, 2024

Oh, when you mentioned nested I assumed you meant the nested data type: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html. Sorry for the misleading response. I will try to take a look later taking this new information into account.

@pquentin
Copy link
Member

pquentin commented May 6, 2024

I thought that the period and structure of the record meant that entryNumber is nested under _common. However, does the fact that the above code worked imply that is is not nested, but something else?

So, as mentioned in the previous comment, "nested" is a very specific Elasticsearch term. In the context, of Elasticsearch, what you have is not "nested", it's simply an object field, which is the default for arrays if you don't specify mappings: https://opster.com/guides/elasticsearch/data-architecture/elasticsearch-nested-field-object-field/

Here's the complete working example from above with object fields.

We begin by setting up the connection to the cluster. I tested this both with Elasticsearch 8.x and Elasticsearch 7.17, but only with the latest version of elasticsearch-py, the client.

from elasticsearch import Elasticsearch

client = Elasticsearch("http://localhost:9200")
print(client.info()["version"])

Let's setup the index from scratch, define mappings to be explicit, and then index the docs you sent me:

client.options(ignore_status=[404]).indices.delete(index="my_index")
client.indices.create(
    index="my_index",
    mappings={
        "properties": {
            "_common": {
                "properties": {
                    "eventTime": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss.SSS"},
                    "entryId": {"type": "keyword"},
                    "entryGroup": {"type": "integer"},
                    "entryNumber": {"type": "integer"},
                    "name": {"type": "keyword"},
                },
            }
        }
    },
)


for doc in [
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "55tgr45s-ffh6-f8ik-aae1-llkf345shdu5",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 10:31:45.125",
                "entryId": "tfg4435",
                "entryGroup": 505,
                "entryNumber": 58,
                "name": "ecd_1234",
            }
        },
    },
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "sdfkjyu5-554h-fft4-eg35-llkfhut758hf",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 09:26:33.122",
                "entryId": "tfg4435",
                "entryGroup": 215,
                "entryNumber": 7,
                "name": "tmx1234",
            }
        },
    },
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "xvnmjdue-44gd-dddf-57fh-lprtuericd3s",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 09:55:16.654",
                "entryId": "tfg4435",
                "entryGroup": 215,
                "entryNumber": 11,
                "name": "ecd_1234",
            }
        },
    },
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "asdfe553-0gj4-998f-fd34-vnbmc643hfd2",
        "_source": {
            "_common": {
                "eventTime": "2024-03-22 08:59:03.258",
                "entryId": "l57fhvg",
                "entryGroup": 77,
                "entryNumber": 3,
                "name": "ecd_1234",
            }
        },
    },
]:
    client.create(index=doc["_index"], id=doc["_id"], body=doc["_source"])

And, finally, let's make the query:

import json

query_model = {
    "range": {
        "_common.eventTime": {
            "gte": "2024-03-22 09:00:00.000",
            "lte": "now",
        }
    }
}


# Set the sort instructions
sort_instructions = {
    "_common.entryNumber": {
        "order": "asc",
    }
}

index_data = client.search(
    index="my_index", query=query_model, sort=sort_instructions, size=10
)
print(json.dumps(index_data.body, indent=2))

Which still prints 3 out of 4 docs, sorted by entryNumber:

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "sdfkjyu5-554h-fft4-eg35-llkfhut758hf",
        "_score": null,
        "_source": {
          "_common": {
            "eventTime": "2024-03-22 09:26:33.122",
            "entryId": "tfg4435",
            "entryGroup": 215,
            "entryNumber": 7,
            "name": "tmx1234"
          }
        },
        "sort": [
          7
        ]
      },
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "xvnmjdue-44gd-dddf-57fh-lprtuericd3s",
        "_score": null,
        "_source": {
          "_common": {
            "eventTime": "2024-03-22 09:55:16.654",
            "entryId": "tfg4435",
            "entryGroup": 215,
            "entryNumber": 11,
            "name": "ecd_1234"
          }
        },
        "sort": [
          11
        ]
      },
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "55tgr45s-ffh6-f8ik-aae1-llkf345shdu5",
        "_score": null,
        "_source": {
          "_common": {
            "eventTime": "2024-03-22 10:31:45.125",
            "entryId": "tfg4435",
            "entryGroup": 505,
            "entryNumber": 58,
            "name": "ecd_1234"
          }
        },
        "sort": [
          58
        ]
      }
    ]
  }
}

Again, you can take that code and use search_after with the point-in-time API if you need a large amount of results: https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html. As mentioned in that page, sorting can slow this operation down: only use it if you actually need it.

@pquentin pquentin closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants