Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Missing IDs when filtering for all IDs + count #33108

Closed
1 task done
bostrt opened this issue May 16, 2024 · 10 comments
Closed
1 task done

[Bug]: Missing IDs when filtering for all IDs + count #33108

bostrt opened this issue May 16, 2024 · 10 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@bostrt
Copy link

bostrt commented May 16, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.1
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.2
- OS(Ubuntu or CentOS): Fedora 38 (podman compose)
- CPU/Memory: 8 CPU, 32GB
- GPU: None

Current Behavior

When I insert around 1,000,000 records, in 1,000 batches and then query to count the final records, some are missing when filtering on the primary ID like this:

>>> from pymilvus import MilvusClient
>>> c = MilvusClient()
>>> c.query(collection_name='test', output_fields=["count(*)"])
[{'count(*)': 999001}]
>>> c.query(collection_name='test', filter='id > 0', output_fields=["count(*)"])
[{'count(*)': 972604}]

This one returns accurate results:

>>> c.query(collection_name='test', filter='id != 0', output_fields=["count(*)"])
[{'count(*)': 999001}]

More troubleshooting:

>>> good_results = c.query(collection_name='test', filter='id != 0', output_fields=["id"])
>>> bad_results = c.query(collection_name='test', filter='id > 0', output_fields=["id"])
>>> good_ids = [x['id'] for x in good_results]
>>> bad_ids = [x['id'] for x in bad_results]

# 1. Remove good ids from bad ids
# 2. Remove bad ids from good ids
>>> unseen = [list(set(bad_ids) - set(good_ids)), list(set(good_ids) - set(bad_ids))]

# All IDs from bad are in good
>>> len(unseen[0])
0
# Good ids have many more than bad 
>>> len(unseen[1])
41444


# Querying with a "bad" id still returns a result
>>> unseen[1][0]
449811121647790610
>>> c.query(collection_name='test', filter='id == 449811121647790610', output_fields=["id"])
[{'id': 449811121647722892}]

# I would have expected this one to fail because of id > 0 given the bug report
>>> c.query(collection_name='test', filter='id > 0 && id == 449811121647790610', output_fields=["id"])
[{'id': 449811121647722892}]

Expected Behavior

I expect any comparison filter that would show all IDs to actually show all IDs.

Steps To Reproduce

1. Start milvus in standalone mode
2. Run `build_db.py` or insert lots of records (1m seems enough to reproduce problem)
3. Query for all IDs greater than zero and count results

Milvus Log

No response

Anything else?

Reproducer script build_db.py. We also tried inserting data sequentially (no batching) and had same issue. We also tried inserting using multi-threading (10 workers) and same issue. The number of "missing" IDs is not the same every time. A co-worker and myself saw different numbers each time we ran the script but it was always reproducible.

cc @rdmullett

from pymilvus import MilvusClient, DataType
import uuid, random

client = MilvusClient()

schema = MilvusClient.create_schema()

schema.add_field("id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=384)
schema.add_field(field_name="color", datatype=DataType.VARCHAR, max_length=256)

# 3.3. Prepare index parameters
index_params = client.prepare_index_params()

# 3.4. Add indexes
index_params.add_index(
    field_name="id"
)

index_params.add_index(
    field_name="vector", 
    index_type="AUTOINDEX",
    metric_type="IP"
)

# 3.5. Create a collection
client.create_collection(
    collection_name="test",
    schema=schema,
    index_params=index_params
)

# Create dataset
colors = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"]
data = [ {
    "vector": [ random.uniform(-1, 1) for _ in range(384) ], 
    "color": f"{random.choice(colors)}_{str(uuid.uuid4())}" 
} for i in range(0, 1_000_000) ]

print('data is ready to insert')

done = 0
insert_cache = []

# Insert data with 1,000 size batches
for d in data:
    insert_cache.append(d)
    if done % 1000 == 0:
        print('COMPLETE ' + str(done))
        client.insert(collection_name='test', data=insert_cache)
        insert_cache.clear()
    done += 1
@bostrt bostrt added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 16, 2024
@bostrt
Copy link
Author

bostrt commented May 16, 2024

After waiting like 10 minutes, I was running another query and noticed the "missing" ID count dropped from 972604 to:

>>> c.query(collection_name='test', filter='id > 0', output_fields=["count(*)"])
[{'count(*)': 701303}]

EDIT: and again:

>>> c.query(collection_name='test', filter='id > 0', output_fields=["count(*)"])
[{'count(*)': 581652}]
```

@yanliang567
Copy link
Contributor

yanliang567 commented May 17, 2024

reproduced on milvus 2.4.1 with pymilvus 2.4.2 after a few attempt, printed more info in one run
the 1st time at round 104
image
the final results count
image

@yanliang567
Copy link
Contributor

@congqixia please help to take a look, my cluster
yanliang-241-milvus-standalone-5bc7b96ff6-96vnz 1/1 Running 0 2m58s 10.104.5.184 4am-node12 <none> <none>

/assign @congqixia

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 17, 2024
@yanliang567 yanliang567 added this to the 2.4.2 milestone May 17, 2024
@longjiquan
Copy link
Contributor

maybe already fixed by #32858

@yanliang567
Copy link
Contributor

you are right. not reproduced on v2.4.2-20240515-b2d83d33. @bostrt could you please also try on this build?

maybe already fixed by #32858

@yanliang567
Copy link
Contributor

/assign @bostrt
/unassign

@sre-ci-robot sre-ci-robot assigned bostrt and unassigned yanliang567 May 17, 2024
@bostrt
Copy link
Author

bostrt commented May 17, 2024

Yes, I just tested and v2.4.2-20240515-b2d83d33 works for me

@bostrt
Copy link
Author

bostrt commented May 17, 2024

/assign @yanliang567
/unassign

@sre-ci-robot sre-ci-robot assigned yanliang567 and unassigned bostrt May 17, 2024
@yanliang567
Copy link
Contributor

good to know that. I'd close this issue as Milvus released a new version

@yanliang567 yanliang567 modified the milestones: 2.4.2, 2.4.3 May 24, 2024
@yanliang567
Copy link
Contributor

Milvus 2.4.3 was released, i'd close this issue. thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants