Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objects not returned from S3Store.query() #786

Open
xperrylinn opened this issue May 14, 2023 · 3 comments
Open

Objects not returned from S3Store.query() #786

xperrylinn opened this issue May 14, 2023 · 3 comments

Comments

@xperrylinn
Copy link

Howdy,

I'm learning how to use the S3Store and I've written an example script pictured below to test it out.

Screenshot 2023-05-14 at 10 46 50 AM

The issue that I'm facing is that when I call S3Store.query(criteria=None) on a bucket that has a document named hello_world.txt, nothing is returned. I've stepped through the source code with a debugger and I can confirm the S3 bucket has the document inside of it, however, when the source code queries the S3Store it passes the query to the S3Store's index and finds nothing.

Screenshot 2023-05-14 at 10 46 28 AM

I think I'm missing something about how the index attribute of the S3Store works. How do I understand what is happening here?

@munrojm
Copy link
Member

munrojm commented May 15, 2023

Every object that is in your bucket needs to have an entry in the index store that contains its key + additional metadata that defined in searchable_fields. Is that the case here?

@xperrylinn
Copy link
Author

Thank you for the quick reply! That is not the case here. I created an index MemoryStore and S3Store without making any updates. I took a look at the s3store fixture and I see what you mean. I've updated my example script pasted below and I can now see the S3Store is finding my existing document in S3:

from maggma.stores.aws import S3Store
from maggma.stores import MemoryStore

index = MemoryStore(collection_name="index", key="blob_uuid")

index.connect()

index.update(
    docs={"blob_uuid": "hello_world.txt"},
    key="blob_uuid",
)

s3_store = S3Store(
    index=index,
    bucket="atomate2-openmm",
    endpoint_url="https://s3.us-west-1.amazonaws.com",
    s3_profile="atomate2-openmm-dev",
    key="blob_uuid",
    s3_workers=1,
    unpack_data=False,
)

s3_store.connect()

s3_store.update(
    docs={"blob_uuid": "my_fancy_uid", "message": "hello world!"},
    key="blob_uuid",
)

print(list(s3_store.query()))

outputs:

[b'HELLO WORLD!\n\n', b'\x82\xa9blob_uuid\xacmy_fancy_uid\xa7message\xachello world!']

Process finished with exit code 0

I think I'm starting to understand why we need an index, which isn't obvious when doing a simple example like this. The use case I'm working towards is where I have a MongoDB of documents that have a blob that I want to store in S3. If I'm understanding it correctly, the MongoDB is the index in this case which essentially points to the blobs that are stored in the S3 bucket. Am I thinking about this correctly?

@munrojm
Copy link
Member

munrojm commented May 22, 2023

@xperrylinn, sorry for the late reply. Yes, you are thinking about it correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants