MemoryStore: replace mongomock with pymongo-inmemory #846

rkingsbury · 2023-08-25T18:30:25Z

Summary

Major changes:

replace mongomock with pymongo-inmemory
eliminate a significant amount of bespoke code from MemoryStore, which can now inherit nearly all methods from MongoStore.
BREAKING CHANGE - MemoryStore() instantiated with the same collection name now point to the same instance in memory. This is in contrast to previous behavior (which was somewhat unintuitive) and leads more intuitive __eq__ behavior (see MemoryStore __eq__ does not behave as expected #788 )
Modify tests to account for stricter behavior of MemoryStore, e.g., must call connect() again after run().
Closes MemoryStore __eq__ does not behave as expected #788
Closes Enhancement: more performant MemoryStore #830
Some docstring cleanups

pymongo-inmemory creates a thin wrapper around MongoClient() using a downloaded instance of mongod, so it guaranteed to work and perform exactly like "real" pymongo, unlike mongomock which has some limitations and inconstencies.

Performance impact

I did a basic timing test for connecting to a FileStore (400 files) on a system with slow (HDD) storage. On first connection, pymongo-inmemory was slower by about 50% (19s vs. 13s), presumably because it has to download and spin up the mongod instance. On subsecquent fs.connect calls pymongo-inmemory was at least as fast as mongomock (12s vs. 13s).

On a system with faster storage (SSD) and 800 files, pymongo-inmemory and mongomock had comparable speed (26-30s for pymongo-inmemory vs. 27s for mongomock)

Todos

If this is work in progress, what else needs to be done?

Pending response to issue, move configuration settings to pyproject.toml - (I need to open a PR for this; not necessary to complete before merging this PR).
Resolve remaining CI problems for which I need assistance. See below.

Remaining test failures / CI problems

I need some help (please!) from @munrojm and maybe @gpetretto to resolve the remaining test failures. I'll break them down one by one

test_jsonstore_orjson_options (@gpetretto )

The changes here prevent the JSONEncodeError from being raised, for reasons I haven't yet identified.

    def test_jsonstore_orjson_options(test_dir):
        class SubFloat(float):
            pass
    
        with ScratchDir("."):
            jsonstore = JSONStore("d.json", read_only=False)
            jsonstore.connect()
            with pytest.raises(orjson.JSONEncodeError):
>               jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
E               Failed: DID NOT RAISE <class 'TypeError'>

tests/stores/test_mongolike.py:498: Failed

By updating MemoryStore to use "real" Mongo, I was able to inherit the update method from MongoStore rather than use the custom update method that had previously been used. That custom update is now inside MontyStore. I suspect that a subtle difference between the two is causing this test to fail, but I haven't been able to pin it down. I would appreciate a 2nd set of eyes. I suggest comparing the two update methods side by side as a starting point.

test_submission_patch (@munrojm )

For some reason the API is now generating a Bad Response code instead of code 200. This must somehow be related to owner_store which is a MemoryStore, but I'm not familiar enough with the API to figure out the connection

    def test_submission_patch(owner_store, post_query_op, patch_query_op):
/home/runner/work/maggma/maggma/tests/cli/test_serial.py:5: PytestCollectionWarning: cannot collect test class 'TestBuilder' because it has a __init__ constructor (from: tests/cli/test_serial.py)
        endpoint = SubmissionResource(
  class TestBuilder(Builder):
            store=owner_store,
            get_query_operators=[PaginationQuery()],
            post_query_operators=[post_query_op],
            patch_query_operators=[patch_query_op],
            calculate_submission_id=True,
            model=Owner,
        )
        app = FastAPI()
        app.include_router(endpoint.router)
    
        client = TestClient(app)
        update = json.dumps({"last_updated": "2023-06-22T17:32:11.645713"})
    
        assert client.get("/").status_code == 200
>       assert client.patch(f"/?name=PersonAge9&update={update}").status_code == 200
E       assert 400 == 200
E        +  where 400 = <Response [400]>.status_code
E        +    where <Response [400]> = <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>>('/?name=PersonAge9&update={"last_updated": "2023-06-22T17:32:11.645713"}')
E        +      where <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>> = <starlette.testclient.TestClient object at 0x7f75aa0f4f10>.patch

tests/api/test_submission_resource.py:118: AssertionError

ServerSelectionTimeout Errors

Occasionally some tests error out because the MemoryStore times out. I suspect this occurs because pymongo-inmemory is downloading and spinning up its own mongod before it can run the tests. It looks like this wasn't a problem after the most recent update, so perhaps it will take care of itself. If the issue persists, I will experiment with having pymongo-inmemory use the built in mongo instance provided on the GH runner instead of downloading its own.

Checklist

Google format doc strings added.
Code linted with ruff. (For guidance in fixing rule violates, see rule list)
Type annotations included. Check with mypy.
Tests added for new features/fixes.
I have run the tests locally and they passed.

mkhorton · 2023-08-25T20:44:17Z

+1 I’ve observed some serious differences between mongomock and pymongo that can result in some tedious debugging! I think the idea of using real Mongo is much preferred.

tests/stores/test_mongolike.py

src/maggma/stores/mongolike.py

tests/stores/test_mongolike.py

@@ -538,14 +535,50 @@
        assert jsonstore.last_updated > start_time


-def test_eq(mongostore, memorystore, jsonstore):
+def test_eq(mongostore, memorystore, jsonstore, montystore):
+    assert montystore == montystore


src/maggma/stores/mongolike.py

+    # def __del__(self):
+    #     """
+    #     Ensure collection is dropped from memory on object destruction, even if .close() has not been called.
+    #     """
+    #     if self._coll is not None:
+    #         self._collection.drop()


rkingsbury · 2023-08-29T20:42:02Z

@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have

(adding this comment b/c I'm not sure if editing the post would trigger an email notification).

munrojm · 2023-08-29T21:47:41Z

@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have

(adding this comment b/c I'm not sure if editing the post would trigger an email notification).

Just committed a fix for the patch test. Was an issue with how the body data was being passed to the patch method of the test client.

…o inmemory

gpetretto · 2023-08-30T10:09:56Z

Thanks @rkingsbury for updating the MemoryStore. The limited number of features supported by mongomock was ineed an issue.
Concerning the test that fails, I made a few tests and figured out what is happining. The reason why I introduced those features and that test was because it was failing when the document contains a FloatWithUnit. I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like

jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})

the dictionary stored internally still contained an instance of SubFloat. So when update_json_file was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance of SubFloat as a simple float. So, when it is queried back to perform update_json_file it is never an instance of SubFloat. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson. I am not sure if it would be better to remove the serialization_default option at this point.

rkingsbury · 2023-08-30T16:06:52Z

I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
the dictionary stored internally still contained an instance of SubFloat. So when update_json_file was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance of SubFloat as a simple float. So, when it is queried back to perform update_json_file it is never an instance of SubFloat. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson.

Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.

I am not sure if it would be better to remove the serialization_default option at this point.

But it sounds like you're saying that if I were to do this:

a = SubFloat(1.1.)
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
jsonstore.query({"task_id": 3})

I would get back a float, not a SubFloat, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.

gpetretto · 2023-08-30T22:29:22Z

Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.

Indeed I think that this would have avoided the need for #791. I will try to make a test with the new version and see if there is no issue with that.

But it sounds like you're saying that if I were to do this:
a = SubFloat(1.1.)
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
jsonstore.query({"task_id": 3})
I would get back a float, not a SubFloat, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.

Yes, you will get back a float. A simple example is:

class SubFloat(float):
    pass

ms = MemoryStore()
ms.connect()

d = {"test": SubFloat(1.1), "normal": 1.1, "task_id": 3}
ms.update(d)

out_d = ms.query_one()
print(out_d["test"].__class__, out_d["normal"].__class__)

The output with the old version is:

<class '__main__.SubFloat'> <class 'float'>

While with the new version is:

<class 'float'> <class 'float'>

In the end it is not really a problem, since this makes the MemoryStore more similar to the MongoStore. However, any object that is not handled by jsanitize (e.g. MSONable, dataclasses, ...) will probably be converted to a simple type during the update() (a string in the worst case). So I don't know if there is any corner case where the object survives after the insertion in MongoDB, but that cannot be handled by orjson by default. I suppose leaving the option will not do much harm (at least will not force me to make a backward incompatible change to use the latest version of maggma), however I couldn't find a way to create a test example to replace the test_jsonstore_orjson_options.

…o inmemory

MemoryStore: mongomock -> pymongo-inmemory

148523f

rkingsbury force-pushed the inmemory branch from 1519768 to 148523f Compare August 25, 2023 18:57

rkingsbury added 4 commits August 25, 2023 15:23

MemoryStorage: use local mongod when possible except CI

4ab8b4a

Merge branch 'main' into inmemory

7e867f7

pin pymongo-inmemory to pymongo version

1587e2f

try CI fix

a306cb3

rkingsbury added 5 commits August 25, 2023 16:49

CI try

ba2b1c5

CI mongo version

c00bd84

CI fix

cb57736

MemoryStore: use MongoDB v6

fae4023

MemoryStore: expand __eq__ testing

09cd5cc

rkingsbury mentioned this pull request Aug 25, 2023

MemoryStore __eq__ does not behave as expected #788

Open

github-advanced-security bot found potential problems Aug 25, 2023

View reviewed changes

tests/stores/test_mongolike.py Fixed Show fixed Hide fixed

rkingsbury added 3 commits August 25, 2023 17:31

MemoryStore: drop collection on close()

cbfc58f

MemoryStore: test cleanups

55963ff

MemoryStore: assure test files close() part 1

c7655bd

rkingsbury mentioned this pull request Aug 25, 2023

Enhancement: more performant MemoryStore #830

Open

rkingsbury added 6 commits August 25, 2023 18:38

MemoryStore test updates part 2

837b4a7

MemoryStore: unique per instance coll name

2df7d6a

MemoryStore __del__ fix

14e843b

MemoryStore fixes

c3f43f9

MemoryStore fixes

1e8c262

fixes

232ab89

github-advanced-security bot found potential problems Aug 26, 2023

View reviewed changes

src/maggma/stores/mongolike.py Fixed Show fixed Hide fixed

MontyStore: inherit from MongoStore

e403bfd

github-advanced-security bot found potential problems Aug 26, 2023

View reviewed changes

rkingsbury added 3 commits August 25, 2023 23:33

test fixes and MontyStore __eq__ fixups

acb98e4

revert changes to FileStore and Azure tests

03c2e31

revert changes to some JsonStore test

671bc6f

more changes

4e0e6b5

github-advanced-security bot found potential problems Aug 27, 2023

View reviewed changes

rkingsbury added 4 commits August 26, 2023 22:30

test fixes

830d0a6

fixes

d27558d

fix

eac9406

rm print statement

421e4e9

rkingsbury requested a review from munrojm August 29, 2023 20:40

Fix test_patch_submission

3cf4e3f

Merge branch 'main' of https://github.com/materialsproject/maggma int…

772c5f0

…o inmemory

rkingsbury added 4 commits August 31, 2023 23:01

Merge branch 'main' into inmemory

6ed6898

Merge branch 'main' into inmemory

ec3324e

bump pymongo-inmemory to 0.3.1

b478ad5

CI python version workaround

33a1fdf

rkingsbury mentioned this pull request Feb 21, 2024

Enhancement: Locking mechanism for file-based stores #832

Open

rkingsbury added 2 commits May 15, 2024 10:00

Merge branch 'main' of https://github.com/materialsproject/maggma int…

be62d78

…o inmemory

add pymongo-inmemory

d448e49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryStore: replace mongomock with pymongo-inmemory #846

MemoryStore: replace mongomock with pymongo-inmemory #846

rkingsbury commented Aug 25, 2023 •

edited

mkhorton commented Aug 25, 2023

rkingsbury commented Aug 29, 2023

munrojm commented Aug 29, 2023

gpetretto commented Aug 30, 2023

rkingsbury commented Aug 30, 2023

gpetretto commented Aug 30, 2023

MemoryStore: replace mongomock with pymongo-inmemory #846

Are you sure you want to change the base?

MemoryStore: replace mongomock with pymongo-inmemory #846

Conversation

rkingsbury commented Aug 25, 2023 • edited

Summary

Performance impact

Todos

Remaining test failures / CI problems

Checklist

mkhorton commented Aug 25, 2023

rkingsbury commented Aug 29, 2023

munrojm commented Aug 29, 2023

gpetretto commented Aug 30, 2023

rkingsbury commented Aug 30, 2023

gpetretto commented Aug 30, 2023

rkingsbury commented Aug 25, 2023 •

edited