DSEGOG-316 Switch between parallel and sequential ingestion for Echo ingest script #120

MRichards99 · 2024-05-01T10:23:01Z

This PR adds some config that switches between sequential and parallel ingestion. As such, I've added the parallel ingestion back in that existed in previous versions of this script. The intention is to use sequential for GitHub Actions and use parallel on large cloud VMs for ingestion of the simulated data on the dev server.

There's a couple of other changes I've made:

In the ingestion script, I've replaced mongo with mongosh - from 6.0, the mongo command disappeared so using mongosh is a solution to support more MongoDB versions (at least 5.0-7.0, if not more)
Added default sorting (_id ASC) on data export when no order is provided in the request. When I tested parallel ingestion on GitHub Action, one of the export tests failed because the nature of parallel ingestion means files aren't always ingested in chronological order so the order of export could change a bit.
Added a 5 second sleep on the ingestion script after the database collections are dropped. If the Echo bucket is empty or the config tells the script to not wipe the bucket, the script can encounter a race condition where the channel manifest file is ingested first, before all collections have been dropped. This would cause a situation where no manifest is present on the database and because the validity checks rely on using the manifest, the ingestion wouldn't work.

…to sequential ingestion

- From 6.0, `mongosh` replaced `mongo`

- This fixes a failure in one of the export tests that occurred when the ingestion script was run in parallel (meaning files weren't ingested in chronological order

- This is useful to prevent a race condition when the S3 bucket is empty and the script might ingest the channel manifest file before dropping that database collection

codecov-commenter · 2024-05-01T10:35:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.99%. Comparing base (bc138ea) to head (28f58a0).

Additional details and impacted files

@@                    Coverage Diff                    @@
##           DSEGOG-305-reingest-data     #120   +/-   ##
=========================================================
  Coverage                     94.99%   94.99%           
=========================================================
  Files                            50       50           
  Lines                          2837     2838    +1     
  Branches                        297      297           
=========================================================
+ Hits                           2695     2696    +1     
  Misses                          105      105           
  Partials                         37       37

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ay through - This will be useful for ingestion onto the dev server

MRichards99 · 2024-05-13T08:40:02Z

Making this PR ready to review as I've finished the ingestion and no further changes were made.

…-switch

DSEGOG-319 Expose Header

patrick-austin

One comment on the possible use of Enums in Pydantic, the rest are a less requested changes and more due to me not being confident I've understood the ingest code. It might be that things work as they are or there's a reason why it needs to be like this (in which case maybe a comment or two might help me or someone else understand this).

patrick-austin · 2024-05-28T12:29:23Z

util/realistic_data/ingest/config.py

+    ingest_mode: str
+    file_to_restart_ingestion: Optional[str]
+
+    @field_validator("ingest_mode")
+    @classmethod
+    def check_ingest_mode(cls, v: str) -> str:
+        if v not in ["parallel", "sequential"]:
+            sys.exit(
+                f"ingest_mode must contain 'parallel' or 'sequential', found: '{v}'",
+            )
+        else:
+            return v


Could we use Enum here to avoid needing to write a custom validator (I would suggest StrEnum but I think that's only Python3.11)? There's an example in the Pydantic docs: https://docs.pydantic.dev/2.7/api/standard_library_types/#enum

patrick-austin · 2024-05-28T12:34:47Z

util/realistic_data/ingest_echo_data.py

@@ -24,6 +39,8 @@ def main():
            ssh.drop_database_collections(collection_names)
        else:
            local_commands.drop_database_collections(collection_names)
+        # Give the drops a chance to complete before adding data
+        sleep(5)


I was wondering if we could do a while loop and actually check the state of the DB to only break when the collections are dropped, but seeing as this is "just" an ingest script a hardcoded wait is probably fine as is. Might be something to consider if this ever becomes a problem again (e.g. if 5 seconds turns out to be too short then maybe we could do something more rigorous rather than just bumping to 10).

patrick-austin · 2024-05-28T12:53:14Z

util/realistic_data/ingest_echo_data.py

+            if og_api.process:
+                t = threading.Thread(
+                    target=APIStarter.clear_buffers,
+                    args=(og_api.process,),
+                )
+                t.start()


I don't think I know what we're doing here. clear_buffers seems to be for printing the stdout/err from og_api.process to the terminal, but if we reach this else block then we didn't actually do anything with og_api.process? Since that's only used on lines 99 and 101 (i.e. in the if block).

I'm not familiar with thread pools in Python so maybe I'm misreading it and this does make sense to someone who knows what they're looking at (in which case maybe a code comment to help novices like me?)

patrick-austin · 2024-05-28T12:58:31Z

util/realistic_data/ingest_echo_data.py

@@ -64,13 +81,34 @@ def main():
    hdf_page_iterator = echo.paginate_hdf_data()
    total_ingestion_start_time = time()

+    last_successful_file = Config.config.script_options.file_to_restart_ingestion
+    ingestion_restarted = False if last_successful_file else True


It's really minor but the variable name here confuses me a bit, i.e. ingestion_restarted is True if I didn't provide a file_to_restart_ingestion. I would expect the opposite, that giving a file to restart from means that I have "restarted ingestion".

After reading the other functionality, I can see that this is a because when no file is set we can start ingesting straight away, but if a file is set we need to find it before we can (re-)start ingesting. In which case, could we name this something like start_ingestion or ingestion_started? Obviously it's personal preference but that would have helped me understand this code better I think.

patrick-austin · 2024-05-28T13:06:31Z

util/realistic_data/ingest_echo_data.py

+
+        if ingestion_restarted:
+            if Config.config.script_options.ingest_mode == "sequential":
+                for name in object_names:


If last_successful_file is set then we skip the page, which makes sense. But we iterate over all the object_names - should we skip these as well until we find last_successful_file? Is there a guarantee that last_successful_file is the first entry in the page?

Maybe something like:

object_names = object_names[object_names.index(last_successful_file):]

between lines 92 and 93?

MRichards99 added 4 commits April 30, 2024 10:52

DSEGOG-316 Add parallel ingestion back and allow it to be swtichable …

34d3c94

…to sequential ingestion

DSEGOG-316 Replace mongo with mongosh

03c21e8

- From 6.0, `mongosh` replaced `mongo`

DSEGOG-316 Add default sorting on export when no order is given

98f407e

- This fixes a failure in one of the export tests that occurred when the ingestion script was run in parallel (meaning files weren't ingested in chronological order

DSEGOG-316 Add sleep after database drops

1810cdb

- This is useful to prevent a race condition when the S3 bucket is empty and the script might ingest the channel manifest file before dropping that database collection

DSEGOG-316 Add functionality to restart the script if it fails part-w…

315bae2

…ay through - This will be useful for ingestion onto the dev server

MRichards99 force-pushed the DSEGOG-316-parallel-mode-switch branch from cabc074 to 315bae2 Compare May 2, 2024 09:32

DSEGOG-319 Expose Content-Disposition header

2bc6cab

MRichards99 marked this pull request as ready for review May 13, 2024 08:39

MRichards99 requested a review from kevinphippsstfc May 13, 2024 08:40

MRichards99 and others added 4 commits May 28, 2024 07:59

Merge branch 'main' into DSEGOG-319-expose-headers

b18ceef

Merge branch 'main' into DSEGOG-316-parallel-mode-switch

6447b61

Merge branch 'DSEGOG-305-reingest-data' into DSEGOG-316-parallel-mode…

76cbe9c

…-switch

Merge pull request #121 from ral-facilities/DSEGOG-319-expose-headers

28f58a0

DSEGOG-319 Expose Header

patrick-austin requested changes May 28, 2024

View reviewed changes

patrick-austin mentioned this pull request May 28, 2024

DSEGOG-315 MongoDB 5.0 to 7.0 #118

Open

MRichards99 mentioned this pull request May 29, 2024

DSEGOG-305 Config & Documentation for Data Re-ingestion #119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSEGOG-316 Switch between parallel and sequential ingestion for Echo ingest script #120

DSEGOG-316 Switch between parallel and sequential ingestion for Echo ingest script #120

MRichards99 commented May 1, 2024

codecov-commenter commented May 1, 2024 •

edited by codecov bot

MRichards99 commented May 13, 2024

patrick-austin left a comment

patrick-austin May 28, 2024

patrick-austin May 28, 2024

patrick-austin May 28, 2024

patrick-austin May 28, 2024

patrick-austin May 28, 2024

DSEGOG-316 Switch between parallel and sequential ingestion for Echo ingest script #120

Are you sure you want to change the base?

DSEGOG-316 Switch between parallel and sequential ingestion for Echo ingest script #120

Conversation

MRichards99 commented May 1, 2024

codecov-commenter commented May 1, 2024 • edited by codecov bot

Codecov Report

MRichards99 commented May 13, 2024

patrick-austin left a comment

Choose a reason for hiding this comment

patrick-austin May 28, 2024

Choose a reason for hiding this comment

patrick-austin May 28, 2024

Choose a reason for hiding this comment

patrick-austin May 28, 2024

Choose a reason for hiding this comment

patrick-austin May 28, 2024

Choose a reason for hiding this comment

patrick-austin May 28, 2024

Choose a reason for hiding this comment

codecov-commenter commented May 1, 2024 •

edited by codecov bot