Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QC reports require locking index.html which doesn't function correctly on NFS-mounted drives #4423

Closed
joshuacwnewton opened this issue Apr 5, 2024 · 7 comments · Fixed by #4439
Assignees
Labels
bug category: fixes an error in the code sct_qc context:
Milestone

Comments

@joshuacwnewton
Copy link
Member

joshuacwnewton commented Apr 5, 2024

Background

Our QC reports contain a table of QC entries:

image

Each entry in the table is stored as simple JSON data listing the subject, date, CLI script, etc. We generate individual .json files for each entry via our qc.py module.

However, when it comes to loading the data into a table in index.html, we store this JSON data directly within a JavaScript variable inside the index.html file. This means that every time we create a new QC entry using our Python script, we need to completely rewrite the index.html file, substituting the JSON data into a placeholder in the HTML template:

output = template.substitute(sct_json_data=json.dumps(json_data))

<script>var sct_data = $sct_json_data;</script>

untitled(2)

Problem

Because all new QC entries have the same destination file, concurrent writes can cause issues with corrupted/missing entries. So, we explicitly lock the index.html before writing to it.

However, in sct-pipeline/spine-park#15 (comment), we ran into some system errors when trying to lock and write to the index.html file while it was on a mounted drive.

Proposed solution

Initially, I can think of two ways we could fix this issue:

  • Lock and write to the index.html file in a tmpdir on a non-mounted drive, then copy the file to the destination directory when sct_run_batch is complete.
    • This may make the QC report folder difficult to access during processing, as pointed out by @jcohenadad in this comment.
  • Try to eliminate the need for locking entirely. EDIT: See this comment for details on why loading the JSON data using client-side JavaScript probably isn't feasible.
@joshuacwnewton

This comment was marked as outdated.

mguaypaq pushed a commit that referenced this issue Apr 10, 2024
… a newer version of SCT (#4427)

When we generate QC reports, we universally update `index.html` with the
copy stored in the repo, due to the behavior described in #4423.

This can cause problems when, for example, `index.html` and `main.js`
are both updated simultaneously between SCT versions (e.g. #4317). In
such a case, if QC reports are regenerated using the newer SCT version,
`index.html` will be updated with the changes, but `main.js` will not.
And, since `index.html` and `main.js` are so interdependent, this can
often cause broken behavior in the QC report.

So, this PR ensures that the assets are updated alongside `index.html`,
but only if the contents differ.
@joshuacwnewton

This comment was marked as resolved.

@joshuacwnewton

This comment was marked as resolved.

@joshuacwnewton
Copy link
Member Author

joshuacwnewton commented Apr 10, 2024

Ah! It may be that we're dealing with an NFS mount in this situation, in which case wolph/portalocker#92 may apply here:

Locking with NFS is rather problematic, while it's not entirely impossible to do it, it absolutely kills your performance in my experience.
If at all possible I would highly recommend using Redis Locks instead: https://github.com/wolph/portalocker#redis-locks

I'll check in the other issue regarding this. :)

EDIT: They were indeed using NFS mounts!

@mguaypaq

This comment was marked as resolved.

@joshuacwnewton joshuacwnewton changed the title QC reports require locking index.html, which can cause issues with sct_run_batch concurrent writes QC reports require locking index.html which doesn't function correctly on NFS-mounted drives Apr 12, 2024
@joshuacwnewton
Copy link
Member Author

joshuacwnewton commented Apr 12, 2024

Assuming the issue here is NFS mounts, then maybe all we need to do is avoid locking on the NFS-mounted drive.

Actually, wait a minute here...

# We lock `index.html` at the start of this function so that we halt any other processes *before* they have a
# chance to generate or read any .json files. This ensues that the last process to write to index.html has read
# in all of the available .json files, preventing:
# https://github.com/spinalcordtoolbox/spinalcordtoolbox/pull/3701#discussion_r816300380

There are two distinct steps here:

  1. Generating a new .json entry in the QC report's _json folder
  2. Reading in all of the .json files on disk, dumping all of it into the index.html template, and writing the data to a file.

But, all that really matters is that the "reading" step is done sequentially, right? (In other words, generating each individual JSON file concurrently is OK, as long as each process waits to read in all of the JSON files to generate the new index.html file.)

In that case, do we even need to write-lock index.html? Couldn't we just use portalocker.BoundedSemaphore as a mutex?

This is very similar to threading.BoundedSemaphore but works across multiple processes and across multiple operating systems.

@joshuacwnewton
Copy link
Member Author

joshuacwnewton commented Apr 15, 2024

I think I might have a solution that could kill 2 birds with one stone:

  1. Reading in all of the .json files on disk, dumping all of it into the index.html template, and writing the data to a file.

What if we were to isolate this logic inside of a separate, standalone Python script (e.g. refresh_qc_entries.py)? Then, distribute it as an asset inside of the QC folder?

This would have two benefits:

  • Users could re-run this script when they want to refresh the QC entries, which could solve Make it possible to remove lines in a QC report #4438. (The user can delete one or more json files, then re-run the script to copy the new json data to the html file.)
  • When generating the QC report, we could use a mutex to only ever run one instance of refresh_qc_entries.py at a time. (This would mitigate the NFS mount issues, since the mutex isn't tied to the location of the report. Plus, the encapsulation of this separate script would ensure that we're locking the entire "read-JSON-write-HTML" cycle, without the need for open(mode='r+") tricks.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug category: fixes an error in the code sct_qc context:
Projects
None yet
2 participants