Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSEGOG-267 HDF ingestion validity checks #93

Merged
merged 41 commits into from
Apr 19, 2024

Conversation

Will-Cross1
Copy link
Contributor

Description:

Implamented checks to the HDF file in accrodance to the word document provided by Stephen Dann.

When an HDF file is ingested it is then checked by the code in this pull request in accordance to the word document

it then returns a message to the user in this format:

{
    "message":"Updated 20200407142816",
    "response":{
        "accepted_channels":["PM-201-TJ-CAM-2-FWHMY"],
        "rejected_channels":{
            "PM-201-FE-CAM-2":["data attribute has wrong datatype, should be ndarray","data attribute has wrong datatype, should be uint16 or uint8"],
            "PM-201-HJ-PD":"Channel is already present in existing record"
        },
        "warnings":[]
    }
}

this will give a message depending on wether an existing record has been updated or created

it gives a response with a list of accepted channels as well as rejected channels
rejected channels are in a dictionary form with the key being the rejected channel which links to a list of reason why the channel failed
any warnings raised in the file checker are also returned to the user in this dictionary inside the "warnings" list

HTTP codes:
On creation: 201
On update: 200
On fail: 400

Key functionalities:

  • If a file with the same ID (timestamp) exists in the database already then the partial import checks are ran
    • checks the record metadata (the record can either be rejected or accepted here)
    • checks if any channels exist in the existing record (if they do they are rejected an added to the rejected_channels list)
  • File checks are ran next, this checks the epac_ops_data_version
    • the data version is meant to be "1.0" with the "1." being the major version and the ".0" being the minor version
    • if missing, not a string or the major version is unknown the file is rejected
    • if the minor version is too high a warning is given
  • Record checks are ran next, these check the metadata of the record attempting to be ingested
    • if the timestamp or active_are are incorrect or missing the record is rejected
    • if shot_num or active_experiment are incorrect then the record is rejected
  • Channel checks are ran last as there is no need to run them if the record failed. Rejected channels are added to the response dictionary
    • there are many channel checks with each giving unique fail messages
    • some will not be able to be ran if others have failed which is not a problem as they would have to resubmit the HDF file anyway if that was the case

Testing:

there are many comprehensive pytests as well as integration tests with this PR

list of test commands ran in shell:
(ran in the top folder (operationsgateway-api))
(Works on Mac not sure about windows)
poetry run pytest
poetry run pytest -s "test/records/test_HDF_file.py" -v -vv
poetry run pytest -s "test/records/test_HDF_file.py::TestFile" -v -vv
poetry run pytest -s "test/records/test_HDF_file.py::TestRecord" -v -vv
poetry run pytest -s "test/records/test_HDF_file.py::TestChannel" -v -vv
poetry run pytest -s "test/records/test_HDF_file.py::TestPartialImport" -v -vv
poetry run pytest -s "test/records/test_HDF_file.py::TestIntegrationIngestData" -v -vv

Additional notes:

  • Needed to add matplotlib.use("Agg") to the waveform.py file to fix an error to do with matplotlib that arose
    • doing this meant E402 needed to be ignored inside the .flake8 file or lint would fail
  • needed to make some models optional and allow them to accept Any value
    • this allowed for a better error message to be given if the HDF file has the wrong datatype

added checks from word document

missing some channel checks
added more file tests

added some pytests

combined pytests into better groups
made sure all of the file checks work and don't cause any errors where it should have passed
changed some of the models so that they don't cause the channel to be rejected

instead the channel is rejected inside the ingestion validator giving a nicer error
expected failed responses for channel_dtype pytests work

code has been linted also
(some of the linting does fail but that will be fixed in a later commit)
need to create a mock for some of these as impossible with normal pytest

tests for image_path, path, waveform_id, id_ will be created in a later commit
may need to add more to them to ensure they work on all cases
may need to add further checks at a later point
while making the tests I fixed some hidden bugs in the validator code
such as if the channel name was wrong the code would fail as it made use of the channel name in the manifest

may need to add further tests in another commit to test different variations
by adding the check into the hdf_handler.py file the unrecognised_attribute_checks,
has become Obsolete apart from a small art at the end

This may need to be adressed in a later commit
added tests for the function that runs all of the channel checks

more comprehensive pytests of this function will be added in a later commit
- need to finalise code for channel checks as some isn't used any more
- need to lint the code and make parts less complicated
- need to implement the checks into the actual code
increased the number of lines pytests covered for ingestion_validator.py

simplified functions and removed unused code
simplified complex functions
removed unused parts of code
started planning where the checks go on ingest
started writing integration tests

updated:
test/endpoints/test_get_records.py::TestGetRecords::test_valid_get_records[Query using a single projection]
pytest with new metadata variables

also updated names of channels for new simulated data
changed some variable names

fixed an issue with timestamp in the partial ingest checks that caused
two of the same timestamps to be considered different

altered channel remover code in ingest_data to be slightly less complicated
simplified a function by spliting it into two

changed some formatting with black
linted the program

had to do matplotlib.use("Agg") to prevent an error in waveform.py

added a way to delete from the s3 echo
@MRichards99 MRichards99 changed the base branch from main to DSEGOG-255-realistic-test-data February 20, 2024 14:49
@Will-Cross1 Will-Cross1 changed the base branch from DSEGOG-255-realistic-test-data to main February 22, 2024 11:40
@Will-Cross1 Will-Cross1 changed the base branch from main to DSEGOG-255-realistic-test-data February 22, 2024 11:40
Base automatically changed from DSEGOG-255-realistic-test-data to main February 27, 2024 08:33
@Will-Cross1
Copy link
Contributor Author

There are a few of these in the hdf_handler code that I'm not sure are able to be ran because of other constraints but still effects the code coverage. I'm not sure if removing them could cause problems though.

except ValidationError as exc:
raise ModelError(str(exc)) from exc

I couldn't increase the coverage too much because
some of the code I don't think can get run

more information on the hdf validator PR
@MRichards99
Copy link
Collaborator

@Will-Cross1 thanks for taking a look at increasing the code coverage on hdf_handler.py. What are the constraints that will mean we won't be able to test those exceptions being handled? It's okay if they're not tested (at a certain point, chasing coverage isn't worth the time) but we should keep the exception handling in the code - small amounts of untestable code is better than removing it just for code coverage.

@codecov-commenter
Copy link

codecov-commenter commented Apr 3, 2024

Codecov Report

Attention: Patch coverage is 94.78958% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 94.99%. Comparing base (db0cc7e) to head (859e36e).

Files Patch % Lines
...nsgateway_api/src/records/ingestion/hdf_handler.py 90.27% 6 Missing and 8 partials ⚠️
...ateway_api/src/records/ingestion/channel_checks.py 97.87% 0 Missing and 4 partials ⚠️
...gateway_api/src/records/ingestion/record_checks.py 88.88% 2 Missing and 1 partial ⚠️
operationsgateway_api/src/routes/ingest_data.py 91.66% 2 Missing and 1 partial ⚠️
...api/src/records/ingestion/partial_import_checks.py 95.12% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
+ Coverage   91.33%   94.99%   +3.66%     
==========================================
  Files          47       50       +3     
  Lines        2446     2837     +391     
  Branches      197      297     +100     
==========================================
+ Hits         2234     2695     +461     
+ Misses        186      105      -81     
- Partials       26       37      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

- Split test classes into separate files and make some minor improvements
@MRichards99 MRichards99 force-pushed the DSEGOG-267-HDF-ingestion-validity-checks branch from 803300e to 1145384 Compare April 17, 2024 15:02
@MRichards99 MRichards99 merged commit 6ba3184 into main Apr 19, 2024
7 checks passed
@MRichards99 MRichards99 deleted the DSEGOG-267-HDF-ingestion-validity-checks branch April 19, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants