Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object Analysis campus documentation #1895

Open
elopatin-uc3 opened this issue May 8, 2024 · 1 comment
Open

Object Analysis campus documentation #1895

elopatin-uc3 opened this issue May 8, 2024 · 1 comment
Assignees

Comments

@elopatin-uc3
Copy link
Contributor

elopatin-uc3 commented May 8, 2024

Extend existing documentation with details pertaining to analysis categories and tests.
https://github.com/CDLUC3/mrt-cron/blob/main/coll-health-obj-analysis/README.md

e.g. object classification (complexity, high-res files, derivatives), metadata classification, mime type sustainability, etc.

Purpose: Allow depositors to put forth specific analysis requests that can be addressed via data pulled from the system for their review.

@elopatin-uc3 elopatin-uc3 self-assigned this May 8, 2024
@elopatin-uc3
Copy link
Contributor Author

elopatin-uc3 commented Jun 7, 2024

Documentation Components

  • Glossary of:
    • Object classifications
    • Metadata classifications
    • Recognized metadata types (e.g. bag-info.txt, etd data.xml, Nuxeo style metadata)
    • Tests:
      • object classification test (e.g. complex object)
      • metadata classification test (e.g. has sidecar metadata file)
      • mime type tests (e.g. mime extension mismatch, unexpected mime extension warnings)
      • ignored file test
      • object-level metadata tests: ERC who, what, when; local ID presence
      • Duplicate checksum test
      • empty file tests: producer (warn), system (info)
      • missing file extension
      • check for URL-like file names
      • file(s) have been deleted from the object in latest version

Sample analysis requests focused on discovering a variety of object qualities

Sample collection requests - coupled with specific classifications or tests

  • missing meaningful object-level metadata
  • unsustainable mime types
  • mime extension mismatch
  • warnings related to an unexpected mime extension (e.g. use of .txt for xml files)
  • missing expected derivatives
  • only derivatives present
  • search for a specific file type within collection
  • search for objects that are missing sidecar metadata
  • edge cases: missing file extensions, empty files, unwanted files (e.g. .ds_store, thumbs.db, etc.)
  • enumerate objects with a specific metadata type (e.g. bag-info.txt)

Enhancements

  • Add specialized test status result to recognized mime types (e.g. Warn for .iiq when classified as TIFF)
  • Associate a recognized mime type with a specific test status (Pass, Info, Warn, Fail)
    • majority of recognized types are flagged with an Info status currently
  • Add recognized derivative file types

Future tests

  • Local ID valid: test if local ID conforms to naming convention for a collection
  • ERC metadata valid: ERC metadata conforms with naming conventions for a collection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant