Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection Health Prototype - Object Dataset #1544

Closed
2 tasks
terrywbrady opened this issue Jul 31, 2023 · 1 comment
Closed
2 tasks

Collection Health Prototype - Object Dataset #1544

terrywbrady opened this issue Jul 31, 2023 · 1 comment
Assignees

Comments

@terrywbrady
Copy link
Contributor

terrywbrady commented Jul 31, 2023

Design Document

TODO

  • Understand how to perform object replace in OpenSearch
  • Mock json records representing objects

Billing Database Extract (135K rows, 43MB)

Producer Files Extract (18M rows, 2.5GB -- likely more with enhancements)

  • Generate a JSON document for each producer file in Merritt. Include collection information, owner information, "Mime Group" information and possibly "inv_ingests" information
  • Analysis to be peformed
    • match file types to a database of sustainable/at risk file types
    • match filenames to patterns
      • identify metadata sidecar files
      • identify content files

Objects Extract (3.5M rows)

  • Generate a JSON document for each object in Merritt
  • Include collection and owner information
  • Include file information
  • include localid information
  • Analysis to be performed
    • Find objects with metadata files
    • Find objects with content files
    • Find objects with local ids
    • Find objects with meaningful metadata
@terrywbrady terrywbrady self-assigned this Jul 31, 2023
@elopatin-uc3
Copy link
Contributor

@terrywbrady In the same vein of finding objects with meaningful metadata, it would be beneficial to find objects that are missing ERC/object-level metadata.

@terrywbrady terrywbrady pinned this issue Aug 4, 2023
@terrywbrady terrywbrady changed the title Collection Health Prototype Collection Health Prototype - Object Dataset Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants