Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Model support for mzXML, sequences, and method files #664

Closed
24 tasks
hepcat72 opened this issue Mar 27, 2023 · 7 comments
Closed
24 tasks

Add Model support for mzXML, sequences, and method files #664

hepcat72 opened this issue Mar 27, 2023 · 7 comments
Labels
component:6-models data models design:needs-proposals Open for design proposals priority:2-required Issue that must be addressed in order to meet release goals type:enhancement New feature or request

Comments

@hepcat72
Copy link
Collaborator

hepcat72 commented Mar 27, 2023

FEATURE REQUEST

Inspiration

In our TraceBase presentation to the Rabinowitz lab, Josh said he would like to be able to get from peak data/groups to the original sequence, and expressed a concern that it seems limiting (to accomplish that) since we don't store mzXML files (or their data).

Description

Add models and model fields to support mzXML files and their sequence (from a "method" file).

Alternatives

There has been lots of meeting discussions on this, which many proposals and docs. (Please link all relevant documentation in the comments below, if any are missing).

Dependencies

This issue cannot be started until the completion of the following issue(s):

Comment

TBD


ISSUE OWNER SECTION

Assumptions

None

Requirements

  • 1. Be able to get from peak groups to the original sequence
  • 2. Be able to obtain data from unanalyzed peaks (e.g. from mzXML files) for inclusion in a new El-Maven/AccuCor/IsoCorr analysis
  • 3. Be able to obtain unanalyzed mzXML files (whether they were used to generate existing peakgroups or not)
  • 4. Be able to correctly associate mzXML files from same/different samples and with same/different names
  • 4.1. Multiple mzXML files originating from the same sample but different scan
  • 4.2. Multiple mzXML files originating from different samples but with the same sample name
  • 4.3. Multiple mzXML files originating from the same sample but with different sample names (e.g. sample1, sample1_pos_scan2)
  • 5. Be able to handle missing mzXML/sequence/scan data from legacy datasets
  • 6. Require mzXML/sequence/scan data for new datasets

(Guidelines/lesser requirements)...

  • 7. If possible, record data that may be used to gauge whether unpicked peaks exist in an mzXML file
  • 8. Minimize redundancy in database as much as possible (e.g. support the M:M relationship between accucor and mxXML)
  • 9. Minimize disk usage as much as possible

Limitations

None

Affected Components

A tentative list of anticipated repository items that will be changed, labeled
with "add", "delete", or "change". One item per line. (Mostly, this will be a
list of files.)

Primary Changes

  • change: ms_run.py
  • add: ms_sequence.py
  • add: ms_scan.py

Changes prompted by the primary changes

  • change: loading scripts, views, templates

DESIGN

Interface Change description

No outward changes to the current interface. All of the views and loading interfaces should be the same.

Code Change Description

MSSequence

  • researcher (null=False)
  • date (null=False)
  • method_file_name (null=True)
  • method_file_hash (unique=True, null=True)

MSRun change (remove all existing and add...)

  • sample_id (null=False)
  • protocol_id (null=False)
  • mssequence_id (null=True)
  • raw_file_name (null=True)
  • raw_file_hash (unique=True, null=True)

MSScanGroup

  • msrun_id (null=False)
  • mzXML_file_name (null=True)
  • mzXML_file_hash (unique=True, null=True)

PeakGroup change (keep fields not mentioned...)

  • ms_run_id
  • msscangroup_id (null=False)

...

I didn't add the file and submission tables/fields - add as you see fit

Make changes to all supporting code to retain current functionality. E.g.:

  • Edit table paths in views (only supporting current data)
  • Edit table paths in formats (only supporting current data)
  • Edit table paths in templates (only supporting current data)
  • Create empty records in the new tables in the loading code
  • etc.

Tests

  • 1. Be able to get from peak groups to the original sequence
  • 2. Be able to obtain data from unanalyzed peaks (e.g. from mzXML files) for inclusion in a new El-Maven/AccuCor/IsoCorr analysis
  • 3. Be able to obtain unanalyzed mzXML files
  • 4. If possible, record data that may be used to gauge whether unpicked peaks exist in an mzXML file
  • 5. Minimize redundancy in database as much as possible
  • 6. Minimize disk usage as much as possible
  • 7. Be able to correctly associate mzXML files from same/different samples and with same/different names
  • 7.1. Multiple mzXML files originating from the same sample but different scan
  • 7.2. Multiple mzXML files originating from different samples but with the same sample name
  • 7.3. Multiple mzXML files originating from the same sample but with different sample names (e.g. sample1, sample1_pos_scan2)
  • 8. Be able to handle missing mzXML/sequence/scan data from legacy datasets
  • 9. Require mzXML/sequence/scan data for new datasets
@hepcat72 hepcat72 added the type:enhancement New feature or request label Mar 27, 2023
@fkang-pu
Copy link
Collaborator

Please review new model design and my notes (part 6, slides 22-26 ) in uploaded file with prefix 20230327.

@hepcat72
Copy link
Collaborator Author

Please review new model design and my notes (part 6, slides 22-26 ) in uploaded file with prefix 20230327.

@fkang-pu - I apologize. What I intended for this issue was just to distill the model changes we all agreed on in the meetings. However, my description in the issue was too broad. I just edited it to apply only to the model changes. Could you make a small design (either in a file or by filling out the issue owner section in the issue above) that contains only the elements of the schema that we came to consensus on in the meetings?

This would consist solely of changed tables and fields.

And, you can reference your larger scope powerpoint files and all our other files in the issue, e.g. in the comment section where it currently says "TBD".

Your documents have been very helpful thus far and should lead to multiple small issues, this being one of them.

@hepcat72 hepcat72 changed the title Add support for mzXML, sequences, and method files Add Model support for mzXML, sequences, and method files Mar 27, 2023
@hepcat72
Copy link
Collaborator Author

Also, slides 22-26 seem to still include elements that we agreed needed to be removed in the last meeting, so when fleshing out the issue owner section, updates from the last meeting should be included.

@jcmatese
Copy link
Collaborator

jcmatese commented Mar 29, 2023

Regarding requirements,

  • this is the first I have heard of Req. 3 which seems out of scope given most of our discussions? Are we proposing to have runs/scans without an associated peakgroupsets/accucor/isocorr? [e.g. our entire system was initiated/focused on loading/analyzing that file]
  • requirement 4.2 also seems to run counter to our model sample.name being unique

@fkang-pu
Copy link
Collaborator

I had a discussion with John about model changes yesterday afternoon. John prefers to make minimum changes to MSRun (e.g. link method file directly from ArchFile to MSRun). I like his suggestions. After reviewing the requirements proposed by Lance and Rob's design descriptions, I created a new ER diagram (part 7, slide 27) in uploaded file with prefix 20230329. Hopefully it will help us to speed up discussions this afternoon and come up with a final design.

@hepcat72
Copy link
Collaborator Author

First of all - excellent! Thanks for the thorough review! I think this demonstrates the value of documenting and reviewing, not only the requirements, but the design as well. Not everyone is at every meeting and me all have our own interpretations.

Regarding requirements,

  • this is the first I have heard of Req. 3

You are correct. When I sat down to formalize requirements, my touchstone was what Josh was getting at in our meeting where we presented version 2, and when I typed it up:

  1. Be able to obtain data from unanalyzed peaks (e.g. from mzXML files) for inclusion in a new El-Maven/AccuCor/IsoCorr analysis

...it occurred to me that unanalyzed/unpicked peaks could come from not only the current used mzXML files in any of a number of accucor/isocorr files, it could come from mzXML files that were not included ion the run. And that triggered a number of other thoughts which I alluded to in the review of Lance's milestone doc. You can see my comment by looking at the comment history of the doc, but to make searching for it unnecessary, it says:

But I also assumed that we would just be providing the method file from which they could obtain the sequence, as opposed to saving the sequence of samples in the database, which is what I'm inferring you're saying here...? Does that mean we may store samples from the sequence that didn't have any peaks analyzed/saved?

It's somewhat of an indirect reference to requirement no. 3, but Lance indicated that we shouldn't have that as a requirement (in the implied context of the milestones), but he did seem open to including finer-grained requirements in the review discussion here, which is why I retained this requirement.

which seems out of scope given most of our discussions?

I think that excluding this requirement could bite us later when Josh wants to find the mzXML file referred to in the sequence in the method file that was totally unanalyzed. While I'd agree that it may not affect the model implementation we've discussed - it could, depending on how it's implemented in associated code updates that support the new model. For example, if Josh wants to be able to find these excluded mzXML files, but our implementation requires (to load a sample) that it has an MSRun child and that requires a scan group and a peakgroup to exist, it would mean a refactor. That refactor would be unnecessary if we account for this possibility.

What made me think of this was the cold exposure validation. Michael had accidentally included 4 accucor files that he had intended to exclude (so they weren't in the sample sheet). It immediately reminded me, when I wrote these requirements of Josh's guiding feedback that he wanted to be able to obtain unanalyzed data. The mzXML files that went into those accucor files represented (by definition) unanalyzed data. They would show up in the sequence, which they would be able to find by downloading the method file, and go to find the mzXML file in tracebase, and if we don't include them, they won't be there.

So including this requirement would prevent us from making a mistake that might make including them more difficult than it would need to be if we'd kept it in mind during this effort.

Are we proposing to have runs/scans without an associated peakgroupsets/accucor/isocorr? [e.g. our entire system was initiated/focused on loading/analyzing that file]

That is my suggestion, yes, in order to satisfy the ultimate goal that started this refactor: the ability to retrieve unanalyzed data from mzXML files. This requirement seems right in line with that goal.

  • requirement 4.2 also seems to run counter to our model sample.name being unique

I would argue that it is in lock step with the goal. Here's my argument:

  • mzXML files will (and do) have name collisions (we fix the samples by adding a prefix)
  • We have no control over mzXML file names on the originating system
  • Renaming files that may exist in a lab notebook or exist as copies on the originating filesystem under a different name would disrupt the ability to cross-reference what we have with external sources
  • We have the sha to fall back on for unique identification of files

We could rename mzXML files in TraceBase, but I would say that doing so is unnecessary (c.i.p.: the sha) and disruptive (breaks external data associations, c.i.p. lab notebooks and the originating file system).

In fact, I think we have all sorts of integrity issues in TraceBase right now with duplicate samples with different names. In fact, I think we should not require sample names to be unique and instead create sample accession numbers using a study prefix and a number, but that is definitely out of scope here. My point is, we shouldn't be uniquely identifying mzXML files by their non-unique names.

@lparsons
Copy link
Contributor

Closing this issue, and replacing with the mzXML Data Storage milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:6-models data models design:needs-proposals Open for design proposals priority:2-required Issue that must be addressed in order to meet release goals type:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants