Add Model support for mzXML, sequences, and method files #664

hepcat72 · 2023-03-27T14:53:29Z

FEATURE REQUEST

Inspiration

In our TraceBase presentation to the Rabinowitz lab, Josh said he would like to be able to get from peak data/groups to the original sequence, and expressed a concern that it seems limiting (to accomplish that) since we don't store mzXML files (or their data).

Description

Add models and model fields to support mzXML files and their sequence (from a "method" file).

Alternatives

There has been lots of meeting discussions on this, which many proposals and docs. (Please link all relevant documentation in the comments below, if any are missing).

Dependencies

This issue cannot be started until the completion of the following issue(s):

All PRs associated with issue User Data Validation Improvement #580

Comment

TBD

ISSUE OWNER SECTION

Assumptions

None

Requirements

1. Be able to get from peak groups to the original sequence
2. Be able to obtain data from unanalyzed peaks (e.g. from mzXML files) for inclusion in a new El-Maven/AccuCor/IsoCorr analysis
3. Be able to obtain unanalyzed mzXML files (whether they were used to generate existing peakgroups or not)
4. Be able to correctly associate mzXML files from same/different samples and with same/different names
4.1. Multiple mzXML files originating from the same sample but different scan
4.2. Multiple mzXML files originating from different samples but with the same sample name
4.3. Multiple mzXML files originating from the same sample but with different sample names (e.g. sample1, sample1_pos_scan2)
5. Be able to handle missing mzXML/sequence/scan data from legacy datasets
6. Require mzXML/sequence/scan data for new datasets

(Guidelines/lesser requirements)...

7. If possible, record data that may be used to gauge whether unpicked peaks exist in an mzXML file
8. Minimize redundancy in database as much as possible (e.g. support the M:M relationship between accucor and mxXML)
9. Minimize disk usage as much as possible

Limitations

None

Affected Components

A tentative list of anticipated repository items that will be changed, labeled
with "add", "delete", or "change". One item per line. (Mostly, this will be a
list of files.)

Primary Changes

change: ms_run.py
add: ms_sequence.py
add: ms_scan.py

Changes prompted by the primary changes

change: loading scripts, views, templates

DESIGN

Interface Change description

No outward changes to the current interface. All of the views and loading interfaces should be the same.

Code Change Description

MSSequence

researcher (null=False)
date (null=False)
method_file_name (null=True)
method_file_hash (unique=True, null=True)

MSRun change (remove all existing and add...)

sample_id (null=False)
protocol_id (null=False)
mssequence_id (null=True)
raw_file_name (null=True)
raw_file_hash (unique=True, null=True)

MSScanGroup

msrun_id (null=False)
mzXML_file_name (null=True)
mzXML_file_hash (unique=True, null=True)

PeakGroup change (keep fields not mentioned...)

~~ms_run_id~~
msscangroup_id (null=False)

...

I didn't add the file and submission tables/fields - add as you see fit

Make changes to all supporting code to retain current functionality. E.g.:

Edit table paths in views (only supporting current data)
Edit table paths in formats (only supporting current data)
Edit table paths in templates (only supporting current data)
Create empty records in the new tables in the loading code
etc.

Tests

fkang-pu · 2023-03-27T20:51:14Z

Please review new model design and my notes (part 6, slides 22-26 ) in uploaded file with prefix 20230327.

hepcat72 · 2023-03-27T21:50:54Z

Please review new model design and my notes (part 6, slides 22-26 ) in uploaded file with prefix 20230327.

@fkang-pu - I apologize. What I intended for this issue was just to distill the model changes we all agreed on in the meetings. However, my description in the issue was too broad. I just edited it to apply only to the model changes. Could you make a small design (either in a file or by filling out the issue owner section in the issue above) that contains only the elements of the schema that we came to consensus on in the meetings?

This would consist solely of changed tables and fields.

And, you can reference your larger scope powerpoint files and all our other files in the issue, e.g. in the comment section where it currently says "TBD".

Your documents have been very helpful thus far and should lead to multiple small issues, this being one of them.

hepcat72 · 2023-03-27T22:27:05Z

Also, slides 22-26 seem to still include elements that we agreed needed to be removed in the last meeting, so when fleshing out the issue owner section, updates from the last meeting should be included.

jcmatese · 2023-03-29T13:00:09Z

Regarding requirements,

this is the first I have heard of Req. 3 which seems out of scope given most of our discussions? Are we proposing to have runs/scans without an associated peakgroupsets/accucor/isocorr? [e.g. our entire system was initiated/focused on loading/analyzing that file]
requirement 4.2 also seems to run counter to our model sample.name being unique

fkang-pu · 2023-03-29T15:01:51Z

I had a discussion with John about model changes yesterday afternoon. John prefers to make minimum changes to MSRun (e.g. link method file directly from ArchFile to MSRun). I like his suggestions. After reviewing the requirements proposed by Lance and Rob's design descriptions, I created a new ER diagram (part 7, slide 27) in uploaded file with prefix 20230329. Hopefully it will help us to speed up discussions this afternoon and come up with a final design.

hepcat72 · 2023-03-29T15:30:23Z

First of all - excellent! Thanks for the thorough review! I think this demonstrates the value of documenting and reviewing, not only the requirements, but the design as well. Not everyone is at every meeting and me all have our own interpretations.

Regarding requirements,

this is the first I have heard of Req. 3

You are correct. When I sat down to formalize requirements, my touchstone was what Josh was getting at in our meeting where we presented version 2, and when I typed it up:

Be able to obtain data from unanalyzed peaks (e.g. from mzXML files) for inclusion in a new El-Maven/AccuCor/IsoCorr analysis

...it occurred to me that unanalyzed/unpicked peaks could come from not only the current used mzXML files in any of a number of accucor/isocorr files, it could come from mzXML files that were not included ion the run. And that triggered a number of other thoughts which I alluded to in the review of Lance's milestone doc. You can see my comment by looking at the comment history of the doc, but to make searching for it unnecessary, it says:

But I also assumed that we would just be providing the method file from which they could obtain the sequence, as opposed to saving the sequence of samples in the database, which is what I'm inferring you're saying here...? Does that mean we may store samples from the sequence that didn't have any peaks analyzed/saved?

It's somewhat of an indirect reference to requirement no. 3, but Lance indicated that we shouldn't have that as a requirement (in the implied context of the milestones), but he did seem open to including finer-grained requirements in the review discussion here, which is why I retained this requirement.

which seems out of scope given most of our discussions?

I think that excluding this requirement could bite us later when Josh wants to find the mzXML file referred to in the sequence in the method file that was totally unanalyzed. While I'd agree that it may not affect the model implementation we've discussed - it could, depending on how it's implemented in associated code updates that support the new model. For example, if Josh wants to be able to find these excluded mzXML files, but our implementation requires (to load a sample) that it has an MSRun child and that requires a scan group and a peakgroup to exist, it would mean a refactor. That refactor would be unnecessary if we account for this possibility.

What made me think of this was the cold exposure validation. Michael had accidentally included 4 accucor files that he had intended to exclude (so they weren't in the sample sheet). It immediately reminded me, when I wrote these requirements of Josh's guiding feedback that he wanted to be able to obtain unanalyzed data. The mzXML files that went into those accucor files represented (by definition) unanalyzed data. They would show up in the sequence, which they would be able to find by downloading the method file, and go to find the mzXML file in tracebase, and if we don't include them, they won't be there.

So including this requirement would prevent us from making a mistake that might make including them more difficult than it would need to be if we'd kept it in mind during this effort.

Are we proposing to have runs/scans without an associated peakgroupsets/accucor/isocorr? [e.g. our entire system was initiated/focused on loading/analyzing that file]

That is my suggestion, yes, in order to satisfy the ultimate goal that started this refactor: the ability to retrieve unanalyzed data from mzXML files. This requirement seems right in line with that goal.

requirement 4.2 also seems to run counter to our model sample.name being unique

I would argue that it is in lock step with the goal. Here's my argument:

mzXML files will (and do) have name collisions (we fix the samples by adding a prefix)
We have no control over mzXML file names on the originating system
Renaming files that may exist in a lab notebook or exist as copies on the originating filesystem under a different name would disrupt the ability to cross-reference what we have with external sources
We have the sha to fall back on for unique identification of files

We could rename mzXML files in TraceBase, but I would say that doing so is unnecessary (c.i.p.: the sha) and disruptive (breaks external data associations, c.i.p. lab notebooks and the originating file system).

In fact, I think we have all sorts of integrity issues in TraceBase right now with duplicate samples with different names. In fact, I think we should not require sample names to be unique and instead create sample accession numbers using a study prefix and a number, but that is definitely out of scope here. My point is, we shouldn't be uniquely identifying mzXML files by their non-unique names.

lparsons · 2023-07-10T17:44:01Z

Closing this issue, and replacing with the mzXML Data Storage milestone.

hepcat72 added the type:enhancement New feature or request label Mar 27, 2023

hepcat72 changed the title ~~Add support for mzXML, sequences, and method files~~ Add Model support for mzXML, sequences, and method files Mar 27, 2023

lparsons added this to the mzXML data storage milestone Mar 30, 2023

lparsons added design:needs-proposals Open for design proposals component:6-models data models labels Mar 30, 2023

lparsons mentioned this issue Mar 30, 2023

Move PeakGroupSet (i.e. the filename) into MSRun and include it in the UniqueConstraint #636

Closed

6 tasks

hepcat72 added the priority:2-required Issue that must be addressed in order to meet release goals label Apr 28, 2023

hepcat72 mentioned this issue Apr 28, 2023

Add median retention time to Compounds table #563

Open

6 tasks

lparsons mentioned this issue Jul 10, 2023

Convert PeakGroupSet to ArchiveFile #693

Closed

8 tasks

lparsons closed this as completed Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Model support for mzXML, sequences, and method files #664

Add Model support for mzXML, sequences, and method files #664

hepcat72 commented Mar 27, 2023 •

edited

fkang-pu commented Mar 27, 2023

hepcat72 commented Mar 27, 2023

hepcat72 commented Mar 27, 2023

jcmatese commented Mar 29, 2023 •

edited

fkang-pu commented Mar 29, 2023

hepcat72 commented Mar 29, 2023

lparsons commented Jul 10, 2023

Add Model support for mzXML, sequences, and method files #664

Add Model support for mzXML, sequences, and method files #664

Comments

hepcat72 commented Mar 27, 2023 • edited

FEATURE REQUEST

Inspiration

Description

Alternatives

Dependencies

Comment

ISSUE OWNER SECTION

Assumptions

Requirements

Limitations

Affected Components

Primary Changes

Changes prompted by the primary changes

DESIGN

Interface Change description

Code Change Description

Tests

fkang-pu commented Mar 27, 2023

hepcat72 commented Mar 27, 2023

hepcat72 commented Mar 27, 2023

jcmatese commented Mar 29, 2023 • edited

fkang-pu commented Mar 29, 2023

hepcat72 commented Mar 29, 2023

lparsons commented Jul 10, 2023

hepcat72 commented Mar 27, 2023 •

edited

jcmatese commented Mar 29, 2023 •

edited