Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow combining ORSO formatted reduced data with raw data_files in orsopy outputs? #99

Open
bmaranville opened this issue Feb 21, 2023 · 4 comments

Comments

@bmaranville
Copy link
Contributor

It would be nice to allow combining raw data files (not covered by ORSO specs) with reduced data (following the ORSO schema) in a single output file.

I don't know that it makes sense to try to do this in the .ort file, but in an HDF file there is not really any reason not to do this, except if you're worried about size.

For raw data that is coming in as HDF5 already (Mantid? NeXus outputs?) it seems like it would be natural to make the ORSO outputs be just a section in the output file, either near the root (so that the raw data is contained in a subgroup of the ORSO output, maybe in a header entry) or at a parallel level in the file (and then the ORSO header can just link to the existing data).

This would allow a reduction program to e.g. re-open a reduced file and populate all the inputs with the raw data, to allow inspection of the reduction steps.

We could leverage the "external link" capability of HDF5 files, but then you have to make sure the linked files are in the same folder as the HDF5 that is accessing them, and it therefore also requires that you be working with a local filesystem (instead of a cloud system or the like, which is becoming much more common).

@bmaranville
Copy link
Contributor Author

It could be as simple as adding an optional attribute contents to fileio.base.File

@aglavic
Copy link
Collaborator

aglavic commented Mar 3, 2023

I definitely would not do this for the text representation as it makes files more complex and does not follow the concept of reduced data for analysis purposes.

For the use-case you discuss with the reduction software we already have the information of reduction parameters and raw data files to be able to reconstruct the reduction. So I only see the advantage if someone wants to re-reduce data at home without access to the facility IT infrastructure.

I agree, there is no reason to not put additional data in a HDF5 representation and if we keep the specification as open as for the ORT format there is complete freedom where to put these data.
From my experience at SNS it would probably not be what you want as each raw datafile is about 50MB so a single reflectivity measurement could become 250-500MB for a dataset that would otherwise be reduced to 50kB.

@bmaranville
Copy link
Contributor Author

The use case I had in mind was for benchtop x-rays, where the raw data tends to be:

  • easy to lose track of (only exists on users' USB flash drives for example)
  • text-based
  • small

Bundling the raw data with the reduced data is clearly superior in this case, where the only other alternative is to refer to an arbitrary, often non-unique filename from a users' computer.

I would agree that in many cases (ORT format, and with large datasets from SNS) the instrument scientist would probably not choose to bundle the raw data with the ORSO-formated refl data.

@aglavic
Copy link
Collaborator

aglavic commented Mar 22, 2023

In the case of bench top systems there is no need for bundling of raw data, as the "reduced" data is already the raw data. I would have thought that companies implementing the .ort file would drop support of other text based formats completely.

All the ASCII based XRR file formats I've seen so far are very simplistic and (besides custom header) have only the angle and counts + maybe attenuator. In a reduced .ort file they could still have these raw columns as optional information and just add the calculation of the Q-column and error on reflectivity. A bundled raw file would just be a repetition of the same column data.

If we want this kind of flexibility, in general, we could allow for a header flag that makes orsopy ignore that data_set. Then any user defined data could follow with custom header and custom columns. In this case I would specify a few rules this should still have to follow:

  1. Header is stored in the same yaml/json based format.
  2. Columns have the same structure and the columns key in the header should follow ORSO specification.
  3. The header is not "merged" with the first dataset header as in the ORSO files but read separately.
  4. The first dataset cannot be such ignored data but must be valid ORSO.

I would support such a feature for the additional flexibility. The rules would make it easy to implement reading such files without too much effort and prevent breaking of automatic analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants