Data order columns first #107

bmaranville · 2023-05-09T14:20:51Z

As we consider support for multidimensional columns (e.g. resolution) it would make the most sense to have the column index first in the data order. Then extra dimensions can be added to the shape.

Under the current ordering, the length of the column is the first index, and the column number is the second index, and then additional dimensions would have to go after the column index, which is awkward.

Note that multidimensional columns will be most easily supported in a binary format (NeXus).

… be a list), to enable multidimensional 'columns' in the future

arm61 · 2023-05-09T16:22:47Z

So this doesn't affect how the data are written, just how we store it internally?

bmaranville · 2023-05-09T16:26:02Z

Correct - it's still columns in the .ort file.

aglavic · 2023-05-09T16:45:04Z

I'm actually very skeptical with making this change. Multi-dimensional data will be some special case that requires some further discussion for the text format, so I don't really see any advantage of changing the index order. Initially I would have been for this change, but now that we published version 1.x and other programs have used it this change will break every program that depends on orsopy.

Is there any important reason to force this change?

bmaranville · 2023-05-09T17:34:56Z

This is only an important change if we are going to support multidimensional datasets. If we as a group decide that is something we are going to do, then this change is to make sure that the shape of the arrays we're using internally looks like this:
(column_number, dataset_dim1, dataset_dim2, dataset_dim3...) and not
(dataset_dim1, column_number, dataset_dim2, dataset_dim3...)

It is not the end of the world if we have the second instead of the first, but it doesn't seem like an easy thing to explain to end-users. Telling them to stack their 1-d columns in column order instead of row order is a much easier thing to ask of them.

This is clearly a breaking change and would have to be done very carefully of course, changing version numbers etc.

I think the primary thing for us all to decide is if we want to support multidimensional datasets going forward. From our early discussions it seemed like this might be important for some use cases, but I can't speak to that directly (I am not one of the ones that would make immediate use of multidimensional datasets - does anyone else want to weigh in?)

andyfaff · 2023-05-10T00:05:44Z

(From memory multidimensional arrays (such as that used for resolution information) have to use an appropriate number of labelled columns in the .ort file)

I agree that having OrsoDataset.data.shape == (ncols, npnts) makes more sense from a software POV (that's what this PR is proposing, right?). Besides the back compat concerns that @aglavic raises there is also the issue that we generally refer to "columns in a dataset", with e.g. Q/R data being two columns. Won't it be confusing when they turn up as two rows in OrsoDataset.data? Could we refer to them differently?

Still I don't mind biting the bullet for changes in this PR, it's still relatively early days.

I would definitely like to store multidimensional data in a file, I was waiting until we had a working HDF file rather than trying to stuff it into .ort. Would they be stored as separate HDF5 datasets rather than trying to keep it in the same HDF5 dataset as Q/R? The point would still come up as to how we would keep it in a OrsoDataset.

bmaranville · 2023-05-10T01:25:04Z

There's a working hdf exporter in the PR next door #97, if you want to try it out and see how the data is structured. What I had envisioned was that data would be defined as one of

a list of arrays, with one array per column, e.g. [Qz<ndarray, shape=(101,>), R <ndarray, shape=(101,), dR <ndarray, shape=(101, 15)]
a 2-d array where the first dimension is ncolumns and the second is npoints (which can be treated exactly the same in the code as the first case)

Then this is compatible with the .ort writer by using the second form but transposed from the current order.

bmaranville · 2023-05-10T01:52:12Z

Sorry I accidentally closed this while trying to type on my phone. Reopened now.

aglavic · 2023-05-10T08:32:00Z

An alternative if we want to allow multidimensional data is to just define the data as a literal list of columns/arrays. The text version could then decide how to write based on dimensionality of the list elements and the hdf5 version could store each column as one dataset that then can be any dimensionality.

This would require some different tests in the class, but it would not break code that loads data from .ort files.
Another alternative would be some kind of future flag that switches the order and a deprecation warning with switching at a later stage. I don't know, just brain storming here.

andyfaff · 2023-12-05T01:29:41Z

a list of arrays, with one array per column, e.g. [Qz<ndarray, shape=(101,>), R <ndarray, shape=(101,), dR <ndarray, shape=(101, 15)]

I think I prefer this option. It allows dR to be something along the lines of (101, 31), which is the kind of array size that might be used.

bmaranville · 2023-12-05T02:31:50Z

It seems like we all agree on the "list of columns" approach.

andyfaff · 2024-02-26T04:29:10Z

I wonder if we can make further progress with this PR now? (would need merge conflicts resolved, and {load,save}_nexus to be updated probably.

It seemed like we settled on a list of columns approach?

How about:
If one sets up OrsoDataset with a numpy array one assumes that the second dimension has the correct number of columns (and store the array unchanged)
If one sets up OrsoDataset with a sequence one assumes that each entry is a column (and store the sequence converted to a tuple).

If one is saving to ORT one concatenates the columns. If an individual column is multidimensional raise an error.
If one is saving to ORB save each column to an individual HDF dataset.

If one is loading from ORT, leave code unchanged.
If one is loading from ORB load with the sequence approach.

aglavic · 2024-02-26T16:27:02Z

Seems like good solutions to me.

In addition I think it would be nice if "save_orso" and "load_orso" would be able to do both ".ort" and ".orb" as it would make it easier for software implementing it to support both file formats. (Either by suffix or in case of load by checking the first line?)

bmaranville · 2024-02-26T18:01:23Z

For reading in, would a check of the "magic number" to see if the file is an HDF5 file be sufficient? I would rather not go by filename or extension. Sometimes those are mandated by facility policy and not under the users' (or instrument scientists') control.

aglavic · 2024-02-27T08:35:50Z

Yes, my idea was to use the extension, if provided as either .ort or .orb, else evaluate file content in a straight forward way. I'm not sure what would be the easiest way, but the proposed magic number approach seems fast and easy and should give no false positives for the binary case.

andyfaff · 2024-02-27T08:38:12Z

The first code section of load_orso could just be:

try:
    load_nexus(...)
except: ThisIsntAnHDFFileError
    pass

# load ort as usual

bmaranville · 2024-02-27T13:25:16Z

The first code section of load_orso could just be:
try:
    load_nexus(...)
except: ThisIsntAnHDFFileError
    pass

# load ort as usual

I think there is a function in h5py for "is this hdf5", which I would prefer over raising exceptions

aglavic · 2024-02-27T14:27:52Z

At one point in the functions we will have to raise an error when the format is not correct. So we could either perform this test in load_nexus and then catch the specific ORSO error (as proposed by Andrew) or we can make the check at two different functions separately. For me it wouldn't really matter.

In any case, the test for hdf5 should be done first, before doing a validation of further ORSO format compatibility. (As should be a test for text file in case of the .ort reader.)

bmaranville · 2024-02-27T15:50:16Z

As of python 3.11, they will allow unpacking variable length tuples with typing of the members (though specifying the types looks a little awkward): https://peps.python.org/pep-0646/#unpacking-unbounded-tuple-types , which we could use to set these constraints in the python types, creating a subclass for e.g. QzColumn and requiring that the first element of Orso.columns is of type QzColumn.

Maybe we can look at this issue again in a few years when 3.11 is the most recently-supported python.
Example QzColumn:

@dataclass
class QzColumn(Column):
    """
    Information about a data column.
    """

    name: Literal["Qz"] = "Qz"
    unit: Optional[Literal["1/nm", "1/angstrom"]] = None
    physical_quantity = "momentum_transfer"

@dataclass
class QzErrorColumn(ErrorColumn):
    """
    Information about a data column.
    """

    error_of: Literal["Qz"] = "Qz"

bmaranville added 4 commits May 9, 2023 10:09

store data in OrsoDataset with column as first index (and allow it to…

07a7197

… be a list), to enable multidimensional 'columns' in the future

store data in OrsoDataset with column as first index (and allow it to…

14f770e

… be a list), to enable multidimensional 'columns' in the future

add missing import of Sequence type

580e220

update tests for column-first data order

1867f83

bmaranville requested review from aglavic, arm61 and andyfaff May 9, 2023 14:20

bmaranville mentioned this pull request May 9, 2023

Save nexus v1 #108

Merged

bmaranville closed this May 10, 2023

bmaranville reopened this May 10, 2023

Merge branch 'main' into data_order_columns_first

6b05b91

andyfaff mentioned this pull request Dec 5, 2023

MAINT: get orsopy working in 3.12. Closes #115 #116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data order columns first #107

Data order columns first #107

bmaranville commented May 9, 2023

arm61 commented May 9, 2023 •

edited

bmaranville commented May 9, 2023

aglavic commented May 9, 2023

bmaranville commented May 9, 2023 •

edited

andyfaff commented May 10, 2023

bmaranville commented May 10, 2023 •

edited

bmaranville commented May 10, 2023

aglavic commented May 10, 2023

andyfaff commented Dec 5, 2023

bmaranville commented Dec 5, 2023

andyfaff commented Feb 26, 2024

aglavic commented Feb 26, 2024

bmaranville commented Feb 26, 2024

aglavic commented Feb 27, 2024

andyfaff commented Feb 27, 2024

bmaranville commented Feb 27, 2024

aglavic commented Feb 27, 2024

bmaranville commented Feb 27, 2024 •

edited

Data order columns first #107

Are you sure you want to change the base?

Data order columns first #107

Conversation

bmaranville commented May 9, 2023

arm61 commented May 9, 2023 • edited

bmaranville commented May 9, 2023

aglavic commented May 9, 2023

bmaranville commented May 9, 2023 • edited

andyfaff commented May 10, 2023

bmaranville commented May 10, 2023 • edited

bmaranville commented May 10, 2023

aglavic commented May 10, 2023

andyfaff commented Dec 5, 2023

bmaranville commented Dec 5, 2023

andyfaff commented Feb 26, 2024

aglavic commented Feb 26, 2024

bmaranville commented Feb 26, 2024

aglavic commented Feb 27, 2024

andyfaff commented Feb 27, 2024

bmaranville commented Feb 27, 2024

aglavic commented Feb 27, 2024

bmaranville commented Feb 27, 2024 • edited

arm61 commented May 9, 2023 •

edited

bmaranville commented May 9, 2023 •

edited

bmaranville commented May 10, 2023 •

edited

bmaranville commented Feb 27, 2024 •

edited