Don't redundantly serialize dynamic matrix element count #1221

Ralith · 2023-03-15T05:47:58Z

Inspired by #1220. Breaking change. Previously, matrices would be serialized through serde with a length tag, followed by element data, followed by dimensions. This change moves the dimensions to the front and multiplies them together to infer length, saving some space.

~~Could probably use testing, if the change is welcome.~~

Ralith · 2023-03-16T02:32:44Z

Ah, I see there's already test coverage, excellent.

sebcrozet · 2023-04-07T14:29:06Z

Thank you for this PR! I’m not sure this is worth it considering the added complexity to the serialization code just to save one usize from dynamic matrices (which are supposed to be somewhat large in the first place).
So I’d like some concrete use-cases where saving that bit of storage has a significant impact.

Ralith · 2023-04-08T20:10:02Z

Yeah, I don't feel strongly that it's a win, but by the time the volume of code required was apparent it was mostly done so I figured I'd offer it for review.

One issue is that it is a format-breaking change, so making it in response to a concrete use case in the future is more disruptive than making it now. On the other hand, more maintenance overhead for the intervening time.

Andlon · 2023-04-14T07:39:06Z

While we're on this topic, I have been thinking lately that we might want to rethink how nalgebra data structures are serialized/deserialized altogether. I think we should consider different use cases. We can distinguish between:

machine-readable formats, possibly binary, like bincode or similar.
human readable formats, such as JSON.

These two use cases have very different requirements. For the first format, we want to be efficient, i.e. it should be fast to serialize/deserialize and the data should ideally be laid out as closely as possible to nalgebra's internal format to facilitate this.

For the second use case, both the current deserialization format and the proposed formats are both problematic. I have a concrete use case where I'm deserializing JSON as configuration (well, strictly speaking I'm using JSON5 but that's beside the point). So I might have a struct like

#[derive(Serialize, Deserialize)]
struct Config {
    // In practice I have other things than an inertia tensor,
    // but I'll use an example that might be familiar to people here
    inertia: Matrix3<f64>,
    velocity: Vector3<f64>,
}

Serializing this matrix to JSON with serde-json gives the following output:

{
  "inertia": [
    1.0,
    4.0,
    7.0,
    2.0,
    5.0,
    8.0,
    3.0,
    6.0,
    9.0
  ],
  "velocity": [
    1.0,
    2.0,
    3.0
  ]
}

While this (coincidentally) works well for vectors, as their layout is essentially unambiguous, it's clearly very suboptimal for matrices.

Anyway, my Config struct here was specialized to 3D. Let's say I want to support dimensions 1, 2 and 3. Then I might replace the structures here with dynamic matrices/vectors and choose the appropriate code path at runtime:

#[derive(Serialize, Deserialize)]
struct Config {
    // In practice I have other things than an inertia tensor,
    // but I'll use an example that might be familiar to people here
    inertia: DMatrix<f64>,
    velocity: DVector<f64>,
}

The (current) output is then:

{
  "inertia": [
    [
      1.0,
      4.0,
      7.0,
      2.0,
      5.0,
      8.0,
      3.0,
      6.0,
      9.0
    ],
    3,
    3
  ],
  "velocity": [
    [
      1.0,
      2.0,
      3.0
    ],
    3,
    null
  ]
}

I don't think this is "sufficiently human-readable" (also, why null? That's weird ...), and the changes here don't improve on this (btw, there appears to be no length tag here?). In my opinion, for human-readable formats (you can query this with (De)serializer::is_human_readable) we should serialize/deserialize matrices as row-major nested arrays, so that the above would yield

{
	"inertia": [[1.0, 2.0, 3.0],
		    [4.0, 5.0, 6.0],
		    [7.0, 8.0, 9.0]],
	"velocity": [[1.0], [2.0], [3.0]]
}

It's a little unfortunate that we cannot serialize a vector as [1, 2, 3] then, however. We could, i.e. if we know that we have e.g. a DVector we could represent this as a flat array, and I think maybe we should. This would mean that DVector and DMatrix would serialize differently (but they could still be lenient in the input they accept for deserialization).

Pretty-printed JSON would unfortunately not align like my example here, you'd typically have all numbers laid out vertically, but this is format-specific and out of our control.

This is important not just for serialization but crucially also for deserializiation, so that you can give intuitive, human-readable input.

I'd be interested in hearing your thoughts here. Sorry for blowing up the scope of the PR here, but I believe that we cannot take this kind of breaking change lightly, because it might completely (silently) break people's code that rely on the current serialization behavior. Therefore I'd like to bring up these additional concerns for discussion, to make sure we only make breaking changes like these having taking everything into consideration.

Ralith · 2023-04-15T00:48:22Z

In my opinion, for human-readable formats (you can query this with (De)serializer::is_human_readable) we should serialize/deserialize matrices as row-major nested arrays

I agree this is much more readable. I worry that it might be surprising for the ordering of elements to be transposed when changing between serialization formats.

My experience is that slapping Serialize/Deserialize derives on in-memory structures rarely produces optimally user-friendly configuration schemas outside of trivial cases, so I'm not sure investing a lot of complexity in optimizing that case makes sense. If you want to use serde at all, better results can be had by defining a "config schema" type that you derive serde traits for, and then manually convert into your in-memory structure (where real nalgebra types are used). You can then easily get the desired behavior, e.g. by using an array type in the schema structure and transposing it during conversion.

Andlon · 2023-04-17T07:25:51Z

I agree this is much more readable. I worry that it might be surprising for the ordering of elements to be transposed when changing between serialization formats.

I share your concern. Perhaps we could still serialize as a nested array for human-readable formats though, even though it's column-major. Wish there was a way to indicate ordering though, to make self-describing formats actually self-describing.

My experience is that slapping Serialize/Deserialize derives on in-memory structures rarely produces optimally user-friendly configuration schemas outside of trivial cases, so I'm not sure investing a lot of complexity in optimizing that case makes sense. If you want to use serde at all, better results can be had by defining a "config schema" type that you derive serde traits for, and then manually convert into your in-memory structure (where real nalgebra types are used). You can then easily get the desired behavior, e.g. by using an array type in the schema structure and transposing it during conversion.

I think my specific use case here might be a bit unusual, because the "user" would be a developer using a framework in which they provide their own arbitrary serializable configuration struct. It's kind of like a "Spring Boot" or some other "web app quickstart framework" but for physics simulation. The idea is that each experiment/scene has its own completely specialized configuration struct, and it should be as easy as possible to change your configuration as you change your simulation. Slapping Serialize/Deserialize is perfect for this case. I think I can provide a wrapper type around Matrix that serializes in a more intuitive fashion though. Maybe we could even provide this in nalgebra? 🤔

tpdickso · 2024-02-10T20:28:40Z

I can highly recommend the wrapper approach -- I think when it comes to human-readable data formats, or more generally archival data-description formats, having wrappers around literally everything nontrivial is a good idea simply to make your data files resilient against version changes. The ability to then customize the interface is a bonus. The built-in serialization implementation can be better left to "disposable" applications, like network traffic and persistent caching. (Even then, there are often benefits to a thin wrapper, like being able to compress floating-point data.)

Ralith · 2024-02-10T23:01:38Z

Closing since no action is expected on this PR, in any case.

Ralith force-pushed the reduced-vec-serialize branch 2 times, most recently from 193e557 to f3d4aa2 Compare March 16, 2023 02:30

Don't redundantly serialize dynamic matrix element count

01856f2

Ralith force-pushed the reduced-vec-serialize branch from f3d4aa2 to 01856f2 Compare March 16, 2023 02:32

sebcrozet added enhancement need community feedback Feedbacks are needed from the community before implementing a solution labels Apr 7, 2023

Ralith closed this Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't redundantly serialize dynamic matrix element count #1221

Don't redundantly serialize dynamic matrix element count #1221

Ralith commented Mar 15, 2023 •

edited

Ralith commented Mar 16, 2023

sebcrozet commented Apr 7, 2023

Ralith commented Apr 8, 2023

Andlon commented Apr 14, 2023 •

edited

Ralith commented Apr 15, 2023

Andlon commented Apr 17, 2023 •

edited

tpdickso commented Feb 10, 2024

Ralith commented Feb 10, 2024

Don't redundantly serialize dynamic matrix element count #1221

Don't redundantly serialize dynamic matrix element count #1221

Conversation

Ralith commented Mar 15, 2023 • edited

Ralith commented Mar 16, 2023

sebcrozet commented Apr 7, 2023

Ralith commented Apr 8, 2023

Andlon commented Apr 14, 2023 • edited

Ralith commented Apr 15, 2023

Andlon commented Apr 17, 2023 • edited

tpdickso commented Feb 10, 2024

Ralith commented Feb 10, 2024

Ralith commented Mar 15, 2023 •

edited

Andlon commented Apr 14, 2023 •

edited

Andlon commented Apr 17, 2023 •

edited