Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for RFC-7049 Canonical CBOR key ordering #13

Closed
vmx opened this issue Apr 5, 2022 · 12 comments
Closed

Support for RFC-7049 Canonical CBOR key ordering #13

vmx opened this issue Apr 5, 2022 · 12 comments

Comments

@vmx
Copy link
Contributor

vmx commented Apr 5, 2022

This library explicitly specifies RF-8949, so this request might be out of scope.

In the project I want to use cbor4ii for I'm stuck with RFC-7049 Canonical CBOR key ordering. This means that keys are sorted by their length first. I wonder if that could perhaps be added behind a feature flag. Here is an implementation the seems to work. I didn't create a PR as this clearly needs more discussion first.

    fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error>
    where
        K: ser::Serialize,
        V: ser::Serialize,
        I: IntoIterator<Item = (K, V)>,
    {
        #[cfg(not(feature = "use_std"))]
        use crate::alloc::vec::Vec;
        use serde::ser::SerializeMap;

        // TODO vmx 2022-04-04: This could perhaps be upstreamed, or the
        // `cbor4ii::serde::buf_writer::BufWriter` could be made public.
        impl enc::Write for Vec<u8> {
            type Error = crate::alloc::collections::TryReserveError;

            fn push(&mut self, input: &[u8]) -> Result<(), Self::Error> {
                self.try_reserve(input.len())?;
                self.extend_from_slice(input);
                Ok(())
            }
        }

        // CBOR RFC-7049 specifies a canonical sort order, where keys are sorted by length first.
        // This was later revised with RFC-8949, but we need to stick to the original order to stay
        // compatible with existing data.
        // We first serialize each map entry into a buffer and then sort those buffers. Byte-wise
        // comparison gives us the right order as keys in DAG-CBOR are always strings and prefixed
        // with the length. Once sorted they are written to the actual output.
        let mut buffer: Vec<u8> = Vec::new();
        let mut mem_serializer = Serializer::new(&mut buffer);
        let mut serializer = Collect {
            bounded: true,
            ser: &mut mem_serializer,
        };
        let mut entries = Vec::new();
        for (key, value) in iter {
            serializer.serialize_entry(&key, &value)
               .map_err(|_| enc::Error::Msg("Map entry cannot be serialized.".into()))?;
            entries.push(serializer.ser.writer.clone());
            serializer.ser.writer.clear();
        }

        TypeNum::new(major::MAP << 5, entries.len() as u64).encode(&mut self.writer)?;
        entries.sort_unstable();
        for entry in entries {
            self.writer.push(&entry)?;
        }

        Ok(())
    }

I'd also like to note that I need even more changes for my use case (it's a subset of CBOR), for which I will need to fork this library. Nonetheless I think it would be a useful addition and I'd also prefer if the fork would be as minimal as possible. I thought I bring it up, to make clear that it won't be a showstopper if this change wouldn't be accepted.

@quininer
Copy link
Owner

quininer commented Apr 5, 2022

I think that's a bit beyond the scope of this crate.

Can this be done outside of this crate? we could have a serde-cbor4ii-canonical crate that wraps the Serializer and overrides the collect_map method.

like

struct CanonicalSerializer<W>(Cbor4iiSerializer<W>);

impl<W: Write> Serializer for &mut CanonicalSerializer<W> {
    type SerializeSeq = <Cbor4iiSerializer<W> as Serializer>::SerializeSeq;

    // forward to Cbor4iiSerializer

    fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error> {
        // your code
    }
}

@vmx
Copy link
Contributor Author

vmx commented Apr 5, 2022

I think that's a bit beyond the scope of this crate.

That's fine. But it brings me to the next point (let me know if I should open a separate issue). As you can see on that implementation, I need access to some things (like TypeNum) which isn't public yet. What I think would be great if cbor4ii could be split into two crates, one core one and one for Serde (similar to what ciborium is doing. This way I could depend on the core crate, but create my own custom Serde implementation. That would help me a lot.

@quininer
Copy link
Owner

quininer commented Apr 5, 2022

cbor4ii exposes MapStartBounded, you don't have to access TypeNum, You can notice that TypeNum is also not accessed in the serde mod.
also if you don't turn on the serde1 feature then it's a core crate and it doesn't need to split out another crate.

The thing I can think of that prevents CanonicalSerializer from being impl externally is that Serializer might need to expose an as_mut to allow CanonicalSerializer to bypass the serde api for writes.

@vmx
Copy link
Contributor Author

vmx commented Apr 6, 2022

cbor4ii exposes MapStartBounded, you don't have to access TypeNum

Thanks! I missed that one, that is exactly what I needed.

also if you don't turn on the serde1 feature then it's a core crate and it doesn't need to split out another crate.

Oh right, I should've thought of this, I'll just do that.

The thing I can think of that prevents CanonicalSerializer from being impl externally is that Serializer might need to expose an as_mut to allow CanonicalSerializer to bypass the serde api for writes.

As I mentioned, I need a few other changes that are not really general purpose (like this sort order), hence don't make sense to be upstreamed/published in a generic way. So I thought I just fork the whole Serde part and copy it over. Though the idea of wrapping the serializer sounds interesting. I'll have a look and come back to you in case I need additional things exposed.


One more thing in regards to the original issue. I find it quite useful if you're able to serialize things into memory. In the above code I've implemented impl enc::Write for Vec<u8>. I've also seen that some tests of this library are using a BufWriter. It would be cool to have this kind of functionality (in whichever way) in the public API of the core library, so that it can be re-used.

@quininer
Copy link
Owner

quininer commented Apr 6, 2022

I didn't initially think about how to expose them, so I chose not to. Currently you can implement enc::Write for your own types.

I would consider adding a mod that provides these helper type.

@vmx
Copy link
Contributor Author

vmx commented Apr 6, 2022

I would consider adding a mod that provides these helper type.

I find the impl enc::Write for Vec<u8> quite convenient to use. Should I create a PR implementing this? Or should I create a PR exposing BufWriter or should I just wait?

@quininer
Copy link
Owner

quininer commented Apr 6, 2022

I would like to expose a BufWriter type, it would be nice if you were willing to open a PR.

@vmx
Copy link
Contributor Author

vmx commented Apr 8, 2022

Thanks a lot for all the help. I think I can now solely rely on the core features (without serde1) and do my own Serde implementation (I still want to experiment with wrapping yours, but that's not related to this issue.

For those who need this kind of collation, this is the version updated for cbor4ii 0.2.20:

fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error>
where
    K: ser::Serialize,
    V: ser::Serialize,
    I: IntoIterator<Item = (K, V)>,
{
    use serde::ser::SerializeMap;
    #[cfg(not(feature = "use_std"))]
    use crate::alloc::vec::Vec;
    use crate::core::utils::BufWriter;

    // CBOR RFC-7049 specifies a canonical sort order, where keys are sorted by length first.
    // This was later revised with RFC-8949, but we need to stick to the original order to stay
    // compatible with existing data.
    // We first serialize each map entry into a buffer and then sort those buffers. Byte-wise
    // comparison gives us the right order as keys in DAG-CBOR are always strings and prefixed
    // with the length. Once sorted they are written to the actual output.
    let mut buffer = BufWriter::new(Vec::new());
    let mut mem_serializer = Serializer::new(&mut buffer);
    let mut serializer = Collect {
        bounded: true,
        ser: &mut mem_serializer,
    };
    let mut entries = Vec::new();
    for (key, value) in iter {
        serializer.serialize_entry(&key, &value)
           .map_err(|_| enc::Error::Msg("Map entry cannot be serialized.".into()))?;
        entries.push(serializer.ser.writer.buffer().to_vec());
        serializer.ser.writer.clear();
    }

    enc::MapStartBounded(entries.len()).encode(&mut self.writer)?;
    entries.sort_unstable();
    for entry in entries {
        self.writer.push(&entry)?;
    }

    Ok(())
}

@vmx vmx closed this as completed Apr 8, 2022
@vmx
Copy link
Contributor Author

vmx commented Apr 11, 2022

I re-open this issue as I'm coming back to your proposal from #13 (comment), where I try to wrap your serializer.

It almost works. The problem is the serializer to serialize the map. In the code from the comment above, I use:

let mut serializer = Collect {
    bounded: true,
    ser: &mut mem_serializer.0,
};

But currently Collect isn't public (with good reason).

I then tried it with:

let mut serializer = Self::SerializeMap {
    bounded: true,
    ser: &mut mem_serializer.0,
};

But that errors with:

error[E0308]: mismatched types
   --> src/ser.rs:290:16
    |
78  | impl<'a, W: enc::Write> ser::Serializer for &'a mut Serializer<W> {
    |          - this type parameter
...
290 |           ser: &mut mem_serializer.0,
    |                ^^^^^^^^^^^^^^^^^^^^^ expected type parameter `W`, found `&mut cbor4ii::core::utils::BufWriter`
    |
    = note: expected mutable reference `&mut cbor4ii::serde::Serializer<W>`
               found mutable reference `&mut cbor4ii::serde::Serializer<&mut cbor4ii::core::utils::BufWriter>`

Next try (thanks to the help from a colleague) was:

let mut serializer = <&mut Serializer<_> as serde::Serializer>::SerializeMap {
    bounded: true,
    ser: &mut mem_serializer.0,
};

That would work, but sadly only on nightly due to rust-lang/rust#86935.

So what to do now? Is making Collect public an option (it would also need to have constructor then)?

@vmx vmx reopened this Apr 11, 2022
@quininer
Copy link
Owner

quininer commented Apr 11, 2022

We should be able to implement a separate Collect for this with just a few lines of code.

or we just need to serialize these two objects into the same buffer, it doesn't necessarily need SerializeMap.

like

    let mut buffer = BufWriter::new(Vec::new());
    let mut entries = Vec::new();
    for (key, value) in iter {
        let mut mem_serializer = Serializer::new(&mut buffer);
        key.serialize(&mut mem_serializer)?;
        value.serialize(&mut mem_serializer)?;
        entries.push(buffer.buffer().to_vec());
        buffer.clear();
    }

@vmx
Copy link
Contributor Author

vmx commented Apr 11, 2022

or we just need to serialize these two objects into the same buffer, it doesn't necessarily need SerializeMap.

Wow, that's even better. I've been spending so much time on Serde, but I'm still not good at it. Thanks a lot this works! I just need to look into some error handling issues, but that should be easy.

@vmx
Copy link
Contributor Author

vmx commented Apr 11, 2022

Here comes the full source of the updated version:

fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error>
where
    K: ser::Serialize,
    V: ser::Serialize,
    I: IntoIterator<Item = (K, V)>,
{
    // CBOR RFC-7049 specifies a canonical sort order, where keys are sorted by length first.
    // This was later revised with RFC-8949, but we need to stick to the original order to stay
    // compatible with existing data.
    // We first serialize each map entry into a buffer and then sort those buffers. Byte-wise
    // comparison gives us the right order as keys in DAG-CBOR are always strings and prefixed
    // with the length. Once sorted they are written to the actual output.
    let mut buffer = BufWriter::new(Vec::new());
    let mut entries = Vec::new();
    for (key, value) in iter {
        let mut mem_serializer = Serializer::new(&mut buffer);
        key.serialize(&mut mem_serializer)
            .map_err(|_| enc::Error::Msg("Map key cannot be serialized.".into()))?;
        value
            .serialize(&mut mem_serializer)
            .map_err(|_| enc::Error::Msg("Map key cannot be serialized.".into()))?;
        entries.push(buffer.buffer().to_vec());
        buffer.clear();
    }

    enc::MapStartBounded(entries.len()).encode(&mut self.0.writer())?;
    entries.sort_unstable();
    for entry in entries {
        self.0.writer().push(&entry)?;
    }

    Ok(())
}

My Serializer looks like this:

pub struct Serializer<W>(cbor4ii::serde::Serializer<W>);

impl<W> Serializer<W> {
    pub fn new(writer: W) -> Serializer<W> {
        Serializer(Cbor4iiSerializer::new(writer))
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants