Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codec proposal: N-Quads (RDF format) #180

Open
joeltg opened this issue Jun 30, 2020 · 9 comments
Open

Codec proposal: N-Quads (RDF format) #180

joeltg opened this issue Jun 30, 2020 · 9 comments

Comments

@joeltg
Copy link

joeltg commented Jun 30, 2020

Not sure if this is the right way to bring this up, but I'd like to propose adding a codec for N-Quads files. RDF is the graph data model for the semantic web, and although N-Quads is just one of many RDF serializations, it's commonly regarded as the lowest-level representation with the most regular structure and the least syntactic sugar.

In particular, N-Quads is the output format of the Universal Dataset Normalization Algorithm (URDNA2015) (also brought up in this issue). URDNA2015 is a big deal for the RDF world because it produces a canonical representation (ie two isomorphic datasets will produce the exact same serialized N-Quads string) that is required for all the digital signatures work that's starting to happen, and it's a representation that people will commonly want to hash!

This would also enable a natural interpretation of RDF datasets as IPLD objects, using an IPLD schema for the RDFJS data model with N-Quads as a custom representation.

I see this as a great concrete foundation for bringing the semantic web & decentralized web communities closer together. Is this the kind of codec we're open to adding? Would it be appropriate to open a pull request to table.csv?

@joeltg
Copy link
Author

joeltg commented Jun 30, 2020

IPLD <-> RDF interop has also been discussed in a few times in the past, without concrete results:

@rvagg
Copy link
Member

rvagg commented Jul 1, 2020

I suspect @mikeal and @vmx will have more mature thoughts about RDF than me, but I'd say that in general multicodec can be used to disambiguate types of objects where any such ambiguity exists. It's not strictly tied to IPLD, although IPLD is a logical consumer of multicodecs. Where something is being transmitted or stored and you want to ensure clarity about what type of thing it is, multicodec should be helpful.

So with that in mind, if you have a use-case where that's applicable, IPLD or not, then an entry in the multicodec table would be a good thing. My preference would be to be adding things where there are concrete examples of them existing in the wild where multicodec could be applied, or at least concrete plans on how they could be applied, but we're taking a fairly relaxed approach to that lately and the idea of explicitly labelling things as "draft" for this purpose is on the cards: #165

Do you see a path to this being used any time soon, or is would this be more a symbolic move for now by saying that multicodec & RDF have potential connectivity?

@joeltg
Copy link
Author

joeltg commented Jul 1, 2020

I know that I'd use it right away! For the Underlay we're currently storing and referencing lots of N-Quads files as raw objects - including linking to N-Quads files from other N-Quads files using a dweb:/ipld/ URI format (all identifiers in RDF are URIs). One use case we'd really like to pull off is using CAR archives (or something similar) to collect and package all transitively linked files, so we want to be able to tell whether a CID is an N-Quads file, and we want IPLD to know how to traverse its links.

@mikeal
Copy link
Contributor

mikeal commented Jul 1, 2020

Is there utility you’d get out of an IPLD representation beyond raw though? My understanding is that links in this format are not addressed by hash, so there’s no way to represent them as links in IPLD, so you’re never actually going to get a graph for this format even if there’s a codec.

The only thing a codec would give you is a Data Model (for this it would just be JSON types) representation of the file format, but you’d have to ensure the serialized representation is kept below the block size limit (1mb) which is going to be hard since you don’t have a way to link between the blocks in IPLD to handle N-Quad files that are larger than the limit because it doesn’t link by hash.

That said, if you can get some utility out of it there’s no real barrier to adding the codec as long as we document these constraints, I’d just caution against using it if you’re going to be encoding large data structures this way.

@joeltg
Copy link
Author

joeltg commented Jul 2, 2020

Is there utility you’d get out of an IPLD representation beyond raw though?

Yes! It would give us a way of referencing individual quads in a dataset (using integer index paths), which we want to do for tracing provenance. There's no widely accepted method for doing this in the RDF world right now.

You're right that the graph structure (what nodes are connected by what edges) won't be directly represented in IPLD - but it couldn't if we tried, since RDF is a directed labelled multigraph (ie possibly containing cycles).

I understand that codecs are a different abstraction level than the IPLD data model, and that there would have to be different representation strategies for 1mb+ datasets, but I still see this as having real utility as a building block for people working to decentralize RDF.

@jonnycrunch
Copy link

@joeltg I went down the rdf over ipld and ran into the fact that rdf graphs contain cycles and thus wouldn't be a good fit for IPLD.

@joeltg
Copy link
Author

joeltg commented Jul 10, 2020

@jonnycrunch the IPLD data model representation of an N-Quads file wouldn't represent the dataset "directly" by having nodes be maps and edges be keys like in JSON-LD, it would represent the dataset at the lower-level RDFJS Data Model, as a flat array of quads.

IPLD data model stuff could be its own conversation; this issue is just about getting an N-Quads multicodec.

@vmx
Copy link
Member

vmx commented Jul 10, 2020

Multicodecs describe a lot. We started to put them into categories. One of them is "ipld" to describe codecs that make sense within the IPLD ecosystem. I don't think it's written down anywhere, but I think formats in that category need to support at least Links. Obviously that's not the case for N-Quads.

So we could put it into another category. Then it would be just an identifier of how things are encoded. I think it would be OK to add such a code, but I it won't add much value to IPLD. IPLD might link to an N-Quad, but that would always be the end of the traversal (a sink), just like the raw codec.

@OR13
Copy link
Contributor

OR13 commented Aug 14, 2020

This is very interesting... I did some related CBOR work here:

https://github.com/transmute-industries/decentralized-cbor

in particular, I represent ZLIB_Compressed_NQuads as CBOR... providing compressed representation for JSON-LD with bi-directional transformation between CBOR and JSON-LD....

There is also work in progress of CBOR-LD as well.... (and obviously DAG_CBOR which powers IPLD).

I agree with vmx, N-Quads are the end of pure IPLD, but here is nothing stoping your from leaving IPLD and following them further... for example, across DIDs or URIs in the N-Quads...

IPLD1 -> IPLD2 -> NQuads  -> did:sov:123
                          -> did:ethr:456
                          -> https://public.oracle.example.com/credentials/123
                          -> https://ipfs.io/CID
                          -> IPLD3       

Some DIDs rely on multicodec already like did:key, and obviously any IRI in an N-Quad might rely on multicodec as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants
@mikeal @vmx @rvagg @jonnycrunch @OR13 @joeltg and others