Skip to content

barmintor/wtfcrepo3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

wtfcrepo3: A FAQ for Fedora Commons Repository v3.x

  1. What's a simple description of Fedora Commons Repository (FCRepo), conceptually speaking?
    FCRepo is a web-facing system for identifying resources, describing them with shallow RDF, and attaching file-like contents to them. A FCRepo repository is like a network graph with files dangling off it. That metaphor drives the fundamental concepts in FCRepo 3, objects and datastreams.

  2. Ok, what are FCRepo Objects?
    An object is a node in the repository graph. It is identified with a local ID called its PID. The PID consists of two parts: A namespace and a name (usually a sequentially generated number) joined with a colon (':'). In the repository, this PID is used to compose the URI for an object, which will be in the form "info:fedora/$PID". Objects have properties and datastreams.

  3. And FCRepo Datastreams?
    Datastreams are streams of bytes identified in the context of a FCRepo object. Like Objects, they have properties. It's no accident that they sound a lot like files; they usually start out that way. But that may also be pointers to streams of bytes in the form of a URL. Datastreams and their contents can be versioned in FCRepo 3.

  4. How are simple FCRepo objects structured?
    Objects minimally have a PID, some core properties, and the 3 required system datastreams: DC, RELS-EXT, and AUDIT. There is also an optional system datastream: RELS-INT.

  5. DC? That's great, can we just use our existing Dublin Core?
    Not exactly. FCRepo 3 expects the DC datastream to consist of only the 15 core DC elements, serialized as XML, and wrapped in an container element borrowed from OAI. FCRepo also inserts the PID as a DC:Identifier value. When the DC stream is updated, its contents are parsed and re-serialized, so it is not naively possible to verify the contents post-update by checksum. The DC datastream was intended to contain minimal description to assist repository administrators. Its contents are searchable in a rudimentary application bundled with FCRepo, and indexed into the backing triplestore if it is enabled.

  6. What's RELS-EXT?
    RELS-EXT is an RDF-XML serialization of triples for which the containing FCRepo object is the subject. It also has a format requirement: The container element must be an RDF:Description, with an "about" attribute indicating the object by internal URI (this means that rdf:type assertions must be made explicitly as contained statements). Like the DC stream, FCRepo will parse and reserialize RELS-EXT data. The contents of RELS-EXT are indexed into the backing triplestore if it is enabled.

  7. What's RELS-INT? RELS-INT is very similar to RELS-EXT, but rather than serializing triples about the FCRepo object, it serializes triples whose subject is a datastream in the same FCRepo object context. If used, its contents are also indexed into the triplestore.

  8. What's AUDIT?
    AUDIT is a rudimentary accounting of changes to the object. It is not directly modifiable, and its contents are not indexed anywhere by default.

  9. What are the core object properties?
    * info:fedora/fedora-system:def/model#state, with possible values Active, Inactive and Deleted * info:fedora/fedora-system:def/model#label, a plain text title for the object * info:fedora/fedora-system:def/model#ownerId, the semicolon delimited list of FCRepo user IDs that own the object (by default fedoraAdmin) * info:fedora/fedora-system:def/model#createdDate, the date created * info:fedora/fedora-system:def/view#lastModifiedDate
    The core object properties are indexed into both the basic search app's index and the triplestore (if enabled). Unfortunately, the object's core properties are not versioned because of an idiosyncracy in FCRepo's object serialization.

  10. How does FCRepo 3 store objects?
    This is where XML enters the picture: FCRepo serializes the tree of the object, its properties, and its datastreams (though not usually their content) as an XML document using a markup called Fedora Object XML, or FOXML. FOXML documents encapsulate versions of datastreams with pointers to their content or, in some cases, inline XML of their content. The FOXML document approximates what digital preservationists call an Archival Information Package (AIP). While datastream properties (for all versions) are present inline in the FOXML, datastream contents will normally be indicated with a URI. The format of this URI will vary according to whether the datastream's contents are managed by FCRepo (that is, in FCRepo's storage) or externally (either as a file-system URI or an HTTP URL).
    The location of this XML document will depend on the configuration FCRepo's storage module, but current defaults will place it in a shallow hierarchy $FEDORA_HOME/data/objectStore based on the MD5 hash of the FCRepo object's internal URI.

  11. Are there limits to the size of files that can be ingested into FCRepo 3?
    Until recent iterations of FCRepo3, the archaism of the Java Web App specs limited uploaded parts to 2Gb (owing to the definition of body length as a signed 32-bit integer), and thus implicitly limited the size of POSTed datastream contents. This ought no longer to be the case (file bugs where applicable), but the easiest way to deal with such files is by passing references to the content instead (the dsLocation parameter in the REST api). One significant departure would be datastreams with inline XML content- such datastreams must be representable as a byte array, and are thus limited to 2Gb (please, do not create inline XML datastreams this large).

  12. Can you ingest files of any type into FCRepo 3 (including unknown MIME types, or obsolete formats)?
    FCRepo is agnostic about the format of managed or external datastream contents (with the obvious exceptions of the system datastreams). The drawbacks of unidentified MIME types will be in the inferred download name, which for datastreams with no MIME or with application/binary will be given the suffix '.bin' (this can be circumvented by giving the datastream a dsLabel property). If the MIME is inadequate to identifying the format, it can be elaborated with the datastream's formatURI property, which is largely documentary (eg, assign a URI from the PRONOM registry to this property).

  13. If my datastream references content externally, can I still use the AUDIT stream to track changes?
    FCRepo can only track the changes it knows about, but your repository can indicate a known update to external content by updating one of the datastream properties (eg, pushing a new version with the same content URI but a changed property). Likewise, deletion of the content in the service will not be recognized in FCRepo unless your repository's workflow also deletes the referring datastream.

  14. Can you create hierarchical relationships of any depth? (objects within collections within collections...)
    The naive answer is "Yes! Objects can have any relationship you define!" But this is not very helpful.
    More complicated relationships between objects are indicated, one triple at a time, in RELS-EXT. For example, in an object with PID 'my:1', you might have the following RELS-EXT data:

<rdf:Description rdf:about="info:fedora/my:1">
	<isMemberOf xmlns="info:fedora/fedora-system:def/relations-external#" rdf:resource="info:fedora/my:2"></memberOf>
</rdf:Description>

... and in the object with PID 'my:2', RELS-EXT data:

<rdf:Description rdf:about="info:fedora/my:2">
	<isMemberOf xmlns="info:fedora/fedora-system:def/relations-external#" rdf:resource="info:fedora/my:3"></memberOf>
</rdf:Description>

... but your repo will need a way of following those relationships. If you have enabled the triplestore (ie the 'resource index'), then you can use that to query these relationships. Alternately, you might flatten the data out and index it externally. In its minimal configuration, FCRepo only provides you with the shallowest RDF queries.
Another approach is to separate the complexity of structure into a third entity, with simple parent child relationships between objects to cohere collections, but more elaborate structures in another datastream to order them appropriately (eg, some type of structure map in a datastream). Which solution is best depends on the needs of your application, and the characteristics of the data- for example, if the intermediate nodes have substantial data of their own, the graph-walking approach might be best.

Contributors

#wtfcrepo3 Ben Armintor (@barmintor) Andrew Berger (@andrewjbtw)

About

Questions and answers about Fedora Commons Repository v3.x

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published