Librivox #1290

sgdwn · 2023-02-12T07:50:38Z

Source Site

https://librivox.org/

Value Provided

huge repository of public domain audiobooks

Licenses Provided

Public Domain // CC0

Implementation

🙋 I would be interested in implementing this feature.

zackkrida · 2023-02-14T20:10:55Z

This appears to be the API: https://librivox.org/api/feed/audiobooks

Their forum also seems to be a good resource for learning about the API: https://forum.librivox.org/viewtopic.php?p=2153488&hilit=api#p2153488

sarayourfriend · 2023-06-27T03:19:42Z

Did a bit more digging, we can sort the API by ID and iterate through that way by increasing the offset parameter:

https://librivox.org/api/feed/audiobooks?sort_field=id&offset=2

The API responds quickly even with high offsets (https://librivox.org/api/feed/audiobooks?sort_field=id&sort_order=desc&offset=18000) so I'm not worried that we would cause a problem if we slowly crawled the API in this way. It also returns JSON responses if you set the content type header to application/json.

There is code to check a rate limit but based on that code, it doesn't seem like it applies to the API as there's no limit defined on the feed controller. In any case, we should still get an API key for this if we do it.

Note that the API returns results for works that haven't yet been reviewed by the LibriVox folks, so we would need to check that url_librivox exists for the work before including it.

Additionally, there are further complications regarding how we would present these results. Specifically, what will be considered the "unit" of a single "work" that we would show in results? One audiobook recording can include many readers for different sections (see https://librivox.org/der-abenteuerliche-simplicissimus-teutsch-teil-1-by-hans-jakob-christoffel-von-grimmelshausen/ as an example). That means we can't include individual books as the result without flattening the attribution. To make matters worse, however, the API doesn't have any way (as far as I can tell) to retrieve the readers for sections of a work. You can get the list of sections for an audio book but it does not list the readers (https://librivox.org/api/feed/audiotracks?project_id=2999).

I don't know whether attribution of the recording would need the reader, however, or if it should be attributed to the author of the work being read.

Tenacious-E · 2024-04-08T17:25:50Z

Hello,

I would like to implement this feature.

sarayourfriend · 2024-04-08T21:05:29Z

@Tenacious-E @AetherUnbound I don't think this feature is ready to implement yet. We need to discuss the issues I brought up in this comment, and decide on the approach we will take:

I don't know whether attribution of the recording would need the reader, however, or if it should be attributed to the author of the work being read.

Additionally, there are further complications regarding how we would present these results. Specifically, what will be considered the "unit" of a single "work" that we would show in results? One audiobook recording can include many readers for different sections (see https://librivox.org/der-abenteuerliche-simplicissimus-teutsch-teil-1-by-hans-jakob-christoffel-von-grimmelshausen/ as an example). That means we can't include individual books as the result without flattening the attribution. To make matters worse, however, the API doesn't have any way (as far as I can tell) to retrieve the readers for sections of a work. You can get the list of sections for an audio book but it does not list the readers (https://librivox.org/api/feed/audiotracks?project_id=2999).

These are complex issues that may even need special design considerations to make audio sets (a) more flexible to allow for multiple creators on a single audioset and (b) more prominent in presentation. We also need to be able to add multiple creators to a single work, which may require special work in the catalog data model, or at the very least careful consideration on how to "make it work" with the tools we have now.

The question of multiple creators or contributors is relevant for e.g., classical music, where the musician is the creator (interpreter) and the composer is an important and relevant contributor to credit. They aren't both the creator though (it needs careful thought and consultation with standard metadata formats for describing such works, we don't need to reinvent the wheel here 🙂)

sarayourfriend · 2024-05-03T05:50:20Z

Here's an example of a formalised approach to what I'm describing, regarding the distinction between reader, author, and so forth: https://www.loc.gov/marc/relators/relaterm.html

The "relator" term defines how the entity (in the case of librivox, usually a person) is related to the work. Never mind which MARC fields these go into and which entity goes where. The main thing I am hoping to demonstrate is that this is a known problem that's been "solved" in at least a few existing cataloguing standards, and we don't necessarily need to reinvent the wheel. Here are, for example, LoC's relator terms mapped to DublinCore dc:contributor: https://memory.loc.gov/diglib/loc.terms/relators/dc-contributor.html

As I said, Librivox isn't the only place this problem presents itself so prominently. Classical music and really music in general, which we already have plenty of in the catalogue, has the same problem, as works that often involve collaboration between multiple contributors. Our single "creator" field isn't capable of capturing that, and it isn't hard to find examples of classical music recordings that are, if not misattributed, at least confusingly attributed. Photographs of artefacts from a museum have the same fundamental problem, and the particular mode of attribution can be pretty confusing! Even if an institution's chosen to license the photograph rather than CC0 or PDM it, it's still strange to say the thing is by them.

Here's an example where that nonsensical attribution is clear: https://openverse.org/image/ca4c9d9d-c22b-45ad-a375-e8431f8d5cec/

In what sense is this work "by" the Biodiversity Heritage Library? It isn't. They made the scan, but their own metadata doesn't credit themselves as the "creator" (because it's absurd and doesn't fit any model of attribution): https://www.biodiversitylibrary.org/page/36934038#page/325/mode/1up

What I mean to say is: this problem already exists, and Librivox is a good example of where this tension and contradiction become so clear it's impossible not to need to solve it before indexing them (in my opinion). At best, we'll encourage misattribution of the works and at worst distribute and represent the works with such poor quality metadata that it wouldn't be possible to consider us good stewards of the data. I already think we risk that with examples like that BioDivLibrary one and the one mentioned in #2594. I keep harping on it and I feel it keeps getting brushed over or not taken seriously, but we do not take good care of the metadata we have from providers that supply it (even huge providers like Wikimedia), and because of that end up with examples like that BioDiv one where two prominent pieces of metadata (in fact, the two most prominent) shown on our page are nonsense, either because it's unintelligible (the "title") or factually incorrect (it isn't meaningfully by the Biodiversity Heritage Library). That particular example comes from pulling the work's data from Flickr, without pulling anything from BHL themselves, but they even have an API, so it's solvable: https://www.biodiversitylibrary.org/docs/api3.html

I'm writing this because I saw a PR go up without any further discussion about these metadata complexities that are especially unavoidable with Librivox, and I'm worried we'll move forward with that and create even more instances of misrepresented works and creators on Openverse. Surely that's fundamentally against the aims of the project.

sgdwn added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🧹 status: ticket work required Needs more details before it can be worked on labels Feb 12, 2023

zackkrida added 🟩 priority: low Low priority and doesn't need to be rushed good first issue New-contributor friendly and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Feb 14, 2023

obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023

openverse-bot added this to Backlog in Openverse Apr 17, 2023

obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023

obulat mentioned this issue Jul 12, 2023

Librivox #1310

Closed

1 task

AetherUnbound assigned Tenacious-E Apr 8, 2024

AetherUnbound added 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository and removed good first issue New-contributor friendly 🧹 status: ticket work required Needs more details before it can be worked on labels Apr 8, 2024

obulat mentioned this issue Apr 9, 2024

Add initial Librivox ingestion #4063

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Librivox #1290

Librivox #1290

sgdwn commented Feb 12, 2023 •

edited

zackkrida commented Feb 14, 2023

sarayourfriend commented Jun 27, 2023

Tenacious-E commented Apr 8, 2024

sarayourfriend commented Apr 8, 2024 •

edited

sarayourfriend commented May 3, 2024

Librivox #1290

Librivox #1290

Comments

sgdwn commented Feb 12, 2023 • edited

Source Site

Value Provided

Licenses Provided

Implementation

zackkrida commented Feb 14, 2023

sarayourfriend commented Jun 27, 2023

Tenacious-E commented Apr 8, 2024

sarayourfriend commented Apr 8, 2024 • edited

sarayourfriend commented May 3, 2024

sgdwn commented Feb 12, 2023 •

edited

sarayourfriend commented Apr 8, 2024 •

edited