Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Librivox #1290

Open
1 task
sgdwn opened this issue Feb 12, 2023 · 5 comments
Open
1 task

Librivox #1290

sgdwn opened this issue Feb 12, 2023 · 5 comments
Assignees
Labels
馃捇 aspect: code Concerns the software code in the repository 馃専 goal: addition Addition of new feature 馃煩 priority: low Low priority and doesn't need to be rushed 馃П stack: catalog Related to the catalog and Airflow DAGs
Projects

Comments

@sgdwn
Copy link

sgdwn commented Feb 12, 2023

Source Site

https://librivox.org/

Value Provided

huge repository of public domain audiobooks

Licenses Provided

Public Domain // CC0

Implementation

  • 馃檵 I would be interested in implementing this feature.
@sgdwn sgdwn added 馃殾 status: awaiting triage Has not been triaged & therefore, not ready for work 馃Ч status: ticket work required Needs more details before it can be worked on labels Feb 12, 2023
@zackkrida zackkrida added 馃煩 priority: low Low priority and doesn't need to be rushed good first issue New-contributor friendly and removed 馃殾 status: awaiting triage Has not been triaged & therefore, not ready for work labels Feb 14, 2023
@zackkrida
Copy link
Member

This appears to be the API: https://librivox.org/api/feed/audiobooks

Their forum also seems to be a good resource for learning about the API: https://forum.librivox.org/viewtopic.php?p=2153488&hilit=api#p2153488

@obulat obulat added the 馃П stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@openverse-bot openverse-bot added this to Backlog in Openverse Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@sarayourfriend
Copy link
Contributor

Did a bit more digging, we can sort the API by ID and iterate through that way by increasing the offset parameter:

https://librivox.org/api/feed/audiobooks?sort_field=id&offset=2

The API responds quickly even with high offsets (https://librivox.org/api/feed/audiobooks?sort_field=id&sort_order=desc&offset=18000) so I'm not worried that we would cause a problem if we slowly crawled the API in this way. It also returns JSON responses if you set the content type header to application/json.

There is code to check a rate limit but based on that code, it doesn't seem like it applies to the API as there's no limit defined on the feed controller. In any case, we should still get an API key for this if we do it.

Note that the API returns results for works that haven't yet been reviewed by the LibriVox folks, so we would need to check that url_librivox exists for the work before including it.

Additionally, there are further complications regarding how we would present these results. Specifically, what will be considered the "unit" of a single "work" that we would show in results? One audiobook recording can include many readers for different sections (see https://librivox.org/der-abenteuerliche-simplicissimus-teutsch-teil-1-by-hans-jakob-christoffel-von-grimmelshausen/ as an example). That means we can't include individual books as the result without flattening the attribution. To make matters worse, however, the API doesn't have any way (as far as I can tell) to retrieve the readers for sections of a work. You can get the list of sections for an audio book but it does not list the readers (https://librivox.org/api/feed/audiotracks?project_id=2999).

I don't know whether attribution of the recording would need the reader, however, or if it should be attributed to the author of the work being read.

@obulat obulat mentioned this issue Jul 12, 2023
1 task
@Tenacious-E
Copy link

Hello,

I would like to implement this feature.

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Apr 8, 2024

@Tenacious-E @AetherUnbound I don't think this feature is ready to implement yet. We need to discuss the issues I brought up in this comment, and decide on the approach we will take:

I don't know whether attribution of the recording would need the reader, however, or if it should be attributed to the author of the work being read.

Additionally, there are further complications regarding how we would present these results. Specifically, what will be considered the "unit" of a single "work" that we would show in results? One audiobook recording can include many readers for different sections (see https://librivox.org/der-abenteuerliche-simplicissimus-teutsch-teil-1-by-hans-jakob-christoffel-von-grimmelshausen/ as an example). That means we can't include individual books as the result without flattening the attribution. To make matters worse, however, the API doesn't have any way (as far as I can tell) to retrieve the readers for sections of a work. You can get the list of sections for an audio book but it does not list the readers (https://librivox.org/api/feed/audiotracks?project_id=2999).

These are complex issues that may even need special design considerations to make audio sets (a) more flexible to allow for multiple creators on a single audioset and (b) more prominent in presentation. We also need to be able to add multiple creators to a single work, which may require special work in the catalog data model, or at the very least careful consideration on how to "make it work" with the tools we have now.

The question of multiple creators or contributors is relevant for e.g., classical music, where the musician is the creator (interpreter) and the composer is an important and relevant contributor to credit. They aren't both the creator though (it needs careful thought and consultation with standard metadata formats for describing such works, we don't need to reinvent the wheel here 馃檪)

@AetherUnbound AetherUnbound added 馃専 goal: addition Addition of new feature 馃捇 aspect: code Concerns the software code in the repository and removed good first issue New-contributor friendly 馃Ч status: ticket work required Needs more details before it can be worked on labels Apr 8, 2024
@sarayourfriend
Copy link
Contributor

Here's an example of a formalised approach to what I'm describing, regarding the distinction between reader, author, and so forth: https://www.loc.gov/marc/relators/relaterm.html

The "relator" term defines how the entity (in the case of librivox, usually a person) is related to the work. Never mind which MARC fields these go into and which entity goes where. The main thing I am hoping to demonstrate is that this is a known problem that's been "solved" in at least a few existing cataloguing standards, and we don't necessarily need to reinvent the wheel. Here are, for example, LoC's relator terms mapped to DublinCore dc:contributor: https://memory.loc.gov/diglib/loc.terms/relators/dc-contributor.html

As I said, Librivox isn't the only place this problem presents itself so prominently. Classical music and really music in general, which we already have plenty of in the catalogue, has the same problem, as works that often involve collaboration between multiple contributors. Our single "creator" field isn't capable of capturing that, and it isn't hard to find examples of classical music recordings that are, if not misattributed, at least confusingly attributed. Photographs of artefacts from a museum have the same fundamental problem, and the particular mode of attribution can be pretty confusing! Even if an institution's chosen to license the photograph rather than CC0 or PDM it, it's still strange to say the thing is by them.

Here's an example where that nonsensical attribution is clear: https://openverse.org/image/ca4c9d9d-c22b-45ad-a375-e8431f8d5cec/

In what sense is this work "by" the Biodiversity Heritage Library? It isn't. They made the scan, but their own metadata doesn't credit themselves as the "creator" (because it's absurd and doesn't fit any model of attribution): https://www.biodiversitylibrary.org/page/36934038#page/325/mode/1up

What I mean to say is: this problem already exists, and Librivox is a good example of where this tension and contradiction become so clear it's impossible not to need to solve it before indexing them (in my opinion). At best, we'll encourage misattribution of the works and at worst distribute and represent the works with such poor quality metadata that it wouldn't be possible to consider us good stewards of the data. I already think we risk that with examples like that BioDivLibrary one and the one mentioned in #2594. I keep harping on it and I feel it keeps getting brushed over or not taken seriously, but we do not take good care of the metadata we have from providers that supply it (even huge providers like Wikimedia), and because of that end up with examples like that BioDiv one where two prominent pieces of metadata (in fact, the two most prominent) shown on our page are nonsense, either because it's unintelligible (the "title") or factually incorrect (it isn't meaningfully by the Biodiversity Heritage Library). That particular example comes from pulling the work's data from Flickr, without pulling anything from BHL themselves, but they even have an API, so it's solvable: https://www.biodiversitylibrary.org/docs/api3.html

I'm writing this because I saw a PR go up without any further discussion about these metadata complexities that are especially unavoidable with Librivox, and I'm worried we'll move forward with that and create even more instances of misrepresented works and creators on Openverse. Surely that's fundamentally against the aims of the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
馃捇 aspect: code Concerns the software code in the repository 馃専 goal: addition Addition of new feature 馃煩 priority: low Low priority and doesn't need to be rushed 馃П stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 馃搵 Backlog
Openverse
  
Backlog
Development

Successfully merging a pull request may close this issue.

6 participants