Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #189

Open
zombiehoffa opened this issue Nov 17, 2021 · 19 comments
Labels
ebooks Issue is related to ebooks or ereader enhancement New feature or request

Comments

@zombiehoffa
Copy link

once ebook's are a lot more mature it would be awesome to be able to identify when an ebook and an audiobook are the same book and automagically text to speech the audiobook so that the audiobook and the ebook can be kept in sync.

@advplyr advplyr added the enhancement New feature or request label Dec 2, 2021
@gelsas
Copy link

gelsas commented Dec 30, 2021

So basically a selfmade version of Amazon's whispersync feature.
That would be a game changer!

@jrhbcn
Copy link

jrhbcn commented Apr 7, 2022

I cannot give more +1 to this. For me it would be the killer feature of audiobookshelf as soon as the ebook reader is more mature.

As a reference these libraries might help implementing this: afaligner and aeneas.

@DDriggs00
Copy link

While I agree that this would be an incredible feature, it is definitely a very long-term goal, and would require an incredible amount of work.

@andrewls
Copy link

This project also seems relevant. I haven't tried it out yet but I've been meaning to. I'll report back on what I find if I do end up trying it out in the next couple of months. A huge issue with this feature is going to be incorporating support for a reading experience of some kind. For that we could probably look at porting Epub3 Media Overlay functionality out from minstrel but all of that code is pretty dated and therefore likely not in the best of shape, and it also locks you into requiring users to create an EPUB3 file with a media overlay instead of any other possible format we might choose. I've definitely looked at implementing something like this in the past and then didn't keep up on it because I didn't have anywhere near enough free time to dedicate to something of this scale. I agree though, this would be an absolutely incredible feature.

@zombiehoffa
Copy link
Author

andrewls, wow, that makes this seem a lot more possible than the pie in the sky idea I thought it was.

@pbozzay
Copy link

pbozzay commented Feb 10, 2023

+1, this would be the killer feature

@donkevlar
Copy link

Would love to see this as well!

@jonasrk
Copy link

jonasrk commented Oct 21, 2023

Just found out about audiobookshelf googling for "Whispersync for Voice open source alternatives". Would be so cool to make this happen somehow.

@sphars
Copy link

sphars commented Dec 24, 2023

Came across this on Hacker News this morning, wonder if it's something that could be integrated, or use the epubs that it creates?

From their docs: It's an self-hosted platform for taking an audiobook (either as an m4b/mp4 file, or as a zip of mp3 files) and an ebook (as an epub file) and producing a new epub file with synced narration support. This follows the media overlay spec for epubs.

@FreedomBen
Copy link
Contributor

I've been experimenting locally with using whisper.cpp to make transcripts of my audiobooks. The reason transcripts rather than just an epub version is that it includes timestamps, which can be easily used to:

  1. Display "subtitles" while playing the book. This is actually even cooler than I thought it would be. Right now my prototype is a hack together with VLC player, but I have eventual plans for a PR for the web and mobile players to be able to display "subtitles" if they exist for the book (and if feature is enabled). With whisper it's possible to have ABS run a periodic job to auto-generate these transcript files for books where they don't yet exist. Will need to be disabled by default cause it uses a ton of CPU, but IMHO would be a super awesome feature.
  2. Easily find the written text based on a timestamp. I often find myself wanting to look up quotes and things that I heard and want to preserve for later.

I suspect it wouldn't be terribly hard to build a "whispersync" type of thing on top of this (once it exists of course).

If somebody wants to implement this sooner than I have availability, I'm happy to yield it. Let me know and I'll try to knowledge dump what I have. Also happy to brainstorm the idea. I'm @FreedomBen in the Matrix chat

@smoores-dev
Copy link

The reason transcripts rather than just an epub version is that it includes timestamps

This is actually how Media Overlays work, as well (I'm the author of Storyteller, the project that @sphars linked to). A Media Overlay is just an XML file that maps XHTML elements to segments of audio files. The Storyteller reader apps can (and do!), for example, highlight the current sentence while it's being read:

And they could also allow you to find the written text based on the timestamp (that's essentially the premise that the Storyteller reader apps are predicated on)! For any given timestamp, you can always find the location in the EPUB text that corresponds to it.

@gelsas
Copy link

gelsas commented Dec 27, 2023

Is it also possible to finetune the highlighting even more? It think with Amazon whispersync it highlights it word by word. And I am so used to that by now, so I wondered if it would be possible to do that aswell with storyteller

@smoores-dev
Copy link

It's possible! Storyteller has word-level timestamps available, but its reliance on fuzzy search for alignment (to account for inaccuracies in the transcription) might make word-level highlights challenging to get right.

If it's a feature you're interested in, feel free to make an Issue on the Storyteller project! It's on GitLab (gitlab.com/smoores/storyteller), but there's a mirror on GitHub if you don't have a GitLab account; I'll copy any Issues created there over to GitLab.

@mr-ransel
Copy link

mr-ransel commented Dec 29, 2023

I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST.

Essentially the flow would look like:

  1. User "pairs" and ebook and audiobook in ABS
  2. ABS reaches out to ST over the API, and triggers the generation of an updated epub file, sending the user-defined chapter demarcations as well
  3. ST parses the audiobook tracks, preferably by filesystem reference instead of a wasteful upload, uses the chapter times to assist the algorithm, and generates new marked up epubs
  4. The new epub gets synced back to ABS via either the API or just a filesystem write replacing/adding a duplicate of the existing epubs, but now with the marked up files

An extension would be to handle conversion of non epubs to epub transparently as well for convenience.

Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary.

@smoores-dev
Copy link

That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.

@MxMarx
Copy link
Contributor

MxMarx commented Feb 12, 2024

I was playing around Storyteller, it looks so amazing for this! Media overlays don't look super easy to access with epub.js, although there's a pull request for that, but something like this snippet, inserted here, can extract the timestamp to cfi mappings from the epubs output from Storyteller

  var manifestItem = this.book.packaging.manifest[item.idref]
  var overlay = this.book.packaging.manifest[manifestItem.overlay]

  if (overlay) {
    const href = resolveURL(overlay.href, basePath)
    this.book.load(href).then(function (overlayXml) {
      var doc = new DOMParser().parseFromString(overlayXml, 'text/xml')

      doc.querySelectorAll('par').forEach((par) => {
        var audio = par.getElementsByTagName('audio')[0]
        var textId = par.getAttribute('id')
        this.audioMapping.push({
          cfi: item.cfiFromElement(item.document.getElementById(textId)),
          clipBegin: parseFloat(audio.getAttribute('clipBegin')),
          clipEnd: parseFloat(audio.getAttribute('clipEnd'))
        })
      })
    })
  }

Since the current epub reader needs the whole epub to be sent to the client, it might be a good idea to use either the original epub since the marked up epub includes embedded audio files, or strip the audio files from Storyteller output.

If using the existing audio files instead of embedding them, another consideration is that the timestamps generated by Storyteller are relative to the audiobook chapters instead of the whole audio. If going down that path, I'm not sure if it would make more sense to modify Storyteller to include some metadata to map the chapter offsets back to the original file, or have audiobookshelf do some post processing after running Storyteller.

@stassinari
Copy link

With the latest iOS 17.4 update, Apple introduced a new transcript feature which is useful and quite intuitive.

I know it's not exactly like what this issue is about, but there might interesting ideas, especially in terms of UX.

@sevenlayercookie
Copy link

That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.

Have you experimented with live transcription using Whisper? As in, using whisper to transcribe what is currently being played and "buffering" 30 seconds ahead or so. Even using CPU alone, it sounds like faster-whisper can easily outpace an audiobook playing at original speed (1x). Would essentially be Immersive Reading (and would localize to the individual word as well, rather than just the whole sentence). And I suppose this transcription could be cached for future use and fed into the fuzzy search to attempt to sync with an ebook as well.

Basically an on-demand, live transcription version of Storyteller, cutting out need for pre-processing.

@Astorsoft
Copy link

This idea would be amazing and outsourcing the sync to a dedicated tool like storyteller is a great idea. If you want to go down the route of an internal service however, I've already mentioned this on storyteller's project but I think https://github.com/echogarden-project/echogarden is an amazing backend for speech to transcript alignment that works with many more language than English, I did some test on Swedish and it was very conclusive, based on their doc it can go down to word-level alignment with great accuracy.

Audiobook/epub alignment is always better than TTS as the reader often make great effort to change their tone of voice to each character and make a good job at expressing the persons' feeling. Maybe one day whisper will reach this stage but we're not there yet.

Lastly, good luck on the player part. It's a nightmare to find a good epub reader with media overlay support, at least on android. Some don't work with specific file format (like ogg vorbis), some add weird delay in the playback, making you think the alignment is off while it is in fact perfect when checked on other platforms like windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ebooks Issue is related to ebooks or ereader enhancement New feature or request
Projects
None yet
Development

No branches or pull requests