[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #189

zombiehoffa · 2021-11-17T04:09:45Z

once ebook's are a lot more mature it would be awesome to be able to identify when an ebook and an audiobook are the same book and automagically text to speech the audiobook so that the audiobook and the ebook can be kept in sync.

gelsas · 2021-12-30T05:11:00Z

So basically a selfmade version of Amazon's whispersync feature.
That would be a game changer!

jrhbcn · 2022-04-07T18:36:16Z

I cannot give more +1 to this. For me it would be the killer feature of audiobookshelf as soon as the ebook reader is more mature.

As a reference these libraries might help implementing this: afaligner and aeneas.

DDriggs00 · 2022-10-11T17:25:14Z

While I agree that this would be an incredible feature, it is definitely a very long-term goal, and would require an incredible amount of work.

andrewls · 2022-12-19T21:58:46Z

This project also seems relevant. I haven't tried it out yet but I've been meaning to. I'll report back on what I find if I do end up trying it out in the next couple of months. A huge issue with this feature is going to be incorporating support for a reading experience of some kind. For that we could probably look at porting Epub3 Media Overlay functionality out from minstrel but all of that code is pretty dated and therefore likely not in the best of shape, and it also locks you into requiring users to create an EPUB3 file with a media overlay instead of any other possible format we might choose. I've definitely looked at implementing something like this in the past and then didn't keep up on it because I didn't have anywhere near enough free time to dedicate to something of this scale. I agree though, this would be an absolutely incredible feature.

zombiehoffa · 2022-12-20T19:08:03Z

andrewls, wow, that makes this seem a lot more possible than the pie in the sky idea I thought it was.

pbozzay · 2023-02-10T03:44:33Z

+1, this would be the killer feature

donkevlar · 2023-03-01T04:38:42Z

Would love to see this as well!

jonasrk · 2023-10-21T14:08:58Z

Just found out about audiobookshelf googling for "Whispersync for Voice open source alternatives". Would be so cool to make this happen somehow.

sphars · 2023-12-24T12:30:18Z

Came across this on Hacker News this morning, wonder if it's something that could be integrated, or use the epubs that it creates?

From their docs: It's an self-hosted platform for taking an audiobook (either as an m4b/mp4 file, or as a zip of mp3 files) and an ebook (as an epub file) and producing a new epub file with synced narration support. This follows the media overlay spec for epubs.

FreedomBen · 2023-12-24T18:58:37Z

I've been experimenting locally with using whisper.cpp to make transcripts of my audiobooks. The reason transcripts rather than just an epub version is that it includes timestamps, which can be easily used to:

Display "subtitles" while playing the book. This is actually even cooler than I thought it would be. Right now my prototype is a hack together with VLC player, but I have eventual plans for a PR for the web and mobile players to be able to display "subtitles" if they exist for the book (and if feature is enabled). With whisper it's possible to have ABS run a periodic job to auto-generate these transcript files for books where they don't yet exist. Will need to be disabled by default cause it uses a ton of CPU, but IMHO would be a super awesome feature.
Easily find the written text based on a timestamp. I often find myself wanting to look up quotes and things that I heard and want to preserve for later.

I suspect it wouldn't be terribly hard to build a "whispersync" type of thing on top of this (once it exists of course).

If somebody wants to implement this sooner than I have availability, I'm happy to yield it. Let me know and I'll try to knowledge dump what I have. Also happy to brainstorm the idea. I'm @FreedomBen in the Matrix chat

smoores-dev · 2023-12-25T05:50:35Z

The reason transcripts rather than just an epub version is that it includes timestamps

This is actually how Media Overlays work, as well (I'm the author of Storyteller, the project that @sphars linked to). A Media Overlay is just an XML file that maps XHTML elements to segments of audio files. The Storyteller reader apps can (and do!), for example, highlight the current sentence while it's being read:

And they could also allow you to find the written text based on the timestamp (that's essentially the premise that the Storyteller reader apps are predicated on)! For any given timestamp, you can always find the location in the EPUB text that corresponds to it.

gelsas · 2023-12-27T21:11:36Z

Is it also possible to finetune the highlighting even more? It think with Amazon whispersync it highlights it word by word. And I am so used to that by now, so I wondered if it would be possible to do that aswell with storyteller

smoores-dev · 2023-12-28T01:49:05Z

It's possible! Storyteller has word-level timestamps available, but its reliance on fuzzy search for alignment (to account for inaccuracies in the transcription) might make word-level highlights challenging to get right.

If it's a feature you're interested in, feel free to make an Issue on the Storyteller project! It's on GitLab (gitlab.com/smoores/storyteller), but there's a mirror on GitHub if you don't have a GitLab account; I'll copy any Issues created there over to GitLab.

mr-ransel · 2023-12-29T13:47:43Z

I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST.

Essentially the flow would look like:

User "pairs" and ebook and audiobook in ABS
ABS reaches out to ST over the API, and triggers the generation of an updated epub file, sending the user-defined chapter demarcations as well
ST parses the audiobook tracks, preferably by filesystem reference instead of a wasteful upload, uses the chapter times to assist the algorithm, and generates new marked up epubs
The new epub gets synced back to ABS via either the API or just a filesystem write replacing/adding a duplicate of the existing epubs, but now with the marked up files

An extension would be to handle conversion of non epubs to epub transparently as well for convenience.

Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary.

smoores-dev · 2023-12-29T14:17:47Z

That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.

MxMarx · 2024-02-12T22:56:53Z

I was playing around Storyteller, it looks so amazing for this! Media overlays don't look super easy to access with epub.js, although there's a pull request for that, but something like this snippet, inserted here, can extract the timestamp to cfi mappings from the epubs output from Storyteller

  var manifestItem = this.book.packaging.manifest[item.idref]
  var overlay = this.book.packaging.manifest[manifestItem.overlay]

  if (overlay) {
    const href = resolveURL(overlay.href, basePath)
    this.book.load(href).then(function (overlayXml) {
      var doc = new DOMParser().parseFromString(overlayXml, 'text/xml')

      doc.querySelectorAll('par').forEach((par) => {
        var audio = par.getElementsByTagName('audio')[0]
        var textId = par.getAttribute('id')
        this.audioMapping.push({
          cfi: item.cfiFromElement(item.document.getElementById(textId)),
          clipBegin: parseFloat(audio.getAttribute('clipBegin')),
          clipEnd: parseFloat(audio.getAttribute('clipEnd'))
        })
      })
    })
  }

Since the current epub reader needs the whole epub to be sent to the client, it might be a good idea to use either the original epub since the marked up epub includes embedded audio files, or strip the audio files from Storyteller output.

If using the existing audio files instead of embedding them, another consideration is that the timestamps generated by Storyteller are relative to the audiobook chapters instead of the whole audio. If going down that path, I'm not sure if it would make more sense to modify Storyteller to include some metadata to map the chapter offsets back to the original file, or have audiobookshelf do some post processing after running Storyteller.

stassinari · 2024-03-12T16:10:24Z

With the latest iOS 17.4 update, Apple introduced a new transcript feature which is useful and quite intuitive.

I know it's not exactly like what this issue is about, but there might interesting ideas, especially in terms of UX.

sevenlayercookie · 2024-03-30T03:32:55Z

That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.

Have you experimented with live transcription using Whisper? As in, using whisper to transcribe what is currently being played and "buffering" 30 seconds ahead or so. Even using CPU alone, it sounds like faster-whisper can easily outpace an audiobook playing at original speed (1x). Would essentially be Immersive Reading (and would localize to the individual word as well, rather than just the whole sentence). And I suppose this transcription could be cached for future use and fed into the fuzzy search to attempt to sync with an ebook as well.

Basically an on-demand, live transcription version of Storyteller, cutting out need for pre-processing.

Astorsoft · 2024-05-09T09:27:42Z

This idea would be amazing and outsourcing the sync to a dedicated tool like storyteller is a great idea. If you want to go down the route of an internal service however, I've already mentioned this on storyteller's project but I think https://github.com/echogarden-project/echogarden is an amazing backend for speech to transcript alignment that works with many more language than English, I did some test on Swedish and it was very conclusive, based on their doc it can go down to word-level alignment with great accuracy.

Audiobook/epub alignment is always better than TTS as the reader often make great effort to change their tone of voice to each character and make a good job at expressing the persons' feeling. Maybe one day whisper will reach this stage but we're not there yet.

Lastly, good luck on the player part. It's a nightmare to find a good epub reader with media overlay support, at least on android. Some don't work with specific file format (like ogg vorbis), some add weird delay in the playback, making you think the alignment is off while it is in fact perfect when checked on other platforms like windows.

advplyr added the enhancement New feature or request label Dec 2, 2021

advplyr mentioned this issue May 4, 2023

[Enhancement]: Add Whisper support #1723

Open

mr-ransel mentioned this issue Nov 13, 2023

[Enhancement]: Ability to estimate equivalent timestamp/page numbers between ebooks and audiobooks #2308

Open

advplyr added the ebooks Issue is related to ebooks or ereader label Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #189

[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #189

zombiehoffa commented Nov 17, 2021

gelsas commented Dec 30, 2021

jrhbcn commented Apr 7, 2022

DDriggs00 commented Oct 11, 2022

andrewls commented Dec 19, 2022

zombiehoffa commented Dec 20, 2022

pbozzay commented Feb 10, 2023

donkevlar commented Mar 1, 2023

jonasrk commented Oct 21, 2023

sphars commented Dec 24, 2023 •

edited

FreedomBen commented Dec 24, 2023

smoores-dev commented Dec 25, 2023

gelsas commented Dec 27, 2023

smoores-dev commented Dec 28, 2023

mr-ransel commented Dec 29, 2023 •

edited

smoores-dev commented Dec 29, 2023

MxMarx commented Feb 12, 2024

stassinari commented Mar 12, 2024

sevenlayercookie commented Mar 30, 2024

Astorsoft commented May 9, 2024

[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #189

[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #189

Comments

zombiehoffa commented Nov 17, 2021

gelsas commented Dec 30, 2021

jrhbcn commented Apr 7, 2022

DDriggs00 commented Oct 11, 2022

andrewls commented Dec 19, 2022

zombiehoffa commented Dec 20, 2022

pbozzay commented Feb 10, 2023

donkevlar commented Mar 1, 2023

jonasrk commented Oct 21, 2023

sphars commented Dec 24, 2023 • edited

FreedomBen commented Dec 24, 2023

smoores-dev commented Dec 25, 2023

gelsas commented Dec 27, 2023

smoores-dev commented Dec 28, 2023

mr-ransel commented Dec 29, 2023 • edited

smoores-dev commented Dec 29, 2023

MxMarx commented Feb 12, 2024

stassinari commented Mar 12, 2024

sevenlayercookie commented Mar 30, 2024

Astorsoft commented May 9, 2024

sphars commented Dec 24, 2023 •

edited

mr-ransel commented Dec 29, 2023 •

edited