Skip to content

πŸ’¬ ASR FastAPI server using faster-whisper and Multi-Scale Auto-Tuning Spectral Clustering for diarization.

License

Notifications You must be signed in to change notification settings

Wordcab/wordcab-transcribe

Repository files navigation

Wordcab Transcribe

πŸ’¬ Speech recognition is now a commodity


FastAPI based API for transcribing audio files using faster-whisper and Auto-Tuning-Spectral-Clustering for diarization (based on this GitHub implementation).

Important

If you want to see the great performance of Wordcab-Transcribe compared to all the available ASR tools on the market, please check out our benchmark project: Rate that ASR.

Key features

  • ⚑ Fast: The faster-whisper library and CTranslate2 make audio processing incredibly fast compared to other implementations.
  • 🐳 Easy to deploy: You can deploy the project on your workstation or in the cloud using Docker.
  • πŸ”₯ Batch requests: You can transcribe multiple audio files at once because batch requests are implemented in the API.
  • πŸ’Έ Cost-effective: As an open-source solution, you won't have to pay for costly ASR platforms.
  • 🫢 Easy-to-use API: With just a few lines of code, you can use the API to transcribe audio files or even YouTube videos.
  • πŸ€— MIT License: You can use the project for commercial purposes without any restrictions.

Requirements

Local development

  • Linux (tested on Ubuntu Server 20.04/22.04)
  • Python >=3.8, <3.12
  • Hatch
  • FFmpeg

Run the API locally πŸš€

hatch run runtime:launch

Deployment

Run the API using Docker

Build the image.

docker build -t wordcab-transcribe:latest .

Run the container.

docker run -d --name wordcab-transcribe \
    --gpus all \
    --shm-size 1g \
    --restart unless-stopped \
    -p 5001:5001 \
    -v ~/.cache:/root/.cache \
    wordcab-transcribe:latest

You can mount a volume to the container to load local whisper models.

If you mount a volume, you need to update the WHISPER_MODEL environment variable in the .env file.

docker run -d --name wordcab-transcribe \
    --gpus all \
    --shm-size 1g \
    --restart unless-stopped \
    -p 5001:5001 \
    -v ~/.cache:/root/.cache \
    -v /path/to/whisper/models:/app/whisper/models \
    wordcab-transcribe:latest

You can simply enter the container using the following command:

docker exec -it wordcab-transcribe /bin/bash

This is useful to check everything is working as expected.

Run the API behind a reverse proxy

You can run the API behind a reverse proxy like Nginx. We have included a nginx.conf file to help you get started.

# Create a docker network and connect the api container to it
docker network create transcribe
docker network connect transcribe wordcab-transcribe

# Replace /absolute/path/to/nginx.conf with the absolute path to the nginx.conf
# file on your machine (e.g. /home/user/wordcab-transcribe/nginx.conf).
docker run -d \
    --name nginx \
    --network transcribe \
    -p 80:80 \
    -v /absolute/path/to/nginx.conf:/etc/nginx/nginx.conf:ro \
    nginx

# Check everything is working as expected
docker logs nginx

⏱️ Profile the API

You can profile the process executions using py-spy as a profiler.

# Launch the container with the cap-add=SYS_PTRACE option
docker run -d --name wordcab-transcribe \
    --gpus all \
    --shm-size 1g \
    --restart unless-stopped \
    --cap-add=SYS_PTRACE \
    -p 5001:5001 \
    -v ~/.cache:/root/.cache \
    wordcab-transcribe:latest

# Enter the container
docker exec -it wordcab-transcribe /bin/bash

# Install py-spy
pip install py-spy

# Find the PID of the process to profile
top  # 28 for example

# Run the profiler
py-spy record --pid 28 --format speedscope -o profile.speedscope.json

# Launch any task on the API to generate some profiling data

# Exit the container and copy the generated file to your local machine
exit
docker cp wordcab-transcribe:/app/profile.speedscope.json profile.speedscope.json

# Go to https://www.speedscope.app/ and upload the file to visualize the profile

Test the API

Once the container is running, you can test the API.

The API documentation is available at http://localhost:5001/docs.

  • Audio file:
import json
import requests

filepath = "/path/to/audio/file.wav"  # or any other convertible format by ffmpeg
data = {
  "num_speakers": -1,  # # Leave at -1 to guess the number of speakers
  "diarization": True,  # Longer processing time but speaker segment attribution
  "multi_channel": False,  # Only for stereo audio files with one speaker per channel
  "source_lang": "en",  # optional, default is "en"
  "timestamps": "s",  # optional, default is "s". Can be "s", "ms" or "hms".
  "word_timestamps": False,  # optional, default is False
}

with open(filepath, "rb") as f:
    files = {"file": f}
    response = requests.post(
        "http://localhost:5001/api/v1/audio",
        files=files,
        data=data,
    )

r_json = response.json()

filename = filepath.split(".")[0]
with open(f"{filename}.json", "w", encoding="utf-8") as f:
  json.dump(r_json, f, indent=4, ensure_ascii=False)
  • YouTube video:
import json
import requests

headers = {"accept": "application/json", "Content-Type": "application/json"}
params = {"url": "https://youtu.be/JZ696sbfPHs"}
data = {
  "diarization": True,  # Longer processing time but speaker segment attribution
  "source_lang": "en",  # optional, default is "en"
  "timestamps": "s",  # optional, default is "s". Can be "s", "ms" or "hms".
  "word_timestamps": False,  # optional, default is False
}

response = requests.post(
  "http://localhost:5001/api/v1/youtube",
  headers=headers,
  params=params,
  data=json.dumps(data),
)

r_json = response.json()

with open("youtube_video_output.json", "w", encoding="utf-8") as f:
  json.dump(r_json, f, indent=4, ensure_ascii=False)

Running Local Models

You can link a local folder path to use a custom model. If you do so, you should mount the folder in the docker run command as a volume, or include the model directory in your Dockerfile to bake it into the image.

Note that for the default tensorrt-llm whisper engine, the simplest way to get a converted model is to use hatch to start the server locally once. Specify the WHISPER_MODEL and ALIGN_MODEL in .env, then run hatch run runtime:launch in your terminal. This will download and convert these models.

You'll then find the converted models in cloned_wordcab_transcribe_repo/src/wordcab_transcribe/whisper_models. Then in your Dockerfile, copy the converted models to the /app/src/wordcab_transcribe/whisper_models directory.

Example Dockerfile line for WHISPER_MODEL: COPY cloned_wordcab_transcribe_repo/src/wordcab_transcribe/whisper_models/large-v3 /app/src/wordcab_transcribe/whisper_models/large-v3 Example Dockerfile line for ALIGN_MODEL: COPY cloned_wordcab_transcribe_repo/src/wordcab_transcribe/whisper_models/tiny /app/src/wordcab_transcribe/whisper_models/tiny

πŸš€ Contributing

Getting started

  1. Ensure you have the Hatch installed (with pipx for example):
  1. Clone the repo
git clone
cd wordcab-transcribe
  1. Install dependencies and start coding
hatch env create
  1. Run tests
# Quality checks without modifying the code
hatch run quality:check

# Quality checks and auto-formatting
hatch run quality:format

# Run tests with coverage
hatch run tests:run

Working workflow

  1. Create an issue for the feature or bug you want to work on.
  2. Create a branch using the left panel on GitHub.
  3. git fetchand git checkout the branch.
  4. Make changes and commit.
  5. Push the branch to GitHub.
  6. Create a pull request and ask for review.
  7. Merge the pull request when it's approved and CI passes.
  8. Delete the branch.
  9. Update your local repo with git fetch and git pull.