Skip to content

System

Beat Service

Local Python service for beats, downbeats, and lyrics.

Beat Service

The beat service is a small FastAPI app at beat-service/app.py. It runs on http://127.0.0.1:8011 and exposes three endpoints used by the studio.

Why Python, why local

  • Beat detection (beat_this), source separation (Demucs), and transcription (faster-whisper) all live in the PyTorch ecosystem.
  • The models are heavy to import (~2 s for PyTorch alone) and slow to warm. Keeping them in their own process means the Next.js studio reloads instantly and the models stay hot across requests.
  • Running locally side-steps quota and privacy concerns — the audio never leaves your laptop.

Requirements

  • Python 3.10 / 3.11 / 3.12. PyTorch wheels are unreliable on 3.13+.
  • FFmpeg on PATH (Demucs writes WAV stems via torchaudio.save, which on torchaudio ≥ 2.7 requires torchcodec plus a system FFmpeg).
  • On Apple Silicon the model uses the MPS backend; otherwise it falls back to CPU.

Endpoints

GET /health

Reports liveness and whether the heavy models have loaded.

POST /beat-grid

Returns beats_sec, downbeats_sec, and a tempo_bpm derived from the median beat interval.

curl -X POST http://127.0.0.1:8011/beat-grid \
  -H "Content-Type: application/json" \
  -d '{"audioPath":"/projects/<project-id>/audio.mp3","beatsPerBar":4}'

audioPath is resolved relative to public/.

POST /lyrics

Word-level lyrics with per-word timestamps. The pipeline is:

  1. Demucs (htdemucs) isolates the vocal stem and caches it under beat-service/.cache/vocals/. Subsequent calls for the same file hit the cache instantly. Set useDemucs: false to skip this and transcribe the original mix (faster, much worse on music with drums).
  2. faster-whisper transcribes the vocal stem with word_timestamps=True, so each word carries start_sec, end_sec, and confidence. These line up with the beats_sec from /beat-grid. Word times are approximate (especially on fast vocals); see Lessons Learned for caveats and a TODO on tighter alignment.
  3. Energy onset refine (Demucs only, default on) — nudge each word’s start_sec forward to the first short analysis window where stem energy beats an adaptive dB threshold (same spirit as voiced gating), capped so Whisper is not discarded blindly. Reduces cases where words appear to start before audible vocals. Set refineWordEnergyStarts: false to skip; see debug.word_start_energy_refine in the response.
curl -X POST http://127.0.0.1:8011/lyrics \
  -H "Content-Type: application/json" \
  -d '{"audioPath":"/projects/<project-id>/audio.mp3","language":"en"}'

Tunables

  • WHISPER_MODEL — defaults to large-v3 (~3 GB). It is the only model that reliably handles sung vocals; small / medium mishear most lyrics and fall back to memorised training-set phrases on weak passages. Override only when transcribing spoken-word English.
  • useDemucs — request-level flag to skip vocal isolation.
  • refineWordEnergyStarts — default true with Demucs; set false on /lyrics to skip the post-Whisper energy onset nudge for word starts.

First-run timings (Apple Silicon, CPU/MPS)

StepTime
Initial PyTorch import~2 s
beat_this checkpoint pull~10 s
large-v3 Whisper download~60 s+
Demucs separation (3 min)~30 s
Whisper transcription (3 min)~10 s

Cached calls for a previously analysed track typically complete in a few seconds.

Cache layout

  • beat-service/.cache/vocals/<hash>.wav — isolated vocal stems.
  • beat-service/.cache/whisper/ — Whisper checkpoints (managed by faster-whisper).
  • ~/.cache/torch/hub/beat_this and Demucs weights.

Delete a cached vocal stem to force re-isolation; delete a Whisper checkpoint to force a re-download.