Beat Service

The beat service is a small FastAPI app at beat-service/app.py. It runs on http://127.0.0.1:8011 and exposes three endpoints used by the studio.

Why Python, why local

Beat detection (beat_this), source separation (Demucs), and transcription (faster-whisper) all live in the PyTorch ecosystem.
The models are heavy to import (~2 s for PyTorch alone) and slow to warm. Keeping them in their own process means the Next.js studio reloads instantly and the models stay hot across requests.
Running locally side-steps quota and privacy concerns — the audio never leaves your laptop.

Requirements

Python 3.10 / 3.11 / 3.12. PyTorch wheels are unreliable on 3.13+.
FFmpeg on PATH (Demucs writes WAV stems via torchaudio.save, which on torchaudio ≥ 2.7 requires torchcodec plus a system FFmpeg).
On Apple Silicon the model uses the MPS backend; otherwise it falls back to CPU.

Endpoints

`GET /health`

Reports liveness and whether the heavy models have loaded.

`POST /beat-grid`

Returns beats_sec, downbeats_sec, and a tempo_bpm derived from the median beat interval.

curl -X POST http://127.0.0.1:8011/beat-grid \
  -H "Content-Type: application/json" \
  -d '{"audioPath":"/projects/<project-id>/audio.mp3","beatsPerBar":4}'

audioPath is resolved relative to public/.

`POST /lyrics`

Word-level lyrics with per-word timestamps. The pipeline is:

Demucs (htdemucs) isolates the vocal stem and caches it under beat-service/.cache/vocals/. Subsequent calls for the same file hit the cache instantly. Set useDemucs: false to skip this and transcribe the original mix (faster, much worse on music with drums).
faster-whisper transcribes the vocal stem with word_timestamps=True, so each word carries start_sec, end_sec, and confidence. These line up with the beats_sec from /beat-grid. Word times are approximate (especially on fast vocals); see Lessons Learned for caveats and a TODO on tighter alignment.
Energy onset refine (Demucs only, default on) — nudge each word’s start_sec forward to the first short analysis window where stem energy beats an adaptive dB threshold (same spirit as voiced gating), capped so Whisper is not discarded blindly. Reduces cases where words appear to start before audible vocals. Set refineWordEnergyStarts: false to skip; see debug.word_start_energy_refine in the response.

curl -X POST http://127.0.0.1:8011/lyrics \
  -H "Content-Type: application/json" \
  -d '{"audioPath":"/projects/<project-id>/audio.mp3","language":"en"}'

Tunables

WHISPER_MODEL — defaults to large-v3 (~3 GB). It is the only model that reliably handles sung vocals; small / medium mishear most lyrics and fall back to memorised training-set phrases on weak passages. Override only when transcribing spoken-word English.
useDemucs — request-level flag to skip vocal isolation.
refineWordEnergyStarts — default true with Demucs; set false on /lyrics to skip the post-Whisper energy onset nudge for word starts.

First-run timings (Apple Silicon, CPU/MPS)

Step	Time
Initial PyTorch import	~2 s
`beat_this` checkpoint pull	~10 s
`large-v3` Whisper download	~60 s+
Demucs separation (3 min)	~30 s
Whisper transcription (3 min)	~10 s

Cached calls for a previously analysed track typically complete in a few seconds.

Cache layout

beat-service/.cache/vocals/<hash>.wav — isolated vocal stems.
beat-service/.cache/whisper/ — Whisper checkpoints (managed by faster-whisper).
~/.cache/torch/hub/ — beat_this and Demucs weights.

Delete a cached vocal stem to force re-isolation; delete a Whisper checkpoint to force a re-download.