Beat Service
The beat service is a small FastAPI app at beat-service/app.py. It runs
on http://127.0.0.1:8011 and exposes three endpoints used by the
studio.
Why Python, why local
- Beat detection (
beat_this), source separation (Demucs), and transcription (faster-whisper) all live in the PyTorch ecosystem. - The models are heavy to import (~2 s for PyTorch alone) and slow to warm. Keeping them in their own process means the Next.js studio reloads instantly and the models stay hot across requests.
- Running locally side-steps quota and privacy concerns — the audio never leaves your laptop.
Requirements
- Python 3.10 / 3.11 / 3.12. PyTorch wheels are unreliable on 3.13+.
- FFmpeg on
PATH(Demucs writes WAV stems viatorchaudio.save, which on torchaudio ≥ 2.7 requirestorchcodecplus a system FFmpeg). - On Apple Silicon the model uses the MPS backend; otherwise it falls back to CPU.
Endpoints
GET /health
Reports liveness and whether the heavy models have loaded.
POST /beat-grid
Returns beats_sec, downbeats_sec, and a tempo_bpm derived from the
median beat interval.
curl -X POST http://127.0.0.1:8011/beat-grid \
-H "Content-Type: application/json" \
-d '{"audioPath":"/projects/<project-id>/audio.mp3","beatsPerBar":4}'
audioPath is resolved relative to public/.
POST /lyrics
Word-level lyrics with per-word timestamps. The pipeline is:
- Demucs (
htdemucs) isolates the vocal stem and caches it underbeat-service/.cache/vocals/. Subsequent calls for the same file hit the cache instantly. SetuseDemucs: falseto skip this and transcribe the original mix (faster, much worse on music with drums). - faster-whisper transcribes the vocal stem with
word_timestamps=True, so each word carriesstart_sec,end_sec, andconfidence. These line up with thebeats_secfrom/beat-grid. Word times are approximate (especially on fast vocals); see Lessons Learned for caveats and a TODO on tighter alignment. - Energy onset refine (Demucs only, default on) — nudge each word’s
start_secforward to the first short analysis window where stem energy beats an adaptive dB threshold (same spirit as voiced gating), capped so Whisper is not discarded blindly. Reduces cases where words appear to start before audible vocals. SetrefineWordEnergyStarts: falseto skip; seedebug.word_start_energy_refinein the response.
curl -X POST http://127.0.0.1:8011/lyrics \
-H "Content-Type: application/json" \
-d '{"audioPath":"/projects/<project-id>/audio.mp3","language":"en"}'
Tunables
WHISPER_MODEL— defaults tolarge-v3(~3 GB). It is the only model that reliably handles sung vocals;small/mediummishear most lyrics and fall back to memorised training-set phrases on weak passages. Override only when transcribing spoken-word English.useDemucs— request-level flag to skip vocal isolation.refineWordEnergyStarts— default true with Demucs; setfalseon/lyricsto skip the post-Whisper energy onset nudge for word starts.
First-run timings (Apple Silicon, CPU/MPS)
| Step | Time |
|---|---|
| Initial PyTorch import | ~2 s |
beat_this checkpoint pull | ~10 s |
large-v3 Whisper download | ~60 s+ |
| Demucs separation (3 min) | ~30 s |
| Whisper transcription (3 min) | ~10 s |
Cached calls for a previously analysed track typically complete in a few seconds.
Cache layout
beat-service/.cache/vocals/<hash>.wav— isolated vocal stems.beat-service/.cache/whisper/— Whisper checkpoints (managed byfaster-whisper).~/.cache/torch/hub/—beat_thisand Demucs weights.
Delete a cached vocal stem to force re-isolation; delete a Whisper checkpoint to force a re-download.
