Skip to content

Models

splifft supports:

  • Separation: isolate stems (vocals, drums, bass, ...)
  • Sequence labeling: predict frame-wise musical signals (beat, pitch, onset, ...)

Goal: one runtime, one config style, one cache/registry flow. More information on config shape is in splifft.config.Config.

Supported models

BS-Roformer / Mel-Roformer

Use this family when separation quality is the top priority.

In splifft, Mel-Roformer is represented through the same model family with splifft.models.bs_roformer.MelBandsConfig enabled.

MDX23C

This is an older architecture that won the Sound Demixing Challenge 2023 Leaderboard C. It is used in some drum separation checkpoints from the community registry.

Beat This

Outputs frame-wise beat/downbeat activations (.npy). We intentionally avoid depending on legacy DBN post-processing stacks so inference stays lightweight and reproducible.

Why not add DBN?

The reference package beat_this depends on madmom.features.downbeats.DBNDownBeatTrackingProcessor as a preprocessing step, which uses:

  • a state space to represent progression through a measure. for a 4/4 time signature, states represent grid positions (e.g. "beat 1, 25% through")
  • transition model encodes tempo (bpm) and continuity, penalising sudden tempo jumps
  • observation model takes the raw logit values (activations) as input probabilties
  • using a Viterbi-like algorithm to find the most likely path through states

However madmom has been abandoned as of 2024-08-25 and so we do not depend on it.

PESTO

Use PESTO when the input is dominated by a single melodic source (voice, lead, solo line). It is designed for stable frame-level F0 tracking.

Outputs pitch, confidence, volume, activations as frame sequences (.npy).

Basic Pitch (polyphonic pitch)

Use Basic Pitch when multiple notes can be active at once. In splifft we intentionally expose raw outputs and do not support MIDI decoding, so downstream applications can choose their own thresholding/hysteresis policy.

Outputs onset, note, and contour activation maps (.npy). Except contour (3 bins per semitone), all others have 1 bin per semitone.

Visual comparisons (MVSep)

The following are quick comparisons for model quality on MVSep (separation-only).

Instrumental

Vocals