Skip to content

Types

types

Types for documentation and data validation (for use in pydantic).

They provide semantic meaning only and we additionally use NewType for strong semantic distinction to avoid mixing up different kinds of tensors.

Note that no code implementations shall be placed here.

Attributes:

Name Type Description
StrPath TypeAlias
BytesPath TypeAlias
Gt0 TypeAlias
Ge0 TypeAlias
ModelType TypeAlias

The type of the model, e.g. bs_roformer, demucs

ModelInputType TypeAlias
ModelOutputType TypeAlias
ChunkSize TypeAlias

The length of an audio segment, in samples, processed by the model at one time.

HopSize TypeAlias

The step size, in samples, between the start of consecutive chunks.

Dropout TypeAlias
ModelOutputStemName TypeAlias

The output stem name, e.g. vocals, drums, bass, etc.

Samples TypeAlias

Number of samples in the audio signal.

SampleRate TypeAlias

The number of samples of audio recorded per second (hertz).

Channels TypeAlias

Number of audio streams.

FileFormat TypeAlias
BitRate TypeAlias

Number of bits of information in each sample.

RawAudioTensor

Time domain tensor of audio samples.

NormalizedAudioTensor

A mixture tensor that has been normalized using on-the-fly statistics.

ComplexSpectrogram

A complex-valued representation of audio's frequency content over time via the STFT.

HybridModelInput TypeAlias

Input for hybrid models that require both spectrogram and waveform.

WindowShape TypeAlias

The shape of the window function applied to each chunk before computing the STFT.

FftSize TypeAlias

The number of frequency bins in the STFT, controlling the frequency resolution.

Bands TypeAlias
BatchSize TypeAlias

The number of chunks processed simultaneously by the GPU.

PaddingMode TypeAlias

The method used to pad the audio before chunking, crucial for handling the edges of the audio signal.

ChunkDuration TypeAlias

The length of an audio segment, in seconds, processed by the model at one time.

OverlapRatio TypeAlias

The fraction of a chunk that overlaps with the next one.

Padding TypeAlias

Samples to add to the beginning and end of each chunk.

PaddedChunkedAudioTensor

A batch of audio chunks from a padded source.

NumModelStems TypeAlias

The number of stems the model outputs. This should be the length of [splifft.models.ModelParamsLike.output_stem_names].

SeparatedSpectrogramTensor

A batch of separated spectrograms.

SeparatedChunkedTensor

A batch of separated audio chunks from the model.

WindowTensor

A 1D tensor representing a window function.

RawSeparatedTensor

The final, stitched, raw-domain separated audio.

PreprocessFn TypeAlias
PostprocessFn TypeAlias
Identifier TypeAlias

{{architecture}}-{{first_author}}-{{unique_name_short}}, use underscore if it has spaces

Instrument TypeAlias
Metric TypeAlias
Sdr TypeAlias

Signal-to-Distortion Ratio (decibels). Higher is better.

SiSdr TypeAlias

Scale-Invariant SDR (SI-SDR) is invariant to scaling errors (decibels). Higher is better.

L1Norm TypeAlias

L1 norm (mean absolute error) between two signals (dimensionless). Lower is better.

DbDifferenceMel TypeAlias

Difference in the dB-scaled mel spectrogram.

Bleedless TypeAlias

A metric to quantify the amount of "bleeding" from other sources. Higher is better.

Fullness TypeAlias

A metric to quantify how much of the original source is missing. Higher is better.

StrPath module-attribute

StrPath: TypeAlias = str | PathLike[str]

BytesPath module-attribute

BytesPath: TypeAlias = bytes | PathLike[bytes]

Gt0 module-attribute

Gt0: TypeAlias = Annotated[_T, Gt(0)]

Ge0 module-attribute

Ge0: TypeAlias = Annotated[_T, Ge(0)]

ModelType module-attribute

ModelType: TypeAlias = str

The type of the model, e.g. bs_roformer, demucs

ModelInputType module-attribute

ModelInputType: TypeAlias = Literal[
    "waveform", "spectrogram", "waveform_and_spectrogram"
]

ModelOutputType module-attribute

ModelOutputType: TypeAlias = Literal[
    "waveform", "spectrogram_mask", "spectrogram"
]

ChunkSize module-attribute

ChunkSize: TypeAlias = Gt0[int]

The length of an audio segment, in samples, processed by the model at one time.

A full audio track is often too long to fit into GPU, instead we process it in fixed-size chunks. A larger chunk size may allow the model to capture more temporal context at the cost of increased memory usage.

HopSize module-attribute

HopSize: TypeAlias = Gt0[int]

The step size, in samples, between the start of consecutive chunks.

To avoid artifacts at the edges of chunks, we process them with overlap. The hop size is the distance we "slide" the chunking window forward. ChunkSize < HopSize implies overlap and the overlap amount is ChunkSize - HopSize.

Dropout module-attribute

Dropout: TypeAlias = Annotated[float, Ge(0.0), Le(1.0)]

ModelOutputStemName module-attribute

ModelOutputStemName: TypeAlias = Annotated[str, MinLen(1)]

The output stem name, e.g. vocals, drums, bass, etc.

Samples module-attribute

Samples: TypeAlias = Gt0[int]

Number of samples in the audio signal.

SampleRate module-attribute

SampleRate: TypeAlias = Gt0[int]

The number of samples of audio recorded per second (hertz).

See concepts for more details.

Channels module-attribute

Channels: TypeAlias = Gt0[int]

Number of audio streams.

  • 1: Mono audio
  • 2: Stereo (left and right). Models are usually trained on stereo audio.

FileFormat module-attribute

FileFormat: TypeAlias = Literal['flac', 'wav', 'ogg']

BitRate module-attribute

BitRate: TypeAlias = Literal[8, 16, 24, 32, 64]

Number of bits of information in each sample.

It determines the dynamic range of the audio signal: the difference between the quietest and loudest possible sounds.

  • 16-bit: Standard for CD audio: ~96 dB dynamic range.
  • 24-bit: Common in professional audio, allowing for more headroom during mixing
  • 32-bit float: Standard in digital audio workstations (DAWs) and deep learning models. The amplitude is represented by a floating-point number, which prevents clipping (distortion from exceeding the maximum value). This library primarily works with fp32 tensors.

RawAudioTensor module-attribute

RawAudioTensor = NewType('RawAudioTensor', Tensor)

Time domain tensor of audio samples. Shape (channels, samples)

NormalizedAudioTensor module-attribute

NormalizedAudioTensor = NewType(
    "NormalizedAudioTensor", Tensor
)

A mixture tensor that has been normalized using on-the-fly statistics. Shape (channels, samples)

ComplexSpectrogram module-attribute

ComplexSpectrogram = NewType('ComplexSpectrogram', Tensor)

A complex-valued representation of audio's frequency content over time via the STFT.

Shape (channels, frequency bins, time frames, 2)

See concepts for more details.

HybridModelInput module-attribute

Input for hybrid models that require both spectrogram and waveform.

WindowShape module-attribute

WindowShape: TypeAlias = Literal[
    "hann", "hamming", "linear_fade"
]

The shape of the window function applied to each chunk before computing the STFT.

FftSize module-attribute

FftSize: TypeAlias = Gt0[int]

The number of frequency bins in the STFT, controlling the frequency resolution.

Bands module-attribute

Bands: TypeAlias = Tensor

BatchSize module-attribute

BatchSize: TypeAlias = Gt0[int]

The number of chunks processed simultaneously by the GPU.

Increasing the batch size can improve GPU utilisation and speed up training, but it requires more memory.

PaddingMode module-attribute

PaddingMode: TypeAlias = Literal[
    "reflect", "constant", "replicate"
]

The method used to pad the audio before chunking, crucial for handling the edges of the audio signal.

  • reflect: Pads the signal by reflecting the audio at the boundary. This creates a smooth continuation and often yields the best results for music.
  • constant: Pads with zeros. Simpler, but can introduce silence at the edges.
  • replicate: Repeats the last sample at the edge.

ChunkDuration module-attribute

ChunkDuration: TypeAlias = Gt0[float]

The length of an audio segment, in seconds, processed by the model at one time.

Equivalent to chunk size divided by the sample rate.

OverlapRatio module-attribute

OverlapRatio: TypeAlias = Annotated[float, Ge(0), Lt(1)]

The fraction of a chunk that overlaps with the next one.

The relationship with hop size is: $$ \text{hop_size} = \text{chunk_size} \cdot (1 - \text{overlap_ratio}) $$

  • A ratio of 0.0 means no overlap (hop_size = chunk_size).
  • A ratio of 0.5 means 50% overlap (hop_size = chunk_size / 2).
  • A higher overlap ratio increases computational cost as more chunks are processed, but it can lead to smoother results by averaging more predictions for each time frame.

Padding module-attribute

Padding: TypeAlias = Gt0[int]

Samples to add to the beginning and end of each chunk.

  • To ensure that the very beginning and end of a track can be centerd within a chunk, we often may add "reflection padding" or "zero padding" before chunking.
  • To ensure that the last chunk is full-size, we may pad the audio so its length is a multiple of the hop size.

PaddedChunkedAudioTensor module-attribute

PaddedChunkedAudioTensor = NewType(
    "PaddedChunkedAudioTensor", Tensor
)

A batch of audio chunks from a padded source. Shape (batch size, channels, chunk size)

NumModelStems module-attribute

NumModelStems: TypeAlias = Gt0[int]

The number of stems the model outputs. This should be the length of [splifft.models.ModelParamsLike.output_stem_names].

SeparatedSpectrogramTensor module-attribute

SeparatedSpectrogramTensor = NewType(
    "SeparatedSpectrogramTensor", Tensor
)

A batch of separated spectrograms. Shape (b, n, f*s, t, c=2)

SeparatedChunkedTensor module-attribute

SeparatedChunkedTensor = NewType(
    "SeparatedChunkedTensor", Tensor
)

A batch of separated audio chunks from the model. Shape (batch size, number of stems, channels, chunk size)

WindowTensor module-attribute

WindowTensor = NewType('WindowTensor', Tensor)

A 1D tensor representing a window function. Shape (chunk size)

RawSeparatedTensor module-attribute

RawSeparatedTensor = NewType('RawSeparatedTensor', Tensor)

The final, stitched, raw-domain separated audio. Shape (number of stems, channels, samples)

PreprocessFn module-attribute

PostprocessFn module-attribute

PostprocessFn: TypeAlias = Callable[
    ..., SeparatedChunkedTensor
]

Identifier module-attribute

Identifier: TypeAlias = LowerCase[str]

{{architecture}}-{{first_author}}-{{unique_name_short}}, use underscore if it has spaces

Instrument module-attribute

Instrument: TypeAlias = Literal[
    "instrum",
    "vocals",
    "drums",
    "bass",
    "other",
    "piano",
    "lead_vocals",
    "back_vocals",
    "guitar",
    "vocals1",
    "vocals2",
    "strings",
    "wind",
    "music",
    "sfx",
    "speech",
    "restored",
    "back",
    "lead",
    "back-instrum",
    "kick",
    "snare",
    "toms",
    "hh",
    "cymbals",
    "hh-cymbals",
    "male",
    "female",
    "violin",
    "dry",
    "reverb",
    "clean",
    "crowd",
    "denoised",
    "noise",
    "vocals_dry",
    "dereverb",
]

Metric module-attribute

Metric: TypeAlias = Literal[
    "sdr",
    "si_sdr",
    "l1_freq",
    "log_wmse",
    "aura_stft",
    "aura_mrstft",
    "bleedless",
    "fullness",
]

Sdr module-attribute

Signal-to-Distortion Ratio (decibels). Higher is better.

Measures the ratio of the power of clean reference signal to the power of all other error components (interference, artifacts, and spatial distortion).

Definition: $$ \text{SDR} = 10 \log_{10} \frac{|\mathbf{s}|^2}{|\mathbf{s} - \mathbf{\hat{s}}|^2}, $$ where:

  • \(\mathbf{s}\): ground truth source signal
  • \(\mathbf{\hat{s}}\): estimated source signal produced by the model
  • \(||\cdot||^2\): squared L2 norm (power) of the signal

SiSdr module-attribute

SiSdr: TypeAlias = float

Scale-Invariant SDR (SI-SDR) is invariant to scaling errors (decibels). Higher is better.

It projects the estimate onto the reference to find the optimal scaling factor \(\alpha\), creating a scaled reference that best matches the estimate's amplitude.

  • Optimal scaling factor: \(\alpha = \frac{\langle\mathbf{\hat{s}}, \mathbf{s}\rangle}{||\mathbf{s}||^2}\)
  • Scaled reference: \(\mathbf{s}_\text{target} = \alpha \cdot \mathbf{s}\)
  • Error: \(\mathbf{e} = \mathbf{\hat{s}} - \mathbf{s}_\text{target}\)
  • \(\text{SI-SDR} = 10 \log_{10} \frac{||\mathbf{s}_\text{target}||^2}{||\mathbf{e}||^2}\)

L1Norm module-attribute

L1Norm: TypeAlias = float

L1 norm (mean absolute error) between two signals (dimensionless). Lower is better.

Measures the average absolute difference between the reference and estimated signals.

  • Time domain: \(\mathcal{L}_\text{L1} = \frac{1}{N} \sum_{n=1}^{N} |\mathbf{s}[n] - \mathbf{\hat{s}}[n]|\),
  • Frequency domain: \(\mathcal{L}_\text{L1Freq} = \frac{1}{\text{MK}}\sum_{m=1}^{M} \sum_{k=1}^{K} \left||S(m, k)| - |\hat{S}(m, k)|\right|\)

DbDifferenceMel module-attribute

DbDifferenceMel: TypeAlias = float

Difference in the dB-scaled mel spectrogram. $$ \mathbf{D}(m, k) = \text{dB}(|\hat{S}\text{mel}(m, k)|) - \text{dB}(|S\text{mel}(m, k)|) $$

Bleedless module-attribute

Bleedless: TypeAlias = float

A metric to quantify the amount of "bleeding" from other sources. Higher is better.

Measures the average energy of the parts of the mel spectrogram that are louder than the reference. A high value indicates that the estimate contains unwanted energy (bleed) from other sources: $$ \text{Bleed} = \text{mean}(\mathbf{D}(m, k)) \quad \forall \quad \mathbf{D}(m, k) > 0 $$

Fullness module-attribute

Fullness: TypeAlias = float

A metric to quantify how much of the original source is missing. Higher is better.

Complementary to Bleedless. Measures the average energy of the parts of the mel spectrogram that are quieter than the reference. A high value indicates that parts of the target loss were lost during the separation, indicating that more of the original source's character is preserved. $$ \text{Fullness} = \text{mean}(|\mathbf{D}(m, k)|) \quad \forall \quad \mathbf{D}(m, k) < 0 $$