Types

types

Types for documentation and data validation (for use in pydantic).

They provide semantic meaning only and we additionally use NewType for strong semantic distinction to avoid mixing up different kinds of tensors.

Note that no code implementations shall be placed here.

Attributes:

Name	Type	Description
`StrPath`	`TypeAlias`
`BytesPath`	`TypeAlias`
`Gt0`	`TypeAlias`
`Ge0`	`TypeAlias`
`ModelType`	`TypeAlias`	The type of the model, e.g. `bs_roformer`, `demucs`
`ModelInputType`	`TypeAlias`
`ModelOutputType`	`TypeAlias`
`ChunkSize`	`TypeAlias`	The length of an audio segment, in samples, processed by the model at one time.
`HopSize`	`TypeAlias`	The step size, in samples, between the start of consecutive chunks.
`Dropout`	`TypeAlias`
`ModelOutputStemName`	`TypeAlias`	The output stem name, e.g. `vocals`, `drums`, `bass`, etc.
`Samples`	`TypeAlias`	Number of samples in the audio signal.
`SampleRate`	`TypeAlias`	The number of samples of audio recorded per second (hertz).
`Channels`	`TypeAlias`	Number of audio streams.
`FileFormat`	`TypeAlias`
`BitRate`	`TypeAlias`	Number of bits of information in each sample.
`RawAudioTensor`		Time domain tensor of audio samples.
`NormalizedAudioTensor`		A mixture tensor that has been normalized using on-the-fly statistics.
`ComplexSpectrogram`		A complex-valued representation of audio's frequency content over time via the STFT.
`HybridModelInput`	`TypeAlias`	Input for hybrid models that require both spectrogram and waveform.
`WindowShape`	`TypeAlias`	The shape of the window function applied to each chunk before computing the STFT.
`FftSize`	`TypeAlias`	The number of frequency bins in the STFT, controlling the frequency resolution.
`Bands`	`TypeAlias`	Groups of adjacent frequency bins in the spectrogram.
`BatchSize`	`TypeAlias`	The number of chunks processed simultaneously by the GPU.
`PaddingMode`	`TypeAlias`	The method used to pad the audio before chunking, crucial for handling the edges of the audio signal.
`ChunkDuration`	`TypeAlias`	The length of an audio segment, in seconds, processed by the model at one time.
`OverlapRatio`	`TypeAlias`	The fraction of a chunk that overlaps with the next one.
`Padding`	`TypeAlias`	Samples to add to the beginning and end of each chunk.
`PaddedChunkedAudioTensor`		A batch of audio chunks from a padded source.
`NumModelStems`	`TypeAlias`	The number of stems the model outputs. This should be the length of [splifft.models.ModelParamsLike.output_stem_names].
`SeparatedSpectrogramTensor`		A batch of separated spectrograms.
`SeparatedChunkedTensor`		A batch of separated audio chunks from the model.
`WindowTensor`		A 1D tensor representing a window function.
`RawSeparatedTensor`		The final, stitched, raw-domain separated audio.
`PreprocessFn`	`TypeAlias`
`PostprocessFn`	`TypeAlias`
`Identifier`	`TypeAlias`	`{{architecture}}-{{first_author}}-{{unique_name_short}}`, use underscore if it has spaces
`Instrument`	`TypeAlias`
`Metric`	`TypeAlias`
`Sdr`	`TypeAlias`	Signal-to-Distortion Ratio (decibels). Higher is better.
`SiSdr`	`TypeAlias`	Scale-Invariant SDR (SI-SDR) is invariant to scaling errors (decibels). Higher is better.
`L1Norm`	`TypeAlias`	L1 norm (mean absolute error) between two signals (dimensionless). Lower is better.
`DbDifferenceMel`	`TypeAlias`	Difference in the dB-scaled mel spectrogram.
`Bleedless`	`TypeAlias`	A metric to quantify the amount of "bleeding" from other sources. Higher is better.
`Fullness`	`TypeAlias`	A metric to quantify how much of the original source is missing. Higher is better.

StrPath `module-attribute`

StrPath: TypeAlias = str | PathLike[str]

BytesPath `module-attribute`

BytesPath: TypeAlias = bytes | PathLike[bytes]

Gt0 `module-attribute`

Gt0: TypeAlias = Annotated[_T, Gt(0)]

Ge0 `module-attribute`

Ge0: TypeAlias = Annotated[_T, Ge(0)]

ModelType `module-attribute`

ModelType: TypeAlias = str

The type of the model, e.g. bs_roformer, demucs

ModelInputType `module-attribute`

ModelInputType: TypeAlias = Literal[
    "waveform", "spectrogram", "waveform_and_spectrogram"
]

ModelOutputType `module-attribute`

ModelOutputType: TypeAlias = Literal[
    "waveform", "spectrogram_mask", "spectrogram"
]

ChunkSize `module-attribute`

ChunkSize: TypeAlias = Gt0[int]

The length of an audio segment, in samples, processed by the model at one time.

A full audio track is often too long to fit into GPU, instead we process it in fixed-size chunks. A larger chunk size may allow the model to capture more temporal context at the cost of increased memory usage.

HopSize `module-attribute`

HopSize: TypeAlias = Gt0[int]

The step size, in samples, between the start of consecutive chunks.

To avoid artifacts at the edges of chunks, we process them with overlap. The hop size is the distance we "slide" the chunking window forward. ChunkSize < HopSize implies overlap and the overlap amount is ChunkSize - HopSize.

Dropout `module-attribute`

Dropout: TypeAlias = Annotated[float, Ge(0.0), Le(1.0)]

ModelOutputStemName `module-attribute`

ModelOutputStemName: TypeAlias = Annotated[str, MinLen(1)]

The output stem name, e.g. vocals, drums, bass, etc.

Samples `module-attribute`

Samples: TypeAlias = Gt0[int]

Number of samples in the audio signal.

SampleRate `module-attribute`

SampleRate: TypeAlias = Gt0[int]

The number of samples of audio recorded per second (hertz).

See concepts for more details.

Channels `module-attribute`

Channels: TypeAlias = Gt0[int]

Number of audio streams.

1: Mono audio
2: Stereo (left and right). Models are usually trained on stereo audio.

FileFormat `module-attribute`

FileFormat: TypeAlias = Literal['flac', 'wav', 'ogg']

BitRate `module-attribute`

BitRate: TypeAlias = Literal[8, 16, 24, 32, 64]

Number of bits of information in each sample.

It determines the dynamic range of the audio signal: the difference between the quietest and loudest possible sounds.

16-bit: Standard for CD audio: ~96 dB dynamic range.
24-bit: Common in professional audio, allowing for more headroom during mixing
32-bit float: Standard in digital audio workstations (DAWs) and deep learning models. The amplitude is represented by a floating-point number, which prevents clipping (distortion from exceeding the maximum value). This library primarily works with fp32 tensors.

RawAudioTensor `module-attribute`

RawAudioTensor = NewType('RawAudioTensor', Tensor)

Time domain tensor of audio samples. Shape (channels, samples)

NormalizedAudioTensor `module-attribute`

NormalizedAudioTensor = NewType(
    "NormalizedAudioTensor", Tensor
)

A mixture tensor that has been normalized using on-the-fly statistics. Shape (channels, samples)

ComplexSpectrogram `module-attribute`

ComplexSpectrogram = NewType('ComplexSpectrogram', Tensor)

A complex-valued representation of audio's frequency content over time via the STFT.

Shape (channels, frequency bins, time frames, 2)

See concepts for more details.

HybridModelInput `module-attribute`

HybridModelInput: TypeAlias = tuple[
    ComplexSpectrogram,
    RawAudioTensor | NormalizedAudioTensor,
]

Input for hybrid models that require both spectrogram and waveform.

WindowShape `module-attribute`

WindowShape: TypeAlias = Literal[
    "hann", "hamming", "linear_fade"
]

The shape of the window function applied to each chunk before computing the STFT.

FftSize `module-attribute`

FftSize: TypeAlias = Gt0[int]

The number of frequency bins in the STFT, controlling the frequency resolution.

Bands `module-attribute`

Bands: TypeAlias = Tensor

Groups of adjacent frequency bins in the spectrogram.

BatchSize `module-attribute`

BatchSize: TypeAlias = Gt0[int]

The number of chunks processed simultaneously by the GPU.

Increasing the batch size can improve GPU utilisation and speed up training, but it requires more memory.

PaddingMode `module-attribute`

PaddingMode: TypeAlias = Literal[
    "reflect", "constant", "replicate"
]

The method used to pad the audio before chunking, crucial for handling the edges of the audio signal.

reflect: Pads the signal by reflecting the audio at the boundary. This creates a smooth continuation and often yields the best results for music.
constant: Pads with zeros. Simpler, but can introduce silence at the edges.
replicate: Repeats the last sample at the edge.

ChunkDuration `module-attribute`

ChunkDuration: TypeAlias = Gt0[float]

The length of an audio segment, in seconds, processed by the model at one time.

Equivalent to chunk size divided by the sample rate.

OverlapRatio `module-attribute`

OverlapRatio: TypeAlias = Annotated[float, Ge(0), Lt(1)]

The fraction of a chunk that overlaps with the next one.

The relationship with hop size is: $$ \text{hop_size} = \text{chunk_size} \cdot (1 - \text{overlap_ratio}) $$

A ratio of 0.0 means no overlap (hop_size = chunk_size).
A ratio of 0.5 means 50% overlap (hop_size = chunk_size / 2).
A higher overlap ratio increases computational cost as more chunks are processed, but it can lead to smoother results by averaging more predictions for each time frame.

Padding `module-attribute`

Padding: TypeAlias = Gt0[int]

Samples to add to the beginning and end of each chunk.

To ensure that the very beginning and end of a track can be centerd within a chunk, we often may add "reflection padding" or "zero padding" before chunking.
To ensure that the last chunk is full-size, we may pad the audio so its length is a multiple of the hop size.

PaddedChunkedAudioTensor `module-attribute`

PaddedChunkedAudioTensor = NewType(
    "PaddedChunkedAudioTensor", Tensor
)

A batch of audio chunks from a padded source. Shape (batch size, channels, chunk size)

NumModelStems `module-attribute`

NumModelStems: TypeAlias = Gt0[int]

The number of stems the model outputs. This should be the length of [splifft.models.ModelParamsLike.output_stem_names].

SeparatedSpectrogramTensor `module-attribute`

SeparatedSpectrogramTensor = NewType(
    "SeparatedSpectrogramTensor", Tensor
)

A batch of separated spectrograms. Shape (b, n, f*s, t, c=2)

SeparatedChunkedTensor `module-attribute`

SeparatedChunkedTensor = NewType(
    "SeparatedChunkedTensor", Tensor
)

A batch of separated audio chunks from the model. Shape (batch size, number of stems, channels, chunk size)

WindowTensor `module-attribute`

WindowTensor = NewType('WindowTensor', Tensor)

A 1D tensor representing a window function. Shape (chunk size)

RawSeparatedTensor `module-attribute`

RawSeparatedTensor = NewType('RawSeparatedTensor', Tensor)

The final, stitched, raw-domain separated audio. Shape (number of stems, channels, samples)

PreprocessFn `module-attribute`

PreprocessFn: TypeAlias = Callable[
    [RawAudioTensor | NormalizedAudioTensor],
    tuple[Tensor, ...],
]

PostprocessFn `module-attribute`

PostprocessFn: TypeAlias = Callable[
    ..., SeparatedChunkedTensor
]

Identifier `module-attribute`

Identifier: TypeAlias = LowerCase[str]

{{architecture}}-{{first_author}}-{{unique_name_short}}, use underscore if it has spaces

Instrument `module-attribute`

Instrument: TypeAlias = Literal[
    "instrum",
    "vocals",
    "drums",
    "bass",
    "other",
    "piano",
    "lead_vocals",
    "back_vocals",
    "guitar",
    "vocals1",
    "vocals2",
    "strings",
    "wind",
    "music",
    "sfx",
    "speech",
    "restored",
    "back",
    "lead",
    "back-instrum",
    "kick",
    "snare",
    "toms",
    "hh",
    "cymbals",
    "hh-cymbals",
    "male",
    "female",
    "violin",
    "dry",
    "reverb",
    "clean",
    "crowd",
    "denoised",
    "noise",
    "vocals_dry",
    "dereverb",
]

Metric `module-attribute`

Metric: TypeAlias = Literal[
    "sdr",
    "si_sdr",
    "l1_freq",
    "log_wmse",
    "aura_stft",
    "aura_mrstft",
    "bleedless",
    "fullness",
]

Sdr `module-attribute`

Sdr: TypeAlias = float

Signal-to-Distortion Ratio (decibels). Higher is better.

Measures the ratio of the power of clean reference signal to the power of all other error components (interference, artifacts, and spatial distortion).

Definition: $$ \text{SDR} = 10 \log_{10} \frac{|\mathbf{s}|^2}{|\mathbf{s} - \mathbf{\hat{s}}|^2}, $$ where:

$\mathbf{s}$: ground truth source signal
$\mathbf{\hat{s}}$: estimated source signal produced by the model
$||\cdot||^2$: squared L2 norm (power) of the signal

SiSdr `module-attribute`

SiSdr: TypeAlias = float

Scale-Invariant SDR (SI-SDR) is invariant to scaling errors (decibels). Higher is better.

It projects the estimate onto the reference to find the optimal scaling factor $\alpha$, creating a scaled reference that best matches the estimate's amplitude.

Optimal scaling factor: $\alpha = \frac{\langle\mathbf{\hat{s}}, \mathbf{s}\rangle}{||\mathbf{s}||^2}$
Scaled reference: $\mathbf{s}_\text{target} = \alpha \cdot \mathbf{s}$
Error: $\mathbf{e} = \mathbf{\hat{s}} - \mathbf{s}_\text{target}$
$\text{SI-SDR} = 10 \log_{10} \frac{||\mathbf{s}_\text{target}||^2}{||\mathbf{e}||^2}$

L1Norm `module-attribute`

L1Norm: TypeAlias = float

L1 norm (mean absolute error) between two signals (dimensionless). Lower is better.

Measures the average absolute difference between the reference and estimated signals.

Time domain: $\mathcal{L}_\text{L1} = \frac{1}{N} \sum_{n=1}^{N} |\mathbf{s}[n] - \mathbf{\hat{s}}[n]|$,
Frequency domain: $\mathcal{L}_\text{L1Freq} = \frac{1}{\text{MK}}\sum_{m=1}^{M} \sum_{k=1}^{K} \left||S(m, k)| - |\hat{S}(m, k)|\right|$

DbDifferenceMel `module-attribute`

DbDifferenceMel: TypeAlias = float

Difference in the dB-scaled mel spectrogram. $$ \mathbf{D}(m, k) = \text{dB}(|\hat{S}\text{mel}(m, k)|) - \text{dB}(|S\text{mel}(m, k)|) $$

Bleedless `module-attribute`

Bleedless: TypeAlias = float

A metric to quantify the amount of "bleeding" from other sources. Higher is better.

Measures the average energy of the parts of the mel spectrogram that are louder than the reference. A high value indicates that the estimate contains unwanted energy (bleed) from other sources: $$ \text{Bleed} = \text{mean}(\mathbf{D}(m, k)) \quad \forall \quad \mathbf{D}(m, k) > 0 $$

Fullness `module-attribute`

Fullness: TypeAlias = float

A metric to quantify how much of the original source is missing. Higher is better.

Complementary to Bleedless. Measures the average energy of the parts of the mel spectrogram that are quieter than the reference. A high value indicates that parts of the target loss were lost during the separation, indicating that more of the original source's character is preserved. $$ \text{Fullness} = \text{mean}(|\mathbf{D}(m, k)|) \quad \forall \quad \mathbf{D}(m, k) < 0 $$

Types

types

StrPath module-attribute

BytesPath module-attribute

Gt0 module-attribute

Ge0 module-attribute

ModelType module-attribute

ModelInputType module-attribute

ModelOutputType module-attribute

ChunkSize module-attribute

HopSize module-attribute

Dropout module-attribute

ModelOutputStemName module-attribute

Samples module-attribute

SampleRate module-attribute

Channels module-attribute

FileFormat module-attribute

BitRate module-attribute

RawAudioTensor module-attribute

NormalizedAudioTensor module-attribute

ComplexSpectrogram module-attribute

HybridModelInput module-attribute

WindowShape module-attribute

FftSize module-attribute

Bands module-attribute

BatchSize module-attribute

PaddingMode module-attribute

ChunkDuration module-attribute

OverlapRatio module-attribute

Padding module-attribute

PaddedChunkedAudioTensor module-attribute

NumModelStems module-attribute

SeparatedSpectrogramTensor module-attribute

SeparatedChunkedTensor module-attribute

WindowTensor module-attribute

RawSeparatedTensor module-attribute

PreprocessFn module-attribute

PostprocessFn module-attribute

Identifier module-attribute

Instrument module-attribute

Metric module-attribute

Sdr module-attribute

SiSdr module-attribute

L1Norm module-attribute

DbDifferenceMel module-attribute

Bleedless module-attribute

Fullness module-attribute