Types
types
Types for documentation and data validation (for use in pydantic).
They provide semantic meaning only and we additionally use NewType
for strong semantic
distinction to avoid mixing up different kinds of tensors.
Note that no code implementations shall be placed here.
Attributes:
Name | Type | Description |
---|---|---|
StrPath |
TypeAlias
|
|
BytesPath |
TypeAlias
|
|
Gt0 |
TypeAlias
|
|
Ge0 |
TypeAlias
|
|
ModelType |
TypeAlias
|
The type of the model, e.g. |
ModelInputType |
TypeAlias
|
|
ModelOutputType |
TypeAlias
|
|
ChunkSize |
TypeAlias
|
The length of an audio segment, in samples, processed by the model at one time. |
HopSize |
TypeAlias
|
The step size, in samples, between the start of consecutive chunks. |
Dropout |
TypeAlias
|
|
ModelOutputStemName |
TypeAlias
|
The output stem name, e.g. |
Samples |
TypeAlias
|
Number of samples in the audio signal. |
SampleRate |
TypeAlias
|
The number of samples of audio recorded per second (hertz). |
Channels |
TypeAlias
|
Number of audio streams. |
FileFormat |
TypeAlias
|
|
BitRate |
TypeAlias
|
Number of bits of information in each sample. |
RawAudioTensor |
Time domain tensor of audio samples. |
|
NormalizedAudioTensor |
A mixture tensor that has been normalized using on-the-fly statistics. |
|
ComplexSpectrogram |
A complex-valued representation of audio's frequency content over time via the STFT. |
|
HybridModelInput |
TypeAlias
|
Input for hybrid models that require both spectrogram and waveform. |
WindowShape |
TypeAlias
|
The shape of the window function applied to each chunk before computing the STFT. |
FftSize |
TypeAlias
|
The number of frequency bins in the STFT, controlling the frequency resolution. |
Bands |
TypeAlias
|
Groups of adjacent frequency bins in the spectrogram. |
BatchSize |
TypeAlias
|
The number of chunks processed simultaneously by the GPU. |
PaddingMode |
TypeAlias
|
The method used to pad the audio before chunking, crucial for handling the edges of the audio signal. |
ChunkDuration |
TypeAlias
|
The length of an audio segment, in seconds, processed by the model at one time. |
OverlapRatio |
TypeAlias
|
The fraction of a chunk that overlaps with the next one. |
Padding |
TypeAlias
|
Samples to add to the beginning and end of each chunk. |
PaddedChunkedAudioTensor |
A batch of audio chunks from a padded source. |
|
NumModelStems |
TypeAlias
|
The number of stems the model outputs. This should be the length of [splifft.models.ModelParamsLike.output_stem_names]. |
SeparatedSpectrogramTensor |
A batch of separated spectrograms. |
|
SeparatedChunkedTensor |
A batch of separated audio chunks from the model. |
|
WindowTensor |
A 1D tensor representing a window function. |
|
RawSeparatedTensor |
The final, stitched, raw-domain separated audio. |
|
PreprocessFn |
TypeAlias
|
|
PostprocessFn |
TypeAlias
|
|
Identifier |
TypeAlias
|
|
Instrument |
TypeAlias
|
|
Metric |
TypeAlias
|
|
Sdr |
TypeAlias
|
Signal-to-Distortion Ratio (decibels). Higher is better. |
SiSdr |
TypeAlias
|
Scale-Invariant SDR (SI-SDR) is invariant to scaling errors (decibels). Higher is better. |
L1Norm |
TypeAlias
|
L1 norm (mean absolute error) between two signals (dimensionless). Lower is better. |
DbDifferenceMel |
TypeAlias
|
Difference in the dB-scaled mel spectrogram. |
Bleedless |
TypeAlias
|
A metric to quantify the amount of "bleeding" from other sources. Higher is better. |
Fullness |
TypeAlias
|
A metric to quantify how much of the original source is missing. Higher is better. |
ModelType
module-attribute
The type of the model, e.g. bs_roformer
, demucs
ModelInputType
module-attribute
ModelOutputType
module-attribute
ChunkSize
module-attribute
The length of an audio segment, in samples, processed by the model at one time.
A full audio track is often too long to fit into GPU, instead we process it in fixed-size chunks. A larger chunk size may allow the model to capture more temporal context at the cost of increased memory usage.
HopSize
module-attribute
The step size, in samples, between the start of consecutive chunks.
To avoid artifacts at the edges of chunks, we process them with overlap. The hop size is the
distance we "slide" the chunking window forward. ChunkSize < HopSize
implies overlap and the
overlap amount is ChunkSize - HopSize
.
ModelOutputStemName
module-attribute
The output stem name, e.g. vocals
, drums
, bass
, etc.
SampleRate
module-attribute
The number of samples of audio recorded per second (hertz).
See concepts for more details.
Channels
module-attribute
Number of audio streams.
- 1: Mono audio
- 2: Stereo (left and right). Models are usually trained on stereo audio.
BitRate
module-attribute
Number of bits of information in each sample.
It determines the dynamic range of the audio signal: the difference between the quietest and loudest possible sounds.
- 16-bit: Standard for CD audio: ~96 dB dynamic range.
- 24-bit: Common in professional audio, allowing for more headroom during mixing
- 32-bit float: Standard in digital audio workstations (DAWs) and deep learning models. The amplitude is represented by a floating-point number, which prevents clipping (distortion from exceeding the maximum value). This library primarily works with fp32 tensors.
RawAudioTensor
module-attribute
NormalizedAudioTensor
module-attribute
A mixture tensor that has been normalized using on-the-fly statistics. Shape (channels, samples)
ComplexSpectrogram
module-attribute
A complex-valued representation of audio's frequency content over time via the STFT.
Shape (channels, frequency bins, time frames, 2)
See concepts for more details.
HybridModelInput
module-attribute
HybridModelInput: TypeAlias = tuple[
ComplexSpectrogram,
RawAudioTensor | NormalizedAudioTensor,
]
Input for hybrid models that require both spectrogram and waveform.
WindowShape
module-attribute
The shape of the window function applied to each chunk before computing the STFT.
FftSize
module-attribute
The number of frequency bins in the STFT, controlling the frequency resolution.
Bands
module-attribute
Groups of adjacent frequency bins in the spectrogram.
BatchSize
module-attribute
The number of chunks processed simultaneously by the GPU.
Increasing the batch size can improve GPU utilisation and speed up training, but it requires more memory.
PaddingMode
module-attribute
The method used to pad the audio before chunking, crucial for handling the edges of the audio signal.
reflect
: Pads the signal by reflecting the audio at the boundary. This creates a smooth continuation and often yields the best results for music.constant
: Pads with zeros. Simpler, but can introduce silence at the edges.replicate
: Repeats the last sample at the edge.
ChunkDuration
module-attribute
The length of an audio segment, in seconds, processed by the model at one time.
Equivalent to chunk size divided by the sample rate.
OverlapRatio
module-attribute
The fraction of a chunk that overlaps with the next one.
The relationship with hop size is: $$ \text{hop_size} = \text{chunk_size} \cdot (1 - \text{overlap_ratio}) $$
- A ratio of
0.0
means no overlap (hop_size = chunk_size). - A ratio of
0.5
means 50% overlap (hop_size = chunk_size / 2). - A higher overlap ratio increases computational cost as more chunks are processed, but it can lead to smoother results by averaging more predictions for each time frame.
Padding
module-attribute
Samples to add to the beginning and end of each chunk.
- To ensure that the very beginning and end of a track can be centerd within a chunk, we often may add "reflection padding" or "zero padding" before chunking.
- To ensure that the last chunk is full-size, we may pad the audio so its length is a multiple of the hop size.
PaddedChunkedAudioTensor
module-attribute
A batch of audio chunks from a padded source. Shape (batch size, channels, chunk size)
NumModelStems
module-attribute
The number of stems the model outputs. This should be the length of [splifft.models.ModelParamsLike.output_stem_names].
SeparatedSpectrogramTensor
module-attribute
A batch of separated spectrograms. Shape (b, n, f*s, t, c=2)
SeparatedChunkedTensor
module-attribute
A batch of separated audio chunks from the model. Shape (batch size, number of stems, channels, chunk size)
WindowTensor
module-attribute
A 1D tensor representing a window function. Shape (chunk size)
RawSeparatedTensor
module-attribute
The final, stitched, raw-domain separated audio. Shape (number of stems, channels, samples)
PreprocessFn
module-attribute
PreprocessFn: TypeAlias = Callable[
[RawAudioTensor | NormalizedAudioTensor],
tuple[Tensor, ...],
]
Identifier
module-attribute
{{architecture}}-{{first_author}}-{{unique_name_short}}
, use underscore if it has spaces
Instrument
module-attribute
Instrument: TypeAlias = Literal[
"instrum",
"vocals",
"drums",
"bass",
"other",
"piano",
"lead_vocals",
"back_vocals",
"guitar",
"vocals1",
"vocals2",
"strings",
"wind",
"music",
"sfx",
"speech",
"restored",
"back",
"lead",
"back-instrum",
"kick",
"snare",
"toms",
"hh",
"cymbals",
"hh-cymbals",
"male",
"female",
"violin",
"dry",
"reverb",
"clean",
"crowd",
"denoised",
"noise",
"vocals_dry",
"dereverb",
]
Metric
module-attribute
Metric: TypeAlias = Literal[
"sdr",
"si_sdr",
"l1_freq",
"log_wmse",
"aura_stft",
"aura_mrstft",
"bleedless",
"fullness",
]
Sdr
module-attribute
Signal-to-Distortion Ratio (decibels). Higher is better.
Measures the ratio of the power of clean reference signal to the power of all other error components (interference, artifacts, and spatial distortion).
Definition: $$ \text{SDR} = 10 \log_{10} \frac{|\mathbf{s}|^2}{|\mathbf{s} - \mathbf{\hat{s}}|^2}, $$ where:
- \(\mathbf{s}\): ground truth source signal
- \(\mathbf{\hat{s}}\): estimated source signal produced by the model
- \(||\cdot||^2\): squared L2 norm (power) of the signal
SiSdr
module-attribute
Scale-Invariant SDR (SI-SDR) is invariant to scaling errors (decibels). Higher is better.
It projects the estimate onto the reference to find the optimal scaling factor \(\alpha\), creating a scaled reference that best matches the estimate's amplitude.
- Optimal scaling factor: \(\alpha = \frac{\langle\mathbf{\hat{s}}, \mathbf{s}\rangle}{||\mathbf{s}||^2}\)
- Scaled reference: \(\mathbf{s}_\text{target} = \alpha \cdot \mathbf{s}\)
- Error: \(\mathbf{e} = \mathbf{\hat{s}} - \mathbf{s}_\text{target}\)
- \(\text{SI-SDR} = 10 \log_{10} \frac{||\mathbf{s}_\text{target}||^2}{||\mathbf{e}||^2}\)
L1Norm
module-attribute
L1 norm (mean absolute error) between two signals (dimensionless). Lower is better.
Measures the average absolute difference between the reference and estimated signals.
- Time domain: \(\mathcal{L}_\text{L1} = \frac{1}{N} \sum_{n=1}^{N} |\mathbf{s}[n] - \mathbf{\hat{s}}[n]|\),
- Frequency domain: \(\mathcal{L}_\text{L1Freq} = \frac{1}{\text{MK}}\sum_{m=1}^{M} \sum_{k=1}^{K} \left||S(m, k)| - |\hat{S}(m, k)|\right|\)
DbDifferenceMel
module-attribute
Difference in the dB-scaled mel spectrogram. $$ \mathbf{D}(m, k) = \text{dB}(|\hat{S}\text{mel}(m, k)|) - \text{dB}(|S\text{mel}(m, k)|) $$
Bleedless
module-attribute
A metric to quantify the amount of "bleeding" from other sources. Higher is better.
Measures the average energy of the parts of the mel spectrogram that are louder than the reference. A high value indicates that the estimate contains unwanted energy (bleed) from other sources: $$ \text{Bleed} = \text{mean}(\mathbf{D}(m, k)) \quad \forall \quad \mathbf{D}(m, k) > 0 $$
Fullness
module-attribute
A metric to quantify how much of the original source is missing. Higher is better.
Complementary to Bleedless. Measures the average energy of the parts of the mel spectrogram that are quieter than the reference. A high value indicates that parts of the target loss were lost during the separation, indicating that more of the original source's character is preserved. $$ \text{Fullness} = \text{mean}(|\mathbf{D}(m, k)|) \quad \forall \quad \mathbf{D}(m, k) < 0 $$