Core

core

Reusable, pure algorithmic components for inference and training.

Classes:

Name	Description
`Audio`
`NormalizationStats`	Statistics for normalizing
`NormalizedAudio`	Container for normalized audio and its original stats.
`StreamingStitcher`	Incremental overlap-add for chunked model outputs.
`SequenceChunkSplit`
`ModelWaveformToWaveform`
`LogMelSpect`	Computes the log-mel spectrogram of a waveform.
`SequenceFeatureExtractor`	Protocol for sequence feature extractors.
`IdentitySequenceFeatureExtractor`
`LogMelSequenceFeatureExtractor`
`HcqtSequenceFeatureExtractor`

Functions:

Name	Description
`normalize_audio`	Preprocess the raw audio in the time domain to have a mean of 0 and a std of 1
`denormalize_audio`	Take the model output and restore them to their original loudness.
`generate_chunks`	Generates batches of overlapping chunks from an audio tensor.
`stitch_chunks`	Stitches processed audio chunks back together using the overlap-add method.
`aggregate_logits`	Stitches time-series logits (split/aggregate strategy).
`aggregate_sequence_chunks`	Aggregate generic time-major chunk outputs.
`pad_dim`	Pad an arbitrary tensor on a specific dimension.
`split_sequence_tensor`	Split a time-major sequence tensor into overlapping chunks.
`apply_mask`	Applies a complex mask to a spectrogram.
`get_model_floating_dtype`	Infer floating input dtype from the model's first floating parameter.
`to_model_device`	Move tensor to model device while preserving model floating dtype compatibility.
`create_w2w_model`
`to_log_magnitude`	Convert complex or real spectrogram-like tensors to dB log-magnitude.
`to_log_power`	Convert complex or real spectrogram-like tensors to dB log-power.
`create_sequence_feature_extractor`
`derive_stems`	It is the caller's responsibility to ensure that all tensors are aligned and have the same shape.
`str_to_torch_dtype`

Audio `dataclass`

Audio(data: _AudioTensorLike, sample_rate: SampleRate)

Bases: Generic[_AudioTensorLike]

Attributes:

Name	Type	Description
`data`	`_AudioTensorLike`	This should either be an raw or a
`sample_rate`	`SampleRate`

data `instance-attribute`

data: _AudioTensorLike

This should either be an raw or a normalized audio tensor.

sample_rate `instance-attribute`

sample_rate: SampleRate

NormalizationStats `dataclass`

NormalizationStats(
    mean: float, std: Annotated[float, Gt(0)]
)

Statistics for normalizing and denormalizing audio.

Attributes:

Name	Type	Description
`mean`	`float`	Mean \(\mu\) of the mixture
`std`	`Annotated[float, Gt(0)]`	Standard deviation \(\sigma\) of the mixture

mean `instance-attribute`

mean: float

Mean \(\mu\) of the mixture

std `instance-attribute`

std: Annotated[float, Gt(0)]

Standard deviation \(\sigma\) of the mixture

NormalizedAudio `dataclass`

NormalizedAudio(
    audio: Audio[NormalizedAudioTensor],
    stats: NormalizationStats,
)

Container for normalized audio and its original stats.

Attributes:

Name	Type	Description
`audio`	`Audio[NormalizedAudioTensor]`
`stats`	`NormalizationStats`

audio `instance-attribute`

audio: Audio[NormalizedAudioTensor]

stats `instance-attribute`

stats: NormalizationStats

normalize_audio

normalize_audio(
    audio: Audio[RawAudioTensor],
) -> NormalizedAudio

Preprocess the raw audio in the time domain to have a mean of 0 and a std of 1 before passing it to the model.

Operates on the mean of the channels.

Source code in src/splifft/core.py

def normalize_audio(audio: Audio[t.RawAudioTensor]) -> NormalizedAudio:
    """Preprocess the raw audio in the time domain to have a mean of 0 and a std of 1
    before passing it to the model.

    Operates on the mean of the [channels][splifft.types.Channels].
    """
    mono_audio = audio.data.mean(dim=0)
    mean = float(mono_audio.mean())
    std = float(mono_audio.std())

    if std <= 1e-8:  # silent audio
        return NormalizedAudio(
            audio=Audio(data=t.NormalizedAudioTensor(audio.data), sample_rate=audio.sample_rate),
            stats=NormalizationStats(mean, 1.0),
        )

    normalized_data = (audio.data - mean) / std
    return NormalizedAudio(
        audio=Audio(data=t.NormalizedAudioTensor(normalized_data), sample_rate=audio.sample_rate),
        stats=NormalizationStats(mean, std),
    )

denormalize_audio

denormalize_audio(
    audio_data: NormalizedAudioTensor,
    stats: NormalizationStats,
) -> RawAudioTensor

Take the model output and restore them to their original loudness.

Source code in src/splifft/core.py

def denormalize_audio(
    audio_data: t.NormalizedAudioTensor, stats: NormalizationStats
) -> t.RawAudioTensor:
    """Take the model output and restore them to their original loudness."""
    return t.RawAudioTensor((audio_data * stats.std) + stats.mean)

generate_chunks

generate_chunks(
    audio_data: RawAudioTensor | NormalizedAudioTensor,
    chunk_size: ChunkSize,
    hop_size: HopSize,
    batch_size: BatchSize,
    *,
    padding_mode: PaddingMode = "reflect",
) -> Iterator[PaddedChunkedAudioTensor]

Generates batches of overlapping chunks from an audio tensor.

Note that reflect padding requires pad size to be strictly less than the audio dimension size. Use constant padding mode otherwise.

Returns:

Type	Description
`Iterator[PaddedChunkedAudioTensor]`	An iterator that yields batches of chunks of shape (B, C, chunk_T).

Source code in src/splifft/core.py

def generate_chunks(
    audio_data: t.RawAudioTensor | t.NormalizedAudioTensor,
    chunk_size: t.ChunkSize,
    hop_size: t.HopSize,
    batch_size: t.BatchSize,
    *,
    padding_mode: t.PaddingMode = "reflect",
) -> Iterator[t.PaddedChunkedAudioTensor]:
    """Generates batches of overlapping chunks from an audio tensor.

    Note that reflect padding requires pad size to be strictly less than the audio dimension size.
    Use `constant` padding mode otherwise.

    :return: An iterator that yields batches of chunks of shape (B, C, chunk_T).
    """
    padding = chunk_size - hop_size
    padded_audio = F.pad(audio_data, (padding, padding), mode=padding_mode)

    padded_len = padded_audio.shape[-1]
    rem = (padded_len - chunk_size) % hop_size
    if rem != 0:
        final_pad = hop_size - rem
        padded_audio = F.pad(padded_audio, (0, final_pad), mode="constant", value=0)

    unfolded = padded_audio.unfold(
        dimension=-1, size=chunk_size, step=hop_size
    )  # (C, num_chunks, chunk_size)

    num_chunks = unfolded.shape[1]
    unfolded = unfolded.permute(1, 0, 2)  # (num_chunks, C, chunk_size)

    for i in range(0, num_chunks, batch_size):
        yield t.PaddedChunkedAudioTensor(unfolded[i : i + batch_size])

stitch_chunks

stitch_chunks(
    processed_chunks: Sequence[SeparatedChunkedTensor],
    num_stems: NumModelStems,
    chunk_size: ChunkSize,
    hop_size: HopSize,
    target_num_samples: Samples,
    *,
    window: WindowTensor,
) -> RawSeparatedTensor

Stitches processed audio chunks back together using the overlap-add method.

Warning

This function materializes all chunks in memory simultaneously before overlap-add, which scales poorly with track length and stem count, often leading to OOM. Prefer using StreamingStitcher instead, which performs overlap-add incrementally.

Reconstructs the full audio signal from a sequence of overlapping, processed chunks. Ensures that the sum of all overlapping windows is constant at every time step: \(\sum_{m=-\infty}^{\infty} w[n - mH] = C\) where \(H\) is the hop size.

Assumptions:

processed_chunks is non-empty.
Every batch has shape (batch, num_stems, channels, chunk_size).
Batches are already ordered in the same temporal order used during chunking.
All batches share the same channel count and dtype.
window has length chunk_size and matches the overlap-add window used during chunking.

Source code in src/splifft/core.py

def stitch_chunks(
    processed_chunks: Sequence[t.SeparatedChunkedTensor],
    num_stems: t.NumModelStems,
    chunk_size: t.ChunkSize,
    hop_size: t.HopSize,
    target_num_samples: t.Samples,
    *,
    window: t.WindowTensor,
) -> t.RawSeparatedTensor:
    r"""Stitches processed audio chunks back together using the [overlap-add method](https://en.wikipedia.org/wiki/Overlap%E2%80%93add_method).

    !!! warning

        This function materializes all chunks in memory simultaneously before overlap-add,
        which scales poorly with track length and stem count, often leading to OOM.
        Prefer using [`StreamingStitcher`][splifft.core.StreamingStitcher] instead,
        which performs overlap-add incrementally.

    Reconstructs the full audio signal from a sequence of overlapping, processed chunks. Ensures
    that the sum of all overlapping windows is constant at every time step:
    $\sum_{m=-\infty}^{\infty} w[n - mH] = C$ where $H$ is the [hop size][splifft.types.HopSize].

    Assumptions:

    - `processed_chunks` is non-empty.
    - Every batch has shape `(batch, num_stems, channels, chunk_size)`.
    - Batches are already ordered in the same temporal order used during chunking.
    - All batches share the same channel count and dtype.
    - `window` has length `chunk_size` and matches the overlap-add window used during chunking.
    """
    first_batch = processed_chunks[0]
    num_channels = first_batch.shape[2]

    total_chunks = sum(batch.shape[0] for batch in processed_chunks)
    total_length = (total_chunks - 1) * hop_size + chunk_size
    window_on_device = t.WindowTensor(window.to(device=first_batch.device, dtype=first_batch.dtype))

    stitched = torch.zeros(
        (num_stems, num_channels, total_length),
        device=first_batch.device,
        dtype=first_batch.dtype,
    )
    norm_window = torch.zeros(total_length, device=first_batch.device, dtype=first_batch.dtype)
    window_view = window_on_device.view(1, 1, -1)

    chunk_index = 0
    for chunk_batch in processed_chunks:
        batch_size = chunk_batch.shape[0]

        start = chunk_index * hop_size
        for chunk in chunk_batch.unbind(0):
            # NOTE: on CPU, dst.addcmul_(x, y) is not bit-identical to dst.add_(x * y) once
            # dst already contains data. on CUDA, both ops are bit-identical.
            stitched.narrow(-1, start, chunk_size).addcmul_(chunk, window_view)
            norm_window.narrow(0, start, chunk_size).add_(window_on_device)
            start += hop_size
        chunk_index += batch_size

    norm_window.clamp_min_(1e-8)  # for edges where the window sum might be zero
    stitched /= norm_window.view(1, 1, -1)

    padding = chunk_size - hop_size
    if padding > 0:
        stitched = stitched[..., padding:-padding]

    return t.RawSeparatedTensor(stitched[..., :target_num_samples])

StreamingStitcher

StreamingStitcher(
    chunk_size: ChunkSize,
    hop_size: HopSize,
    target_num_samples: Samples,
    window: WindowTensor,
)

Incremental overlap-add for chunked model outputs.

Classes:

Name	Description
`Buffers`

Methods:

Name	Description
`step`	Accumulate one `(stems, channels, chunk_size)` chunk.
`flush`	Emit the trailing valid tail after the final chunk.

Attributes:

Name	Type	Description
`chunk_size`
`hop_size`
`target_num_samples`
`window`
`tail_size`
`samples_to_skip`
`samples_to_keep`
`buffers`	`Buffers \| None`

Source code in src/splifft/core.py

def __init__(
    self,
    chunk_size: t.ChunkSize,
    hop_size: t.HopSize,
    target_num_samples: t.Samples,
    window: t.WindowTensor,
) -> None:
    self.chunk_size = chunk_size
    self.hop_size = hop_size
    self.target_num_samples = target_num_samples
    self.window = window
    self.tail_size = chunk_size - hop_size

    self.samples_to_skip = self.tail_size
    self.samples_to_keep = target_num_samples
    self.buffers: StreamingStitcher.Buffers | None = None

Buffers `dataclass`

Buffers(
    buffer: Tensor,
    norm_buffer: Tensor,
    shift_buffer: Tensor,
    shift_norm_buffer: Tensor,
)

Attributes:

Name	Type	Description
`buffer`	`Tensor`
`norm_buffer`	`Tensor`
`shift_buffer`	`Tensor`
`shift_norm_buffer`	`Tensor`

buffer `instance-attribute`

buffer: Tensor

norm_buffer `instance-attribute`

norm_buffer: Tensor

shift_buffer `instance-attribute`

shift_buffer: Tensor

shift_norm_buffer `instance-attribute`

shift_norm_buffer: Tensor

chunk_size `instance-attribute`

chunk_size = chunk_size

hop_size `instance-attribute`

hop_size = hop_size

target_num_samples `instance-attribute`

target_num_samples = target_num_samples

window `instance-attribute`

window = window

tail_size `instance-attribute`

tail_size = chunk_size - hop_size

samples_to_skip `instance-attribute`

samples_to_skip = tail_size

samples_to_keep `instance-attribute`

samples_to_keep = target_num_samples

buffers `instance-attribute`

buffers: Buffers | None = None

step

step(chunk: Tensor) -> Tensor | None

Accumulate one (stems, channels, chunk_size) chunk.

Source code in src/splifft/core.py

def step(self, chunk: Tensor) -> Tensor | None:
    """Accumulate one `(stems, channels, chunk_size)` chunk."""
    # NOTE: the caller still uses a single output tensor at the end (because streaming
    # serialisation is a can of worms) but we still keep the overlap add state bounded.
    if chunk.ndim != 3:
        raise ValueError(f"expected chunk rank 3, got shape {tuple(chunk.shape)}")
    if chunk.shape[-1] != self.chunk_size:
        raise ValueError(f"expected chunk length {self.chunk_size}, got {chunk.shape[-1]}")
    if self.buffers is None:
        chunk_shape = (*chunk.shape[:-1], self.chunk_size)
        self.buffers = StreamingStitcher.Buffers(
            # alloc from declared chunk size. does upstream hand us a edge chunk?
            buffer=torch.zeros(chunk_shape, device=chunk.device, dtype=chunk.dtype),
            norm_buffer=torch.zeros(self.chunk_size, device=chunk.device, dtype=chunk.dtype),
            shift_buffer=torch.zeros(
                (*chunk.shape[:-1], self.tail_size),
                device=chunk.device,
                dtype=chunk.dtype,
            ),
            shift_norm_buffer=torch.zeros(
                self.tail_size,
                device=chunk.device,
                dtype=chunk.dtype,
            ),
        )

    buffer = self.buffers.buffer
    norm_buffer = self.buffers.norm_buffer
    shift_buffer = self.buffers.shift_buffer
    shift_norm_buffer = self.buffers.shift_norm_buffer

    buffer.addcmul_(chunk, self.window.view(1, 1, -1))
    norm_buffer.add_(self.window)

    finalized = buffer[..., : self.hop_size] / norm_buffer[: self.hop_size].clamp_min(1e-8)

    if self.tail_size > 0:
        # NOTE: while a ring buffer gives slighly lower RSS, it significantly increased wall
        # clock. we just use explicit shift for now
        shift_buffer.copy_(buffer[..., self.hop_size :])
        buffer[..., : self.tail_size].copy_(shift_buffer)
        buffer[..., self.tail_size :].zero_()

        shift_norm_buffer.copy_(norm_buffer[self.hop_size :])
        norm_buffer[: self.tail_size].copy_(shift_norm_buffer)
        norm_buffer[self.tail_size :].zero_()
    else:
        buffer.zero_()
        norm_buffer.zero_()

    return self._extract_valid(finalized)

flush

flush() -> Tensor | None

Emit the trailing valid tail after the final chunk.

Source code in src/splifft/core.py

def flush(self) -> Tensor | None:
    """Emit the trailing valid tail after the final chunk."""
    if self.buffers is None or self.samples_to_keep <= 0 or self.tail_size == 0:
        return None
    finalized = self.buffers.buffer[..., : self.tail_size] / self.buffers.norm_buffer[
        : self.tail_size
    ].clamp_min(1e-8)
    return self._extract_valid(finalized)

aggregate_logits

aggregate_logits(
    processed_chunks: Sequence[LogitsTensor],
    starts: Sequence[int],
    full_size: int,
    chunk_size: int,
    num_stems: int,
    *,
    trim_margin: int = 0,
    overlap_mode: OverlapMode = "keep_first",
) -> LogitsTensor

Stitches time-series logits (split/aggregate strategy).

This is a 1:1 map of beat_this's aggregation behavior: - trim trim_margin frames from each chunk side - write into a full-size buffer - in keep_first mode, process chunks in reverse so earlier chunks overwrite later ones

Source code in src/splifft/core.py

def aggregate_logits(
    processed_chunks: Sequence[t.LogitsTensor],
    starts: Sequence[int],
    full_size: int,
    chunk_size: int,
    num_stems: int,
    *,
    trim_margin: int = 0,
    overlap_mode: t.OverlapMode = "keep_first",
) -> t.LogitsTensor:
    """Stitches time-series logits (split/aggregate strategy).

    This is a 1:1 map of beat_this's aggregation behavior:
    - trim `trim_margin` frames from each chunk side
    - write into a full-size buffer
    - in `keep_first` mode, process chunks in reverse so earlier chunks
      overwrite later ones
    """
    all_chunks = torch.cat(tuple(processed_chunks), dim=0)
    total_chunks, _, chunk_len_frames = all_chunks.shape

    if len(starts) != total_chunks:
        raise ValueError(f"expected {total_chunks=} starts, got {len(starts)}")
    if chunk_len_frames != chunk_size:
        raise ValueError(f"expected {chunk_size=} but got chunk length {chunk_len_frames}")
    if trim_margin * 2 >= chunk_len_frames:
        raise ValueError(f"{trim_margin=} is too large for {chunk_len_frames=}")

    buffer = torch.full(
        (num_stems, full_size), -1000.0, device=all_chunks.device, dtype=all_chunks.dtype
    )

    if overlap_mode == "keep_first":
        indices = range(total_chunks - 1, -1, -1)
    elif overlap_mode == "keep_last":
        indices = range(total_chunks)
    else:
        assert_never(overlap_mode)

    for i in indices:
        chunk = all_chunks[i]
        chunk_valid = chunk[:, trim_margin : chunk_len_frames - trim_margin]
        start = starts[i] + trim_margin
        end = starts[i] + chunk_size - trim_margin

        clipped_start = max(0, start)
        clipped_end = min(end, full_size)
        if clipped_start >= clipped_end:
            continue

        src_start = clipped_start - start
        src_end = src_start + (clipped_end - clipped_start)
        buffer[:, clipped_start:clipped_end] = chunk_valid[:, src_start:src_end]

    return t.LogitsTensor(buffer)

aggregate_sequence_chunks

aggregate_sequence_chunks(
    processed_chunks: Sequence[Tensor],
    starts: Sequence[int],
    full_size: int,
    chunk_size: int,
    *,
    trim_margin: int = 0,
    overlap_mode: OverlapMode = "keep_first",
) -> Tensor

Aggregate generic time-major chunk outputs.

Each processed_chunks[i] must have shape (chunk_time, ...) where ... can contain any additional feature dimensions (for example bins for activations).

Source code in src/splifft/core.py

def aggregate_sequence_chunks(
    processed_chunks: Sequence[Tensor],
    starts: Sequence[int],
    full_size: int,
    chunk_size: int,
    *,
    trim_margin: int = 0,
    overlap_mode: t.OverlapMode = "keep_first",
) -> Tensor:
    """Aggregate generic time-major chunk outputs.

    Each `processed_chunks[i]` must have shape `(chunk_time, ...)` where `...` can
    contain any additional feature dimensions (for example bins for `activations`).
    """
    if not processed_chunks:
        raise ValueError("expected at least one chunk")

    chunk_len_frames = int(processed_chunks[0].shape[0])
    if trim_margin * 2 >= chunk_len_frames:
        raise ValueError(f"{trim_margin=} is too large for {chunk_len_frames=}")
    if len(starts) != len(processed_chunks):
        raise ValueError(f"expected {len(processed_chunks)=} starts, got {len(starts)}")
    if chunk_len_frames != chunk_size:
        raise ValueError(f"expected {chunk_size=} but got chunk length {chunk_len_frames}")

    tail_shape = tuple(processed_chunks[0].shape[1:])
    buffer = processed_chunks[0].new_zeros((full_size, *tail_shape))

    if overlap_mode == "keep_first":
        indices = range(len(processed_chunks) - 1, -1, -1)
    elif overlap_mode == "keep_last":
        indices = range(len(processed_chunks))
    else:
        assert_never(overlap_mode)

    for i in indices:
        chunk = processed_chunks[i]
        if tuple(chunk.shape[1:]) != tail_shape:
            raise ValueError(
                f"all stream chunks must have identical non-time dimensions, got {chunk.shape[1:]} and {tail_shape}"
            )

        chunk_valid = chunk[trim_margin : chunk_len_frames - trim_margin]
        start = starts[i] + trim_margin
        end = starts[i] + chunk_size - trim_margin

        clipped_start = max(0, start)
        clipped_end = min(end, full_size)
        if clipped_start >= clipped_end:
            continue

        src_start = clipped_start - start
        src_end = src_start + (clipped_end - clipped_start)
        buffer[clipped_start:clipped_end] = chunk_valid[src_start:src_end]

    return buffer

pad_dim

pad_dim(
    tensor: Tensor,
    *,
    dim: int,
    pad: tuple[int, int],
    value: float = 0.0,
) -> Tensor

Pad an arbitrary tensor on a specific dimension.

This avoids relying on F.pad's reverse-dimension argument ordering.

Source code in src/splifft/core.py

def pad_dim(tensor: Tensor, *, dim: int, pad: tuple[int, int], value: float = 0.0) -> Tensor:
    """Pad an arbitrary tensor on a specific dimension.

    This avoids relying on `F.pad`'s reverse-dimension argument ordering.
    """
    left, right = pad
    if left < 0 or right < 0:
        raise ValueError(f"expected non-negative pad widths, got left={left}, right={right}")
    if left == 0 and right == 0:
        return tensor

    rank = tensor.ndim
    resolved_dim = dim if dim >= 0 else rank + dim
    if resolved_dim < 0 or resolved_dim >= rank:
        raise IndexError(f"dim out of range for rank-{rank} tensor: {dim}")

    pieces: list[Tensor] = []
    if left:
        left_shape = list(tensor.shape)
        left_shape[resolved_dim] = left
        pieces.append(tensor.new_full(left_shape, fill_value=value))

    pieces.append(tensor)

    if right:
        right_shape = list(tensor.shape)
        right_shape[resolved_dim] = right
        pieces.append(tensor.new_full(right_shape, fill_value=value))

    return torch.cat(pieces, dim=resolved_dim)

SequenceChunkSplit

Bases: NamedTuple

Attributes:

Name	Type	Description
`chunks`	`list[Tensor]`
`starts`	`list[int]`

chunks `instance-attribute`

chunks: list[Tensor]

starts `instance-attribute`

starts: list[int]

split_sequence_tensor

split_sequence_tensor(
    sequence: Tensor,
    chunk_size: int,
    *,
    trim_margin: int = 0,
    avoid_short_end: bool = True,
) -> SequenceChunkSplit

Split a time-major sequence tensor into overlapping chunks.

sequence must be shaped (time, ...), where ... can be any feature tail.

Source code in src/splifft/core.py

def split_sequence_tensor(
    sequence: Tensor,
    chunk_size: int,
    *,
    trim_margin: int = 0,
    avoid_short_end: bool = True,
) -> SequenceChunkSplit:
    """Split a time-major sequence tensor into overlapping chunks.

    `sequence` must be shaped `(time, ...)`, where `...` can be any feature tail.
    """
    full_size = sequence.shape[0]
    if (step := chunk_size - 2 * trim_margin) <= 0:
        raise ValueError(
            f"expected chunk_size - 2*trim_margin > 0, got {chunk_size=}, {trim_margin=}"
        )
    if not (starts := list(range(-trim_margin, full_size - trim_margin, step))):
        starts = [-trim_margin]
    if avoid_short_end and full_size > step:
        starts[-1] = full_size - (chunk_size - trim_margin)

    chunks: list[Tensor] = []
    for start in starts:
        src_start = max(start, 0)
        src_end = min(start + chunk_size, full_size)
        left = max(0, -start)
        right = max(0, start + chunk_size - full_size)

        chunk = sequence[src_start:src_end]
        if left > 0 or right > 0:
            chunk = pad_dim(chunk, dim=0, pad=(left, right), value=0.0)
        chunks.append(chunk)

    return SequenceChunkSplit(chunks, starts)

apply_mask

apply_mask(
    spec_for_masking: ComplexSpectrogram,
    mask_batch: ComplexSpectrogram,
    mask_add_sub_dtype: dtype | None,
    mask_out_dtype: dtype | None,
) -> SeparatedSpectrogramTensor

Applies a complex mask to a spectrogram.

While this can be simply replaced by a complex multiplication and torch.view_as_complex, CoreML does not support it: https://github.com/apple/coremltools/issues/2003 so we handroll our own.

Source code in src/splifft/core.py

def apply_mask(
    spec_for_masking: t.ComplexSpectrogram,
    mask_batch: t.ComplexSpectrogram,
    mask_add_sub_dtype: torch.dtype | None,
    mask_out_dtype: torch.dtype | None,
) -> t.SeparatedSpectrogramTensor:
    """Applies a complex mask to a spectrogram.

    While this can be simply replaced by a complex multiplication and `torch.view_as_complex`,
    CoreML does not support it: https://github.com/apple/coremltools/issues/2003 so we handroll our
    own.
    """
    spec_real = spec_for_masking[..., 0]
    spec_imag = spec_for_masking[..., 1]
    mask_real = mask_batch[..., 0]
    mask_imag = mask_batch[..., 1]

    # see: 14385, 14401, 14392, 14408
    ac = spec_real * mask_real
    bd = spec_imag * mask_imag
    ad = spec_real * mask_imag
    bc = spec_imag * mask_real

    # see: 509, 506, 505, 504, 741, 747
    out_real = ac.to(mask_add_sub_dtype) - bd.to(mask_add_sub_dtype)
    out_imag = ad.to(mask_add_sub_dtype) + bc.to(mask_add_sub_dtype)

    # see: 503, 501
    separated_spec = torch.stack([out_real, out_imag], dim=-1).to(mask_out_dtype)
    return t.SeparatedSpectrogramTensor(separated_spec)

get_model_floating_dtype

get_model_floating_dtype(model: Module) -> dtype | None

Infer floating input dtype from the model's first floating parameter.

Source code in src/splifft/core.py

def get_model_floating_dtype(model: nn.Module) -> torch.dtype | None:
    """Infer floating input dtype from the model's first floating parameter."""
    first_param = next(model.parameters(), None)
    if first_param is None:
        return None
    return first_param.dtype if first_param.is_floating_point() else None

to_model_device

to_model_device(
    tensor: Tensor,
    *,
    model_device: device,
    model_floating_dtype: dtype | None,
) -> Tensor

Move tensor to model device while preserving model floating dtype compatibility.

Source code in src/splifft/core.py

def to_model_device(
    tensor: Tensor,
    *,
    model_device: torch.device,
    model_floating_dtype: torch.dtype | None,
) -> Tensor:
    """Move tensor to model device while preserving model floating dtype compatibility."""
    if model_floating_dtype is not None and tensor.is_floating_point():
        return tensor.to(device=model_device, dtype=model_floating_dtype)
    return tensor.to(device=model_device)

ModelWaveformToWaveform

ModelWaveformToWaveform(
    model: Module,
    preprocess: PreprocessFn,
    postprocess: PostprocessFn,
    *,
    io_device: device,
    model_device: device,
)

Bases: Module

Methods:

Name	Description
`forward`

Attributes:

Name	Type	Description
`model`
`preprocess`
`postprocess`
`io_device`
`model_device`
`model_input_dtype`

Source code in src/splifft/core.py

def __init__(
    self,
    model: nn.Module,
    preprocess: t.PreprocessFn,
    postprocess: t.PostprocessFn,
    *,
    io_device: torch.device,
    model_device: torch.device,
):
    super().__init__()
    self.model = model
    self.preprocess = preprocess
    self.postprocess = postprocess
    self.io_device = io_device
    self.model_device = model_device
    self.model_input_dtype = get_model_floating_dtype(self.model)

model `instance-attribute`

model = model

preprocess `instance-attribute`

preprocess = preprocess

postprocess `instance-attribute`

postprocess = postprocess

io_device `instance-attribute`

io_device = io_device

model_device `instance-attribute`

model_device = model_device

model_input_dtype `instance-attribute`

model_input_dtype = get_model_floating_dtype(model)

forward

forward(
    waveform_chunk: RawAudioTensor | NormalizedAudioTensor,
) -> SeparatedChunkedTensor | LogitsTensor

Source code in src/splifft/core.py

def forward(
    self, waveform_chunk: t.RawAudioTensor | t.NormalizedAudioTensor
) -> t.SeparatedChunkedTensor | t.LogitsTensor:
    model_waveform_chunk = cast(
        t.RawAudioTensor | t.NormalizedAudioTensor,
        to_model_device(
            waveform_chunk,
            model_device=self.model_device,
            model_floating_dtype=self.model_input_dtype,
        ),
    )
    preprocessed_input = self.preprocess(model_waveform_chunk)
    model_output = self.model(*preprocessed_input)
    postprocessed = self.postprocess(model_output, *preprocessed_input)
    if isinstance(postprocessed, Tensor):
        return cast(
            t.SeparatedChunkedTensor | t.LogitsTensor,
            postprocessed.to(self.io_device),
        )
    return postprocessed

create_w2w_model

create_w2w_model(
    model: Module,
    model_input_type: ModelInputType,
    model_output_type: ModelOutputType,
    stft_cfg: StftConfig | None,
    num_channels: Channels,
    chunk_size: ChunkSize,
    masking_cfg: MaskingConfig,
    *,
    io_device: device,
    model_device: device,
) -> ModelWaveformToWaveform

Source code in src/splifft/core.py

def create_w2w_model(
    model: nn.Module,
    model_input_type: t.ModelInputType,
    model_output_type: t.ModelOutputType,
    stft_cfg: StftConfig | None,
    num_channels: t.Channels,
    chunk_size: t.ChunkSize,
    masking_cfg: MaskingConfig,
    *,
    io_device: torch.device,
    model_device: torch.device,
) -> ModelWaveformToWaveform:
    needs_stft = model_input_type == "spectrogram" or model_input_type == "waveform_and_spectrogram"
    needs_istft = model_output_type == "spectrogram_mask" or model_output_type == "spectrogram"

    if (needs_stft or needs_istft) and stft_cfg is None:
        raise ValueError(
            "expected stft config for models that operate on spectrograms, but found `None`."
        )

    def _identity_preprocess(
        chunk: t.RawAudioTensor | t.NormalizedAudioTensor,
    ) -> t.WaveformModelInput:
        return t.WaveformModelInput(chunk)

    preprocess: t.PreprocessFn = _identity_preprocess
    postprocess: t.PostprocessFn = lambda model_output, *_: model_output  # noqa: E731

    if needs_stft:
        assert stft_cfg is not None
        conv_dtype = stft_cfg.conv_dtype
        if model_device.type == "cpu" and conv_dtype == torch.float16:
            conv_dtype = torch.float32

        stft_module = Stft(
            n_fft=stft_cfg.n_fft,
            hop_length=stft_cfg.hop_length,
            win_length=stft_cfg.win_length,
            window_fn=lambda win_len: _get_window_fn(stft_cfg.window_shape, win_len, model_device),
            conv_dtype=conv_dtype,
        ).to(model_device)
        if model_input_type == "spectrogram":
            preprocess = _create_stft_preprocessor(stft_module)
        elif model_input_type == "waveform_and_spectrogram":
            preprocess = _create_hybrid_preprocessor(stft_module)
        else:
            raise NotImplementedError(f"unsupported input type for stft: {model_input_type}")

    if needs_istft:
        assert stft_cfg is not None
        istft_module = IStft(
            n_fft=stft_cfg.n_fft,
            hop_length=stft_cfg.hop_length,
            win_length=stft_cfg.win_length,
            window_fn=lambda win_len: _get_window_fn(stft_cfg.window_shape, win_len, model_device),
        ).to(model_device)

        add_sub_dtype = masking_cfg.add_sub_dtype
        out_dtype = masking_cfg.out_dtype
        if model_device.type == "cpu":
            if add_sub_dtype == torch.float16:
                add_sub_dtype = torch.float32
            if out_dtype == torch.float16:
                out_dtype = torch.float32

        postprocess = _create_spec_postprocessor(
            istft_module,
            num_channels,
            chunk_size,
            add_sub_dtype,
            out_dtype,
            model_output_type,  # type: ignore
        )
    return ModelWaveformToWaveform(
        model,
        preprocess,
        postprocess,
        io_device=io_device,
        model_device=model_device,
    )

LogMelSpect

LogMelSpect(
    sample_rate: int,
    n_fft: int,
    hop_length: int,
    n_mels: int,
    f_min: float = 0.0,
    f_max: float | None = None,
    mel_scale: Literal["htk", "slaney"] = "slaney",
    normalized: bool | str = "frame_length",
    power: float = 1.0,
    log_multiplier: float = 1000.0,
)

Bases: Module

Computes the log-mel spectrogram of a waveform.

Methods:

Name	Description
`forward`	:param x: Waveform tensor of shape (batch, channels, time) or (batch, time)

Attributes:

Name	Type	Description
`spect_class`
`log_multiplier`

Source code in src/splifft/core.py

def __init__(
    self,
    sample_rate: int,
    n_fft: int,
    hop_length: int,
    n_mels: int,
    f_min: float = 0.0,
    f_max: float | None = None,
    mel_scale: Literal["htk", "slaney"] = "slaney",
    normalized: bool | str = "frame_length",
    power: float = 1.0,
    log_multiplier: float = 1000.0,
):
    super().__init__()
    import torchaudio.transforms as T

    self.spect_class = T.MelSpectrogram(
        sample_rate=sample_rate,
        n_fft=n_fft,
        hop_length=hop_length,
        f_min=f_min,
        f_max=f_max,
        n_mels=n_mels,
        mel_scale=mel_scale,
        normalized=cast(Any, normalized),
        power=power,
    )
    self.log_multiplier = log_multiplier

spect_class `instance-attribute`

spect_class = MelSpectrogram(
    sample_rate=sample_rate,
    n_fft=n_fft,
    hop_length=hop_length,
    f_min=f_min,
    f_max=f_max,
    n_mels=n_mels,
    mel_scale=mel_scale,
    normalized=cast(Any, normalized),
    power=power,
)

log_multiplier `instance-attribute`

log_multiplier = log_multiplier

forward

forward(x: Tensor) -> LogMelSpectrogram

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Waveform tensor of shape (batch, channels, time) or (batch, time)	required

Returns:

Type	Description
`LogMelSpectrogram`	Log-Mel spectrogram of shape (batch, channels, n_mels, time)

Source code in src/splifft/core.py

def forward(self, x: Tensor) -> t.LogMelSpectrogram:
    """
    :param x: Waveform tensor of shape (batch, channels, time) or (batch, time)
    :return: Log-Mel spectrogram of shape (batch, channels, n_mels, time)
    """
    if x.ndim == 2:
        x = x.unsqueeze(1)
    mel_spec = self.spect_class(x)
    return torch.log1p(self.log_multiplier * mel_spec)  # type: ignore

to_log_magnitude

to_log_magnitude(
    x: Tensor, *, epsilon: float = 1e-08
) -> Tensor

Convert complex or real spectrogram-like tensors to dB log-magnitude.

Source code in src/splifft/core.py

def to_log_magnitude(x: Tensor, *, epsilon: float = 1e-8) -> Tensor:
    """Convert complex or real spectrogram-like tensors to dB log-magnitude."""
    if x.shape[-1] == 2:
        x = torch.sqrt(x[..., 0] ** 2 + x[..., 1] ** 2)
    else:
        x = x.abs()
    return x.clamp_min(epsilon).log10().mul(20)

to_log_power

to_log_power(x: Tensor, *, epsilon: Gt0[float]) -> Tensor

Convert complex or real spectrogram-like tensors to dB log-power.

Source code in src/splifft/core.py

def to_log_power(x: Tensor, *, epsilon: t.Gt0[float]) -> Tensor:
    """Convert complex or real spectrogram-like tensors to dB log-power."""
    if x.shape[-1] == 2:
        x = x[..., 0] ** 2 + x[..., 1] ** 2
    else:
        x = x.abs() ** 2
    return x.add(epsilon).log10().mul(10)

SequenceFeatureExtractor

Bases: Protocol

Protocol for sequence feature extractors.

Required contract: - input: (B, C, T) - output: (B, seq_len, feature_dim)

Methods:

Name	Description
`__call__`

Attributes:

Name	Type	Description
`hop_length_samples`	`int`
`stage_name`	`str`

hop_length_samples `instance-attribute`

hop_length_samples: int

stage_name `instance-attribute`

stage_name: str

call

__call__(x: Tensor) -> Tensor

Source code in src/splifft/core.py

def __call__(self, x: Tensor) -> Tensor: ...

IdentitySequenceFeatureExtractor

Bases: Module, SequenceFeatureExtractor

Methods:

Name	Description
`forward`

Attributes:

Name	Type	Description
`hop_length_samples`
`stage_name`

hop_length_samples `class-attribute` `instance-attribute`

hop_length_samples = 1

stage_name `class-attribute` `instance-attribute`

stage_name = 'sequence_features'

forward

forward(x: Tensor) -> Tensor

Source code in src/splifft/core.py

def forward(self, x: Tensor) -> Tensor:
    if x.ndim != 3:
        raise ValueError(f"expected shape (B,C,T), got {tuple(x.shape)}")
    if x.shape[1] != 1:
        raise ValueError(
            f"identity sequence extractor expects mono input with C=1, got shape={tuple(x.shape)}"
        )
    return _ensure_btf(x.transpose(1, 2), source="identity extractor")

LogMelSequenceFeatureExtractor

LogMelSequenceFeatureExtractor(
    mel: LogMelSpect, *, hop_length_samples: int
)

Bases: Module, SequenceFeatureExtractor

Methods:

Name	Description
`forward`

Attributes:

Name	Type	Description
`stage_name`
`mel`
`hop_length_samples`

Source code in src/splifft/core.py

def __init__(self, mel: LogMelSpect, *, hop_length_samples: int):
    super().__init__()
    self.mel = mel
    self.hop_length_samples = hop_length_samples

stage_name `class-attribute` `instance-attribute`

stage_name = 'mel'

mel `instance-attribute`

mel = mel

hop_length_samples `instance-attribute`

hop_length_samples = hop_length_samples

forward

forward(x: Tensor) -> Tensor

Source code in src/splifft/core.py

def forward(self, x: Tensor) -> Tensor:
    if x.ndim != 3:
        raise ValueError(f"expected shape (B,C,T), got {tuple(x.shape)}")
    if x.shape[1] != 1:
        raise ValueError(
            f"mel extractor expects mono input with C=1, got shape={tuple(x.shape)}"
        )
    x_mono = x[:, :1]
    mel = self.mel(x_mono).squeeze(1)
    return _ensure_btf(mel.transpose(1, 2), source="mel extractor")

HcqtSequenceFeatureExtractor

HcqtSequenceFeatureExtractor(
    hcqt: Module,
    *,
    hop_length_samples: HopSize,
    log_epsilon: Gt0[float],
    power_epsilon: Ge0[float] | None,
)

Bases: Module, SequenceFeatureExtractor

Methods:

Name	Description
`forward`

Attributes:

Name	Type	Description
`stage_name`
`hcqt`
`hop_length_samples`
`log_epsilon`
`power_epsilon`

Source code in src/splifft/core.py

def __init__(
    self,
    hcqt: nn.Module,
    *,
    hop_length_samples: t.HopSize,
    log_epsilon: t.Gt0[float],
    power_epsilon: t.Ge0[float] | None,
):
    super().__init__()
    self.hcqt = hcqt
    self.hop_length_samples = hop_length_samples
    self.log_epsilon = log_epsilon
    self.power_epsilon = power_epsilon

stage_name `class-attribute` `instance-attribute`

stage_name = 'hcqt'

hcqt `instance-attribute`

hcqt = hcqt

hop_length_samples `instance-attribute`

hop_length_samples = hop_length_samples

log_epsilon `instance-attribute`

log_epsilon = log_epsilon

power_epsilon `instance-attribute`

power_epsilon = power_epsilon

forward

forward(x: Tensor) -> Tensor

Source code in src/splifft/core.py

def forward(self, x: Tensor) -> Tensor:
    if x.ndim != 3:
        raise ValueError(f"expected shape (B,C,T), got {tuple(x.shape)}")
    if x.shape[1] != 1:
        raise ValueError(
            f"cqt extractor expects mono input with C=1, got shape={tuple(x.shape)}"
        )
    x_mono = x[:, 0]
    hcqt_output = self.hcqt(x_mono)
    if self.power_epsilon is None:
        cqt = to_log_magnitude(hcqt_output, epsilon=self.log_epsilon)
    else:
        cqt = to_log_power(hcqt_output, epsilon=self.power_epsilon)
    cqt = cqt.permute(0, 3, 1, 2)
    b, t_len, harmonics, bins = cqt.shape
    return _ensure_btf(cqt.reshape(b, t_len, harmonics * bins), source="cqt extractor")

create_sequence_feature_extractor

create_sequence_feature_extractor(
    feature_cfg: FeatureExtractionConfig | None,
    *,
    sample_rate: SampleRate,
    device: device,
) -> SequenceFeatureExtractor

Source code in src/splifft/core.py

def create_sequence_feature_extractor(
    feature_cfg: FeatureExtractionConfig | None,
    *,
    sample_rate: t.SampleRate,
    device: torch.device,
) -> SequenceFeatureExtractor:
    if feature_cfg is None:
        return IdentitySequenceFeatureExtractor()

    if feature_cfg.kind == "mel":
        mel = LogMelSpect(
            sample_rate=feature_cfg.sample_rate,
            n_fft=feature_cfg.n_fft,
            hop_length=feature_cfg.hop_length,
            n_mels=feature_cfg.n_mels,
            f_min=feature_cfg.f_min,
            f_max=feature_cfg.f_max,
            mel_scale=feature_cfg.mel_scale,
            normalized=feature_cfg.normalized,
            power=feature_cfg.power,
            log_multiplier=feature_cfg.log_multiplier,
        ).to(device)
        return LogMelSequenceFeatureExtractor(mel=mel, hop_length_samples=feature_cfg.hop_length)

    if feature_cfg.kind == "hcqt_regular":
        hop_length_samples = _hop_size_ms_to_samples(feature_cfg.hop_size_ms, sample_rate)
        hcqt = HarmonicRegularCQT(
            sr=sample_rate,
            hop_length=hop_length_samples,
            harmonics=feature_cfg.harmonics,
            fmin=feature_cfg.fmin,
            bins_per_semitone=feature_cfg.bins_per_semitone,
            n_bins=feature_cfg.n_bins,
            center_bins=feature_cfg.center_bins,
            gamma=feature_cfg.gamma,
            center=feature_cfg.center,
            filter_scale=feature_cfg.filter_scale,
            window=feature_cfg.window,
        ).to(device)

        return HcqtSequenceFeatureExtractor(
            hcqt=hcqt,
            hop_length_samples=hop_length_samples,
            log_epsilon=feature_cfg.log_epsilon,
            power_epsilon=feature_cfg.power_epsilon,
        )

    if feature_cfg.kind == "hcqt_recursive":
        hop_length_samples = _hop_size_ms_to_samples(feature_cfg.hop_size_ms, sample_rate)
        hcqt = HarmonicRecursiveCQT(
            sr=sample_rate,
            hop_length=hop_length_samples,
            harmonics=feature_cfg.harmonics,
            fmin=feature_cfg.fmin,
            bins_per_semitone=feature_cfg.bins_per_semitone,
            n_bins=feature_cfg.n_bins,
            center_bins=feature_cfg.center_bins,
            center=feature_cfg.center,
        ).to(device)

        return HcqtSequenceFeatureExtractor(
            hcqt=hcqt,
            hop_length_samples=hop_length_samples,
            log_epsilon=feature_cfg.log_epsilon,
            power_epsilon=feature_cfg.power_epsilon,
        )

    if feature_cfg.kind == "stft":
        raise NotImplementedError(
            "sequence feature extraction does not support `stft`; use a mel or CQT transform"
        )

    assert_never(feature_cfg.kind)

derive_stems

derive_stems(
    separated_stems: Mapping[
        ModelOutputStemName, RawAudioTensor
    ],
    mixture_input: RawAudioTensor,
    stem_rules: DerivedStemsConfig,
) -> dict[StemName, RawAudioTensor]

It is the caller's responsibility to ensure that all tensors are aligned and have the same shape.

Note

Mixture input and separated stems must first be denormalized.

Source code in src/splifft/core.py

def derive_stems(
    separated_stems: Mapping[t.ModelOutputStemName, t.RawAudioTensor],
    mixture_input: t.RawAudioTensor,
    stem_rules: DerivedStemsConfig,
) -> dict[StemName, t.RawAudioTensor]:
    """
    It is the caller's responsibility to ensure that all tensors are aligned and have the same shape.

    !!! note
        Mixture input and separated stems must first be [denormalized][splifft.core.denormalize_audio].
    """
    stems = {
        "mixture": t.RawAudioTensor(mixture_input),  # for subtraction
        **separated_stems,
    }

    for derived_name, rule in stem_rules.items():
        if rule.operation == "subtract":
            # pydantic should have already validated that the stem names exist so safe to index directly
            minuend = stems[rule.stem_name]
            subtrahend = stems[rule.by_stem_name]
            stems[derived_name] = t.RawAudioTensor(minuend - subtrahend)
        elif rule.operation == "sum":
            to_sum = tuple(stems[s] for s in rule.stem_names)
            stems[derived_name] = t.RawAudioTensor(torch.stack(to_sum).sum(dim=0))
        else:
            assert_never(rule)

    stems.pop("mixture", None)
    return stems

str_to_torch_dtype

str_to_torch_dtype(value: Any) -> dtype

Source code in src/splifft/core.py

def str_to_torch_dtype(value: Any) -> torch.dtype:
    if not isinstance(value, str):
        raise TypeError(f"expected dtype to be a string, got {value} (type {type(value)})")
    try:
        dtype = getattr(torch, value)
    except AttributeError:
        raise ValueError(f"`{value}` cannot be found under the `torch` namespace")
    if not isinstance(dtype, torch.dtype):
        raise TypeError(f"expected {dtype} to be a dtype but it is a {type(dtype)}")
    return dtype

Core

core

Audio dataclass

data instance-attribute

sample_rate instance-attribute

NormalizationStats dataclass

mean instance-attribute

std instance-attribute

NormalizedAudio dataclass

audio instance-attribute

stats instance-attribute

normalize_audio

denormalize_audio

generate_chunks

stitch_chunks

StreamingStitcher

Buffers dataclass

buffer instance-attribute

norm_buffer instance-attribute

shift_buffer instance-attribute

shift_norm_buffer instance-attribute

chunk_size instance-attribute

hop_size instance-attribute

target_num_samples instance-attribute

window instance-attribute

tail_size instance-attribute

samples_to_skip instance-attribute

samples_to_keep instance-attribute

buffers instance-attribute

step

flush

aggregate_logits

aggregate_sequence_chunks

pad_dim

SequenceChunkSplit

chunks instance-attribute

starts instance-attribute

split_sequence_tensor

apply_mask

get_model_floating_dtype

to_model_device

ModelWaveformToWaveform

model instance-attribute

preprocess instance-attribute

postprocess instance-attribute

io_device instance-attribute

model_device instance-attribute

model_input_dtype instance-attribute

forward

create_w2w_model

LogMelSpect

spect_class instance-attribute

log_multiplier instance-attribute

forward

to_log_magnitude

to_log_power

SequenceFeatureExtractor

hop_length_samples instance-attribute

stage_name instance-attribute

__call__

IdentitySequenceFeatureExtractor

hop_length_samples class-attribute instance-attribute

stage_name class-attribute instance-attribute

forward

LogMelSequenceFeatureExtractor

stage_name class-attribute instance-attribute

mel instance-attribute

hop_length_samples instance-attribute

forward

HcqtSequenceFeatureExtractor

stage_name class-attribute instance-attribute

hcqt instance-attribute

hop_length_samples instance-attribute

log_epsilon instance-attribute

power_epsilon instance-attribute

forward

create_sequence_feature_extractor

derive_stems

str_to_torch_dtype

Audio `dataclass`

data `instance-attribute`

sample_rate `instance-attribute`

NormalizationStats `dataclass`

mean `instance-attribute`

std `instance-attribute`

NormalizedAudio `dataclass`

audio `instance-attribute`

stats `instance-attribute`

Buffers `dataclass`

buffer `instance-attribute`

norm_buffer `instance-attribute`

shift_buffer `instance-attribute`

shift_norm_buffer `instance-attribute`

chunk_size `instance-attribute`

hop_size `instance-attribute`

target_num_samples `instance-attribute`

window `instance-attribute`

tail_size `instance-attribute`

samples_to_skip `instance-attribute`

samples_to_keep `instance-attribute`

buffers `instance-attribute`

chunks `instance-attribute`

starts `instance-attribute`

model `instance-attribute`

preprocess `instance-attribute`

postprocess `instance-attribute`

io_device `instance-attribute`

model_device `instance-attribute`

model_input_dtype `instance-attribute`

spect_class `instance-attribute`

log_multiplier `instance-attribute`

hop_length_samples `instance-attribute`

stage_name `instance-attribute`

call

hop_length_samples `class-attribute` `instance-attribute`

stage_name `class-attribute` `instance-attribute`

stage_name `class-attribute` `instance-attribute`

mel `instance-attribute`

hop_length_samples `instance-attribute`

stage_name `class-attribute` `instance-attribute`

hcqt `instance-attribute`

hop_length_samples `instance-attribute`

log_epsilon `instance-attribute`

power_epsilon `instance-attribute`