Models#

autrainer provides a number of different audio-specific models as well as wrappers for torchvision and timm models.

Tip

To create custom models, refer to the custom models tutorial.

Default configurations that end in -T indicate that the model uses transfer learning with pretrained weights. To avoid race conditions when using Launcher Plugins that may run multiple training jobs in parallel, autrainer fetch or fetch() are used to download the pretrained weights before training.

Note

Most models are pretrained on the ImageNet or AudioSet datasets. To ensure compatibility with any number of output dimensions, the last linear layer of the model is replaced with a new linear layer with the correct number of output dimensions and will therefore not be pretrained.

The weights for all pretrained models that are provided by autrainer can be automatically downloaded using the autrainer fetch CLI command or the fetch() CLI wrapper function.

To optionally use model, optimizer, or scheduler checkpoints, the following attributes can be set in any model configuration file:

model_checkpoint: The path to the model checkpoint file. Defaults to None.
optimizer_checkpoint: The path to the optimizer checkpoint file. Defaults to None.
scheduler_checkpoint: The path to the scheduler checkpoint file. Defaults to None.
skip_last_layer: Whether to skip loading the state of the last linear or convolutional layer. When set to True, the state of the last layer (if present) is omitted from both the model and optimizer, allowing for training on a different target dataset with varying output dimensions. Defaults to True.

Note

Loading a checkpoint assumes that the model architecture is the same as the one used to create the checkpoint and that the last layer of the model is specified as the final Linear or _ConvNd module. If the last layer is not the final layer in the module order, it may not be correctly identified for skipping.

Abstract Model#

All models inherit from the AbstractModel class and implement the forward() and embeddings() methods. These models must adhere a specific signature for their forward() method as autrainer uses argument names for its internal data flow. For this reason, features must always be the name of the first argument. All subsequent arguments can be named differently but they must be present as attributes in the corresponding AbstractDataItem.

class autrainer.models.AbstractModel(output_dim, transfer=None)[source]#

Abstract model class.

Parameters:

output_dim (int) – Output dimension of the model.
transfer (Union[bool, str, None]) – Whether to load the model with pretrained weights if available. May be a boolean or a string representing a truthy or falsy value. Defaults to None.

abstract embeddings(features)[source]#

Get embeddings from the model.

Parameters:: features (Tensor) – Input tensor.
Return type:: Tensor
Returns:: Embeddings.

property inputs: List[str]#

Get the inputs to the model’s forward method.

Returns:: Model inputs.

property embedding_inputs: List[str]#

Get the inputs to the model’s embedding method.

Returns:: Model inputs.

abstract forward(features)[source]#

Get model output.

Parameters:: features (Tensor) – Input tensor.
Return type:: Tensor
Returns:: Model output.

Model Wrappers#

For convenience, we provide wrappers for torchvision and timm models.

class autrainer.models.TorchvisionModel(output_dim, torchvision_name, transfer=False, **kwargs)[source]#

Wrapper for torchvision models.

Parameters:

output_dim (int) – Number of output classes.
torchvision_name (str) – Name of the model available in torchvision.models.
transfer (bool) – Whether to load the model with pretrained weights. The “DEFAULT” weights are used if transfer is True. The final layer is replaced with a new layer with output_dim output features. Defaults to False.
kwargs (Dict[str, Any]) – Additional arguments to pass to the model constructor.

class autrainer.models.TimmModel(output_dim, timm_name, transfer=False, **kwargs)[source]#

Wrapper for timm models.

Parameters:

output_dim (int) – Number of output classes.
timm_name (str) – Name of the model available in timm.create_model.
transfer (bool) – Whether to load the model with pretrained weights. The final layer is replaced with a new layer with output_dim output features. Defaults to False.
kwargs (Dict[str, Any]) – Additional arguments to pass to the model constructor.

Audio Models#

autrainer provides a number of different audio-specific models.

class autrainer.models.ASTModel(output_dim, num_hidden_layers=12, hidden_size=128, dropout=0.5, transfer=None)[source]#

Audio Speech Transformer (AST) model. For more information see: https://huggingface.co/docs/transformers/v4.31.0/en/model_doc/audio-spectrogram-transformer

Parameters:

output_dim (int) – Output dimension of the model.
num_hidden_layers (int) – Number of hidden layers in the transformer. Defaults to 12.
hidden_size (int) – Hidden size of the linear layer. Defaults to 128.
dropout (float) – Dropout rate. Defaults to 0.5.
transfer (Optional[str]) – Name of the pretrained model to load. If None, the default AST fine-tuned on AudioSet is used. Defaults to None. For more information see: https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593

class autrainer.models.AudioRNNModel(output_dim, model_name, hidden_size=256, num_layers=2, dropout=0.5, cell='LSTM', bidirectional=False, transfer=None)[source]#

Audio RNN model.

Parameters:

output_dim (int) – Output dimension of the model.
model_name (str) – Model name in [“emo18”, “zhao19”].
hidden_size (int) – Hidden size of the RNN. Defaults to 256.
num_layers (int) – Number of layers of the RNN. Defaults to 2.
dropout (float) – Dropout rate. Defaults to 0.5.
cell (str) – Type of RNN cell in [“LSTM”, “GRU”] Defaults to “LSTM”.
bidirectional (bool) – Whether to use a bidirectional RNN. Defaults to False.
transfer (Union[bool, str, None]) – Not available for this model. If set, raises an error. Defaults to None.

class autrainer.models.Cnn10(output_dim, segmentwise=False, in_channels=1, transfer=None)[source]#

CNN10 PANN model. For more information see: https://doi.org/10.48550/arXiv.1912.10211

Parameters:

output_dim (int) – Output dimension of the model.
segmentwise (bool) – Whether to use segmentwise path or clipwise path. Defaults to False.
in_channels (int) – Number of input channels. Defaults to 1.
transfer (Optional[str]) – Link to the weights to transfer. If None, the weights are randomly initialized. Defaults to None.

class autrainer.models.Cnn14(output_dim, segmentwise=False, in_channels=1, transfer=None)[source]#

CNN14 PANN model. For more information see: https://doi.org/10.48550/arXiv.1912.10211

Parameters:

output_dim (int) – Output dimension of the model.
segmentwise (bool) – Whether to use segmentwise path or clipwise path. Defaults to False.
in_channels (int) – Number of input channels. Defaults to 1.
transfer (Optional[str]) – Link to the weights to transfer. If None, the weights weights will be randomly initialized. Defaults to None.

class autrainer.models.FFNN(output_dim, input_size, hidden_size, num_layers=2, dropout=0.5, transfer=None)[source]#

Feedforward neural network.

Parameters:

output_dim (int) – Output dimension.
input_size (int) – Input size.
hidden_size (int) – Hidden size.
num_layers (int) – Number of layers.
dropout (float) – Dropout rate.
transfer (Union[bool, str, None]) – Not available for this model. If set, raises an error. Defaults to None.

class autrainer.models.LEAFNet(output_dim, leaf_filters=40, kernel_size=25, stride=0.0625, window_stride=10, padding_kernel_size=25, sample_rate=16000, min_freq=60, max_freq=7800, efficientnet_type='efficientnet_b0', mode='interspeech', initialization='mel', generator_seed=42, transfer=False)[source]#

EfficientNet with LEAF-Is frontend. Used to reproduce work from: https://www.isca-archive.org/interspeech_2023/meng23c_interspeech.html

Also see original LEAF and PCEN papers (c.f. speechbrain).

We take and slightly adapt the LEAF frontend implementation from: Hanyu-Meng/Adapting-LEAF

Parameters:

output_dim (int) – Output dimension.
leaf_filters (int) – Number of LEAF filterbanks to train. Defaults to 40.
kernel_size (int) – Size of kernels applied by LEAF (in ms). Defaults to 25.
stride (float) – Stride of LEAF (in ms). Defaults to 0.0625.
window_stride (int) – Stride of lowpass filter (in ms). Defaults to 10.
padding_kernel_size (int) – Size of lowpass filter (in ms). Defaults to 25.
sample_rate (int) – Used to compute LEAF params. Defaults to 16000.
min_freq (int) – Minimum freq analyzed by LEAF. Defaults to 60.
max_freq (int) – Maximum freq analyzed by LEAF. Defaults to 7800.
efficientnet_type (str) – EfficientNet type to use from timm. Defaults to “efficientnet_b0”.
mode (str) – Implementation according to “interspeech” paper (Meng et al.) or “speech_brain”. Defaults to “interspeech”.
initialization (str) – Filterbank initialisation in [“mel”, “bark”, “linear-constant”, “constant”, “uniform”, “zeros”]. Defaults to “mel”.
generator_seed (int) – Seed for random generator. Defaults to 42.
transfer (bool) – Whether to use EfficientNet weights from ImageNet. Defaults to False.

Raises:

ValueError – If efficientnet_type is not supported.
ValueError – If mode is not supported.

class autrainer.models.SeqFFNN(output_dim, backbone_input_dim, backbone_hidden_size, backbone_num_layers, hidden_size, num_layers=2, dropout=0.5, backbone_dropout=0.5, backbone_cell='LSTM', backbone_time_pooling=True, backbone_bidirectional=False, transfer=None)[source]#

Sequential model with FFNN frontend.

Parameters:

output_dim (int) – Output dimension of the FFNN.
backbone_input_dim (int) – Input dimension of the backbone.
backbone_hidden_size (int) – Hidden size of the backbone.
backbone_num_layers (int) – Number of layers of the backbone.
hidden_size (int) – Hidden size of the FFNN.
num_layers (int) – Number of layers of the FFNN. Defaults to 2.
dropout (float) – Dropout rate of the FFNN. Defaults to 0.5.
backbone_dropout (float) – Dropout rate of the backbone. Defaults to 0.5.
backbone_cell (str) – Cell type of the backbone in [“LSTM”, “GRU”]. Defaults to “LSTM”.
backbone_time_pooling (bool) – Whether to apply time pooling in the backbone. Defaults to True.
backbone_bidirectional (bool) – Whether to use a bidirectional backbone. Defaults to False.
transfer (Union[bool, str, None]) – Not available for this model. If set, raises an error. Defaults to None.

class autrainer.models.TDNNFFNN(output_dim, hidden_size, num_layers=2, dropout=0.5, transfer=False)[source]#

Time Delay Neural Network with FFNN frontend.

Parameters:

output_dim (int) – Output dimension.
hidden_size (int) – Hidden size.
num_layers (int) – Number of layers. Defaults to 2.
dropout (float) – Dropout rate. Defaults to 0.5.
transfer (bool) – Whether to initialize the TDNN backbone with pretrained weights. Defaults to False.

class autrainer.models.W2V2FFNN(output_dim, model_name, freeze_extractor, hidden_size, num_layers=2, dropout=0.5, transfer=False)[source]#

Wav2Vec2 model with FFNN frontend adapted for audio classification. For more information, see: https://huggingface.co/docs/transformers/model_doc/wav2vec2

Parameters:

output_dim (int) – Output dimension of the FFNN.
model_name (str) – Name of the model loaded from Huggingface.
freeze_extractor (bool) – Whether to freeze the feature extractor.
hidden_size (int) – Hidden size of the FFNN.
num_layers (int) – Number of layers of the FFNN. Defaults to 2.
dropout (float) – Dropout rate. Defaults to 0.5.
transfer (bool) – Whether to initialize the Wav2Vec2 backbone with pretrained weights. Defaults to False.

class autrainer.models.WhisperFFNN(output_dim, model_name, hidden_size, num_layers=2, dropout=0.5, transfer=False)[source]#

Whisper model with FFNN frontend adapted for audio classification. For more information, see: https://doi.org/10.48550/arXiv.2212.04356

Parameters:

model_name (str) – Name of the model loaded from Huggingface.
hidden_size (int) – Hidden size of the FFNN.
output_dim (int) – Output dimension of the FFNN.
num_layers (int) – Number of layers of the FFNN. Defaults to 2.
dropout (float) – Dropout rate. Defaults to 0.5.
transfer (bool) – Whether to initialize the Whisper backbone with pretrained weights. Defaults to False.

Table of Contents

Models#

Abstract Model#

Model Wrappers#

Audio Models#