Transforms#

Transforms serve the purpose of preprocessing the input data before it is fed to the model and are specified as pipelines with Shorthand Syntax. This can be done both offline (preprocessing) and online.

Tip

To create custom transforms, refer to the custom transforms tutorial.

Multiple transforms are combined into a pipeline, which is handled by the TransformManager. Pipelines are automatically assembled and sorted based on the order attribute of each transform, which is handled by the SmartCompose.

Transforms with a lower order are applied first, followed by transforms with a higher order. If two transforms share the same order, they are applied in the order they are specified in the configuration.

Both types of transforms can utilize any of the available transforms provided by autrainer, or custom transforms inheriting from the AbstractTransform class.

While the choice to use offline or online transforms depends on the use case and is a tradeoff between storage and computational costs, both can be used in conjunction for maximum flexibility.

Preprocessing Transforms#

Preprocessing transforms are specified in the dataset configuration aligning the features_subdir with the name of the preprocessing file. This way, the processed data is stored in a subdirectory of the dataset directory with the same name as the preprocessing file. To save the processed data, the file_handler specified in the dataset configuration is used.

Tip

To create custom preprocessing transforms, refer to the custom preprocessing transforms tutorial.

Preprocessing transforms consist of the following attributes:

  • file_handler: The file handler to use for loading the data specified with Shorthand Syntax. To discover all available file handlers, refer to the file handlers section.

  • pipeline: The sequence of transformations to apply to the data. The pipeline is specified using Shorthand Syntax and can include any of the available transforms.

Note

To avoid race conditions during parallel training, the preprocessing is applied to the entire dataset before training.

To automatically preprocess all datasets specified in the main configuration (conf/config.yaml) file, the following preprocessing CLI command can be used:

autrainer preprocess

Alternatively, use the following preprocessing CLI wrapper function:

autrainer.cli.preprocess()

autrainer offers default configurations for log-Mel spectrogram extraction and openSMILE feature extraction.

Default Configurations

Log-Mel Spectrograms

Log-Mel spectrograms are a common representation of audio data.

autrainer offers default configurations for log-Mel spectrogram extraction for different sampling rates. The generation process is adapted from: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition.

conf/preprocessing/log_mel_16k.yaml#
 1file_handler:
 2  autrainer.datasets.utils.AudioFileHandler:
 3    target_sample_rate: 16000
 4pipeline:
 5  - autrainer.transforms.StereoToMono
 6  - autrainer.transforms.PannMel:
 7      sample_rate: 16000
 8      window_size: 512
 9      hop_size: 160
10      mel_bins: 64
11      fmin: 50
12      fmax: 8000
13      ref: 1.0
14      amin: 1e-10
15      top_db: null
16
17# adapted from:
18# https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/models.py#L2547
conf/preprocessing/log_mel_32k.yaml#
 1file_handler:
 2  autrainer.datasets.utils.AudioFileHandler:
 3    target_sample_rate: 32000
 4pipeline:
 5  - autrainer.transforms.StereoToMono
 6  - autrainer.transforms.PannMel:
 7      sample_rate: 32000
 8      window_size: 1024
 9      hop_size: 320
10      mel_bins: 64
11      fmin: 50
12      fmax: 14000
13      ref: 1.0
14      amin: 1e-10
15      top_db: null
16
17# adapted from:
18# https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/main.py#L315
conf/preprocessing/log_mel_8k.yaml#
 1file_handler:
 2  autrainer.datasets.utils.AudioFileHandler:
 3    target_sample_rate: 8000
 4pipeline:
 5  - autrainer.transforms.StereoToMono
 6  - autrainer.transforms.PannMel:
 7      sample_rate: 8000
 8      window_size: 256
 9      hop_size: 80
10      mel_bins: 64
11      fmin: 50
12      fmax: 4000
13      ref: 1.0
14      amin: 1e-10
15      top_db: null
16
17# adapted from:
18# https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/models.py#L2644

openSMILE Features

openSMILE is a widely used feature extraction tool for audio data.

autrainer provides default configurations for extracting openSMILE features. openSMILE is an optional dependency and may need to be installed separately. For installation, refer to the installation guide.

conf/preprocessing/eGeMAPSv02-llds.yaml#
1file_handler:
2  autrainer.datasets.utils.AudioFileHandler:
3    target_sample_rate: 16000
4pipeline:
5  - autrainer.transforms.StereoToMono
6  - autrainer.transforms.OpenSMILE:
7      feature_set: eGeMAPSv02
8      sample_rate: 16000
conf/preprocessing/eGeMAPSv02.yaml#
1file_handler:
2  autrainer.datasets.utils.AudioFileHandler:
3    target_sample_rate: 16000
4pipeline:
5  - autrainer.transforms.StereoToMono
6  - autrainer.transforms.OpenSMILE:
7      feature_set: eGeMAPSv02
8      sample_rate: 16000
9      functionals: true
conf/preprocessing/ComParE_2016-llds.yaml#
1file_handler:
2  autrainer.datasets.utils.AudioFileHandler:
3    target_sample_rate: 16000
4pipeline:
5  - autrainer.transforms.StereoToMono
6  - autrainer.transforms.OpenSMILE:
7      feature_set: ComParE_2016
8      sample_rate: 16000
conf/preprocessing/ComParE_2016.yaml#
1file_handler:
2  autrainer.datasets.utils.AudioFileHandler:
3    target_sample_rate: 16000
4pipeline:
5  - autrainer.transforms.StereoToMono
6  - autrainer.transforms.OpenSMILE:
7      feature_set: ComParE_2016
8      sample_rate: 16000
9      functionals: true

Online Transforms#

Online transforms can be specified with the transform attribute in both the model and dataset, applied to the input data before it is passed to the model. Each pipeline is specified as a list of transforms using Shorthand Syntax.

Tip

To create custom online transforms, refer to the custom online transforms tutorial.

If augmentations are used, they are merged with the online transforms to create a single pipeline.

Model and dataset configurations specify the transforms to be applied to the input data. Each transform configuration includes a type attribute, specifying the type of data the model expects or the dataset provides:

  • image: RGB images

  • grayscale: Grayscale images (e.g. spectrograms or other single-channel images)

  • raw: Raw audio waveforms

  • tabular: Tabular data (e.g. openSMILE features)

Note

The conversion between RGB and grayscale images is handled automatically by the pipeline to ensure compatibility between models and datasets.

Different transforms can be specified for train, dev, and test pipelines. In addition, base provides a common set of transforms to include in all pipelines.

For example, the following configuration applies different RandomCrop transforms for the train, dev, and test pipelines (which may not necessarily be useful in practice):

conf/model/SpectrogramModel.yaml#
 1...
 2transform:
 3  type: grayscale
 4  train:
 5    - autrainer.transforms.RandomCrop:
 6        size: 111
 7  dev:
 8    - autrainer.transforms.RandomCrop:
 9        size: 121
10  test:
11    - autrainer.transforms.RandomCrop:
12        size: 131

Models and datasets can remove any existing transform if it exists to ensure compatibility between datasets and models. Removal is done by setting the transform to null in the configuration. If a transform is set to null but is not present in the pipeline, it is ignored.

For example, a model taking in spectrograms could set the Normalize transform to null to remove it from the pipeline as it is not needed for spectrograms:

conf/model/SpectrogramModel.yaml#
1...
2transform:
3  type: grayscale
4  base:
5    - autrainer.transforms.Normalize: null

Note

Model online transforms outweigh dataset online transforms. If a model specifies a transform that is also specified in the dataset, the model transform is used, overriding the dataset transform.

Transform Manager#

The TransformManager is responsible for building the transformation pipeline based on the model and dataset configurations. In addition, it handles the inclusion of augmentations.

class autrainer.transforms.TransformManager(model_transform, dataset_transform, train_augmentation=None, dev_augmentation=None, test_augmentation=None)[source]#

Transform manager for composing transforms for the train, dev, and test datasets. Automatically handles the creation of all transformation pipelines including incorporating optional augmentations.

Parameters:
  • model_transform (Union[DictConfig, Dict]) – The model transform configuration.

  • dataset_transform (Union[DictConfig, Dict]) – The dataset transform configuration.

  • train_augmentation (Optional[SmartCompose]) – The train augmentation pipeline. Defaults to None.

  • dev_augmentation (Optional[SmartCompose]) – The dev augmentation pipeline. Defaults to None.

  • test_augmentation (Optional[SmartCompose]) – The test augmentation pipeline. Defaults to None.

get_transforms()[source]#

Get the composed transform pipelines for the train, dev, and test datasets.

Return type:

Tuple[SmartCompose, SmartCompose, SmartCompose]

Returns:

The composed transform pipelines for the train, dev, and test

datasets.

Smart Compose#

The SmartCompose is a helper that allows for the composition and ordering of transformations.

class autrainer.transforms.SmartCompose(transforms, **kwargs)[source]#

SmartCompose wrapper for torchvision.transforms.Compose, which allows for simple composition of transforms by adding them together. Transforms are automatically sorted by their order attribute if present.

Additionally, SmartCompose allows for the specification of a collate function to be used in a DataLoader, the last collate function specified will be used.

Parameters:
  • transforms (List[Union[Compose, AbstractTransform, Callable]]) – List of transforms to compose.

  • **kwargs – Additional keyword arguments to pass to torchvision.transforms.Compose.

__add__(other)[source]#

Add another transform to the composition. Transforms are automatically sorted by their order attribute if present.

Parameters:

other (Union[Compose, AbstractTransform, Callable, List[Callable]]) – Transform to add to the composition.

Raises:

TypeError – If the addition is not a valid type. Supported types are ‘torchvision.transforms.Compose’, ‘AbstractTransform’, ‘Callable’, or ‘List[Callable]’.

Return type:

SmartCompose

Returns:

New SmartCompose object with the added transform.

get_collate_fn(data)[source]#

Get the collate function. If no collate function is present in the transforms, None is returned. If multiple collate functions are present, the last one is used.

Parameters:

data (AbstractDataset) – Dataset to get the collate function for.

Return type:

Optional[Callable]

Returns:

Collate function.

__call__(x, index)[source]#

Apply the transforms to the input tensor.

Parameters:
  • x (Tensor) – Input tensor.

  • index (int) – Dataset index of the input tensor.

Return type:

Tensor

Returns:

Transformed tensor.

Abstract Transform#

All transforms must inherit from AbstractTransform and implement the __call__() method.

class autrainer.transforms.AbstractTransform(order=0, **kwargs)[source]#

Abstract class for a transform.

Parameters:
  • order (int) – The order of the transform in the pipeline. Larger means later in the pipeline. If multiple transforms have the same order, they are applied in the order they were added to the pipeline. Defaults to 0.

  • kwargs – Additional keyword arguments to store in the object.

abstract __call__(x, index=None)[source]#

Apply the transform to the input tensor.

Parameters:
  • x (Tensor) – The input tensor.

  • index (Optional[int]) – The index of the input tensor in the dataset. Defaults to None.

Return type:

Tensor

Returns:

The transformed tensor.

Available Transforms#

autrainer provides a set of predefined transforms that can be used in any configuration.

class autrainer.transforms.AnyToTensor(order=-100)[source]#

Convert a numpy array, torch tensor, or a PIL image to a torch tensor.

Parameters:

order (int) – The order of the transform in the pipeline. Defaults to -100.

class autrainer.transforms.Expand(size, method='pad', axis=-2, order=-85)[source]#

Expand a tensor along a specific axis to a specific size, ensuring it is at least the specified size. If the tensor is smaller than the target size, it will be padded with zeros. If the tensor is larger than the target size, it will not be cropped.

Parameters:
  • size (int) – The target size.

  • method (str) – The method to use in [“pad”, “replicate”]. Defaults to “pad”.

  • axis (int) – The axis along which to expand. Defaults to -2.

  • order (int) – The order of the transform in the pipeline. Defaults to -85.

class autrainer.transforms.FeatureExtractor(fe_type=None, fe_transfer=None, sampling_rate=16000, order=-80)[source]#

Extract features from an audio signal using a feature extractor from the Hugging Face Transformers library.

Parameters:
  • fe_type (Optional[str]) – The class of feature extractor to use in [“AST”, “Whisper”, “W2V2”, None]. If None, the AutoFeatureExtractor will be used. Defaults to None.

  • fe_transfer (Optional[str]) – The name of a pretrained feature extractor to use. If None, the feature extractor will be initialized with default values. Defaults to None.

  • sampling_rate (int) – The sampling rate of the audio signal. Defaults to 16000.

  • order (int) – The order of the transform in the pipeline. Defaults to -80.

Raises:

ValueError – If neither ‘fe_type’ nor ‘fe_transfer’ is provided.

class autrainer.transforms.GrayscaleToRGB(order=100)[source]#

Convert a grayscale image to an RGB image.

Parameters:

order (int) – The order of the transform in the pipeline. Defaults to 100.

class autrainer.transforms.Normalize(mean, std, order=95)[source]#

Normalize a tensor with a mean and standard deviation.

Parameters:
  • mean (List[float]) – The mean to use for normalization.

  • std (List[float]) – The standard deviation to use for normalization.

  • order (int) – The order of the transform in the pipeline. Defaults to 95.

class autrainer.transforms.NumpyToTensor(order=-100)[source]#

Convert a numpy array to a torch tensor.

Parameters:

order (int) – The order of the transform in the pipeline. Defaults to -100.

class autrainer.transforms.OpenSMILE(feature_set, sample_rate, functionals=False, order=-80)[source]#

Extract features from an audio signal using openSMILE.

The openSMILE library must be installed to use this transform. To install the required extras, run: ‘pip install autrainer[opensmile]’.

Parameters:
  • feature_set (str) – The feature set to use.

  • sample_rate (int) – The sample rate of the audio signal.

  • functionals (bool) – Whether to use functionals. Defaults to False.

  • order (int) – The order of the transform in the pipeline. Defaults to -80.

Raises:

ImportError – If openSMILE is not available.

class autrainer.transforms.PannMel(window_size, hop_size, sample_rate, fmin, fmax, mel_bins, ref, amin, top_db, order=-90)[source]#

Create a log-Mel spectrogram from an audio signal analogous to the Pretrained Audio Neural Networks (PANN) implementation.

For more information, see: https://doi.org/10.48550/arXiv.1912.10211

Parameters:
  • window_size (int) – The size of the window.

  • hop_size (int) – Hop length.

  • sample_rate (int) – The sample rate of the audio signal.

  • fmin (int) – The minimum frequency.

  • fmax (int) – The maximum frequency.

  • mel_bins (int) – The number of mel bins.

  • ref (float) – The reference amplitude.

  • amin (float) – The minimum amplitude.

  • top_db (int) – The top decibel.

  • order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.RandomCrop(size, method='pad', axis=-2, fix_randomization=False, order=-90)[source]#

Randomly crop a tensor along a specific axis to a specific size, ensuring it is the specified size.

If the tensor is larger than the target size, it will be randomly cropped along the specified axis.

If the tensor is smaller than the target size, it will be padded.

Parameters:
  • size (int) – The target size.

  • method (str) – Padding method to use if the tensor is smaller than the target size in [“pad”, “replicate”]. Defaults to “pad”.

  • axis (int) – The axis along which to crop. Defaults to -2.

  • fix_randomization (bool) – Whether to fix the randomization. Defaults to False.

  • order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.Resample(current_sr, target_sr, order=-95)[source]#

Resample an audio signal to a target sample rate.

Parameters:
  • current_sr (int) – The current sample rate.

  • target_sr (int) – The target sample rate.

  • order (int) – The order of the transform in the pipeline. Defaults to -95.

class autrainer.transforms.Resize(height=64, width=64, interpolation='bilinear', antialias=True, order=-90)[source]#

Resize an image to a specific height and width. If “Any” is provided for the height or width, the other dimension will be used as the size, creating a square image.

Parameters:
  • height (int) – The target height. If set to “Any”, the width will be used as the size. Defaults to 64.

  • width (int) – The target width. If set to “Any”, the height will be used as the size. Defaults to 64.

  • interpolation (str) – The interpolation method to use. Defaults to ‘bilinear’.

  • antialias (bool) – Whether to use antialiasing. Defaults to True.

  • order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.RGBAToRGB(order=-95)[source]#

Abstract class for a transform.

Parameters:
  • order (int) – The order of the transform in the pipeline. Larger means later in the pipeline. If multiple transforms have the same order, they are applied in the order they were added to the pipeline. Defaults to 0.

  • kwargs – Additional keyword arguments to store in the object.

class autrainer.transforms.RGBAToGrayscale(order=-95)[source]#

Convert an RGBA image to grayscale by dropping the alpha channel and converting to grayscale.

For the conversion to grayscale, the luminance is calculated in line with the implementation in torchvision.transforms.Grayscale: Y = 0.2989 * R + 0.587 * G + 0.114 * B.

Parameters:

order (int) – The order of the transform in the pipeline. Defaults to -95.

class autrainer.transforms.ScaleRange(range=[0.0, 1.0], order=90)[source]#

Scale a tensor to a specific target range.

Parameters:
  • range (List[float]) – The range to scale the tensor to. Defaults to [0.0, 1.0].

  • order (int) – The order of the transform in the pipeline. Defaults to 90.

class autrainer.transforms.RGBToGrayscale(order=100)[source]#

Convert an RGB image to a grayscale image.

For the conversion to grayscale, the luminance is calculated in line with the implementation in torchvision.transforms.Grayscale: Y = 0.2989 * R + 0.587 * G + 0.114 * B.

Parameters:

order (int) – The order of the transform in the pipeline. Defaults to 100.

class autrainer.transforms.SpectToImage(height, width, cmap='magma', order=-90)[source]#

Convert a spectrogram to a 3-channel image based on a colormap.

Parameters:
  • height (int) – The height of the image.

  • width (int) – The width of the image.

  • cmap (str) – The colormap to use. Defaults to “magma”.

  • order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.SquarePadCrop(mode='pad', order=-91)[source]#

Pad or crop an image to make it square. If the image is padded, the padding will be added to the shorter sides as black pixels.

Parameters:
  • mode (str) – The mode to use in [“pad”, “crop”]. Defaults to “pad”.

  • order (int) – The order of the transform in the pipeline. Defaults to -91.

Raises:

ValueError – If an invalid mode is provided.

class autrainer.transforms.StereoToMono(order=-95)[source]#

Convert a stereo audio signal to mono by taking the mean of the first dimension.

Parameters:

order (int) – The order of the transform in the pipeline. Defaults to -95.