Transforms#

Transforms serve the purpose of preprocessing the input data before it is fed to the model and are specified as pipelines with Shorthand Syntax. This can be done both offline (preprocessing) and online.

Tip

To create custom transforms, refer to the custom transforms tutorial.

Multiple transforms are combined into a pipeline, which is handled by the TransformManager. Pipelines are automatically assembled and sorted based on the order attribute of each transform, which is handled by the SmartCompose.

Transforms with a lower order are applied first, followed by transforms with a higher order. If two transforms share the same order, they are applied in the order they are specified in the configuration.

Both types of transforms can utilize any of the available transforms provided by autrainer, or custom transforms inheriting from the AbstractTransform class.

While the choice to use offline or online transforms depends on the use case and is a tradeoff between storage and computational costs, both can be used in conjunction for maximum flexibility.

Note

Some transforms (such as FeatureExtractor) may require additional resources (e.g., tokenizers), which can be automatically downloaded using the autrainer fetch CLI command or the fetch() CLI wrapper function after specifying the transform in the model or dataset configuration.

Preprocessing Transforms#

Preprocessing transforms are specified in the dataset configuration aligning the features_subdir with the name of the preprocessing file. This way, the processed data is stored in a subdirectory of the dataset directory with the same name as the preprocessing file, or in a different path altogether by specifying the features_path attribute in the configuration file. To save the processed data, the file_handler specified in the dataset configuration is used.

Tip

To create custom preprocessing transforms, refer to the custom preprocessing transforms tutorial.

Preprocessing transforms consist of the following attributes:

file_handler: The file handler to use for loading the data specified with Shorthand Syntax. To discover all available file handlers, refer to the file handlers section.
pipeline: The sequence of transformations to apply to the data. The pipeline is specified using Shorthand Syntax and can include any of the available transforms.

Note

To avoid race conditions during parallel training, the preprocessing is applied to the entire dataset before training.

To automatically preprocess all datasets specified in the main configuration (conf/config.yaml) file, the following preprocessing CLI command can be used:

autrainer preprocess

Alternatively, use the following preprocessing CLI wrapper function:

autrainer.cli.preprocess()

Warning

Following v0.6.0, autrainer iterates over all files in all datasets and extracts the respective features without replacement. It is possible that features have been extracted for a subset of the dataset. This may happen when different dataset subsets are supported. Therefore, it is recommended that autrainer preprocess is always called before training. Features have to be manually deleted to overwrite.

autrainer offers default configurations for log-Mel spectrogram extraction and openSMILE feature extraction.

Default Configurations

Log-Mel Spectrograms

Log-Mel spectrograms are a common representation of audio data.

autrainer offers default configurations for log-Mel spectrogram extraction for different sampling rates. The generation process is adapted from: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition.

conf/preprocessing/log_mel_16k.yaml#

file_handler:
  autrainer.datasets.utils.AudioFileHandler:
    target_sample_rate: 16000
pipeline:
  - autrainer.transforms.StereoToMono
  - autrainer.transforms.PannMel:
      sample_rate: 16000
      window_size: 512
      hop_size: 160
      mel_bins: 64
      fmin: 50
      fmax: 8000
      ref: 1.0
      amin: 1e-10
      top_db: null

# adapted from:
# https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/models.py#L2547

conf/preprocessing/log_mel_32k.yaml#

file_handler:
  autrainer.datasets.utils.AudioFileHandler:
    target_sample_rate: 32000
pipeline:
  - autrainer.transforms.StereoToMono
  - autrainer.transforms.PannMel:
      sample_rate: 32000
      window_size: 1024
      hop_size: 320
      mel_bins: 64
      fmin: 50
      fmax: 14000
      ref: 1.0
      amin: 1e-10
      top_db: null

# adapted from:
# https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/main.py#L315

conf/preprocessing/log_mel_8k.yaml#

file_handler:
  autrainer.datasets.utils.AudioFileHandler:
    target_sample_rate: 8000
pipeline:
  - autrainer.transforms.StereoToMono
  - autrainer.transforms.PannMel:
      sample_rate: 8000
      window_size: 256
      hop_size: 80
      mel_bins: 64
      fmin: 50
      fmax: 4000
      ref: 1.0
      amin: 1e-10
      top_db: null

# adapted from:
# https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/models.py#L2644

openSMILE Features

openSMILE is a widely used feature extraction tool for audio data.

autrainer provides default configurations for extracting openSMILE features. openSMILE is an optional dependency and may need to be installed separately. For installation, refer to the installation guide.

conf/preprocessing/eGeMAPSv02-llds.yaml#

file_handler:
  autrainer.datasets.utils.AudioFileHandler:
    target_sample_rate: 16000
pipeline:
  - autrainer.transforms.StereoToMono
  - autrainer.transforms.OpenSMILE:
      feature_set: eGeMAPSv02
      sample_rate: 16000

conf/preprocessing/eGeMAPSv02.yaml#

file_handler:
  autrainer.datasets.utils.AudioFileHandler:
    target_sample_rate: 16000
pipeline:
  - autrainer.transforms.StereoToMono
  - autrainer.transforms.OpenSMILE:
      feature_set: eGeMAPSv02
      sample_rate: 16000
      functionals: true

conf/preprocessing/ComParE_2016-llds.yaml#

file_handler:
  autrainer.datasets.utils.AudioFileHandler:
    target_sample_rate: 16000
pipeline:
  - autrainer.transforms.StereoToMono
  - autrainer.transforms.OpenSMILE:
      feature_set: ComParE_2016
      sample_rate: 16000

conf/preprocessing/ComParE_2016.yaml#

file_handler:
  autrainer.datasets.utils.AudioFileHandler:
    target_sample_rate: 16000
pipeline:
  - autrainer.transforms.StereoToMono
  - autrainer.transforms.OpenSMILE:
      feature_set: ComParE_2016
      sample_rate: 16000
      functionals: true

Online Transforms#

Online transforms can be specified with the transform attribute in both the model and dataset, applied to the input data before it is passed to the model. Each pipeline is specified as a list of transforms using Shorthand Syntax.

Model and dataset configurations specify the transforms to be applied to the input data. If augmentations are used, they are merged with the online transforms to create a single pipeline.

Tip

To create custom online transforms, refer to the custom online transforms tutorial.

Types#

Each transform configuration includes a type attribute, specifying the type of data the model expects or the dataset provides:

image: RGB images
grayscale: Grayscale images (e.g., spectrograms or other single-channel images)
raw: Raw audio waveforms
tabular: Tabular data (e.g., openSMILE features)

Note

The conversion between RGB and grayscale images is handled automatically by the pipeline to ensure compatibility between models and datasets.

Pipelines#

Different transforms can be specified for train, dev, and test pipelines. In addition, base provides a common set of transforms to include in all pipelines.

For example, the following configuration applies different RandomCrop transforms for the train, dev, and test pipelines (which may not necessarily be useful in practice):

conf/model/SpectrogramModel.yaml#

...
transform:
  type: grayscale
  train:
    - autrainer.transforms.RandomCrop:
        size: 111
  dev:
    - autrainer.transforms.RandomCrop:
        size: 121
  test:
    - autrainer.transforms.RandomCrop:
        size: 131

Removing and Overriding#

Models and datasets can remove any existing transform if it exists to ensure compatibility between datasets and models. Removal is done by setting the transform to null in the configuration. If a transform is set to null but is not present in the pipeline, it is ignored.

For example, a model taking in spectrograms could set the Normalize transform to null to remove it from the pipeline as it is not needed for spectrograms:

conf/model/SpectrogramModel.yaml#

...
transform:
  type: grayscale
  base:
    - autrainer.transforms.Normalize: null

Model online transforms outweigh dataset online transforms. If a model specifies a transform that is also specified in the dataset, the model transform is used, overriding the dataset transform.

Note

Removing and overriding transforms is useful for ensuring compatibility between models and datasets, however, both are bound to the same pipeline (e.g. base, train, dev, or test).

If multiple transforms of the same type are specified in a pipeline, overriding or removing a transform affects all transforms of that type in the pipeline.

Tags#

In case multiple transforms of the same type are specified in the pipeline, they can be tagged with a unique identifier using an @ symbol followed by the tag name.

For example, the following configuration applies two Normalize transforms to the pipeline, each with a different tag:

conf/model/SpectrogramModel.yaml#

...
transform:
  type: grayscale
  base:
    - autrainer.transforms.Normalize@first:
        mean: 0.5
        std: 0.5
    - autrainer.transforms.Normalize@second:
        mean: 123
        std: 456

Transforms with tags can be removed or overridden analogously to transforms without tags, by appending the tag name to the transform name.

Note

If a transform with a tag is removed or overridden, only the transform with the specified tag (if it exists) is affected, allowing for the removal or overriding of specific transforms in the pipeline.

Transform Manager#

The TransformManager is responsible for building the transformation pipeline based on the model and dataset configurations. In addition, it handles the inclusion of augmentations.

class autrainer.transforms.TransformManager(model_transform, dataset_transform, train_augmentation=None, dev_augmentation=None, test_augmentation=None)[source]#

Transform manager for composing transforms for the train, dev, and test datasets. Automatically handles the creation of all transformation pipelines including incorporating optional augmentations.

Parameters:

model_transform (Union[DictConfig, Dict]) – The model transform configuration.
dataset_transform (Union[DictConfig, Dict]) – The dataset transform configuration.
train_augmentation (Optional[SmartCompose]) – The train augmentation pipeline. Defaults to None.
dev_augmentation (Optional[SmartCompose]) – The dev augmentation pipeline. Defaults to None.
test_augmentation (Optional[SmartCompose]) – The test augmentation pipeline. Defaults to None.

get_transforms()[source]#

Get the composed transform pipelines for the train, dev, and test datasets.

Return type:

Tuple[SmartCompose, SmartCompose, SmartCompose]

Returns:

The composed transform pipelines for the train, dev, and test: datasets.

Smart Compose#

The SmartCompose is a helper that allows for the composition and ordering of transformations.

class autrainer.transforms.SmartCompose(transforms, **kwargs)[source]#

SmartCompose wrapper for torchvision.transforms.Compose, which allows for simple composition of transforms by adding them together. Transforms are automatically sorted by their order attribute if present.

Additionally, SmartCompose allows for the specification of a collate function to be used in a DataLoader, the last collate function specified will be used.

Parameters:

transforms (List[AbstractTransform]) – List of transforms to compose.
**kwargs – Additional keyword arguments to pass to torchvision.transforms.Compose.

__add__(other)[source]#

Add another transform to the composition. Transforms are automatically sorted by their order attribute if present.

Parameters:: other (Union[SmartCompose, AbstractTransform, List[AbstractTransform], None]) – Transform to add to the composition.
Raises:: TypeError – If the addition is not a valid type. Supported types are AbstractTransform, SmartCompose, and list of AbstractTransform or SmartCompose.
Return type:: SmartCompose
Returns:: New SmartCompose object with the added transform.

get_collate_fn(data)[source]#

Get the collate function. If no collate function is present in the transforms, the dataset default is returned. If multiple collate functions are present, the last one is used.

Parameters:: data (AbstractDataset) – Dataset to get the collate function for. Includes a default_collate_fn.
Return type:: Callable
Returns:: Collate function.

offset_generator_seed(offset)[source]#

Offset the generator seed for transforms that use random number generators. Useful for ensuring reproducibility and randomness of augmentations when using multiple workers.

Parameters:: offset (int) – Offset to add to the generator seed. Usually the worker index.
Return type:: None

__call__(item)[source]#

Apply the transform to the input data item.

Parameters:: item (AbstractDataItem) – The input data item to transform.
Return type:: AbstractDataItem
Returns:: The transformed data item.

Abstract Transform#

All transforms must inherit from AbstractTransform and implement the __call__() method.

class autrainer.transforms.AbstractTransform(order=0, **kwargs)[source]#

Abstract class for a transform.

Parameters:

order (int) – The order of the transform in the pipeline. Larger means later in the pipeline. If multiple transforms have the same order, they are applied in the order they were added to the pipeline. Defaults to 0.
kwargs – Additional keyword arguments to store in the object.

abstract __call__(item)[source]#

Apply the transform to the input data item.

Parameters:: item (AbstractDataItem) – The input data item to transform.
Return type:: AbstractDataItem
Returns:: The transformed data item.

Available Transforms#

autrainer provides a set of predefined transforms that can be used in any configuration.

class autrainer.transforms.AnyToTensor(order=-100)[source]#

Convert a numpy array, torch tensor, or a PIL image to a torch tensor.

Parameters:: order (int) – The order of the transform in the pipeline. Defaults to -100.

class autrainer.transforms.Expand(size, method='pad', axis=-2, order=-85)[source]#

Expand a tensor along a specific axis to a specific size, ensuring it is at least the specified size. If the tensor is smaller than the target size, it will be padded with zeros. If the tensor is larger than the target size, it will not be cropped.

Parameters:

size (int) – The target size.
method (str) – The method to use in [“pad”, “replicate”]. Defaults to “pad”.
axis (int) – The axis along which to expand. Defaults to -2.
order (int) – The order of the transform in the pipeline. Defaults to -85.

class autrainer.transforms.FeatureExtractor(fe_type=None, fe_transfer=None, sampling_rate=16000, order=-80)[source]#

Extract features from an audio signal using a feature extractor from the Hugging Face Transformers library.

Parameters:

fe_type (Optional[str]) – The class of feature extractor to use in [“AST”, “Whisper”, “W2V2”, None]. If None, the AutoFeatureExtractor will be used. Defaults to None.
fe_transfer (Optional[str]) – The name of a pretrained feature extractor to use. If None, the feature extractor will be initialized with default values. Defaults to None.
sampling_rate (int) – The sampling rate of the audio signal. Defaults to 16000.
order (int) – The order of the transform in the pipeline. Defaults to -80.

Raises:

ValueError – If neither ‘fe_type’ nor ‘fe_transfer’ is provided.

class autrainer.transforms.GrayscaleToRGB(order=100)[source]#

Convert a grayscale image to an RGB image.

Parameters:: order (int) – The order of the transform in the pipeline. Defaults to 100.

class autrainer.transforms.ImageToFloat(order=90)[source]#

Transform a uint8 image in the range [0, 255] to a float32 image in the range [0.0, 1.0].

Parameters:: order (int) – The order of the transform in the pipeline. Defaults to 90.

class autrainer.transforms.Normalize(mean, std, order=95)[source]#

Normalize a tensor with a mean and standard deviation.

Parameters:

mean (List[float]) – The mean to use for normalization.
std (List[float]) – The standard deviation to use for normalization.
order (int) – The order of the transform in the pipeline. Defaults to 95.

class autrainer.transforms.NumpyToTensor(order=-100)[source]#

Convert a numpy array to a torch tensor.

Parameters:: order (int) – The order of the transform in the pipeline. Defaults to -100.

class autrainer.transforms.OpenSMILE(feature_set, sample_rate, functionals=False, order=-80)[source]#

Extract features from an audio signal using openSMILE.

The openSMILE library must be installed to use this transform. To install the required extras, run: ‘pip install autrainer[opensmile]’.

Parameters:

feature_set (str) – The feature set to use.
sample_rate (int) – The sample rate of the audio signal.
functionals (bool) – Whether to use functionals. Defaults to False.
order (int) – The order of the transform in the pipeline. Defaults to -80.

Raises:

ImportError – If openSMILE is not available.

class autrainer.transforms.PannMel(window_size, hop_size, sample_rate, fmin, fmax, mel_bins, ref, amin, top_db, order=-90)[source]#

Create a log-Mel spectrogram from an audio signal analogous to the Pretrained Audio Neural Networks (PANN) implementation.

For more information, see: https://doi.org/10.48550/arXiv.1912.10211

Parameters:

window_size (int) – The size of the window.
hop_size (int) – Hop length.
sample_rate (int) – The sample rate of the audio signal.
fmin (int) – The minimum frequency.
fmax (int) – The maximum frequency.
mel_bins (int) – The number of mel bins.
ref (float) – The reference amplitude.
amin (float) – The minimum amplitude.
top_db (int) – The top decibel.
order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.RandomCrop(size, method='pad', axis=-2, fix_randomization=False, order=-90)[source]#

Randomly crop a tensor along a specific axis to a specific size, ensuring it is the specified size.

If the tensor is larger than the target size, it will be randomly cropped along the specified axis.

If the tensor is smaller than the target size, it will be padded.

Parameters:

size (int) – The target size.
method (str) – Padding method to use if the tensor is smaller than the target size in [“pad”, “replicate”]. Defaults to “pad”.
axis (int) – The axis along which to crop. Defaults to -2.
fix_randomization (bool) – Whether to fix the randomization. Defaults to False.
order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.Resample(current_sr, target_sr, order=-95)[source]#

Resample an audio signal to a target sample rate.

Parameters:

current_sr (int) – The current sample rate.
target_sr (int) – The target sample rate.
order (int) – The order of the transform in the pipeline. Defaults to -95.

class autrainer.transforms.Resize(height=64, width=64, interpolation='bilinear', antialias=True, order=-90)[source]#

Resize an image to a specific height and width. If “Any” is provided for the height or width, the other dimension will be used as the size, creating a square image.

Parameters:

height (int) – The target height. If set to “Any”, the width will be used as the size. Defaults to 64.
width (int) – The target width. If set to “Any”, the height will be used as the size. Defaults to 64.
interpolation (str) – The interpolation method to use. Defaults to ‘bilinear’.
antialias (bool) – Whether to use antialiasing. Defaults to True.
order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.RGBAToRGB(order=-95)[source]#

Abstract class for a transform.

Parameters:

order (int) – The order of the transform in the pipeline. Larger means later in the pipeline. If multiple transforms have the same order, they are applied in the order they were added to the pipeline. Defaults to 0.
kwargs – Additional keyword arguments to store in the object.

class autrainer.transforms.RGBToGrayscale(order=100)[source]#

Convert an RGB or RGBA image to a grayscale image.

For the conversion to grayscale, the luminance is calculated in line with the implementation in torchvision.transforms.Grayscale: Y = 0.2989 * R + 0.587 * G + 0.114 * B.

Parameters:: order (int) – The order of the transform in the pipeline. Defaults to 100.

class autrainer.transforms.ScaleRange(range=[0.0, 1.0], order=90)[source]#

Scale a tensor to a specific target range.

Parameters:

range (List[float]) – The range to scale the tensor to. Defaults to [0.0, 1.0].
order (int) – The order of the transform in the pipeline. Defaults to 90.

class autrainer.transforms.SpectToImage(height, width, cmap='magma', order=-90)[source]#

Convert a spectrogram in the range [0, 1] to a 3-channel uint8 image in the range [0, 255] using a specific colormap.

Note: The transform should be used in combination with ~autrainer.transforms.ImageToFloat to convert the image back to the range [0, 1] for training.

Parameters:

height (int) – The height of the image.
width (int) – The width of the image.
cmap (str) – The colormap to use. Defaults to “magma”.
order (int) – The order of the transform in the pipeline. Defaults to -90.

class autrainer.transforms.SquarePadCrop(mode='pad', order=-91)[source]#

Pad or crop an image to make it square. If the image is padded, the padding will be added to the shorter sides as black pixels.

Parameters:

mode (str) – The mode to use in [“pad”, “crop”]. Defaults to “pad”.
order (int) – The order of the transform in the pipeline. Defaults to -91.

Raises:

ValueError – If an invalid mode is provided.

class autrainer.transforms.StereoToMono(order=-95)[source]#

Convert a stereo audio signal to mono by taking the mean of the first dimension.

Parameters:: order (int) – The order of the transform in the pipeline. Defaults to -95.

class autrainer.transforms.Standardizer(mean=None, std=None, subset='train', order=-99)[source]#

Standardize a dataset by calculating the mean and standard deviation of the data and applying the normalization.

The mean and standard deviation are automatically calculated from the data in the specified subset.

Note: The transform is applied at the specified order in the pipeline such that any preceding transforms are applied before the mean and standard deviation are calculated.

Parameters:

mean (Optional[List[float]]) – The mean to use for normalization. If None, the mean will be calculated from the data. Defaults to None.
std (Optional[List[float]]) – The standard deviation to use for normalization. If None, the standard deviation will be calculated from the data. Defaults to None.
subset (str) – The subset to use for calculating the mean and standard deviation. Must be one of [“train”, “dev”, “test”]. Defaults to “train”.
order (int) – The order of the transform in the pipeline. Defaults to -99.

Table of Contents

Transforms#

Preprocessing Transforms#

Online Transforms#

Types#

Pipelines#

Removing and Overriding#

Tags#

Transform Manager#

Smart Compose#

Abstract Transform#

Available Transforms#