Tutorials#

aucurriculum is designed to be flexible and extensible, allowing for the creation of custom …

For each, a tutorial is provided below to demonstrate their implementation and configuration.

For the following tutorials, all Python files should be placed in the project root directory and all configuration files should be placed in the corresponding subdirectories of the conf/ directory.

Custom Scoring Functions#

To create a custom scoring function, inherit from AbstractScore and implement the run() method.

For example, the following model-based scoring function determines the difficulty of each sample by computing the probability (assuming higher probabilities indicate easier samples) of the most likely class (regardless of the true class):

probability_score.py#
 1import os
 2
 3from autrainer.core.utils import Timer, set_device
 4from omegaconf import DictConfig
 5import torch
 6from torch.utils.data import DataLoader
 7
 8from aucurriculum.curricula.scoring import AbstractScore
 9
10
11class ProbabilityScore(AbstractScore):
12    def __init__(
13        self,
14        output_directory: str,
15        results_dir: str,
16        experiment_id: str,
17        run_name: str,
18        stop: str = "best",
19        subset: str = "train",
20    ) -> None:
21        """Probability scoring function determining the difficulty of a sample
22        based on the model's highest output probability (regardless of the true
23        class).
24
25        Args:
26            output_directory: Directory where the scores will be stored.
27            results_dir: The directory where the results are stored.
28            experiment_id: The ID of the grid search experiment.
29            run_name: Name or list of names of the runs to score. Runs can be
30                single runs or aggregated runs.
31            stop: Model state dict to load or to stop at in ["best", "last"].
32                Defaults to "best".
33            subset: Dataset subset to use for scoring in ["train", "dev",
34                "test"]. Defaults to "train".
35        """
36        super().__init__(
37            output_directory=output_directory,
38            results_dir=results_dir,
39            experiment_id=experiment_id,
40            run_name=run_name,
41            stop=stop,
42            subset=subset,
43            reverse_score=True,  # assume higher probabilities are easier
44        )
45
46    def run(
47        self, config: DictConfig, run_config: DictConfig, run_name: str
48    ) -> None:
49        run_name, full_run_name = self.split_run_name(run_name)
50        run_path = os.path.join(self.output_directory, full_run_name)
51        data, model = self.prepare_data_and_model(run_config)
52        dataset = self.get_dataset_subset(data, self.subset)
53        batch_size = config.get("batch_size", run_config.get("batch_size", 32))
54        loader = DataLoader(dataset, batch_size=batch_size)
55        self.load_model_checkpoint(model, run_name)
56        device = set_device(config.device)
57        forward_timer = Timer(run_path, "model_forward")
58        probabilities, labels = self.forward_pass(
59            model=model,
60            loader=loader,
61            batch_size=batch_size,
62            output_map_fn=self.score,
63            tqdm_desc=run_name,
64            disable_progress_bar=not config.get("progress_bar", False),
65            device=device,
66            timer=forward_timer,
67        )
68        forward_timer.save()
69        df = self.create_dataframe(probabilities, labels, data)
70        self.save_scores(df, run_path)
71
72    def score(self, outputs: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
73        """Compute the highest probability per sample regardless of the true
74        class.
75
76        Args:
77            outputs: Batch of model outputs.
78            y: Batch of labels.
79
80        Returns:
81            Batch of highest probability per sample.
82        """
83        return torch.softmax(outputs, dim=1).max(dim=1).values

Next, create a ProbabilityScore.yaml configuration file for the scoring function in the conf/curriculum/scoring/ directory:

conf/curriculum/scoring/ProbabilityScore.yaml#
1id: ProbabilityScore
2type: ProbabilityScore
3_target_: probability_score.ProbabilityScore
4stop: best # "best" or "last"
5subset: train # train, dev, test
6
7run_name: ??? # has to be defined based on a finished run

The id should match the name of the configuration file. The _target_ should point to the custom scoring function class via a Python import path (here assuming that the probability_score.py file is in the root directory of the project).

The run_name should be a run name or list of run names from which to load the models for scoring.

Custom Pacing Functions#

To create a custom pacing function, inherit from AbstractPace and implement the get_dataset_size() method.

For example, the following pacing function determines the dataset size at each iteration based on the convergence of the model, adding a new discrete bucket of samples when the tracking metric does not improve for a specified number of iterations:

discrete_convergence.py#
 1from typing import TYPE_CHECKING
 2
 3from aucurriculum.curricula.pacing import AbstractPace
 4
 5
 6if TYPE_CHECKING:
 7    from autrainer.training import ModularTaskTrainer
 8
 9
10class DiscreteConvergence(AbstractPace):
11    def __init__(
12        self,
13        initial_size: float,
14        final_iteration: float,
15        total_iterations: int,
16        dataset_size: int,
17        patience: int = 1,
18        min_improvement: float = 0.0,
19        buckets: int = 10,
20    ) -> None:
21        super().__init__(
22            initial_size, final_iteration, total_iterations, dataset_size
23        )
24        """Discrete convergence pacing function adding a new bucket of training
25        data every time the validation performance of the tracking metric does
26        not improve by at least `min_improvement` for `patience` iterations.
27
28        Args:
29            initial_size: The initial fraction of the dataset to start training
30                with.
31            final_iteration: The fraction of training iterations at which the
32                dataset size will be the full dataset size. If not all buckets
33                are introduced by this iteration, the remaining buckets will be
34                added immediately.
35            total_iterations: The total number of training iterations.
36            dataset_size: The size of the dataset.
37            patience: The number of iterations to wait before adding a new
38                bucket of training data. Defaults to 1.
39            min_improvement: The minimum improvement in the tracking metric to
40                consider as an improvement. Defaults to 0.0.
41            buckets: The number of buckets to divide the remaining dataset size
42                into. Defaults to 10.
43        """
44        if patience < 1:
45            raise ValueError(f"patience {patience} must be a positive integer")
46        self.patience = patience
47        if min_improvement < 0:
48            raise ValueError(
49                f"min_improvement {min_improvement} must be a positive float"
50            )
51        self.min_improvement = min_improvement
52        if buckets < 1:
53            raise ValueError(f"buckets {buckets} must be a positive integer")
54        self.bucket_size = int((1 - initial_size) * dataset_size / buckets)
55        self.current_size = int(initial_size * dataset_size)
56        self.current_wait = 0
57
58    def get_dataset_size(self, iteration: int) -> int:
59        if self.total_iterations * self.final_iteration <= iteration:
60            return self.dataset_size
61        return self.current_size
62
63    def convergence_criterion(self, metric: float) -> None:
64        if self.metric_fn.compare(
65            metric, self.current_best + self.min_improvement
66        ):
67            self.current_best = metric
68            self.current_wait = 0
69            return
70
71        self.current_wait += 1
72        if self.current_wait >= self.patience:
73            size = min(self.current_size + self.bucket_size, self.dataset_size)
74            self.current_size = size
75            self.current_wait = 0
76
77    def cb_on_train_begin(self, trainer: "ModularTaskTrainer") -> None:
78        self.metric_fn = trainer.data.tracking_metric
79        self.current_best = self.metric_fn.starting_metric
80        if self.metric_fn.suffix == "min":
81            self.min_improvement = -self.min_improvement
82
83    def cb_on_val_end(
84        self, trainer: "ModularTaskTrainer", iteration: int, val_results: dict
85    ) -> None:
86        self.convergence_criterion(val_results[self.metric_fn.name])

Next, create a DiscreteConvergence.yaml configuration file for the pacing function in the conf/curriculum/pacing/ directory:

conf/curriculum/pacing/DiscreteConvergence.yaml#
1id: DiscreteConvergence
2_target_: discrete_convergence.DiscreteConvergence
3initial_size: ???
4final_iteration: ???
5
6patience: 5
7min_improvement: 0.1
8buckets: 10

The id should match the name of the configuration file. The _target_ should point to the custom pacing function class via a Python import path (here assuming that the discrete_convergence.py file is in the root directory of the project). The patience controls the number of iterations to wait for improvement before adding a new bucket of samples. The min_improvement specifies the minimum improvement required to consider the model as having converged. The buckets determines the number of discrete buckets the dataset is divided into.

Both the initial_size and final_iteration serve as placeholders (indicated by ???) and are automatically passed to the pacing function configuration in the main configuration file (e.g. conf/config.yaml).