Tutorials#
aucurriculum is designed to be flexible and extensible, allowing for the creation of custom …
For each, a tutorial is provided below to demonstrate their implementation and configuration.
For the following tutorials, all Python files should be placed in the project root directory
and all configuration files should be placed in the corresponding subdirectories of the conf/
directory.
Custom Scoring Functions#
To create a custom scoring function, inherit from AbstractScore
and implement the run()
method.
For example, the following model-based scoring function determines the difficulty of each sample by computing the probability (assuming higher probabilities indicate easier samples) of the most likely class (regardless of the true class):
1import os
2
3from autrainer.core.utils import Timer, set_device
4from omegaconf import DictConfig
5import torch
6from torch.utils.data import DataLoader
7
8from aucurriculum.curricula.scoring import AbstractScore
9
10
11class ProbabilityScore(AbstractScore):
12 def __init__(
13 self,
14 output_directory: str,
15 results_dir: str,
16 experiment_id: str,
17 run_name: str,
18 stop: str = "best",
19 subset: str = "train",
20 ) -> None:
21 """Probability scoring function determining the difficulty of a sample
22 based on the model's highest output probability (regardless of the true
23 class).
24
25 Args:
26 output_directory: Directory where the scores will be stored.
27 results_dir: The directory where the results are stored.
28 experiment_id: The ID of the grid search experiment.
29 run_name: Name or list of names of the runs to score. Runs can be
30 single runs or aggregated runs.
31 stop: Model state dict to load or to stop at in ["best", "last"].
32 Defaults to "best".
33 subset: Dataset subset to use for scoring in ["train", "dev",
34 "test"]. Defaults to "train".
35 """
36 super().__init__(
37 output_directory=output_directory,
38 results_dir=results_dir,
39 experiment_id=experiment_id,
40 run_name=run_name,
41 stop=stop,
42 subset=subset,
43 reverse_score=True, # assume higher probabilities are easier
44 )
45
46 def run(
47 self, config: DictConfig, run_config: DictConfig, run_name: str
48 ) -> None:
49 run_name, full_run_name = self.split_run_name(run_name)
50 run_path = os.path.join(self.output_directory, full_run_name)
51 data, model = self.prepare_data_and_model(run_config)
52 dataset = self.get_dataset_subset(data, self.subset)
53 batch_size = config.get("batch_size", run_config.get("batch_size", 32))
54 loader = DataLoader(dataset, batch_size=batch_size)
55 self.load_model_checkpoint(model, run_name)
56 device = set_device(config.device)
57 forward_timer = Timer(run_path, "model_forward")
58 probabilities, labels = self.forward_pass(
59 model=model,
60 loader=loader,
61 batch_size=batch_size,
62 output_map_fn=self.score,
63 tqdm_desc=run_name,
64 disable_progress_bar=not config.get("progress_bar", False),
65 device=device,
66 timer=forward_timer,
67 )
68 forward_timer.save()
69 df = self.create_dataframe(probabilities, labels, data)
70 self.save_scores(df, run_path)
71
72 def score(self, outputs: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
73 """Compute the highest probability per sample regardless of the true
74 class.
75
76 Args:
77 outputs: Batch of model outputs.
78 y: Batch of labels.
79
80 Returns:
81 Batch of highest probability per sample.
82 """
83 return torch.softmax(outputs, dim=1).max(dim=1).values
Next, create a ProbabilityScore.yaml
configuration file for the scoring function in the conf/curriculum/scoring/
directory:
1id: ProbabilityScore
2type: ProbabilityScore
3_target_: probability_score.ProbabilityScore
4stop: best # "best" or "last"
5subset: train # train, dev, test
6
7run_name: ??? # has to be defined based on a finished run
The id
should match the name of the configuration file.
The _target_
should point to the custom scoring function class via a Python import path
(here assuming that the probability_score.py
file is in the root directory of the project).
The run_name
should be a run name or list of run names from which to load the models for scoring.
Custom Pacing Functions#
To create a custom pacing function, inherit from AbstractPace
and implement the get_dataset_size()
method.
For example, the following pacing function determines the dataset size at each iteration based on the convergence of the model, adding a new discrete bucket of samples when the tracking metric does not improve for a specified number of iterations:
1from typing import TYPE_CHECKING
2
3from aucurriculum.curricula.pacing import AbstractPace
4
5
6if TYPE_CHECKING:
7 from autrainer.training import ModularTaskTrainer
8
9
10class DiscreteConvergence(AbstractPace):
11 def __init__(
12 self,
13 initial_size: float,
14 final_iteration: float,
15 total_iterations: int,
16 dataset_size: int,
17 patience: int = 1,
18 min_improvement: float = 0.0,
19 buckets: int = 10,
20 ) -> None:
21 super().__init__(
22 initial_size, final_iteration, total_iterations, dataset_size
23 )
24 """Discrete convergence pacing function adding a new bucket of training
25 data every time the validation performance of the tracking metric does
26 not improve by at least `min_improvement` for `patience` iterations.
27
28 Args:
29 initial_size: The initial fraction of the dataset to start training
30 with.
31 final_iteration: The fraction of training iterations at which the
32 dataset size will be the full dataset size. If not all buckets
33 are introduced by this iteration, the remaining buckets will be
34 added immediately.
35 total_iterations: The total number of training iterations.
36 dataset_size: The size of the dataset.
37 patience: The number of iterations to wait before adding a new
38 bucket of training data. Defaults to 1.
39 min_improvement: The minimum improvement in the tracking metric to
40 consider as an improvement. Defaults to 0.0.
41 buckets: The number of buckets to divide the remaining dataset size
42 into. Defaults to 10.
43 """
44 if patience < 1:
45 raise ValueError(f"patience {patience} must be a positive integer")
46 self.patience = patience
47 if min_improvement < 0:
48 raise ValueError(
49 f"min_improvement {min_improvement} must be a positive float"
50 )
51 self.min_improvement = min_improvement
52 if buckets < 1:
53 raise ValueError(f"buckets {buckets} must be a positive integer")
54 self.bucket_size = int((1 - initial_size) * dataset_size / buckets)
55 self.current_size = int(initial_size * dataset_size)
56 self.current_wait = 0
57
58 def get_dataset_size(self, iteration: int) -> int:
59 if self.total_iterations * self.final_iteration <= iteration:
60 return self.dataset_size
61 return self.current_size
62
63 def convergence_criterion(self, metric: float) -> None:
64 if self.metric_fn.compare(
65 metric, self.current_best + self.min_improvement
66 ):
67 self.current_best = metric
68 self.current_wait = 0
69 return
70
71 self.current_wait += 1
72 if self.current_wait >= self.patience:
73 size = min(self.current_size + self.bucket_size, self.dataset_size)
74 self.current_size = size
75 self.current_wait = 0
76
77 def cb_on_train_begin(self, trainer: "ModularTaskTrainer") -> None:
78 self.metric_fn = trainer.data.tracking_metric
79 self.current_best = self.metric_fn.starting_metric
80 if self.metric_fn.suffix == "min":
81 self.min_improvement = -self.min_improvement
82
83 def cb_on_val_end(
84 self, trainer: "ModularTaskTrainer", iteration: int, val_results: dict
85 ) -> None:
86 self.convergence_criterion(val_results[self.metric_fn.name])
Next, create a DiscreteConvergence.yaml
configuration file for the pacing function in the conf/curriculum/pacing/
directory:
1id: DiscreteConvergence
2_target_: discrete_convergence.DiscreteConvergence
3initial_size: ???
4final_iteration: ???
5
6patience: 5
7min_improvement: 0.1
8buckets: 10
The id
should match the name of the configuration file.
The _target_
should point to the custom pacing function class via a Python import path
(here assuming that the discrete_convergence.py
file is in the root directory of the project).
The patience
controls the number of iterations to wait for improvement before adding a new bucket of samples.
The min_improvement
specifies the minimum improvement required to consider the model as having converged.
The buckets
determines the number of discrete buckets the dataset is divided into.
Both the initial_size
and final_iteration
serve as placeholders (indicated by ???
)
and are automatically passed to the pacing function configuration
in the main configuration file (e.g. conf/config.yaml
).