Scoring Functions#

Scoring functions compute a difficulty score for each sample in a dataset. Difficulty scores are converted into a difficulty ordering by ranking all examples by ascending sample difficulty.

Tip

To create custom scoring functions, refer to the custom scoring functions tutorial.

Such a difficulty ordering may be used to create a curriculum, a sequence of samples ordered by difficulty, which can be used in downstream tasks to train a model in combination with a pacing function.

Most scoring functions obtain sample difficulty scores from a single training configuration. aucurriculum allows for the automatic creation of ensemble scoring functions which may consist of multiple configurations and obtain the final difficulty ordering by averaging the per-example difficulty scores across all atomic scoring functions.

Note

aucurriculum currently supports scoring functions exclusively for multi-class classification tasks, both for obtaining sample difficulty scores and for curriculum-based training.

aucurriculum povides model-based, predefined, and random scoring functions to compute sample difficulty scores.

Tip

All scoring function configurations contain placeholder values (indicated by ???) that need to be replaced with the appropriate values. For more information on how to configure scoring functions, refer to the quickstart guide.

Curriculum Score Manager#

CurriculumScoreManager manages the calculation of scoring functions in three steps:

preprocess(): Preprocess a scoring function configuration, creating one or more atomic scoring functions.
run(): Run a single atomic scoring function, producing a difficulty score for each sample.
postprocess(): Optionally combine multiple atomic scoring functions, creating a single difficulty score for each sample in a dataset.

class aucurriculum.curricula.CurriculumScoreManager(cfg, output_directory)[source]#

Curriculum score manager to control pre-processing, running, and post-processing of scoring functions.

Parameters:

cfg (DictConfig) – Curriculum configuration.
output_directory (str) – The output directory to save the results.

preprocess()[source]#

Preprocess the scoring function, possibly creating one or more configurations and run names.

Return type:: Tuple[list, list]
Returns:: List of configurations and list of run names.

run(run_config, run_name)[source]#

Run a single scoring function.

Parameters:

run_config (DictConfig) – The run configuration.
run_name (str) – The name of the run.

Return type:

None

postprocess(score_id, correlation=None)[source]#

Postprocess the scoring function and optionally create a correlation matrix.

Parameters:

score_id (str) – The score ID to postprocess.
correlation (Optional[DictConfig]) – The correlation matrix configuration. Dictionary of lists of score IDs to include in a single correlation matrix. Defaults to None.

Return type:

None

Abstract Scoring Function#

All scoring functions inherit from the AbstractScore class and implement the run() method calculating the difficulty scores for each sample in the dataset. AbstractScore additionally provides common methods that are shared among most scoring functions.

Note

Scoring functions can optionally override the preprocess() and postprocess() methods to perform additional operations before and after the scoring function is run, such as combining multiple atomic scoring functions in a different way than averaging.

class aucurriculum.curricula.scoring.AbstractScore(output_directory, results_dir, experiment_id, run_name, stop=None, subset='train', reverse_score=False, criterion=None)[source]#

Abstract class for scoring functions.

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
run_name (Union[str, List[str]]) – Name or list of names of the runs to score. Runs can be single runs or aggregated runs.
stop (Optional[str]) – Model state dict to load or to stop at in [“best”, “last”]. Defaults to None.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.
reverse_score (bool) – Whether to reverse the score ranking. Defaults to False.
criterion (Optional[str]) – The criterion to use for scoring. If None, no criterion will be used. Defaults to None.

Raises:

ValueError – If subset is not in [“train”, “dev”, “test”] or if stop is not in [“best”, “last”, None].

preprocess()[source]#

Preprocess one or multiple runs, creating a list of configurations and a list of run names to score.

Return type:: Tuple[list, list]
Returns:: List of configurations and list of run names to score.

abstract run(config, run_config, run_name)[source]#

Run the scoring function for a single run and generate the scores.

Parameters:

config (DictConfig) – The configuration of the curriculum scoring function.
run_config (DictConfig) – The configuration of the run to score.
run_name (str) – The name of the run to score.

Return type:

None

postprocess(score_id, runs)[source]#

Postprocess the scores and create the final scoring function ordering by averaging the scores of multiple runs and ranking the samples based on the mean score.

Parameters:

score_id (str) – ID of the score to save.
runs (list) – List of run names to postprocess and include in the score.

Return type:

None

split_run_name(run_name)[source]#

Split the full run name run name into the underlying training run name and the full run name containing the stop iteration (and optional criterion).

Parameters:: run_name (str) – The run name to split.
Return type:: Tuple[str, str]
Returns:: The name of the underlying training run and the full run name.

create_criterion(data, reduction='none')[source]#

Create the criterion for the scoring function based on the criterion configuration.

Parameters:

data (AbstractDataset) – Dataset to use for criterion setup.
reduction (str) – Reduction to use for the criterion. Defaults to “none”.

Return type:

Module

Returns:

Criterion for the scoring function.

static prepare_data_and_model(cfg)[source]#

Prepare the dataset and model for the scoring function based on the underlying training run configuration.

Parameters:: cfg (DictConfig) – The configuration of the underlying training run.
Return type:: Tuple[AbstractDataset, AbstractModel]
Returns:: The instantiated dataset and model.

load_model_checkpoint(model, run_name)[source]#

Load the trained model checkpoint based on the run name and run configuration. The model will be loaded from the best checkpoint if stop is set to “best” and from the last checkpoint if stop is set to “last”.

Parameters:

model (Module) – Model to load the checkpoint into.
run_name (str) – Name of the run to load the checkpoint from.

Return type:

None

static forward_pass(model, loader, batch_size, output_map_fn, output_size=None, tqdm_desc='Scoring Forward Pass', disable_progress_bar=True, device=None, timer=None)[source]#

Perform a forward pass through the model and return the outputs and labels.

Parameters:

model (AbstractModel) – Model to perform the forward pass with.
loader (DataLoader) – DataLoader to use for the forward pass.
batch_size (int) – Batch size to use for the forward pass.
output_map_fn (Callable[[Tensor], Tensor]) – Function to map the model outputs to the desired output format.
output_size (Optional[int]) – Size of the output tensor. If None, the model output should be a single scalar. Defaults to None.
tqdm_desc (str) – Description for the tqdm progress bar. Defaults to “Scoring Forward Pass”.
disable_progress_bar (bool) – Whether to disable the progress bar. Defaults to True.
device (Optional[device]) – Device to use for the forward pass. If None, the device will be set to “cpu”. Defaults to None.
timer (Optional[Timer]) – Timer to time the forward pass. If provided, the timer is started before the forward pass and stopped after the forward pass. Defaults to None.

Return type:

Tuple[ndarray, ndarray]

Returns:

Mapped model outputs and labels.

create_dataframe(scores, labels, data)[source]#

Create a dataframe from the scores, labels, and dataset.

Parameters:

scores (ndarray) – The score for each sample.
labels (ndarray) – The encoded labels for each sample.
data (AbstractDataset) – The dataset.

Return type:

DataFrame

Returns:

The dataframe with the scores, encoded labels, and decoded labels.

rank_and_normalize(df)[source]#

Rank and normalize the scores in the dataframe by ranking the scores using method=”first” and normalizing the ranks to the range [0, 1]. If reverse_score is set to True, the ranks will be reversed.

In the resulting difficulty orderint, lower ranks always indicate easier samples.

Parameters:: df (DataFrame) – The output dataframe with “mean” column containing the scores.
Return type:: Series
Returns:: The normalized ranks.

static save_scores(df, path)[source]#

Save the scores dataframe to the specified path.

Parameters:

df (DataFrame) – The scores dataframe.
path (str) – The path to save the scores dataframe.

Return type:

None

static get_dataset_subset(data, subset)[source]#

Get the dataset subset.

Parameters:

data (AbstractDataset) – The dataset.
subset (str) – The dataset subset to get in [“train”, “dev”, “test”].

Return type:

Dataset

Returns:

The dataset subset.

static get_dataframe(data, subset)[source]#

Get the dataframe of the dataset subset.

Parameters:

data (AbstractDataset) – The dataset.
subset (str) – The dataset subset to get in [“train”, “dev”, “test”].

Return type:

DataFrame

Returns:

The dataframe of the dataset subset.

Model-based Scoring Functions#

Model-based scoring functions obtain a difficulty score for each sample by leveraging trained models and using, among others, training dynamics, model predictions, or losses to determine the difficulty of a sample.

Most model-based scoring functions require a trained model to compute the difficulty scores which is specified in the scoring function configuration under the run_name parameter. The run_name should be a run name or list of run names from which to load the models for scoring and should exist in the results_dir and experiment_id set in the curriculum scoring configuration (conf/curriculum.yaml) file. It is also possible to specify (lists of) aggregated run names which are automatically resolved to the underlying runs, effectively creating an ensemble scoring function.

class aucurriculum.curricula.scoring.CELoss(output_directory, results_dir, experiment_id, run_name, criterion, stop='best', subset='train')[source]#

Cross-Entropy Loss scoring function computing the cross-entropy loss for each sample in the dataset individually. It is originally termed bootstrapping and implemented as described in: https://arxiv.org/abs/1904.03626

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
run_name (str) – Name or list of names of the runs to score. Runs can be single runs or aggregated runs.
criterion (str) – The criterion to use for obtaining the per-example loss. The reduction of the criterion is automatically set to “none”.
stop (str) – Model state dict to load or to stop at in [“best”, “last”]. Defaults to “best”.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

class aucurriculum.curricula.scoring.CumulativeAccuracy(output_directory, results_dir, experiment_id, run_name, stop='best', subset='train')[source]#

Cumulative Accuracy scoring function computing the mean accuracy from the first to the stop epoch for each sample in the dataset individually. The scoring function serves as a computationally less expensive proxy to the C-score as described in: https://arxiv.org/abs/2002.03206

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
run_name (str) – Name or list of names of the runs to score. Runs can be single runs or aggregated runs.
stop (str) – Model state dict to load or to stop at in [“best”, “last”]. Defaults to “best”.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

class aucurriculum.curricula.scoring.CVLoss(output_directory, results_dir, experiment_id, splits, setup, criterion, stop='best', subset='train')[source]#

Cross-Validation Loss scoring function computing the cross-entropy loss for each sample in the dataset individually. The dataset is split into splits parts and the loss is computed for each part individually by training on the remaining parts as described in: TODO: add reference once paper is published.

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
splits (int) – Number of splits for the cross-validation.
setup (DictConfig) –
Configuration for the grid search to perform for each split. Each configuration parameter can be a string or list of strings for multiple configurations. The following parameters are required:
- filters: Optional list of filters to apply to the runs.
- dataset: Dataset ID.
- model: Model ID.
- optimizer: Optimizer ID.
- learning_rate: Learning rate.
- scheduler: Scheduler ID.
- augmentation: Augmentation ID.
- seed: Seed.
- batch_size: Batch size.
- inference_batch_size: Batch size for inference.
- plotting: Plotting ID.
- training_type: Training type.
- iterations: Number of iterations.
- eval_frequency: Evaluation frequency.
- save_frequency: Save frequency.
- save_train_outputs: Whether to save the training outputs.
- save_dev_outputs: Whether to save the dev outputs.
- save_test_outputs: Whether to save the test outputs.
criterion (str) – The criterion to use for obtaining the per-example loss. The reduction of the criterion is automatically set to “none”.
stop (str) – Model state dict to load or to stop at in [“best”, “last”]. Defaults to “best”.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

Raises:

ValueError – If the number of splits is less than 2.

class aucurriculum.curricula.scoring.FirstIteration(output_directory, results_dir, experiment_id, run_name, stop='best', subset='train')[source]#

First Iteration scoring function computing the first iteration in which the model correctly predicts the target including all subsequent iterations for each sample in the dataset individually as described in: https://arxiv.org/abs/2012.03107

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
run_name (str) – Name or list of names of the runs to score. Runs can be single runs or aggregated runs.
stop (str) – Model state dict to load or to stop at in [“best”, “last”]. Defaults to “best”.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

class aucurriculum.curricula.scoring.PredictionDepth(output_directory, results_dir, experiment_id, run_name, probe_placements, max_embedding_size=None, match_dimensions=False, knn_n_neighbors=30, knn_batch_size=1024, save_embeddings=False, stop='best', subset='train')[source]#

Prediction Depth scoring function computing the depth at which the first and all subsequent KNN probes align with the model’s prediction for each sample in the dataset individually as described in: https://arxiv.org/abs/2106.09647

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
run_name (str) – Name or list of names of the runs to score. Runs can be single runs or aggregated runs.
probe_placements (Union[List[str], Dict[str, List[str]]]) – Names of the nodes in the traced model graph where the probes should be placed, specified using regex patterns. The input and output of the model are automatically added. If a list is provided, the same placements will be used for all runs. If a dictionary is provided, the placements will be used for the corresponding run names.
max_embedding_size (Optional[int]) – Maximum dimensionality of the flattened embeddings. If embeddings exceed this size, they will be pooled. Defaults to None.
match_dimensions (bool) – Whether to match the spatial dimensions of the embeddings and create square embeddings. Defaults to False.
knn_n_neighbors (int) – Number of neighbors to use for the parallel k-nearest neighbors algorithm. Defaults to 30.
knn_batch_size (int) – Batch size for the parallel k-nearest neighbors algorithm. Defaults to 1024.
save_embeddings (bool) – Whether to save the embeddings for each probe. Defaults to False.
stop (str) – Model state dict to load or to stop at in [“best”, “last”]. Defaults to “best”.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

class aucurriculum.curricula.scoring.TransferTeacher(output_directory, results_dir, experiment_id, model, dataset, subset='train')[source]#

Transfer Teacher scoring function that computes margin to the decision boundary of a support vector machine (SVM) trained on the embeddings of a pre-trained model for each sample in the dataset as described in: https://arxiv.org/abs/1904.03626

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
model (Union[str, List[str]]) – Model ID or list of model IDs to use for scoring.
dataset (str) – Dataset ID to use for scoring.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

Predefined Scoring Functions#

Predefined scoring functions determine the difficulty of a sample based on predefined criteria and are specified in a CSV file.

class aucurriculum.curricula.scoring.Predefined(output_directory, results_dir, experiment_id, file, scores_column, reverse, dataset, subset='train')[source]#

Predefined scoring function using predefined scores from a file.

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
file (str) – Path to the file containing the scores.
scores_column (str) – Column name of the scores in the file.
reverse (bool) – Whether to reverse the order of the scores.
dataset (str) – Dataset ID to use for scoring.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

Raises:

ValueError – If the file does not exist.

Random Scoring Functions#

Random scoring functions assign a random difficulty score to each sample in the dataset.

class aucurriculum.curricula.scoring.Random(output_directory, results_dir, experiment_id, dataset, seed, subset='train')[source]#

Random scoring function that assigns random scores to each sample in the dataset.

Parameters:

output_directory (str) – Directory where the scores will be stored.
results_dir (str) – The directory where the results are stored.
experiment_id (str) – The ID of the grid search experiment.
dataset (str) – Dataset ID to use for scoring.
seed (int) – Seed to use for random scoring.
subset (str) – Dataset subset to use for scoring in [“train”, “dev”, “test”]. Defaults to “train”.

Table of Contents

Scoring Functions#

Curriculum Score Manager#

Abstract Scoring Function#

Model-based Scoring Functions#

Predefined Scoring Functions#

Random Scoring Functions#