stratified_train_test_split#

openstef_models.utils.data_split.stratified_train_test_split(dataset: T, test_fraction: float, stratification_fraction: float = 0.15, target_column: str = 'load', random_state: int = 42, min_days_for_stratification: int = 4) tuple[T, T][source]#

Split a dataset into train and test sets with stratification on extreme values.

Splits data while ensuring that extreme high and low values are proportionally represented in both training and testing sets. This helps maintain representative distributions for model evaluation, especially important for forecasting tasks where extreme events are critical.

Parameters:
  • dataset (T) – The dataset to split.

  • test_fraction (float) – Fraction of data to include in the test split.

  • stratification_fraction (float) – Fraction of extreme days to consider for stratification.

  • target_column (str) – Column name containing the values to stratify on.

  • random_state (int) – Random seed for reproducible splits.

  • min_days_for_stratification (int) – Minimum days required for stratification.

Returns:

Tuple of (train_dataset, test_dataset).

Raises:

ValueError – If test_fraction is not between 0 and 1.

Return type:

tuple[T, T]

Note

Falls back to chronological splitting if there are too few days for stratification.

Parameters:
Return type:

tuple[TypeVar(T, bound= TimeSeriesDataset), TypeVar(T, bound= TimeSeriesDataset)]