openstef.model_selection package


openstef.model_selection.model_selection module

openstef.model_selection.model_selection.backtest_split_default(data, n_folds, test_fraction=0.15, stratification_min_max=True, randomize_fold_split=False)

Default cross validation strategy.

  • data (DataFrame)

  • n_folds (int)

  • test_fraction (float)

  • stratification_min_max (bool)

  • randomize_fold_split (bool)

Return type:

Iterable[tuple[DataFrame, DataFrame, DataFrame]]


Iterable on train, val, test splits


We use a generator in order to have lazy estimation and avoid multiple copy of the data.

openstef.model_selection.model_selection.group_kfold(input_data, n_folds, randomize_fold_split=True)

Function to group data into groups, according to the date and the number of folds.

Each date gets assigned a number between 0 and n_folds.

  • input_data (DataFrame) – Input data

  • n_folds (int) – Number of folds

  • randomize_fold_split (bool) – Indicates if random split needs to be applied

Return type:



Grouped data

openstef.model_selection.model_selection.random_sample(all_peaks, k)

Random sampling of numbers out of a np.array.

Implemented due to security sonar cloud not accepting the random built-in functions.

  • all_peaks (array) – List with numbers to sample from

  • k (int) – Number of wanted samples

Return type:



Sorted array with the random samples (dates from the peaks)

openstef.model_selection.model_selection.sample_indices_train_val(data, peaks)

Sample indices of given period length assuming the peaks are evenly spreaded.

  • data (DataFrame) – Clean data with features

  • peaks (DataFrame) – Data frame of selected peaks to sample the dates from

Return type:

tuple[array, array]


  • List with the start point of each peak

  • Sorted list with the indices corresponding to the peak

openstef.model_selection.model_selection.split_data_train_validation_test(data_, test_fraction=0.1, validation_fraction=0.15, back_test=False, stratification_min_max=True)

Split input data into train, test and validation set.

Function for splitting data with features in a train, test and validation dataset. In an operational setting the following sequence is returned (when using stratification):

Train >> Validation (and the test is the Train and Validation combined.)

For a back test (indicated with argument “back_test”) the following sequence is returned:

Train >> Validation >> Test

The ratios of the different types can be set with test_fraction and validation fraction.

  • data – Cleaned data with features

  • test_fraction (float) – Number between 0 and 1 that indicates the desired fraction of test data.

  • validation_fraction (float) – Number between 0 and 1 that indicates the desired fraction of validation data.

  • back_test (bool) – Indicates if data is intended for a back test.

  • stratification_min_max (bool) – Indicates if validation data must be sampled as periods, using stratification on min and max values per day. If True, ‘extreme days’ are ensured to be included in the validation and train sets, ensuring the validation set to be representative of the train set.

Return type:

tuple[DataFrame, DataFrame, DataFrame]


  • Train data.

  • Validation data.

  • Test data.


ValueError – When the test and validation fractions are too high.

Module contents