openstef.model_selection package#
Submodules#
openstef.model_selection.model_selection module#
- openstef.model_selection.model_selection.backtest_split_default(data, n_folds, test_fraction=0.15, stratification_min_max=True, randomize_fold_split=False)#
Default cross validation strategy.
- Parameters:
data (
DataFrame
) –n_folds (
int
) –test_fraction (
float
) –stratification_min_max (
bool
) –randomize_fold_split (
bool
) –
- Return type:
Iterable
[tuple
[DataFrame
,DataFrame
,DataFrame
]]- Returns:
Iterable on train, val, test splits
Notes
We use a generator in order to have lazy estimation and avoid multiple copy of the data.
- openstef.model_selection.model_selection.group_kfold(input_data, n_folds, randomize_fold_split=True)#
Function to group data into groups, according to the date and the number of folds.
Each date gets assigned a number between 0 and n_folds.
- Parameters:
input_data (
DataFrame
) – Input datan_folds (
int
) – Number of foldsrandomize_fold_split (
bool
) – Indicates if random split needs to be applied
- Return type:
DataFrame
- Returns:
Grouped data
- openstef.model_selection.model_selection.random_sample(all_peaks, k)#
Random sampling of numbers out of a np.array.
Implemented due to security sonar cloud not accepting the random built-in functions.
- Parameters:
all_peaks (
array
) – List with numbers to sample fromk (
int
) – Number of wanted samples
- Return type:
array
- Returns:
Sorted array with the random samples (dates from the peaks)
- openstef.model_selection.model_selection.sample_indices_train_val(data, peaks)#
Sample indices of given period length assuming the peaks are evenly spreaded.
- Parameters:
data (
DataFrame
) – Clean data with featurespeaks (
DataFrame
) – Data frame of selected peaks to sample the dates from
- Return type:
tuple
[array
,array
]- Returns:
List with the start point of each peak
Sorted list with the indices corresponding to the peak
- openstef.model_selection.model_selection.split_data_train_validation_test(data_, test_fraction=0.1, validation_fraction=0.15, back_test=False, stratification_min_max=True)#
Split input data into train, test and validation set.
Function for splitting data with features in a train, test and validation dataset. In an operational setting the following sequence is returned (when using stratification):
Test >> Train >> Validation
For a back test (indicated with argument “back_test”) the following sequence is returned:
Train >> Validation >> Test
The ratios of the different types can be set with test_fraction and validation fraction.
- Parameters:
data – Cleaned data with features
test_fraction (
float
) – Number between 0 and 1 that indicates the desired fraction of test data.validation_fraction (
float
) – Number between 0 and 1 that indicates the desired fraction of validation data.back_test (
bool
) – Indicates if data is intended for a back test.stratification_min_max (
bool
) – Indicates if validation data must be sampled as periods, using stratification on min and max values per day. If True, ‘extreme days’ are ensured to be included in the validation and train sets, ensuring the validation set to be representative of the train set.
- Return type:
tuple
[DataFrame
,DataFrame
,DataFrame
]- Returns:
Train data.
Validation data.
Test data.