openstef.pipeline package¶
Submodules¶
openstef.pipeline.create_basecase_forecast module¶
- openstef.pipeline.create_basecase_forecast.create_basecase_forecast_pipeline(pj, input_data)¶
Compute the base case forecast and confidence intervals for a given prediction job and input data.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobinput_data (
DataFrame
) – data frame containing the input data necessary for the prediction.
- Return type:
DataFrame
- Returns:
Base case forecast
- Raises:
NoRealisedLoadError – When no realised load for given datetime range.
- openstef.pipeline.create_basecase_forecast.generate_basecase_confidence_interval(data_with_features)¶
Calculate confidence interval for a basecase forecast.
- Parameters:
data_with_features (
DataFrame
) – Input dataframe that is used to make the basecase forecast.- Return type:
DataFrame
- Returns:
Dataframe with the confidence interval.
openstef.pipeline.create_component_forecast module¶
- openstef.pipeline.create_component_forecast.create_components_forecast_pipeline(pj, input_data, weather_data)¶
Pipeline for creating a component forecast using Dazls prediction model.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobinput_data (
DataFrame
) – Input forecast for the components forecast.weather_data (
DataFrame
) – Weather data with ‘radiation’ and ‘windspeed_100m’ columns
- Return type:
DataFrame
- Returns:
DataFrame with component forecasts. The dataframe contains these columns; “forecast_wind_on_shore”, “forecast_solar”, “forecast_other”, “pid”, “customer”, “description”, “type”, “algtype”
- openstef.pipeline.create_component_forecast.create_input(pj, input_data, weather_data)¶
This function prepares the input data.
This data will be used for the Dazls model prediction, so they will be according Dazls model requirements.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobinput_data (
DataFrame
) – Input forecast for the components forecast.weather_data (
DataFrame
) – Weather data with ‘radiation’ and ‘windspeed_100m’ columns
- Return type:
DataFrame
- Returns:
It outputs a dataframe which will be used for the Dazls prediction function.
openstef.pipeline.create_forecast module¶
- openstef.pipeline.create_forecast.create_forecast_pipeline(pj, input_data, mlflow_tracking_uri)¶
Create forecast pipeline.
This is the top-level pipeline which included loading the most recent model for the given prediction job.
Expected prediction job keys: “id”,
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobinput_data (
DataFrame
) – Training input data (without features)mlflow_tracking_uri (
str
) – MlFlow tracking URI
- Return type:
DataFrame
- Returns:
DataFrame with the forecast
- Raises:
InputDataOngoingZeroFlatlinerError – When all recent load measurements are zero.
LookupError – When no model is found for the given prediction job in MLflow.
- openstef.pipeline.create_forecast.create_forecast_pipeline_core(pj, input_data, model, model_specs)¶
Create forecast pipeline (core).
Computes the forecasts and confidence intervals given a prediction job and input data. This pipeline has no database or persisitent storage dependencies.
- Expected prediction job keys: “resolution_minutes”, “id”, “type”,
“name”, “quantiles”
- Parameters:
pj (
PredictionJobDataClass
) – Prediction job.input_data (
DataFrame
) – Input data for the prediction.model (
OpenstfRegressor
) – Model to use for this prediction.model_specs (
ModelSpecificationDataClass
) – Model specifications.
- Return type:
DataFrame
- Returns:
Forecast
- Raises:
InputDataOngoingZeroFlatlinerError – When all recent load measurements are zero.
openstef.pipeline.optimize_hyperparameters module¶
- openstef.pipeline.optimize_hyperparameters.optimize_hyperparameters_pipeline(pj, input_data, mlflow_tracking_uri, artifact_folder, n_trials=100)¶
Optimize hyperparameters pipeline.
Expected prediction job key’s: “name”, “model”
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobinput_data (
DataFrame
) – Raw training input datamlflow_tracking_uri (
str
) – Path/Uri to mlflow serviceartifact_folder (
str
) – Path where artifacts, such as trained models, are storedhorizons – horizons for feature engineering.
n_trials (
int
) – The number of trials. Defaults to N_TRIALS.
- Raises:
ValueError – If the input_date is insufficient.
InputDataInsufficientError – If the input dataframe is empty.
InputDataWrongColumnOrderError – If the load column is missing in the input dataframe.
OldModelHigherScoreError – When old model is better than new model.
- Return type:
dict
- Returns:
Optimized hyperparameters.
- openstef.pipeline.optimize_hyperparameters.optimize_hyperparameters_pipeline_core(pj, input_data, horizons=[0.25, 47.0], n_trials=100)¶
Optimize hyperparameters pipeline core.
Expected prediction job key’s: “name”, “model”
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobinput_data (
DataFrame
) – Raw training input datahorizons (
list
[float
]) – horizons for feature engineering in hours.n_trials (
int
) – The number of trials. Defaults to N_TRIALS.
- Raises:
ValueError – If the input_date is insufficient.
InputDataInsufficientError – If the input dataframe is empty.
InputDataWrongColumnOrderError – If the load column is missing in the input dataframe.
OldModelHigherScoreError – When old model is better than new model.
InputDataOngoingZeroFlatlinerError – When all recent load measurements are zero.
- Return type:
tuple
[OpenstfRegressor
,ModelSpecificationDataClass
,Report
,dict
,int
,dict
[str
,Any
]]- Returns:
Best model,
Model specifications of the best model,
Report of the best training round,
Trials,
Best trial number,
Optimized hyperparameters.
- openstef.pipeline.optimize_hyperparameters.optuna_optimization(pj, objective, validated_data_with_features, n_trials)¶
Perform hyperparameter optimization with optuna.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobobjective (
RegressorObjective
) – Objective function for optunavalidated_data_with_features (
DataFrame
) – cleaned input dataframen_trials (
int
) – number of optuna trials
- Return type:
tuple
[Study
,RegressorObjective
]- Returns:
Optimization study from optuna
The objective object used by optuna
openstef.pipeline.train_create_forecast_backtest module¶
- openstef.pipeline.train_create_forecast_backtest.train_model_and_forecast_back_test(pj, modelspecs, input_data, training_horizons=None, n_folds=1)¶
Pipeline for a back test.
When number of folds is larger than 1: apply pipeline for a back test when forecasting the entire input range.
Makes use of kfold cross validation in order to split data multiple times.
Results of all the testsets are added together to obtain the forecast for the whole input range.
Obtaining the days for each fold can be done either randomly or not
DO NOT USE THIS PIPELINE FOR OPERATIONAL FORECASTS
- Parameters:
pj (
PredictionJobDataClass
) – Prediction job.modelspecs (
ModelSpecificationDataClass
) – Dataclass containing model specificationsinput_data (
DataFrame
) – Input datatraining_horizons (
list
[float
]) – horizons to train on in hours. These horizons are also used to make predictions (one for every horizon)n_folds (
int
) – number of folds to apply (if 1, no cross validation will be applied)
- Return type:
tuple
[DataFrame
,list
[OpenstfRegressor
],list
[DataFrame
],list
[DataFrame
],list
[DataFrame
]]- Returns:
Forecast (pandas.DataFrame)
Fitted models (list[OpenStfRegressor])
Train data sets (list[pd.DataFrame])
Validation data sets (list[pd.DataFrame])
Test data sets (list[pd.DataFrame])
- Raises:
InputDataInsufficientError – when input data is insufficient.
InputDataWrongColumnOrderError – when input data has a invalid column order.
ValueError – when the horizon is a string and the corresponding column in not in the input data
InputDataOngoingZeroFlatlinerError – when all recent load measurements are zero.
- openstef.pipeline.train_create_forecast_backtest.train_model_and_forecast_test_core(pj, modelspecs, train_data, validation_data, test_data)¶
Trains the model and forecast on the test set.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction job.modelspecs (
ModelSpecificationDataClass
) – Dataclass containing model specificationstrain_data (
DataFrame
) – Train data with computed featuresvalidation_data (
DataFrame
) – Validation data with computed featurestest_data (
DataFrame
) – Test data with computed features
- Return type:
tuple
[OpenstfRegressor
,DataFrame
]- Returns:
The trained model
The forecast on the test set.
- Raises:
NotImplementedError – When using invalid model type in the prediction job.
InputDataWrongColumnOrderError – When ‘load’ column is not first and ‘horizon’ column is not last.
openstef.pipeline.train_model module¶
- openstef.pipeline.train_model.train_model_pipeline(pj, input_data, check_old_model_age, mlflow_tracking_uri, artifact_folder)¶
Middle level pipeline that takes care of all persistent storage dependencies.
Expected prediction jobs keys: “id”, “model”, “hyper_params”, “feature_names”.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobinput_data (
DataFrame
) – Raw training input datacheck_old_model_age (
bool
) – Check if training should be skipped because the model is too youngmlflow_tracking_uri (
str
) – Tracking URI for MLFlowartifact_folder (
str
) – Path where artifacts, such as trained models, are stored
- Return type:
Optional
[tuple
[DataFrame
,DataFrame
,DataFrame
]]- Returns:
If pj.save_train_forecasts is False, None is returned Otherwise:
The train dataset with forecasts
The validation dataset with forecasts
The test dataset with forecasts
- Raises:
InputDataInsufficientError – when input data is insufficient.
InputDataWrongColumnOrderError – when input data has a invalid column order. ‘load’ column should be first and ‘horizon’ column last.
OldModelHigherScoreError – When old model is better than new model.
SkipSaveTrainingForecasts – If old model is better or younger than MAXIMUM_MODEL_AGE, the model is not saved.
- openstef.pipeline.train_model.train_model_pipeline_core(pj, model_specs, input_data, old_model=None, horizons=[0.25, 47.0])¶
Train model core pipeline.
Trains a new model given a prediction job, input data and compares it to an old model. This pipeline has no database or persistent storage dependencies.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobmodel_specs (
ModelSpecificationDataClass
) – Dataclass containing model specificationsinput_data (
DataFrame
) – Input dataold_model (
OpenstfRegressor
) – Old model to compare to. Defaults to None.horizons (
list
[float
]) – Horizons to train on in hours, relevant for feature engineering.
- Raises:
InputDataInsufficientError – when input data is insufficient.
InputDataWrongColumnOrderError – when input data has a invalid column order.
OldModelHigherScoreError – When old model is better than new model.
InputDataOngoingZeroFlatlinerError – when all recent load measurements are zero.
- Return type:
Tuple
[OpenstfRegressor
,Report
,ModelSpecificationDataClass
,tuple
[DataFrame
,DataFrame
,DataFrame
]]- Returns:
Fitted_model (OpenstfRegressor)
Report (Report)
Modelspecs (ModelSpecificationDataClass)
Datasets (tuple[pd.DataFrmae, pd.DataFrame, pd.Dataframe): The train, validation and test sets
- openstef.pipeline.train_model.train_pipeline_common(pj, model_specs, input_data, horizons, test_fraction=0.0, backtest=False, test_data_predefined=Empty DataFrame Columns: [] Index: [])¶
Common pipeline shared with operational training and backtest training.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobmodel_specs (
ModelSpecificationDataClass
) – Dataclass containing model specificationsinput_data (
DataFrame
) – Input datahorizons (
list
[float
]) – horizons to train on in hours.test_fraction (
float
) – fraction of data to use for testingbacktest (
bool
) – boolean if we need to do a backtesttest_data_predefined (
DataFrame
) – Predefined test data frame to be used in the pipeline (empty data frame by default)
- Return type:
tuple
[OpenstfRegressor
,Report
,DataFrame
,DataFrame
,DataFrame
,DataFrame
]- Returns:
The trained model
Report
The train data
The validation data
The test data
- Raises:
InputDataInsufficientError – when input data is insufficient.
InputDataWrongColumnOrderError – when input data has a invalid column order. ‘load’ column should be first and ‘horizon’ column last.
InputDataOngoingZeroFlatlinerError – when all recent load measurements are zero.
- openstef.pipeline.train_model.train_pipeline_step_compute_features(pj, model_specs, input_data, horizons=list[float])¶
Compute features and perform consistency checks.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobmodel_specs (
ModelSpecificationDataClass
) – Dataclass containing model specificationsinput_data (
DataFrame
) – Input datahorizons – horizons to train on in hours.
- Return type:
DataFrame
- Returns:
The dataframe with features need to train the model
- Raises:
InputDataInsufficientError – when input data is insufficient.
InputDataWrongColumnOrderError – when input data has a invalid column order.
ValueError – when the horizon is a string and the corresponding column in not in the input data
InputDataOngoingZeroFlatlinerError – when all recent load measurements are zero.
- openstef.pipeline.train_model.train_pipeline_step_load_model(pj, serializer)¶
- Return type:
Tuple
[OpenstfRegressor
,ModelSpecificationDataClass
,Union
[int
,float
]]
- openstef.pipeline.train_model.train_pipeline_step_split_data(data_with_features, pj, test_fraction, backtest=False, test_data_predefined=Empty DataFrame Columns: [] Index: [])¶
The default way to perform train, val, test split.
- Parameters:
data_with_features (
DataFrame
) – Input datapj (
PredictionJobDataClass
) – Prediction jobtest_fraction (
float
) – fraction of data to use for testingbacktest (
bool
) – boolean if we need to do a backtesttest_data_predefined (
DataFrame
) – Predefined test data frame to be used in the pipeline (empty data frame by default)
- Return type:
Tuple
[DataFrame
,DataFrame
,DataFrame
,DataFrame
]- Returns:
Train dataset
Validation dataset
Test dataset
- openstef.pipeline.train_model.train_pipeline_step_train_model(pj, model_specs, train_data, validation_data)¶
Train the model.
- Parameters:
pj (
PredictionJobDataClass
) – Prediction jobmodel_specs (
ModelSpecificationDataClass
) – Dataclass containing model specificationstrain_data (
DataFrame
) – The training datavalidation_data (
DataFrame
) – The test data
- Return type:
- Returns:
The trained model
- Raises:
NotImplementedError – When using invalid model type in the prediction job.
InputDataWrongColumnOrderError – When ‘load’ column is not first and ‘horizon’ column is not last.
openstef.pipeline.utils module¶
- openstef.pipeline.utils.generate_forecast_datetime_range(forecast_data)¶
Generate forecast range based on last cluster of null values in first target column of forecast data.
Example
A forecast dataset with data between 2021-11-05 and 2021-11-19, and the target column ‘load’ as first column is given as input to this function. The first column ‘load’ has null values between 2021-11-17 04:00:00 and 2021-11-19 05:00:00. The null values at the end of the column indicate when forecasts are needed. Therefore this function sets starting time of forecasts as 2021-11-17 04:00:00 and end time of forecasts as 2021-11-19 05:00:00.
- Parameters:
forecast_data (
DataFrame
) – The forecast dataframe.- Return type:
tuple
[datetime
,datetime
]- Returns:
Start and end datetimes of the forecast range.
- Raises:
ValueError – If the target column does not have null values.