openstef.pipeline package

Submodules

openstef.pipeline.create_basecase_forecast module

openstef.pipeline.create_basecase_forecast.create_basecase_forecast_pipeline(pj, input_data)

Compute the base case forecast and confidence intervals for a given prediction job and input data.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • input_data (DataFrame) – data frame containing the input data necessary for the prediction.

Return type:

DataFrame

Returns:

Base case forecast

Raises:

NoRealisedLoadError – When no realised load for given datetime range.

openstef.pipeline.create_basecase_forecast.generate_basecase_confidence_interval(data_with_features)

Calculate confidence interval for a basecase forecast.

Parameters:

data_with_features (DataFrame) – Input dataframe that is used to make the basecase forecast.

Return type:

DataFrame

Returns:

Dataframe with the confidence interval.

openstef.pipeline.create_component_forecast module

openstef.pipeline.create_component_forecast.create_components_forecast_pipeline(pj, input_data, weather_data)

Pipeline for creating a component forecast using Dazls prediction model.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • input_data (DataFrame) – Input forecast for the components forecast.

  • weather_data (DataFrame) – Weather data with ‘radiation’ and ‘windspeed_100m’ columns

Return type:

DataFrame

Returns:

DataFrame with component forecasts. The dataframe contains these columns; “forecast_wind_on_shore”, “forecast_solar”, “forecast_other”, “pid”, “customer”, “description”, “type”, “algtype”

openstef.pipeline.create_component_forecast.create_input(pj, input_data, weather_data)

This function prepares the input data.

This data will be used for the Dazls model prediction, so they will be according Dazls model requirements.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • input_data (DataFrame) – Input forecast for the components forecast.

  • weather_data (DataFrame) – Weather data with ‘radiation’ and ‘windspeed_100m’ columns

Return type:

DataFrame

Returns:

It outputs a dataframe which will be used for the Dazls prediction function.

openstef.pipeline.create_forecast module

openstef.pipeline.create_forecast.create_forecast_pipeline(pj, input_data, mlflow_tracking_uri)

Create forecast pipeline.

This is the top-level pipeline which included loading the most recent model for the given prediction job.

Expected prediction job keys: “id”,

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • input_data (DataFrame) – Training input data (without features)

  • mlflow_tracking_uri (str) – MlFlow tracking URI

Return type:

DataFrame

Returns:

DataFrame with the forecast

Raises:
openstef.pipeline.create_forecast.create_forecast_pipeline_core(pj, input_data, model, model_specs)

Create forecast pipeline (core).

Computes the forecasts and confidence intervals given a prediction job and input data. This pipeline has no database or persisitent storage dependencies.

Expected prediction job keys: “resolution_minutes”, “id”, “type”,

“name”, “quantiles”

Parameters:
Return type:

DataFrame

Returns:

Forecast

Raises:

InputDataOngoingZeroFlatlinerError – When all recent load measurements are zero.

openstef.pipeline.optimize_hyperparameters module

openstef.pipeline.optimize_hyperparameters.optimize_hyperparameters_pipeline(pj, input_data, mlflow_tracking_uri, artifact_folder, n_trials=100)

Optimize hyperparameters pipeline.

Expected prediction job key’s: “name”, “model”

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • input_data (DataFrame) – Raw training input data

  • mlflow_tracking_uri (str) – Path/Uri to mlflow service

  • artifact_folder (str) – Path where artifacts, such as trained models, are stored

  • horizons – horizons for feature engineering.

  • n_trials (int) – The number of trials. Defaults to N_TRIALS.

Raises:
Return type:

dict

Returns:

Optimized hyperparameters.

openstef.pipeline.optimize_hyperparameters.optimize_hyperparameters_pipeline_core(pj, input_data, horizons=[0.25, 47.0], n_trials=100)

Optimize hyperparameters pipeline core.

Expected prediction job key’s: “name”, “model”

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • input_data (DataFrame) – Raw training input data

  • horizons (list[float]) – horizons for feature engineering in hours.

  • n_trials (int) – The number of trials. Defaults to N_TRIALS.

Raises:
Return type:

tuple[OpenstfRegressor, ModelSpecificationDataClass, Report, dict, int, dict[str, Any]]

Returns:

  • Best model,

  • Model specifications of the best model,

  • Report of the best training round,

  • Trials,

  • Best trial number,

  • Optimized hyperparameters.

openstef.pipeline.optimize_hyperparameters.optuna_optimization(pj, objective, validated_data_with_features, n_trials)

Perform hyperparameter optimization with optuna.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • objective (RegressorObjective) – Objective function for optuna

  • validated_data_with_features (DataFrame) – cleaned input dataframe

  • n_trials (int) – number of optuna trials

Return type:

tuple[Study, RegressorObjective]

Returns:

  • Optimization study from optuna

  • The objective object used by optuna

openstef.pipeline.train_create_forecast_backtest module

openstef.pipeline.train_create_forecast_backtest.train_model_and_forecast_back_test(pj, modelspecs, input_data, training_horizons=None, n_folds=1)

Pipeline for a back test.

When number of folds is larger than 1: apply pipeline for a back test when forecasting the entire input range.

  • Makes use of kfold cross validation in order to split data multiple times.

  • Results of all the testsets are added together to obtain the forecast for the whole input range.

  • Obtaining the days for each fold can be done either randomly or not

DO NOT USE THIS PIPELINE FOR OPERATIONAL FORECASTS

Parameters:
  • pj (PredictionJobDataClass) – Prediction job.

  • modelspecs (ModelSpecificationDataClass) – Dataclass containing model specifications

  • input_data (DataFrame) – Input data

  • training_horizons (Optional[list[float]]) – horizons to train on in hours. These horizons are also used to make predictions (one for every horizon)

  • n_folds (int) – number of folds to apply (if 1, no cross validation will be applied)

Return type:

tuple[DataFrame, list[OpenstfRegressor], list[DataFrame], list[DataFrame], list[DataFrame]]

Returns:

  • Forecast (pandas.DataFrame)

  • Fitted models (list[OpenStfRegressor])

  • Train data sets (list[pd.DataFrame])

  • Validation data sets (list[pd.DataFrame])

  • Test data sets (list[pd.DataFrame])

Raises:
openstef.pipeline.train_create_forecast_backtest.train_model_and_forecast_test_core(pj, modelspecs, train_data, validation_data, test_data)

Trains the model and forecast on the test set.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job.

  • modelspecs (ModelSpecificationDataClass) – Dataclass containing model specifications

  • train_data (DataFrame) – Train data with computed features

  • validation_data (DataFrame) – Validation data with computed features

  • test_data (DataFrame) – Test data with computed features

Return type:

tuple[OpenstfRegressor, DataFrame]

Returns:

  • The trained model

  • The forecast on the test set.

Raises:
  • NotImplementedError – When using invalid model type in the prediction job.

  • InputDataWrongColumnOrderError – When ‘load’ column is not first and ‘horizon’ column is not last.

openstef.pipeline.train_model module

openstef.pipeline.train_model.train_model_pipeline(pj, input_data, check_old_model_age, mlflow_tracking_uri, artifact_folder)

Middle level pipeline that takes care of all persistent storage dependencies.

Expected prediction jobs keys: “id”, “model”, “hyper_params”, “feature_names”.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • input_data (DataFrame) – Raw training input data

  • check_old_model_age (bool) – Check if training should be skipped because the model is too young

  • mlflow_tracking_uri (str) – Tracking URI for MLFlow

  • artifact_folder (str) – Path where artifacts, such as trained models, are stored

Return type:

Optional[tuple[DataFrame, DataFrame, DataFrame]]

Returns:

If pj.save_train_forecasts is False, None is returned Otherwise:

  • The train dataset with forecasts

  • The validation dataset with forecasts

  • The test dataset with forecasts

Raises:
openstef.pipeline.train_model.train_model_pipeline_core(pj, model_specs, input_data, old_model=None, horizons=[0.25, 47.0])

Train model core pipeline.

Trains a new model given a prediction job, input data and compares it to an old model. This pipeline has no database or persistent storage dependencies.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • model_specs (ModelSpecificationDataClass) – Dataclass containing model specifications

  • input_data (DataFrame) – Input data

  • old_model (Optional[OpenstfRegressor]) – Old model to compare to. Defaults to None.

  • horizons (list[float]) – Horizons to train on in hours, relevant for feature engineering.

Raises:
Return type:

Union[OpenstfRegressor, Report, ModelSpecificationDataClass, tuple[DataFrame, DataFrame, DataFrame]]

Returns:

  • Fitted_model (OpenstfRegressor)

  • Report (Report)

  • Modelspecs (ModelSpecificationDataClass)

  • Datasets (tuple[pd.DataFrmae, pd.DataFrame, pd.Dataframe): The train, validation and test sets

openstef.pipeline.train_model.train_pipeline_common(pj, model_specs, input_data, horizons, test_fraction=0.0, backtest=False, test_data_predefined=Empty DataFrame Columns: [] Index: [])

Common pipeline shared with operational training and backtest training.

Parameters:
  • pj (PredictionJobDataClass) – Prediction job

  • model_specs (ModelSpecificationDataClass) – Dataclass containing model specifications

  • input_data (DataFrame) – Input data

  • horizons (list[float]) – horizons to train on in hours.

  • test_fraction (float) – fraction of data to use for testing

  • backtest (bool) – boolean if we need to do a backtest

  • test_data_predefined (DataFrame) – Predefined test data frame to be used in the pipeline (empty data frame by default)

Return type:

tuple[OpenstfRegressor, Report, DataFrame, DataFrame, DataFrame]

Returns:

  • The trained model

  • Report

  • The train data

  • The validation data

  • The test data

Raises:
openstef.pipeline.train_model.train_pipeline_step_compute_features(pj, model_specs, input_data, horizons=list[float])

Compute features and perform consistency checks.

Parameters:
Return type:

DataFrame

Returns:

The dataframe with features need to train the model

Raises:
openstef.pipeline.train_model.train_pipeline_step_load_model(pj, serializer)
Return type:

tuple[OpenstfRegressor, ModelSpecificationDataClass, Union[int, float]]

openstef.pipeline.train_model.train_pipeline_step_split_data(data_with_features, pj, test_fraction, backtest=False, test_data_predefined=Empty DataFrame Columns: [] Index: [])

The default way to perform train, val, test split.

Parameters:
  • data_with_features (DataFrame) – Input data

  • pj (PredictionJobDataClass) – Prediction job

  • test_fraction (float) – fraction of data to use for testing

  • backtest (bool) – boolean if we need to do a backtest

  • test_data_predefined (DataFrame) – Predefined test data frame to be used in the pipeline (empty data frame by default)

Return type:

DataFrame

Returns:

  • Train dataset

  • Validation dataset

  • Test dataset

openstef.pipeline.train_model.train_pipeline_step_train_model(pj, model_specs, train_data, validation_data)

Train the model.

Parameters:
Return type:

OpenstfRegressor

Returns:

The trained model

Raises:
  • NotImplementedError – When using invalid model type in the prediction job.

  • InputDataWrongColumnOrderError – When ‘load’ column is not first and ‘horizon’ column is not last.

openstef.pipeline.utils module

openstef.pipeline.utils.generate_forecast_datetime_range(forecast_data)

Generate forecast range based on last cluster of null values in first target column of forecast data.

Example

A forecast dataset with data between 2021-11-05 and 2021-11-19, and the target column ‘load’ as first column is given as input to this function. The first column ‘load’ has null values between 2021-11-17 04:00:00 and 2021-11-19 05:00:00. The null values at the end of the column indicate when forecasts are needed. Therefore this function sets starting time of forecasts as 2021-11-17 04:00:00 and end time of forecasts as 2021-11-19 05:00:00.

Parameters:

forecast_data (DataFrame) – The forecast dataframe.

Return type:

tuple[datetime, datetime]

Returns:

Start and end datetimes of the forecast range.

Raises:

ValueError – If the target column does not have null values.

Module contents