π Backtesting OpenSTEF Models with OpenSTEF-BEAM#
This tutorial demonstrates how to use OpenSTEF-BEAM (Backtesting, Evaluation, Analysis, Metrics) to systematically evaluate forecasting models. Youβll learn how to:
Configure benchmark experiments with multiple model types
Run parallel backtests across dozens of energy assets
Compare model performance with standardized metrics
Generate analysis reports with interactive visualizations
BEAM provides a rigorous framework for model evaluation, ensuring fair comparisons and reproducible results.
π§ Environment Setup#
First, we configure thread settings to prevent conflicts with XGBoostβs internal parallelization when running multiple processes.
# --- Thread Configuration ---
# Prevent thread contention when running parallel backtests with XGBoost
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
# --- Standard Imports ---
import logging
import multiprocessing
from pathlib import Path
logging.basicConfig(level=logging.INFO, format="[%(asctime)s][%(levelname)s] %(message)s")
βοΈ Benchmark Configuration#
Configure the benchmark parameters:
Output paths β where to store results for each model
Forecast horizons β how far ahead to predict (using ISO 8601 duration format)
Quantiles β prediction intervals for probabilistic evaluation
# Import types for configuration
from openstef_beam.benchmarking.benchmarks.liander2024 import Liander2024Category
from openstef_core.types import LeadTime, Q # LeadTime: forecast horizon, Q: quantile
# --- Output Paths ---
OUTPUT_PATH = Path("./benchmark_results")
BENCHMARK_RESULTS_PATH_XGBOOST = OUTPUT_PATH / "XGBoost"
BENCHMARK_RESULTS_PATH_GBLINEAR = OUTPUT_PATH / "GBLinear"
# --- Parallelization ---
N_PROCESSES = multiprocessing.cpu_count() # Use all available CPU cores
print(f"π₯οΈ Running with {N_PROCESSES} parallel processes")
# --- Forecast Configuration ---
FORECAST_HORIZONS = [LeadTime.from_string("P3D")] # 3-day ahead forecast (ISO 8601: P3D)
# Quantiles for probabilistic forecasting (7 quantiles covering 5th to 95th percentile)
PREDICTION_QUANTILES = [
Q(0.05),
Q(0.1),
Q(0.3), # Lower quantiles
Q(0.5), # Median
Q(0.7),
Q(0.9),
Q(0.95), # Upper quantiles
]
# --- Benchmark Filter (optional) ---
# Set to None to run all categories, or specify categories like:
# BENCHMARK_FILTER = [Liander2024Category.TRANSFORMER, Liander2024Category.MV_FEEDER]
BENCHMARK_FILTER: list[Liander2024Category] | None = None
π οΈ Model Configuration#
We define a common configuration that both models share, then create model-specific variants. This ensures fair comparison by keeping all settings identical except the model type.
Available Models:#
XGBoost β Gradient boosting trees (handles complex nonlinear patterns)
GBLinear β Gradient boosted linear model (better extrapolation, faster)
# Import workflow configuration
from openstef_models.presets import ForecastingWorkflowConfig
# Common configuration shared by all models
# This ensures fair comparison by keeping all settings identical
common_config = ForecastingWorkflowConfig(
model_id="benchmark_model_",
run_name=None,
model="flatliner", # Placeholder - will be overwritten per model
# Forecast settings
horizons=FORECAST_HORIZONS,
quantiles=PREDICTION_QUANTILES,
# Model reuse: reuse trained model for same target (speeds up backtesting)
model_reuse_enable=True,
mlflow_storage=None, # Disable MLflow for this demo
# Weather feature column mappings (match dataset column names)
radiation_column="shortwave_radiation",
wind_speed_column="wind_speed_80m", # 80m wind speed for better wind park predictions
pressure_column="surface_pressure",
temperature_column="temperature_2m",
relative_humidity_column="relative_humidity_2m",
# Additional features
energy_price_column="EPEX_NL", # Day-ahead electricity price
rolling_aggregate_features=["mean", "median", "max", "min"], # Rolling window stats
# Logging
verbosity=0, # Quiet mode for batch processing
)
# Create model-specific configurations by copying common config and updating model type
xgboost_config = common_config.model_copy(update={"model": "xgboost"})
gblinear_config = common_config.model_copy(update={"model": "gblinear"})
print("β
Model configurations created:")
print(f" - XGBoost: {xgboost_config.model}")
print(f" - GBLinear: {gblinear_config.model}")
πΎ Storage Configuration#
LocalBenchmarkStorage manages the file structure for benchmark results:
benchmark_results/
βββ XGBoost/
β βββ backtest/ # Raw predictions
β βββ evaluation/ # Metrics per target
β βββ analysis/ # Visualizations (HTML)
βββ GBLinear/
βββ ...
# Initialize storage backends for each model
from openstef_beam.benchmarking.storage.local_storage import LocalBenchmarkStorage
storage_xgboost = LocalBenchmarkStorage(base_path=BENCHMARK_RESULTS_PATH_XGBOOST)
storage_gblinear = LocalBenchmarkStorage(base_path=BENCHMARK_RESULTS_PATH_GBLINEAR)
print(f"π XGBoost results: {BENCHMARK_RESULTS_PATH_XGBOOST}")
print(f"π GBLinear results: {BENCHMARK_RESULTS_PATH_GBLINEAR}")
π Run Backtests#
Now we run the Liander 2024 Benchmark β a comprehensive evaluation suite that:
Downloads the benchmark dataset from HuggingFace Hub (if needed)
Runs backtests across 5 asset categories (transformers, feeders, solar/wind parks)
Computes metrics and generates analysis visualizations
β οΈ Note: This may take several minutes depending on your hardware.
# Import benchmark components
from openstef_beam.benchmarking.baselines.openstef4 import create_openstef4_preset_backtest_forecaster
from openstef_beam.benchmarking.benchmarks.liander2024 import create_liander2024_benchmark_runner
from openstef_beam.benchmarking.callbacks.strict_execution_callback import StrictExecutionCallback
# --- Run XGBoost Benchmark ---
print("π² Running XGBoost benchmark...")
create_liander2024_benchmark_runner(
storage=storage_xgboost,
callbacks=[StrictExecutionCallback()], # Fail fast on errors
).run(
forecaster_factory=create_openstef4_preset_backtest_forecaster(
workflow_config=xgboost_config,
),
run_name="xgboost",
n_processes=N_PROCESSES,
filter_args=BENCHMARK_FILTER,
)
print("β
XGBoost benchmark complete!")
# --- Run GBLinear Benchmark ---
print("\nπ Running GBLinear benchmark...")
create_liander2024_benchmark_runner(
storage=storage_gblinear,
callbacks=[StrictExecutionCallback()],
).run(
forecaster_factory=create_openstef4_preset_backtest_forecaster(
workflow_config=gblinear_config,
),
run_name="gblinear",
n_processes=N_PROCESSES,
filter_args=BENCHMARK_FILTER,
)
print("β
GBLinear benchmark complete!")
π Compare Model Performance#
The BenchmarkComparisonPipeline generates side-by-side analysis of multiple models:
Global metrics across all targets
Per-category breakdowns (transformers, feeders, etc.)
Time-windowed performance analysis
# Run model comparison analysis
from openstef_beam.benchmarking import BenchmarkComparisonPipeline
from openstef_beam.benchmarking.benchmarks.liander2024 import LIANDER2024_ANALYSIS_CONFIG
# Create comparison pipeline
target_provider = create_liander2024_benchmark_runner(
storage=LocalBenchmarkStorage(base_path=OUTPUT_PATH),
).target_provider
comparison_pipeline = BenchmarkComparisonPipeline(
analysis_config=LIANDER2024_ANALYSIS_CONFIG,
storage=LocalBenchmarkStorage(base_path=OUTPUT_PATH),
target_provider=target_provider,
)
# Generate comparison reports
print("π Generating comparison analysis...")
comparison_pipeline.run(
run_data={
"xgboost": storage_xgboost,
"gblinear": storage_gblinear,
}
)
print("β
Comparison analysis complete!")
π View Analysis Results#
The benchmark generates interactive HTML visualizations. Letβs open the most important ones:
Key Metrics:#
rCRPS (relative Continuous Ranked Probability Score) β measures probabilistic forecast accuracy
rMAE (relative Mean Absolute Error) β measures point forecast accuracy
Lower values = better performance
# Open key analysis plots in browser
# HTML visualizations are interactive and best viewed in a browser
import os
import webbrowser
# Base path for analysis results
analysis_base = os.path.abspath("./benchmark_results/analysis/D-1T06:00")
# Define key visualizations to open
visualizations = [
("rCRPS Grouped by Category", "rCRPS_grouped.html"),
("rCRPS Time-Windowed (7 days)", "rCRPS_windowed_7D.html"),
]
print("π Opening analysis visualizations in browser...\n")
for name, filename in visualizations:
filepath = os.path.join(analysis_base, filename)
if Path(filepath).exists():
print(f" π {name}")
webbrowser.open(f"file://{filepath}")
else:
print(f" β οΈ {name} not found at {filepath}")
π Explore Individual Target Results#
You can also view time series plots for individual targets. Letβs look at a transformer forecast:
# List available target-specific visualizations
import glob
# Find all time series plots for individual targets
target_plots = glob.glob("./benchmark_results/XGBoost/analysis/*/*/time_series_plot*.html")
if target_plots:
print("π Available target-specific time series plots:\n")
for i, plot in enumerate(sorted(target_plots)[:5]): # Show first 5
parts = plot.split("/")
category = parts[-3] # e.g., "transformer"
target = parts[-2] # e.g., "OS Apeldoorn"
print(f" {i + 1}. {category}/{target}")
# Open the first transformer plot as an example
transformer_plots = [p for p in target_plots if "transformer" in p]
if transformer_plots:
example_plot = os.path.abspath(transformer_plots[0])
print(f"\nπ Opening example: {transformer_plots[0]}")
webbrowser.open(f"file://{example_plot}")
else:
print("β οΈ No target-specific plots found. Run the benchmark first.")
π― Summary#
In this tutorial, you learned how to:
β Configure benchmark experiments with
ForecastingWorkflowConfigβ Run parallel backtests using the Liander 2024 benchmark
β Compare models (XGBoost vs GBLinear) with
BenchmarkComparisonPipelineβ Analyze results with interactive HTML visualizations
π Output Structure#
benchmark_results/
βββ XGBoost/
β βββ backtest/ # Raw predictions (parquet)
β βββ evaluation/ # Metrics per target
β βββ analysis/ # HTML visualizations
βββ GBLinear/
β βββ ...
βββ analysis/ # Comparison analysis (both models)
βββ D-1T06:00/
βββ rCRPS_grouped.html # Probabilistic accuracy by category
βββ rMAE_grouped.html # Point forecast accuracy
βββ summary.html # Overall summary
π Next Steps#
Experiment with different
FORECAST_HORIZONS(e.g.,"PT6H","P7D")Add more quantiles for higher resolution prediction intervals
Filter specific categories with
BENCHMARK_FILTERIntegrate MLflow for experiment tracking