EvaluationPipeline#

class openstef_beam.evaluation.EvaluationPipeline(config: EvaluationConfig, quantiles: list[Quantile], window_metric_providers: list[MetricProvider], global_metric_providers: list[MetricProvider]) None[source]#

Bases: object

Pipeline for evaluating probabilistic forecasting models.

Computes metrics across various dimensions: - Prediction availability times - Lead times - Time windows - Global and windowed metrics

Always includes observed probability as a calibration metric.

Parameters:
  • config (EvaluationConfig)

  • quantiles (list[Quantile])

  • window_metric_providers (list[MetricProvider])

  • global_metric_providers (list[MetricProvider])

__init__(config: EvaluationConfig, quantiles: list[Quantile], window_metric_providers: list[MetricProvider], global_metric_providers: list[MetricProvider]) None[source]#

Initializes the pipeline with configuration and metric providers.

Automatically adds ObservedProbabilityProvider to global metrics to ensure calibration is always evaluated.

Parameters:
  • config (EvaluationConfig) – Configuration for evaluation pipeline with time dimensions.

  • quantiles (list[Quantile]) – List of quantiles to evaluate, must include 0.5 (median).

  • window_metric_providers (list[MetricProvider]) – Metrics to compute for time windows.

  • global_metric_providers (list[MetricProvider]) – Metrics to compute for entire dataset.

  • config

  • quantiles

  • window_metric_providers

  • global_metric_providers

Raises:

ValueError – If quantiles list does not include 0.5 (median quantile).

run(predictions: TimeSeriesMixin, ground_truth: TimeSeriesMixin, target_column: str, evaluation_mask: DatetimeIndex | None = None) EvaluationReport[source]#

Evaluates predictions against ground truth.

Segments data by available_at and lead_time configurations, then computes metrics for each subset.

Parameters:
  • predictions (TimeSeriesMixin) – Forecasted values with versioning information.

  • ground_truth (TimeSeriesMixin) – Actual observed values for comparison.

  • target_column (str) – Name of the target column in ground truth dataset.

  • evaluation_mask (DatetimeIndex | None) – Optional datetime index to limit evaluation period.

  • predictions

  • ground_truth

  • target_column

  • evaluation_mask

Returns:

EvaluationReport containing metrics for each subset, organized by filtering criteria such as lead time windows and availability timestamps.

Raises:
  • ValueError – If predictions and ground truth have different sample intervals.

  • MissingColumnsError – If any configured quantile columns are missing from predictions.

Return type:

EvaluationReport

run_for_subset(filtering: Filtering, predictions: ForecastDataset, evaluation_mask: DatetimeIndex | None = None) EvaluationSubsetReport[source]#

Evaluates a single evaluation subset.

Computes metrics for the provided subset without additional filtering.

Parameters:
  • filtering (TypeAliasType) – The filtering criteria describing this subset.

  • predictions (ForecastDataset) – TimeSeriesDataset containing the predicted values.

  • evaluation_mask (DatetimeIndex | None) – Optional datetime index to limit evaluation period.

  • filtering

  • predictions

  • evaluation_mask

Returns:

EvaluationSubsetReport containing computed metrics for the subset.

Return type:

EvaluationSubsetReport