Imputer#

class openstef_models.transforms.general.imputer.Imputer(**data: Any) None[source]

Bases: BaseConfig, TimeSeriesTransform

Transform that imputes missing values in specified columns of time series data.

This transform applies imputation strategies to handle missing values in the dataset. It focuses solely on missing value imputation and avoids filling future values by preserving trailing NaNs after the last valid value in each column.

The transform works by:

  1. Validating that selected columns are not completely empty

  2. Applying imputation to the specified columns

  3. Restoring trailing NaNs to preserve time series integrity

Imputation Strategies:

  • Simple strategies (mean, median, most_frequent, constant): Use statistics computed during fit()

  • Iterative strategy: Multivariate imputation (default: BayesianRidge()). Leverages relations between features but can be slower and needs parameter tuning.

Note: If you have completely empty columns, use EmptyFeatureRemover first to remove them before applying imputation.

Example

>>> from datetime import timedelta
>>> import numpy as np
>>> import pandas as pd
>>> from openstef_core.datasets import TimeSeriesDataset
>>> from openstef_models.transforms.general import (
...     Imputer,
... )
>>> data = pd.DataFrame(
...     {
...         "radiation": [100, np.nan, 110, np.nan],
...         "temperature": [20, np.nan, 24, 21],
...         "wind_speed": [5, 6, np.nan, np.nan]
...     },
...     index=pd.date_range("2025-01-01", periods=4, freq="1h"),
... )
>>> dataset = TimeSeriesDataset(data, timedelta(hours=1))
>>> # Apply imputation to all columns (default behavior)
>>> transform_all = Imputer(imputation_strategy="mean")
>>> transform_all.fit(dataset)
>>> result_all = transform_all.transform(dataset)
>>> # Apply imputation only to specific columns
>>> from openstef_models.utils.feature_selection import FeatureSelection
>>> transform_selective = Imputer(
...     imputation_strategy="mean",
...     selection=FeatureSelection(include={"temperature", "wind_speed"})
... )
>>> transform_selective.fit(dataset)
>>> result_selective = transform_selective.transform(dataset)
>>> result_selective.data["temperature"].isna().sum() == 0  # Temperature NaNs filled
np.True_
>>> result_selective.data["radiation"].isna().sum() == 2  # Radiation NaNs preserved
np.True_
>>> # Use iterative imputation with custom estimator
>>> from sklearn.ensemble import RandomForestRegressor
>>> transform_iterative = Imputer(
...     imputation_strategy="iterative",
...     selection=FeatureSelection(include={"temperature", "radiation"}),
...     impute_estimator=RandomForestRegressor(
...         n_estimators=2,  # not many trees for test speed
...         max_depth=3,  # shallow tree for test speed
...         bootstrap=True,
...         max_samples=0.5,
...         n_jobs=1,
...         random_state=0,
...     ),
...     max_iterations=20,
...     tolerance=0.1
... )
>>> transform_iterative.fit(dataset)
>>> result_iterative = transform_iterative.transform(dataset)
>>> result_iterative.data["temperature"].isna().sum() == 0  # Temperature NaNs filled
np.True_
>>> np.isnan(result_iterative.data["radiation"].iloc[1])  # Radiation first NaN replaced
np.False_
>>> np.isnan(result_iterative.data["radiation"].iloc[3]) # Check if trailing NaN is preserved
np.True_
>>> result_iterative.data["wind_speed"].isna().sum() == 2  # Wind speed NaNs preserved
np.True_
Parameters:

data (Any)

imputation_strategy: TypeAliasType
missing_value: float
fill_value: float | str | None
impute_estimator: Any
initial_strategy: Literal['mean', 'median', 'most_frequent', 'constant']
tolerance: float
max_iterations: int
selection: FeatureSelection
fill_future_values: FeatureSelection
validate_fill_value_with_strategy() Imputer[source]

Validate that fill_value is provided when strategy is CONSTANT.

Return type:

Imputer

Returns:

The validated model instance.

Raises:

ValueError – If imputation_strategy is CONSTANT but fill_value is None.

model_post_init(context: Any) None[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Parameters:

context (Any)

Return type:

None

property is_fitted: bool

Check if the transform has been fitted.

fit(data: TimeSeriesDataset) None[source]

Fit the transform to the input data.

This method should be called before applying the transform to the data. It allows the transform to learn any necessary parameters from the data.

Parameters:
Return type:

None

transform(data: TimeSeriesDataset) TimeSeriesDataset[source]

Transform the input data.

This method should apply a transformation to the input data and return a new instance.

Parameters:
Returns:

A new instance of the transformed data.

Raises:

NotFittedError – If the transform has not been fitted yet.

Return type:

TimeSeriesDataset

features_added() list[str][source]

List of feature names added by this transform.

Return type:

list[str]

Returns:

A list of strings representing the names of features added to the dataset by this transform. Default is an empty list.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': False, 'extra': 'ignore', 'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].