Imputer#

class openstef_models.transforms.general.imputer.Imputer(**data: Any) → None[source]

Bases: BaseConfig, TimeSeriesTransform

Transform that imputes missing values in specified columns of time series data.

This transform applies imputation strategies to handle missing values in the dataset. It focuses solely on missing value imputation and avoids filling future values by preserving trailing NaNs after the last valid value in each column.

The transform works by:

Validating that selected columns are not completely empty
Applying imputation to the specified columns
Restoring trailing NaNs to preserve time series integrity

Imputation Strategies:

Simple strategies (mean, median, most_frequent, constant): Use statistics computed during fit()

Iterative strategy: Multivariate imputation (default: BayesianRidge()). Leverages relations between features but can be slower and needs parameter tuning.

Note: If you have completely empty columns, use EmptyFeatureRemover first to remove them before applying imputation.

Example

>>> from datetime import timedelta
>>> import numpy as np
>>> import pandas as pd
>>> from openstef_core.datasets import TimeSeriesDataset
>>> from openstef_models.transforms.general import (
...     Imputer,
... )
>>> data = pd.DataFrame(
...     {
...         "radiation": [100, np.nan, 110, np.nan],
...         "temperature": [20, np.nan, 24, 21],
...         "wind_speed": [5, 6, np.nan, np.nan]
...     },
...     index=pd.date_range("2025-01-01", periods=4, freq="1h"),
... )
>>> dataset = TimeSeriesDataset(data, timedelta(hours=1))
>>> # Apply imputation to all columns (default behavior)
>>> transform_all = Imputer(imputation_strategy="mean")
>>> transform_all.fit(dataset)
>>> result_all = transform_all.transform(dataset)
>>> # Apply imputation only to specific columns
>>> from openstef_models.utils.feature_selection import FeatureSelection
>>> transform_selective = Imputer(
...     imputation_strategy="mean",
...     selection=FeatureSelection(include={"temperature", "wind_speed"})
... )
>>> transform_selective.fit(dataset)
>>> result_selective = transform_selective.transform(dataset)
>>> result_selective.data["temperature"].isna().sum() == 0  # Temperature NaNs filled
np.True_
>>> result_selective.data["radiation"].isna().sum() == 2  # Radiation NaNs preserved
np.True_
>>> # Use iterative imputation with custom estimator
>>> from sklearn.ensemble import RandomForestRegressor
>>> transform_iterative = Imputer(
...     imputation_strategy="iterative",
...     selection=FeatureSelection(include={"temperature", "radiation"}),
...     impute_estimator=RandomForestRegressor(
...         n_estimators=2,  # not many trees for test speed
...         max_depth=3,  # shallow tree for test speed
...         bootstrap=True,
...         max_samples=0.5,
...         n_jobs=1,
...         random_state=0,
...     ),
...     max_iterations=20,
...     tolerance=0.1
... )
>>> transform_iterative.fit(dataset)
>>> result_iterative = transform_iterative.transform(dataset)
>>> result_iterative.data["temperature"].isna().sum() == 0  # Temperature NaNs filled
np.True_
>>> np.isnan(result_iterative.data["radiation"].iloc[1])  # Radiation first NaN replaced
np.False_
>>> np.isnan(result_iterative.data["radiation"].iloc[3]) # Check if trailing NaN is preserved
np.True_
>>> result_iterative.data["wind_speed"].isna().sum() == 2  # Wind speed NaNs preserved
np.True_

Parameters:: data (Any)

imputation_strategy: ImputationStrategy

missing_value: float

fill_value: float | str | None

impute_estimator: Any

initial_strategy: Literal['mean', 'median', 'most_frequent', 'constant']

tolerance: float

max_iterations: int

selection: FeatureSelection

fill_future_values: FeatureSelection

validate_fill_value_with_strategy() → Imputer[source]

Validate that fill_value is provided when strategy is CONSTANT.

Return type:: Imputer
Returns:: The validated model instance.
Raises:: ValueError – If imputation_strategy is CONSTANT but fill_value is None.

model_post_init(context: Any) → None[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

Parameters:: context (Any)
Return type:: None

property is_fitted: bool: Check if the transform has been fitted.

fit(data: TimeSeriesDataset) → None[source]

Fit the transform to the input data.

This method should be called before applying the transform to the data. It allows the transform to learn any necessary parameters from the data.

Parameters:

data (TimeSeriesDataset) – The input data to fit the transform on.
data

Return type:

None

transform(data: TimeSeriesDataset) → TimeSeriesDataset[source]

Transform the input data.

This method should apply a transformation to the input data and return a new instance.

Parameters:

data (TimeSeriesDataset) – The input data to be transformed.
data

Returns:

A new instance of the transformed data.

Raises:

NotFittedError – If the transform has not been fitted yet.

Return type:

TimeSeriesDataset

features_added() → list[str][source]

List of feature names added by this transform.

Return type:: list[str]
Returns:: A list of strings representing the names of features added to the dataset by this transform. Default is an empty list.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': False, 'extra': 'ignore', 'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].