Imputer#
- class openstef_models.transforms.general.imputer.Imputer(**data: Any) None[source]
Bases:
BaseConfig,TimeSeriesTransformTransform that imputes missing values in specified columns of time series data.
This transform applies imputation strategies to handle missing values in the dataset. It focuses solely on missing value imputation and avoids filling future values by preserving trailing NaNs after the last valid value in each column.
The transform works by:
Validating that selected columns are not completely empty
Applying imputation to the specified columns
Restoring trailing NaNs to preserve time series integrity
Imputation Strategies:
Simple strategies (mean, median, most_frequent, constant): Use statistics computed during fit()
Iterative strategy: Multivariate imputation (default: BayesianRidge()). Leverages relations between features but can be slower and needs parameter tuning.
Note: If you have completely empty columns, use EmptyFeatureRemover first to remove them before applying imputation.
Example
>>> from datetime import timedelta >>> import numpy as np >>> import pandas as pd >>> from openstef_core.datasets import TimeSeriesDataset >>> from openstef_models.transforms.general import ( ... Imputer, ... ) >>> data = pd.DataFrame( ... { ... "radiation": [100, np.nan, 110, np.nan], ... "temperature": [20, np.nan, 24, 21], ... "wind_speed": [5, 6, np.nan, np.nan] ... }, ... index=pd.date_range("2025-01-01", periods=4, freq="1h"), ... ) >>> dataset = TimeSeriesDataset(data, timedelta(hours=1)) >>> # Apply imputation to all columns (default behavior) >>> transform_all = Imputer(imputation_strategy="mean") >>> transform_all.fit(dataset) >>> result_all = transform_all.transform(dataset) >>> # Apply imputation only to specific columns >>> from openstef_models.utils.feature_selection import FeatureSelection >>> transform_selective = Imputer( ... imputation_strategy="mean", ... selection=FeatureSelection(include={"temperature", "wind_speed"}) ... ) >>> transform_selective.fit(dataset) >>> result_selective = transform_selective.transform(dataset) >>> result_selective.data["temperature"].isna().sum() == 0 # Temperature NaNs filled np.True_ >>> result_selective.data["radiation"].isna().sum() == 2 # Radiation NaNs preserved np.True_ >>> # Use iterative imputation with custom estimator >>> from sklearn.ensemble import RandomForestRegressor >>> transform_iterative = Imputer( ... imputation_strategy="iterative", ... selection=FeatureSelection(include={"temperature", "radiation"}), ... impute_estimator=RandomForestRegressor( ... n_estimators=2, # not many trees for test speed ... max_depth=3, # shallow tree for test speed ... bootstrap=True, ... max_samples=0.5, ... n_jobs=1, ... random_state=0, ... ), ... max_iterations=20, ... tolerance=0.1 ... ) >>> transform_iterative.fit(dataset) >>> result_iterative = transform_iterative.transform(dataset) >>> result_iterative.data["temperature"].isna().sum() == 0 # Temperature NaNs filled np.True_ >>> np.isnan(result_iterative.data["radiation"].iloc[1]) # Radiation first NaN replaced np.False_ >>> np.isnan(result_iterative.data["radiation"].iloc[3]) # Check if trailing NaN is preserved np.True_ >>> result_iterative.data["wind_speed"].isna().sum() == 2 # Wind speed NaNs preserved np.True_
- Parameters:
data (
Any)
-
imputation_strategy:
TypeAliasType
-
missing_value:
float
-
impute_estimator:
Any
-
initial_strategy:
Literal['mean','median','most_frequent','constant']
-
tolerance:
float
-
max_iterations:
int
-
selection:
FeatureSelection
-
fill_future_values:
FeatureSelection
- validate_fill_value_with_strategy() Imputer[source]
Validate that fill_value is provided when strategy is CONSTANT.
- Return type:
Imputer- Returns:
The validated model instance.
- Raises:
ValueError – If imputation_strategy is CONSTANT but fill_value is None.
- model_post_init(context: Any) None[source]
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- property is_fitted: bool
Check if the transform has been fitted.
- fit(data: TimeSeriesDataset) None[source]
Fit the transform to the input data.
This method should be called before applying the transform to the data. It allows the transform to learn any necessary parameters from the data.
- Parameters:
data (
TimeSeriesDataset) – The input data to fit the transform on.data
- Return type:
- transform(data: TimeSeriesDataset) TimeSeriesDataset[source]
Transform the input data.
This method should apply a transformation to the input data and return a new instance.
- Parameters:
data (
TimeSeriesDataset) – The input data to be transformed.data
- Returns:
A new instance of the transformed data.
- Raises:
NotFittedError – If the transform has not been fitted yet.
- Return type:
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': False, 'extra': 'ignore', 'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].