openstef.model.metamodels package#

Submodules#

openstef.model.metamodels.grouped_regressor module#

This module defines the grouped regressor.

class openstef.model.metamodels.grouped_regressor.GroupedRegressor(base_estimator, group_columns, n_jobs=1)#

Bases: BaseEstimator, RegressorMixin, MetaEstimatorMixin

Meta-model that trains an instance of the base estimator for each key of a groupby operation applied on the data.

The base estimator is a sklearn regressor, the groupby is performed on the columns specified in parameters. Moreover fit and predict methods can be performed in parallel for each group key thanks to joblib.

Example:

data =  | index | group | x0 | x1 | x3 | y |
        |   0   |   1   | .. | .. | .. | . |
        |   1   |   2   | .. | .. | .. | . |
        |   2   |   1   | .. | .. | .. | . |
        |   3   |   2   | .. | .. | .. | . |

        [              X              ][ Y ]
The GroupedRegressor on the data with the group_columns=’group’ fits 2 models:
  • The model 1 with the row 0 and 2, columns x0, x1 and x3 as the features and column y as the target.

  • The model 2 with the row 1 and 3, columns x0, x1 and x3 as the features and column y as the target.

Parameters:
  • base_estimator (RegressorMixin) – Regressor .

  • group_columns (Union[str, int, list[str], list[int]]) – Name(s) of the column(s) used as the key for groupby operation.

  • n_jobs (int) – default=1 The maximum number of concurrently running jobs, such as the number of Python worker processes when backend=”multiprocessing” or the size of the thread-pool when backend=”threading

feature_names_#

All input feature (without group_columns).

estimators_#

Dictionnary that stocks fitted estimators for each group. The keys are the keys of grouping and the values are the regressors fitted on the grouped data.

fit(x, y, eval_set=None, **kwargs)#

Fit the model.

classmethod grouped_compute(df, group_columns, func, n_jobs=1, eval_set=None)#

Computes the specified function on each group defined by the grouping columns.

It is an utility function used to perform fit and predict on each group. The df_res is the final dataframe that aggregate the results for each group. The group_res is a tuple where each field is corresponding to a results for a group. The gb is the grouping object.

Parameters:
  • df (DataFrame) – DataFrame containing the input data necessary for the computation .

  • group_columns (Union[list[str], list[int]]) – List of the columns used for the groupby operation

  • func (Callable[[tuple, DataFrame], array]) – Function that take the group key and the conrresponding data of this group and perform the computation on this group.

  • n_jobs (int) – The maximum number of concurrently running jobs,

Return type:

tuple[tuple[array, ...], DataFrameGroupBy, DataFrame]

Returns:

The tuple of the results of each group, the grouping dataframe and the global dataframe of results.

predict(x, **kwargs)#

Make a predicion.

Return type:

ndarray

openstef.model.metamodels.missing_values_handler module#

This module defines the missing value handler.

class openstef.model.metamodels.missing_values_handler.MissingValuesHandler(base_estimator, missing_values=nan, imputation_strategy=None, fill_value=None)#

Bases: BaseEstimator, RegressorMixin, MetaEstimatorMixin

Class for a meta-models that handles missing values and removes columns filled exclusively by NaN.

It’s a pipeline of:

  • An Imputation transformer for completing missing values.

  • A Regressor fitted on the filled data.

Parameters:
  • base_estimator (RegressorMixin) – Regressor used in the pipeline.

  • missing_values (Union[int, float, str, None]) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

  • imputation_strategy (Optional[str]) – The imputation strategy. - If None no imputation is performed. - If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data. - If “median”, then replace missing values using the median along each column. Can only be used with numeric data. - If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. - If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

  • fill_value (Union[str, int, float, None]) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

feature_names#

All input feature.

non_null_columns_#

Valid features used by the regressor.

n_features_in_#

Number of input features.

regressor_#

RegressorMixin Regressor fitted on valid columns.

imputer_#

SimpleImputer Imputer for missig value fitted on valid columns.

pipeline_#

Pipeline Pipeline that chains the imputer and the regressor.

feature_importances_#

ndarray (n_features_in_, ) The feature importances from the regressor for valid features and zero otherwise.

fit(x, y)#

Fit model.

predict(x)#

Make a prediction.

Module contents#