TimeSeriesDataset#

class openstef_core.datasets.TimeSeriesDataset(data: DataFrame, sample_interval: timedelta = timedelta(minutes=15), *, horizon_column: str = 'horizon', available_at_column: str = 'available_at', is_sorted: bool = False) None[source]#

Bases: TimeSeriesMixin, DatasetMixin

A time series dataset with regular sampling intervals and optional versioning.

This class represents time series data with a consistent sampling interval and provides operations for data access, persistence, and filtering. It supports both regular time series and versioned datasets where data availability is tracked over time through either a horizon column or an available_at column.

The dataset automatically detects versioning: - If a horizon column exists, data is versioned by forecast horizon - If an available_at column exists, data is versioned by availability time - Otherwise, data is treated as a regular time series

The dataset guarantees:
  • Data is sorted by timestamp in ascending order

  • Consistent sampling interval across all data points

  • DateTime index for temporal operations

Variables:
  • data – DataFrame containing the time series data indexed by timestamp.

  • horizon_column – Name of the column storing forecast horizons (if versioned by horizon).

  • available_at_column – Name of the column storing availability times (if versioned).

Example

Create a simple time series dataset:

>>> import pandas as pd
>>> from datetime import timedelta
>>> data = pd.DataFrame({
...     'temperature': [20.1, 22.3, 21.5],
...     'load': [100, 120, 110]
... }, index=pd.date_range('2025-01-01', periods=3, freq='15min'))
>>> dataset = TimeSeriesDataset(data, sample_interval=timedelta(minutes=15))
>>> dataset.feature_names
['temperature', 'load']
>>> dataset.is_versioned
False

Create a versioned dataset with horizons:

>>> data_with_horizon = pd.DataFrame({
...     'load': [100, 120],
...     'horizon': pd.to_timedelta(['1h', '2h'])
... }, index=pd.date_range('2025-01-01', periods=2, freq='1h'))
>>> dataset = TimeSeriesDataset(data_with_horizon, sample_interval=timedelta(hours=1))
>>> dataset.is_versioned
True
>>> dataset.horizons is not None
True
Parameters:
  • data (DataFrame)

  • sample_interval (timedelta)

  • horizon_column (str)

  • available_at_column (str)

  • is_sorted (bool)

index_name: ClassVar[str] = 'timestamp'#
__init__(data: DataFrame, sample_interval: timedelta = timedelta(minutes=15), *, horizon_column: str = 'horizon', available_at_column: str = 'available_at', is_sorted: bool = False) None[source]#

Initialize a time series dataset.

The dataset automatically detects whether it’s versioned based on column presence: - If horizon_column exists: versioned by forecast horizon - If available_at_column exists: versioned by availability time - Otherwise: regular time series

Parameters:
  • data (DataFrame) – DataFrame with DatetimeIndex containing the time series data.

  • sample_interval (timedelta) – Fixed interval between consecutive data points.

  • horizon_column (str) – Name of the column storing forecast horizons.

  • available_at_column (str) – Name of the column storing availability times.

  • is_sorted (bool) – Whether the data is sorted by timestamp.

  • data

  • sample_interval

  • horizon_column

  • available_at_column

  • is_sorted

Raises:

TypeError – If data index is not a pandas DatetimeIndex or if versioning columns have incorrect types.

horizon_column: str#
available_at_column: str#
data: DataFrame#
property index: DatetimeIndex#

Get the datetime index of the dataset.

Returns:

DatetimeIndex representing all timestamps in the dataset.

property sample_interval: timedelta#

Get the fixed time interval between consecutive data points.

Returns:

The sample interval as a timedelta.

property feature_names: list[str]#

Get the names of all available features in the dataset.

Returns:

List of feature names, excluding metadata columns like timestamp, available_at, or horizon.

property is_versioned: bool#

Check if the dataset tracks data availability over time.

Returns:

True if the dataset is versioned (tracks availability via horizon or available_at columns), False for regular time series.

property horizons: list[LeadTime] | None#

Get the list of forecast horizons present in the dataset.

Returns:

List of unique forecast horizons if the dataset is versioned by horizons, None otherwise.

property available_at_series: Series | None#

Get the availability times as a pandas Series.

Returns:

Series containing availability times indexed by timestamp if versioned, None for non-versioned datasets.

property lead_time_series: Series | None#

Get the lead times as a pandas Series.

Lead time is the gap between when data became available and the timestamp.

Returns:

Series containing lead times indexed by timestamp if versioned, None for non-versioned datasets.

filter_by_range(start: datetime | None = None, end: datetime | None = None) Self[source]#

Filter the dataset to include only data within the specified time range.

Parameters:
  • start (datetime | None) – Inclusive start time of the range. If None, no start boundary applied.

  • end (datetime | None) – Exclusive end time of the range. If None, no end boundary applied.

  • start

  • end

Returns:

New instance containing only data within [start, end).

Return type:

Self

filter_by_available_before(available_before: datetime) Self[source]#

Filter to include only data available before the specified timestamp.

Parameters:
  • available_before (datetime) – Cutoff time for data availability.

  • available_before

Returns:

New instance containing only data available before the cutoff.

Return type:

Self

filter_by_available_at(available_at: AvailableAt) Self[source]#

Filter based on realistic daily data availability constraints.

Parameters:
  • available_at (AvailableAt) – Specification defining when data becomes available.

  • available_at

Returns:

New instance with data filtered by availability pattern.

Return type:

Self

filter_by_lead_time(lead_time: LeadTime) Self[source]#

Filter to include only data available at or longer than the specified lead time.

Parameters:
  • lead_time (LeadTime) – Minimum time gap required between data availability and timestamp.

  • lead_time

Returns:

New instance containing only data available with the required lead time.

Return type:

Self

select_version() Self[source]#

Select a specific version of the dataset based on data availability.

Creates a point-in-time snapshot by selecting the latest available version for each timestamp. Essential for preventing lookahead bias in backtesting.

Return type:

Self

Returns:

TimeSeriesDataset containing the selected version of the data.

filter_index(mask: Index) Self[source]#

Filter dataset to include only timestamps present in the mask.

Returns:

New dataset containing only rows with timestamps in the mask.

Parameters:

mask (Index)

Return type:

Self

select_horizon(horizon: LeadTime) Self[source]#

Select data for a specific forecast horizon.

Parameters:
  • horizon (LeadTime) – The forecast horizon to filter the dataset by.

  • horizon

Returns:

A new TimeSeriesDataset instance containing only data for the specified horizon.

Return type:

Self

to_pandas() DataFrame[source]#

Convert the dataset to a pandas DataFrame with metadata stored in attrs.

Stores sample_interval, available_at_column, and horizon_column in the DataFrame’s attrs dictionary for later reconstruction.

Return type:

DataFrame

Returns:

DataFrame with dataset data and metadata in attrs.

classmethod from_pandas(df: DataFrame, *, sample_interval: timedelta | None = None, available_at_column: str = 'available_at', horizon_column: str = 'horizon') Self[source]#

Create a dataset instance from a pandas DataFrame with metadata in attrs.

Reads sample_interval, available_at_column, and horizon_column from the DataFrame’s attrs dictionary.

Parameters:
  • df (DataFrame) – DataFrame containing dataset data with metadata in attrs.

  • sample_interval (timedelta | None) – Fixed interval between consecutive data points. If None, reads from attrs.

  • available_at_column (str) – Name of the column storing availability times.

  • horizon_column (str) – Name of the column storing forecast horizons.

  • df

  • sample_interval

  • available_at_column

  • horizon_column

Returns:

New TimeSeriesDataset instance reconstructed from the DataFrame.

Return type:

Self

to_parquet(path: Annotated[Path, PathType(path_type=file)]) None[source]#

Save the dataset to a parquet file.

Stores both the dataset’s data and all necessary metadata for complete reconstruction. Metadata should be stored in the parquet file’s attrs dictionary.

Parameters:

path (Annotated[Path, PathType(path_type=file)]) – File path where the dataset should be saved.

See also

read_parquet: Counterpart method for loading datasets.

Parameters:

path (Path)

Return type:

None

classmethod read_parquet(path: Annotated[Path, PathType(path_type=file)], *, sample_interval: timedelta | None = None, timestamp_column: str = 'timestamp', available_at_column: str = 'available_at', horizon_column: str = 'horizon') Self[source]#

Load a dataset from a parquet file.

Reconstructs a dataset from a parquet file created with to_parquet, including all data and metadata. Should handle missing metadata gracefully with sensible defaults.

Parameters:

path (Annotated[Path, PathType(path_type=file)]) – Path to the parquet file to load.

Returns:

New dataset instance reconstructed from the file.

Return type:

Self

See also

to_parquet: Counterpart method for saving datasets.

Parameters:
  • path (Path)

  • sample_interval (timedelta | None)

  • timestamp_column (str)

  • available_at_column (str)

  • horizon_column (str)

Return type:

Self

pipe_pandas(func: Callable[[Concatenate[DataFrame, P]], DataFrame], *args: P, **kwargs: P) Self[source]#

Apply a pandas DataFrame transformation function to the dataset.

Executes a function on the underlying DataFrame and wraps the result back into a TimeSeriesDataset, preserving all metadata.

Returns:

New dataset with the transformation applied.

Parameters:
  • func (Callable[[Concatenate[DataFrame, ParamSpec(P)]], DataFrame])

  • args (ParamSpecArgs)

  • kwargs (ParamSpecKwargs)

Return type:

Self

select_features(feature_names: list[str]) TimeSeriesDataset[source]#

Select a subset of features from the dataset.

Parameters:
  • feature_names (list[str]) – List of feature column names to retain in the dataset.

  • feature_names

Returns:

A new TimeSeriesDataset instance containing only the specified features.

Return type:

TimeSeriesDataset

copy_with(data: DataFrame, *, is_sorted: bool = False) TimeSeriesDataset[source]#

Create a copy of the dataset with new data.

Parameters:
  • data (DataFrame) – New DataFrame to use for the dataset.

  • is_sorted (bool) – Whether the new data is already sorted by timestamp.

  • data

  • is_sorted

Returns:

New TimeSeriesDataset instance with the provided data and same metadata.

Return type:

TimeSeriesDataset