TimeSeriesDataset#

class openstef_core.datasets.TimeSeriesDataset(data: DataFrame, sample_interval: timedelta = timedelta(minutes=15), *, horizon_column: str = 'horizon', available_at_column: str = 'available_at', is_sorted: bool = False, check_frequency: bool = False) → None[source]#

Bases: TimeSeriesMixin, DatasetMixin

A time series dataset with regular sampling intervals and optional versioning.

This class represents time series data with a consistent sampling interval and provides operations for data access, persistence, and filtering. It supports both regular time series and versioned datasets where data availability is tracked over time through either a horizon column or an available_at column.

The dataset automatically detects versioning: - If a horizon column exists, data is versioned by forecast horizon - If an available_at column exists, data is versioned by availability time - Otherwise, data is treated as a regular time series

The dataset guarantees:

Data is sorted by timestamp in ascending order
Consistent sampling interval across all data points
DateTime index for temporal operations

Variables:

data – DataFrame containing the time series data indexed by timestamp.
horizon_column – Name of the column storing forecast horizons (if versioned by horizon).
available_at_column – Name of the column storing availability times (if versioned).

Example

Create a simple time series dataset:

>>> import pandas as pd
>>> from datetime import timedelta
>>> data = pd.DataFrame({
...     'temperature': [20.1, 22.3, 21.5],
...     'load': [100, 120, 110]
... }, index=pd.date_range('2025-01-01', periods=3, freq='15min'))
>>> dataset = TimeSeriesDataset(data, sample_interval=timedelta(minutes=15))
>>> dataset.feature_names
['temperature', 'load']
>>> dataset.is_versioned
False

Create a versioned dataset with horizons:

>>> data_with_horizon = pd.DataFrame({
...     'load': [100, 120],
...     'horizon': pd.to_timedelta(['1h', '2h'])
... }, index=pd.date_range('2025-01-01', periods=2, freq='1h'))
>>> dataset = TimeSeriesDataset(data_with_horizon, sample_interval=timedelta(hours=1))
>>> dataset.is_versioned
True
>>> dataset.horizons is not None
True

Parameters:

data (DataFrame)
sample_interval (timedelta)
horizon_column (str)
available_at_column (str)
is_sorted (bool)
check_frequency (bool)

index_name: ClassVar[str] = 'timestamp'#

__init__(data: DataFrame, sample_interval: timedelta = timedelta(minutes=15), *, horizon_column: str = 'horizon', available_at_column: str = 'available_at', is_sorted: bool = False, check_frequency: bool = False) → None[source]#

Initialize a time series dataset.

The dataset automatically detects whether it’s versioned based on column presence: - If horizon_column exists: versioned by forecast horizon - If available_at_column exists: versioned by availability time - Otherwise: regular time series

Parameters:

data (DataFrame) – DataFrame with DatetimeIndex containing the time series data.
sample_interval (timedelta) – Fixed interval between consecutive data points.
horizon_column (str) – Name of the column storing forecast horizons.
available_at_column (str) – Name of the column storing availability times.
is_sorted (bool) – Whether the data is sorted by timestamp.
check_frequency (bool) – Whether to check that the data frequency matches sample_interval.
data
sample_interval
horizon_column
available_at_column
is_sorted
check_frequency

Raises:

TypeError – If data index is not a pandas DatetimeIndex or if versioning columns have incorrect types.
ValueError – If data frequency does not match sample_interval.

horizon_column: str#

available_at_column: str#

data: DataFrame#

property index: DatetimeIndex#

Get the datetime index of the dataset.

Returns:: DatetimeIndex representing all timestamps in the dataset.

property sample_interval: timedelta#

Get the fixed time interval between consecutive data points.

Returns:: The sample interval as a timedelta.

property feature_names: list[str]#

Get the names of all available features in the dataset.

Returns:: List of feature names, excluding metadata columns like timestamp, available_at, or horizon.

property is_versioned: bool#

Check if the dataset tracks data availability over time.

Returns:: True if the dataset is versioned (tracks availability via horizon or available_at columns), False for regular time series.

property horizons: list[LeadTime] | None#

Get the list of forecast horizons present in the dataset.

Returns:: List of unique forecast horizons if the dataset is versioned by horizons, None otherwise.

property available_at_series: Series | None#

Get the availability times as a pandas Series.

Returns:: Series containing availability times indexed by timestamp if versioned, None for non-versioned datasets.

property lead_time_series: Series | None#

Get the lead times as a pandas Series.

Lead time is the gap between when data became available and the timestamp.

Returns:: Series containing lead times indexed by timestamp if versioned, None for non-versioned datasets.

filter_by_range(start: datetime | None = None, end: datetime | None = None) → Self[source]#

Filter the dataset to include only data within the specified time range.

Parameters:

start (datetime | None) – Inclusive start time of the range. If None, no start boundary applied.
end (datetime | None) – Exclusive end time of the range. If None, no end boundary applied.
start
end

Returns:

New instance containing only data within [start, end).

Return type:

Self

filter_by_available_before(available_before: datetime) → Self[source]#

Filter to include only data available before the specified timestamp.

Parameters:

available_before (datetime) – Cutoff time for data availability.
available_before

Returns:

New instance containing only data available before the cutoff.

Return type:

Self

filter_by_available_at(available_at: AvailableAt) → Self[source]#

Filter based on realistic daily data availability constraints.

Parameters:

available_at (AvailableAt) – Specification defining when data becomes available.
available_at

Returns:

New instance with data filtered by availability pattern.

Return type:

Self

filter_by_lead_time(lead_time: LeadTime) → Self[source]#

Filter to include only data available at or longer than the specified lead time.

Parameters:

lead_time (LeadTime) – Minimum time gap required between data availability and timestamp.
lead_time

Returns:

New instance containing only data available with the required lead time.

Return type:

Self

select_version() → Self[source]#

Select a specific version of the dataset based on data availability.

Creates a point-in-time snapshot by selecting the latest available version for each timestamp. Essential for preventing lookahead bias in backtesting.

Return type:: Self
Returns:: TimeSeriesDataset containing the selected version of the data.

filter_index(mask: Index) → Self[source]#

Filter dataset to include only timestamps present in the mask.

Returns:: New dataset containing only rows with timestamps in the mask.
Parameters:: mask (Index)
Return type:: Self

select_horizon(horizon: LeadTime) → Self[source]#

Select data for a specific forecast horizon.

Parameters:

horizon (LeadTime) – The forecast horizon to filter the dataset by.
horizon

Returns:

A new TimeSeriesDataset instance containing only data for the specified horizon.

Return type:

Self

to_pandas() → DataFrame[source]#

Convert the dataset to a pandas DataFrame with metadata stored in attrs.

Stores sample_interval, available_at_column, and horizon_column in the DataFrame’s attrs dictionary for later reconstruction.

Return type:: DataFrame
Returns:: DataFrame with dataset data and metadata in attrs.

classmethod from_pandas(df: DataFrame, *, sample_interval: timedelta | None = None, available_at_column: str = 'available_at', horizon_column: str = 'horizon') → Self[source]#

Create a dataset instance from a pandas DataFrame with metadata in attrs.

Reads sample_interval, available_at_column, and horizon_column from the DataFrame’s attrs dictionary.

Parameters:

df (DataFrame) – DataFrame containing dataset data with metadata in attrs.
sample_interval (timedelta | None) – Fixed interval between consecutive data points. If None, reads from attrs.
available_at_column (str) – Name of the column storing availability times.
horizon_column (str) – Name of the column storing forecast horizons.
df
sample_interval
available_at_column
horizon_column

Returns:

New TimeSeriesDataset instance reconstructed from the DataFrame.

Return type:

Self

to_parquet(path: Annotated[Path, PathType(path_type=file)]) → None[source]#

Save the dataset to a parquet file.

Stores both the dataset’s data and all necessary metadata for complete reconstruction. Metadata should be stored in the parquet file’s attrs dictionary.

Parameters:: path (Annotated[Path, PathType(path_type=file)]) – File path where the dataset should be saved.