VersionedTimeSeriesDataset#

class openstef_core.datasets.VersionedTimeSeriesDataset(data_parts: list[TimeSeriesDataset], *, index: DatetimeIndex | None = None) → None[source]#

Bases: TimeSeriesMixin, DatasetMixin

A versioned time series dataset composed of multiple data parts.

This class combines multiple TimeSeriesDataset instances into a unified dataset that tracks data availability over time. It provides methods to filter datasets by time ranges, availability constraints, and lead times, as well as select specific versions of the data for point-in-time reconstruction.

The dataset is particularly useful for realistic backtesting scenarios where data arrives with delays or gets revised over time.

Key motivation: This architecture solves the O(n²) space complexity problem that occurs when concatenating DataFrames with misaligned (timestamp, available_at) pairs. Instead of immediately combining data, it uses lazy composition that delays actual DataFrame concatenation until select_version() is called.

Variables:: data_parts – List of TimeSeriesDataset instances that compose this dataset.

Example

Create a versioned dataset by combining multiple data parts:

>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>>
>>> # Create weather data part
>>> weather_data = pd.DataFrame({
...     'temperature': [20.5],
...     'available_at': [datetime(2025, 1, 1, 16, 0)]
... }, index=pd.DatetimeIndex([datetime(2025, 1, 1, 10, 0)]))
>>> weather_part = TimeSeriesDataset(weather_data, timedelta(hours=1))
>>>
>>> # Combine into versioned dataset
>>> dataset = VersionedTimeSeriesDataset([weather_part])
>>> dataset.is_versioned
True

Note

All data parts must have identical sample intervals and disjoint feature sets. The final dataset index is the union of all part indices, enabling flexible composition of data sources with different coverage periods.

Parameters:

data_parts (list[TimeSeriesDataset])
index (DatetimeIndex | None)

__init__(data_parts: list[TimeSeriesDataset], *, index: DatetimeIndex | None = None) → None[source]#

Initialize a versioned time series dataset from multiple parts.

Parameters:

data_parts (list[TimeSeriesDataset]) – List of TimeSeriesDataset instances to combine. Must have identical sample intervals and disjoint feature sets.
index (DatetimeIndex | None) – Optional explicit index for the combined dataset. If not provided, the union of all part indices will be used.
data_parts
index

Raises:

TimeSeriesValidationError – If no data parts provided or validation fails.

data_parts: list[TimeSeriesDataset]#

property index: DatetimeIndex#

Get the datetime index of the dataset.

Returns:: DatetimeIndex representing all timestamps in the dataset.

property sample_interval: timedelta#

Get the fixed time interval between consecutive data points.

Returns:: The sample interval as a timedelta.

property feature_names: list[str]#

Get the names of all available features in the dataset.

Returns:: List of feature names, excluding metadata columns like timestamp, available_at, or horizon.

property is_versioned: bool#

Check if the dataset tracks data availability over time.

Returns:: True if the dataset is versioned (tracks availability via horizon or available_at columns), False for regular time series.

filter_by_range(start: datetime | None = None, end: datetime | None = None) → Self[source]#

Filter the dataset to include only data within the specified time range.

Parameters:

start (datetime | None) – Inclusive start time of the range. If None, no start boundary applied.
end (datetime | None) – Exclusive end time of the range. If None, no end boundary applied.
start
end

Returns:

New instance containing only data within [start, end).

Return type:

Self

filter_by_available_before(available_before: datetime) → Self[source]#

Filter to include only data available before the specified timestamp.

Parameters:

available_before (datetime) – Cutoff time for data availability.
available_before

Returns:

New instance containing only data available before the cutoff.

Return type:

Self

filter_by_available_at(available_at: AvailableAt) → Self[source]#

Filter based on realistic daily data availability constraints.

Parameters:

available_at (AvailableAt) – Specification defining when data becomes available.
available_at

Returns:

New instance with data filtered by availability pattern.

Return type:

Self

filter_by_lead_time(lead_time: LeadTime) → Self[source]#

Filter to include only data available at or longer than the specified lead time.

Parameters:

lead_time (LeadTime) – Minimum time gap required between data availability and timestamp.
lead_time

Returns:

New instance containing only data available with the required lead time.

Return type:

Self

select_version() → TimeSeriesDataset[source]#

Select a specific version of the dataset based on data availability.

Creates a point-in-time snapshot by selecting the latest available version for each timestamp. Essential for preventing lookahead bias in backtesting.

Return type:: TimeSeriesDataset
Returns:: TimeSeriesDataset containing the selected version of the data.

to_parquet(path: Annotated[Path, PathType(path_type=file)]) → None[source]#

Save the dataset to a parquet file.

Stores both the dataset’s data and all necessary metadata for complete reconstruction. Metadata should be stored in the parquet file’s attrs dictionary.

Parameters:: path (Annotated[Path, PathType(path_type=file)]) – File path where the dataset should be saved.

See also

to_parquet: Counterpart method for saving datasets.

Parameters:

path (Path)
sample_interval (timedelta | None)
timestamp_column (str)
available_at_column (str)
horizon_column (str)

Return type:

Self

classmethod concat(datasets: Sequence[Self], mode: ConcatMode) → Self[source]#

Concatenate multiple versioned datasets into a single dataset.

Combines multiple VersionedTimeSeriesDataset instances using the specified concatenation mode. Supports different strategies for handling overlapping time indices across datasets.

This method is useful when you have data from different sources or time periods that need to be combined while preserving their versioning information. For example, combining weather data from different providers or merging historical data with recent updates.

Parameters:

datasets (Sequence[Self]) – Sequence of VersionedTimeSeriesDataset instances to concatenate. Must contain at least one dataset.
mode (TypeAliasType) – Concatenation mode determining how to handle overlapping indices: - “left”: Use indices from the first dataset only - “outer”: Union of all indices across datasets - “inner”: Intersection of all indices across datasets
datasets
mode

Returns:

New VersionedTimeSeriesDataset containing all data parts from input datasets.

Raises:

TimeSeriesValidationError – If no datasets are provided for concatenation.

Return type:

Self

classmethod from_dataframe(data: DataFrame, sample_interval: timedelta, *, timestamp_column: str = 'timestamp', available_at_column: str = 'available_at') → Self[source]#

Create a VersionedTimeSeriesDataset from a single DataFrame.

Convenience constructor for creating a versioned dataset from a single DataFrame containing all features.

Parameters:

data (DataFrame) – DataFrame containing versioned time series data with timestamp and available_at columns.
sample_interval (timedelta) – The regular interval between consecutive data points.
available_at_column (str) – Name of the column indicating when data became available. Default is ‘available_at’.
timestamp_column (str) – Name of the column indicating the timestamps of the data. Default is ‘timestamp’.

Returns:

New VersionedTimeSeriesDataset instance containing the data.

Return type:

Self

Example

Create dataset from a single DataFrame:

>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> data = pd.DataFrame({
...     'available_at': [datetime.fromisoformat('2025-01-01T10:05:00'),
...                      datetime.fromisoformat('2025-01-01T10:20:00')],
...     'load': [100.0, 120.0],
...     'temperature': [20.0, 22.0]
... }, index=pd.DatetimeIndex([datetime.fromisoformat('2025-01-01T10:00:00'),
...                            datetime.fromisoformat('2025-01-01T10:15:00')], name='timestamp'))
>>> dataset = VersionedTimeSeriesDataset.from_dataframe(data, timedelta(minutes=15))
>>> sorted(dataset.feature_names)
['load', 'temperature']

Note

This is equivalent to creating a TimeSeriesDataset and then wrapping it in a VersionedTimeSeriesDataset, but more convenient for simple cases.

Parameters:

data (DataFrame)
sample_interval (timedelta)
timestamp_column (str)
available_at_column (str)

Return type:

Self

to_horizons(horizons: list[LeadTime]) → TimeSeriesDataset[source]#

Convert versioned dataset to horizon-based format for multiple lead times.

Selects data for each specified horizon, adds a horizon column, and combines into a single TimeSeriesDataset. Useful for creating multi-horizon training data.

Returns:: TimeSeriesDataset with horizon column indicating forecast lead time.
Parameters:: horizons (list[LeadTime])
Return type:: TimeSeriesDataset