VersionedTimeSeriesDataset#
- class openstef_core.datasets.VersionedTimeSeriesDataset(data_parts: list[TimeSeriesDataset], *, index: DatetimeIndex | None = None) None[source]#
Bases:
TimeSeriesMixin,DatasetMixinA versioned time series dataset composed of multiple data parts.
This class combines multiple TimeSeriesDataset instances into a unified dataset that tracks data availability over time. It provides methods to filter datasets by time ranges, availability constraints, and lead times, as well as select specific versions of the data for point-in-time reconstruction.
The dataset is particularly useful for realistic backtesting scenarios where data arrives with delays or gets revised over time.
Key motivation: This architecture solves the O(n²) space complexity problem that occurs when concatenating DataFrames with misaligned (timestamp, available_at) pairs. Instead of immediately combining data, it uses lazy composition that delays actual DataFrame concatenation until select_version() is called.
- Variables:
data_parts – List of TimeSeriesDataset instances that compose this dataset.
Example
Create a versioned dataset by combining multiple data parts:
>>> import pandas as pd >>> from datetime import datetime, timedelta >>> >>> # Create weather data part >>> weather_data = pd.DataFrame({ ... 'temperature': [20.5], ... 'available_at': [datetime(2025, 1, 1, 16, 0)] ... }, index=pd.DatetimeIndex([datetime(2025, 1, 1, 10, 0)])) >>> weather_part = TimeSeriesDataset(weather_data, timedelta(hours=1)) >>> >>> # Combine into versioned dataset >>> dataset = VersionedTimeSeriesDataset([weather_part]) >>> dataset.is_versioned True
Note
All data parts must have identical sample intervals and disjoint feature sets. The final dataset index is the union of all part indices, enabling flexible composition of data sources with different coverage periods.
- Parameters:
data_parts (
list[TimeSeriesDataset])index (
DatetimeIndex|None)
- __init__(data_parts: list[TimeSeriesDataset], *, index: DatetimeIndex | None = None) None[source]#
Initialize a versioned time series dataset from multiple parts.
- Parameters:
data_parts (
list[TimeSeriesDataset]) – List of TimeSeriesDataset instances to combine. Must have identical sample intervals and disjoint feature sets.index (
DatetimeIndex|None) – Optional explicit index for the combined dataset. If not provided, the union of all part indices will be used.data_parts
index
- Raises:
TimeSeriesValidationError – If no data parts provided or validation fails.
-
data_parts:
list[TimeSeriesDataset]#
- property index: DatetimeIndex#
Get the datetime index of the dataset.
- Returns:
DatetimeIndex representing all timestamps in the dataset.
- property sample_interval: timedelta#
Get the fixed time interval between consecutive data points.
- Returns:
The sample interval as a timedelta.
- property feature_names: list[str]#
Get the names of all available features in the dataset.
- Returns:
List of feature names, excluding metadata columns like timestamp, available_at, or horizon.
- property is_versioned: bool#
Check if the dataset tracks data availability over time.
- Returns:
True if the dataset is versioned (tracks availability via horizon or available_at columns), False for regular time series.
- filter_by_range(start: datetime | None = None, end: datetime | None = None) Self[source]#
Filter the dataset to include only data within the specified time range.
- Parameters:
start (
datetime|None) – Inclusive start time of the range. If None, no start boundary applied.end (
datetime|None) – Exclusive end time of the range. If None, no end boundary applied.start
end
- Returns:
New instance containing only data within [start, end).
- Return type:
Self
- filter_by_available_before(available_before: datetime) Self[source]#
Filter to include only data available before the specified timestamp.
- Parameters:
available_before (
datetime) – Cutoff time for data availability.available_before
- Returns:
New instance containing only data available before the cutoff.
- Return type:
Self
- filter_by_available_at(available_at: AvailableAt) Self[source]#
Filter based on realistic daily data availability constraints.
- Parameters:
available_at (
AvailableAt) – Specification defining when data becomes available.available_at
- Returns:
New instance with data filtered by availability pattern.
- Return type:
Self
- filter_by_lead_time(lead_time: LeadTime) Self[source]#
Filter to include only data available at or longer than the specified lead time.
- Parameters:
lead_time (
LeadTime) – Minimum time gap required between data availability and timestamp.lead_time
- Returns:
New instance containing only data available with the required lead time.
- Return type:
Self
- select_version() TimeSeriesDataset[source]#
Select a specific version of the dataset based on data availability.
Creates a point-in-time snapshot by selecting the latest available version for each timestamp. Essential for preventing lookahead bias in backtesting.
- Return type:
- Returns:
TimeSeriesDataset containing the selected version of the data.
- to_parquet(path: Annotated[Path, PathType(path_type=file)]) None[source]#
Save the dataset to a parquet file.
Stores both the dataset’s data and all necessary metadata for complete reconstruction. Metadata should be stored in the parquet file’s attrs dictionary.
- Parameters:
path (Annotated[Path, PathType(path_type=file)]) – File path where the dataset should be saved.
See also
read_parquet: Counterpart method for loading datasets.
- Parameters:
path (
Path)- Return type:
None
- classmethod read_parquet(path: Annotated[Path, PathType(path_type=file)], *, sample_interval: timedelta | None = None, timestamp_column: str = 'timestamp', available_at_column: str = 'available_at', horizon_column: str = 'horizon') Self[source]#
Load a dataset from a parquet file.
Reconstructs a dataset from a parquet file created with to_parquet, including all data and metadata. Should handle missing metadata gracefully with sensible defaults.
- Parameters:
path (Annotated[Path, PathType(path_type=file)]) – Path to the parquet file to load.
- Returns:
New dataset instance reconstructed from the file.
- Return type:
Self
See also
to_parquet: Counterpart method for saving datasets.
- Parameters:
path (
Path)sample_interval (
timedelta|None)timestamp_column (
str)available_at_column (
str)horizon_column (
str)
- Return type:
Self
- classmethod concat(datasets: Sequence[Self], mode: ConcatMode) Self[source]#
Concatenate multiple versioned datasets into a single dataset.
Combines multiple VersionedTimeSeriesDataset instances using the specified concatenation mode. Supports different strategies for handling overlapping time indices across datasets.
This method is useful when you have data from different sources or time periods that need to be combined while preserving their versioning information. For example, combining weather data from different providers or merging historical data with recent updates.
- Parameters:
datasets (
Sequence[Self]) – Sequence of VersionedTimeSeriesDataset instances to concatenate. Must contain at least one dataset.mode (
TypeAliasType) – Concatenation mode determining how to handle overlapping indices: - “left”: Use indices from the first dataset only - “outer”: Union of all indices across datasets - “inner”: Intersection of all indices across datasetsdatasets
mode
- Returns:
New VersionedTimeSeriesDataset containing all data parts from input datasets.
- Raises:
TimeSeriesValidationError – If no datasets are provided for concatenation.
- Return type:
Self
- classmethod from_dataframe(data: DataFrame, sample_interval: timedelta, *, timestamp_column: str = 'timestamp', available_at_column: str = 'available_at') Self[source]#
Create a VersionedTimeSeriesDataset from a single DataFrame.
Convenience constructor for creating a versioned dataset from a single DataFrame containing all features.
- Parameters:
data (DataFrame) – DataFrame containing versioned time series data with timestamp and available_at columns.
sample_interval (timedelta) – The regular interval between consecutive data points.
available_at_column (str) – Name of the column indicating when data became available. Default is ‘available_at’.
timestamp_column (str) – Name of the column indicating the timestamps of the data. Default is ‘timestamp’.
- Returns:
New VersionedTimeSeriesDataset instance containing the data.
- Return type:
Self
Example
Create dataset from a single DataFrame:
>>> import pandas as pd >>> from datetime import datetime, timedelta >>> data = pd.DataFrame({ ... 'available_at': [datetime.fromisoformat('2025-01-01T10:05:00'), ... datetime.fromisoformat('2025-01-01T10:20:00')], ... 'load': [100.0, 120.0], ... 'temperature': [20.0, 22.0] ... }, index=pd.DatetimeIndex([datetime.fromisoformat('2025-01-01T10:00:00'), ... datetime.fromisoformat('2025-01-01T10:15:00')], name='timestamp')) >>> dataset = VersionedTimeSeriesDataset.from_dataframe(data, timedelta(minutes=15)) >>> sorted(dataset.feature_names) ['load', 'temperature']
Note
This is equivalent to creating a TimeSeriesDataset and then wrapping it in a VersionedTimeSeriesDataset, but more convenient for simple cases.
- Parameters:
data (
DataFrame)sample_interval (
timedelta)timestamp_column (
str)available_at_column (
str)
- Return type:
Self
- to_horizons(horizons: list[LeadTime]) TimeSeriesDataset[source]#
Convert versioned dataset to horizon-based format for multiple lead times.
Selects data for each specified horizon, adds a horizon column, and combines into a single TimeSeriesDataset. Useful for creating multi-horizon training data.
- Returns:
TimeSeriesDataset with horizon column indicating forecast lead time.
- Parameters:
horizons (
list[LeadTime])- Return type: