TimeSeriesDataset#
- class openstef_core.datasets.TimeSeriesDataset(data: DataFrame, sample_interval: timedelta = timedelta(minutes=15), *, horizon_column: str = 'horizon', available_at_column: str = 'available_at', is_sorted: bool = False) None[source]#
Bases:
TimeSeriesMixin,DatasetMixinA time series dataset with regular sampling intervals and optional versioning.
This class represents time series data with a consistent sampling interval and provides operations for data access, persistence, and filtering. It supports both regular time series and versioned datasets where data availability is tracked over time through either a horizon column or an available_at column.
The dataset automatically detects versioning: - If a horizon column exists, data is versioned by forecast horizon - If an available_at column exists, data is versioned by availability time - Otherwise, data is treated as a regular time series
- The dataset guarantees:
Data is sorted by timestamp in ascending order
Consistent sampling interval across all data points
DateTime index for temporal operations
- Variables:
data – DataFrame containing the time series data indexed by timestamp.
horizon_column – Name of the column storing forecast horizons (if versioned by horizon).
available_at_column – Name of the column storing availability times (if versioned).
Example
Create a simple time series dataset:
>>> import pandas as pd >>> from datetime import timedelta >>> data = pd.DataFrame({ ... 'temperature': [20.1, 22.3, 21.5], ... 'load': [100, 120, 110] ... }, index=pd.date_range('2025-01-01', periods=3, freq='15min')) >>> dataset = TimeSeriesDataset(data, sample_interval=timedelta(minutes=15)) >>> dataset.feature_names ['temperature', 'load'] >>> dataset.is_versioned False
Create a versioned dataset with horizons:
>>> data_with_horizon = pd.DataFrame({ ... 'load': [100, 120], ... 'horizon': pd.to_timedelta(['1h', '2h']) ... }, index=pd.date_range('2025-01-01', periods=2, freq='1h')) >>> dataset = TimeSeriesDataset(data_with_horizon, sample_interval=timedelta(hours=1)) >>> dataset.is_versioned True >>> dataset.horizons is not None True
- Parameters:
data (
DataFrame)sample_interval (
timedelta)horizon_column (
str)available_at_column (
str)is_sorted (
bool)
-
index_name:
ClassVar[str] = 'timestamp'#
- __init__(data: DataFrame, sample_interval: timedelta = timedelta(minutes=15), *, horizon_column: str = 'horizon', available_at_column: str = 'available_at', is_sorted: bool = False) None[source]#
Initialize a time series dataset.
The dataset automatically detects whether it’s versioned based on column presence: - If horizon_column exists: versioned by forecast horizon - If available_at_column exists: versioned by availability time - Otherwise: regular time series
- Parameters:
data (
DataFrame) – DataFrame with DatetimeIndex containing the time series data.sample_interval (
timedelta) – Fixed interval between consecutive data points.horizon_column (
str) – Name of the column storing forecast horizons.available_at_column (
str) – Name of the column storing availability times.is_sorted (
bool) – Whether the data is sorted by timestamp.data
sample_interval
horizon_column
available_at_column
is_sorted
- Raises:
TypeError – If data index is not a pandas DatetimeIndex or if versioning columns have incorrect types.
-
horizon_column:
str#
-
available_at_column:
str#
-
data:
DataFrame#
- property index: DatetimeIndex#
Get the datetime index of the dataset.
- Returns:
DatetimeIndex representing all timestamps in the dataset.
- property sample_interval: timedelta#
Get the fixed time interval between consecutive data points.
- Returns:
The sample interval as a timedelta.
- property feature_names: list[str]#
Get the names of all available features in the dataset.
- Returns:
List of feature names, excluding metadata columns like timestamp, available_at, or horizon.
- property is_versioned: bool#
Check if the dataset tracks data availability over time.
- Returns:
True if the dataset is versioned (tracks availability via horizon or available_at columns), False for regular time series.
- property horizons: list[LeadTime] | None#
Get the list of forecast horizons present in the dataset.
- Returns:
List of unique forecast horizons if the dataset is versioned by horizons, None otherwise.
- property available_at_series: Series | None#
Get the availability times as a pandas Series.
- Returns:
Series containing availability times indexed by timestamp if versioned, None for non-versioned datasets.
- property lead_time_series: Series | None#
Get the lead times as a pandas Series.
Lead time is the gap between when data became available and the timestamp.
- Returns:
Series containing lead times indexed by timestamp if versioned, None for non-versioned datasets.
- filter_by_range(start: datetime | None = None, end: datetime | None = None) Self[source]#
Filter the dataset to include only data within the specified time range.
- Parameters:
start (
datetime|None) – Inclusive start time of the range. If None, no start boundary applied.end (
datetime|None) – Exclusive end time of the range. If None, no end boundary applied.start
end
- Returns:
New instance containing only data within [start, end).
- Return type:
Self
- filter_by_available_before(available_before: datetime) Self[source]#
Filter to include only data available before the specified timestamp.
- Parameters:
available_before (
datetime) – Cutoff time for data availability.available_before
- Returns:
New instance containing only data available before the cutoff.
- Return type:
Self
- filter_by_available_at(available_at: AvailableAt) Self[source]#
Filter based on realistic daily data availability constraints.
- Parameters:
available_at (
AvailableAt) – Specification defining when data becomes available.available_at
- Returns:
New instance with data filtered by availability pattern.
- Return type:
Self
- filter_by_lead_time(lead_time: LeadTime) Self[source]#
Filter to include only data available at or longer than the specified lead time.
- Parameters:
lead_time (
LeadTime) – Minimum time gap required between data availability and timestamp.lead_time
- Returns:
New instance containing only data available with the required lead time.
- Return type:
Self
- select_version() Self[source]#
Select a specific version of the dataset based on data availability.
Creates a point-in-time snapshot by selecting the latest available version for each timestamp. Essential for preventing lookahead bias in backtesting.
- Return type:
Self- Returns:
TimeSeriesDataset containing the selected version of the data.
- filter_index(mask: Index) Self[source]#
Filter dataset to include only timestamps present in the mask.
- Returns:
New dataset containing only rows with timestamps in the mask.
- Parameters:
mask (
Index)- Return type:
Self
- select_horizon(horizon: LeadTime) Self[source]#
Select data for a specific forecast horizon.
- Parameters:
horizon (
LeadTime) – The forecast horizon to filter the dataset by.horizon
- Returns:
A new TimeSeriesDataset instance containing only data for the specified horizon.
- Return type:
Self
- to_pandas() DataFrame[source]#
Convert the dataset to a pandas DataFrame with metadata stored in attrs.
Stores sample_interval, available_at_column, and horizon_column in the DataFrame’s attrs dictionary for later reconstruction.
- Return type:
DataFrame- Returns:
DataFrame with dataset data and metadata in attrs.
- classmethod from_pandas(df: DataFrame, *, sample_interval: timedelta | None = None, available_at_column: str = 'available_at', horizon_column: str = 'horizon') Self[source]#
Create a dataset instance from a pandas DataFrame with metadata in attrs.
Reads sample_interval, available_at_column, and horizon_column from the DataFrame’s attrs dictionary.
- Parameters:
df (
DataFrame) – DataFrame containing dataset data with metadata in attrs.sample_interval (
timedelta|None) – Fixed interval between consecutive data points. If None, reads from attrs.available_at_column (
str) – Name of the column storing availability times.horizon_column (
str) – Name of the column storing forecast horizons.df
sample_interval
available_at_column
horizon_column
- Returns:
New TimeSeriesDataset instance reconstructed from the DataFrame.
- Return type:
Self
- to_parquet(path: Annotated[Path, PathType(path_type=file)]) None[source]#
Save the dataset to a parquet file.
Stores both the dataset’s data and all necessary metadata for complete reconstruction. Metadata should be stored in the parquet file’s attrs dictionary.
- Parameters:
path (Annotated[Path, PathType(path_type=file)]) – File path where the dataset should be saved.
See also
read_parquet: Counterpart method for loading datasets.
- Parameters:
path (
Path)- Return type:
None
- classmethod read_parquet(path: Annotated[Path, PathType(path_type=file)], *, sample_interval: timedelta | None = None, timestamp_column: str = 'timestamp', available_at_column: str = 'available_at', horizon_column: str = 'horizon') Self[source]#
Load a dataset from a parquet file.
Reconstructs a dataset from a parquet file created with to_parquet, including all data and metadata. Should handle missing metadata gracefully with sensible defaults.
- Parameters:
path (Annotated[Path, PathType(path_type=file)]) – Path to the parquet file to load.
- Returns:
New dataset instance reconstructed from the file.
- Return type:
Self
See also
to_parquet: Counterpart method for saving datasets.
- Parameters:
path (
Path)sample_interval (
timedelta|None)timestamp_column (
str)available_at_column (
str)horizon_column (
str)
- Return type:
Self
- pipe_pandas(func: Callable[[Concatenate[DataFrame, P]], DataFrame], *args: P, **kwargs: P) Self[source]#
Apply a pandas DataFrame transformation function to the dataset.
Executes a function on the underlying DataFrame and wraps the result back into a TimeSeriesDataset, preserving all metadata.
- Returns:
New dataset with the transformation applied.
- Parameters:
func (
Callable[[Concatenate[DataFrame,ParamSpec(P)]],DataFrame])args (
ParamSpecArgs)kwargs (
ParamSpecKwargs)
- Return type:
Self
- select_features(feature_names: list[str]) TimeSeriesDataset[source]#
Select a subset of features from the dataset.
- Parameters:
feature_names (
list[str]) – List of feature column names to retain in the dataset.feature_names
- Returns:
A new TimeSeriesDataset instance containing only the specified features.
- Return type:
- copy_with(data: DataFrame, *, is_sorted: bool = False) TimeSeriesDataset[source]#
Create a copy of the dataset with new data.
- Parameters:
data (
DataFrame) – New DataFrame to use for the dataset.is_sorted (
bool) – Whether the new data is already sorted by timestamp.data
is_sorted
- Returns:
New TimeSeriesDataset instance with the provided data and same metadata.
- Return type: