Data¶
pyepo.data provides synthetic data generators and the optDataset class for wrapping data samples.
For more details, see the 02 Optimization Dataset notebook.
Data Generator¶
pyepo.data includes synthetic data generators for four optimization problems: shortest path, multi-dimensional knapsack, traveling salesperson, and portfolio optimization.
Each generator produces feature-cost pairs \((\mathbf{x}, \mathbf{c})\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the cost \(\mathbf{c}_i \in \mathbb{R}^d\) is computed from a polynomial function \(f(\mathbf{x}_i)\) scaled by a multiplicative noise factor \(\boldsymbol{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\).
Common parameters across all generators:
num_data (\(n\)): number of data samples
num_features (\(p\)): feature dimension
deg (\(deg\)): polynomial degree of the mapping \(f(\mathbf{x}_i)\)
noise_width (\(\bar{\epsilon}\)): noise half-width (shortest path, knapsack, TSP; portfolio uses
noise_levelinstead – see below)seed: random seed for reproducibility
Shortest Path¶
A random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) with Bernoulli(0.5) entries maps the feature vector into the cost coefficients: \(c_i^j = \big[\tfrac{1}{{3.5}^{deg}} \big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j\).
- pyepo.data.shortestpath.genData(num_data: int, num_features: int, grid: tuple[int, int], deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray]
Generate synthetic feature-cost pairs for the shortest path problem.
Features are sampled from a standard multivariate Gaussian \(\mathcal{N}(0, \mathbf{I})\). A random Bernoulli(0.5) matrix \(\mathcal{B}\) maps each feature vector into the edge-cost coefficients via a polynomial of degree
deg, scaled by multiplicative uniform noise of half-widthnoise_width.- Parameters:
num_data – number of data points
num_features – dimension of features
grid – size of grid network
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random seed (default 135 for reproducibility)
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)
Knapsack¶
Only the cost coefficients are uncertain; item weights are fixed. Let \(m\) be the number of items and \(k\) the number of resource dimensions. The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are sampled from 3 to 8 with one decimal place of precision. The cost coefficients are \(c_i^j = \big\lceil \big[\tfrac{5}{{3.5}^{deg}} \big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j \big\rceil\).
- pyepo.data.knapsack.genData(num_data: int, num_features: int, num_items: int, dim: int = 1, deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray, ndarray]
Generate synthetic feature-cost pairs for the multi-dimensional knapsack.
Item weights are fixed across instances; only the value (cost) of each item depends on features. Features are sampled from a standard Gaussian \(\mathcal{N}(0, \mathbf{I})\), mapped through a random Bernoulli(0.5) matrix \(\mathcal{B}\) and a polynomial of degree
deg, then scaled by multiplicative uniform noise of half-widthnoise_widthand rounded up to the nearest integer.- Parameters:
num_data – number of data points
num_features – dimension of features
num_items – number of items
dim – dimension of multi-dimensional knapsack
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random state seed (default 135 for reproducibility)
- Returns:
weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)
Traveling Salesperson¶
The distance matrix has two components: a Euclidean distance term and a feature-encoded term. Coordinates are drawn from a mixture of a Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\) and a uniform distribution \(\mathbf{U}(-2, 2)\). The feature-encoded component is \(\tfrac{1}{{3}^{deg - 1}} \big(\tfrac{1}{\sqrt{p}} (\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} \cdot \boldsymbol{\epsilon}_i\), where the elements of \(\mathcal{B}\) are products of Bernoulli \(\mathbf{B}(0.5)\) and uniform \(\mathbf{U}(-2, 2)\) samples.
- pyepo.data.tsp.genData(num_data: int, num_features: int, num_nodes: int, deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray]
Generate synthetic feature-cost pairs for the traveling salesperson problem.
Edge costs combine a Euclidean component (node coordinates drawn from a mixture of a Gaussian \(\mathcal{N}(0, \mathbf{I})\) and a uniform \(\mathbf{U}(-2, 2)\) distribution) with a feature-encoded component obtained by mapping the standard-Gaussian feature vector through a random Bernoulli \(\times\) uniform matrix \(\mathcal{B}\) and a polynomial of degree
deg, scaled by multiplicative noise of half-widthnoise_width.- Parameters:
num_data – number of data points
num_features – dimension of features
num_nodes – number of nodes
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random seed (default 135 for reproducibility)
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)
Portfolio¶
Let \(\bar{r}_{ij} = \big(\tfrac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}}\big)^{deg}\). The expected return is \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), and the covariance matrix is \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows a Bernoulli distribution, \(\mathbf{L} \sim \mathbf{U}(-0.0025\tau, 0.0025\tau)\), and \(\mathbf{f}, \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\).
Unlike the other generators, portfolio noise is controlled by noise_level (\(\tau\)), which scales both the factor loadings \(\mathbf{L}\) and the residual noise; noise_width does not apply here.
- pyepo.data.portfolio.genData(num_data: int, num_features: int, num_assets: int, deg: int = 1, noise_level: float = 1, seed: int = 135) tuple[ndarray, ndarray, ndarray]
Generate synthetic feature-cost pairs for portfolio optimization.
Returns the expected returns \(\mathbf{r}\) (the per-instance cost vectors) and a single shared covariance matrix \(\mathbf{\Sigma}\) used in the risk constraint of the predefined portfolio model. The mean returns follow a factor-model structure \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L}\mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), where the factor loadings \(\mathbf{L}\) and residual noise are both scaled by
noise_level(\(\tau\)). Unlike the other generators inpyepo.data, portfolio noise is controlled bynoise_levelrather thannoise_width.- Parameters:
num_data – number of data points
num_features – dimension of features
num_assets – number of assets
deg – polynomial degree of the feature-to-return mapping
noise_level – scales factor loadings L and residual noise (tau)
seed – random seed (default 135 for reproducibility)
- Returns:
covariance matrix (np.ndarray), data features (np.ndarray), mean returns (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)
optDataset¶
pyepo.data.optDataset is a PyTorch Dataset that stores features and cost coefficients, and solves the optimization problem to obtain optimal solutions and objective values.
optDataset is the standard input format for end-to-end training in PyEPO: it precomputes \(\mathbf{w}^*(\mathbf{c})\) and \(z^*(\mathbf{c})\) once at construction time, so the training loop does not pay solver cost for these label lookups. If those labels are already available from another source, optDataset can be skipped and batches fed directly to pyepo.func modules.
- class pyepo.data.dataset.optDataset(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor)
PyTorch
Datasetfor predict-then-optimize problems.At construction time it solves the optimization problem for every cost vector and caches the optimal solution \(\mathbf{w}^*(\mathbf{c})\) and objective value \(z^*(\mathbf{c})\). This precomputation removes solver overhead from the training loop, making
optDatasetthe standard input format for end-to-end training in PyEPO. When labels are already available from another source,optDatasetcan be skipped and batches fed directly topyepo.funcmodules.- Variables:
model (optModel) – Optimization model
feats (torch.Tensor) – Data features
costs (torch.Tensor) – Cost vectors
sols (torch.Tensor) – Cached optimal solutions w*(c)
objs (torch.Tensor) – Cached optimal objective values z*(c)
The following example shows how to use optDataset with a PyTorch DataLoader:
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
optDatasetKNN¶
pyepo.data.optDatasetKNN is a PyTorch Dataset that implements the k-nearest neighbors (kNN) robust loss [1] for decision-focused learning. It stores features and cost coefficients, and computes the mean k-nearest-neighbor solutions and the corresponding optimal objective values.
For a runnable walkthrough, see the 08 kNN Robust Losses notebook.
- class pyepo.data.dataset.optDatasetKNN(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, k: int = 10, weight: float = 0.5)
PyTorch
Datasetfor the kNN-robust decision-focused loss.For each instance the cost vector is replaced with a convex combination of its k nearest neighbours in feature space, and the optimization problem is solved on the smoothed costs. The mean kNN solutions and objective values are cached for training, providing a robust supervision signal under noisy or out-of-distribution feature observations.
Reference: Schutte et al. (2023) https://arxiv.org/abs/2310.04328
- Variables:
model (optModel) – Optimization model
k (int) – number of nearest neighbours selected
weight (float) – self-weight in the kNN convex combination (1.0 = no smoothing)
feats (torch.Tensor) – Data features
costs (torch.Tensor) – kNN-smoothed cost vectors
sols (torch.Tensor) – Mean kNN optimal solutions
objs (torch.Tensor) – Mean kNN optimal objective values
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
optDatasetConstrs¶
pyepo.data.dataset.optDatasetConstrs is a PyTorch Dataset for the CaVE [2] cone-aligned loss. In addition to the features, costs, optimal solutions, and objective values stored by optDataset, it also extracts the normals of the binding constraints at the optimal vertex for each instance. CaVE then projects the sense-flipped predicted cost vector onto the cone spanned by these normals.
Because the binding-constraint extraction relies on Gurobi’s sparse-matrix and constraint-sense APIs, optDatasetConstrs currently requires a Gurobi-backed optModel. The dataset also enforces that the optimal vertex is binary — CaVE is defined for binary linear programs.
For a runnable walkthrough that uses optDatasetConstrs end-to-end with the CaVE loss, see the 04 CaVE for Binary Linear Programs notebook.
- class pyepo.data.dataset.optDatasetConstrs(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, skip_infeas: bool = False)
PyTorch
Datasetfor the CaVE cone-aligned loss.Stores features and cost coefficients, solves each instance with a fresh copy of the Gurobi model, and extracts the normals of the binding constraints at the optimal vertex in canonical
<=orientation. These normals span the polyhedral cone onto whichconeAlignedCosineprojects the predicted cost vector during training.CaVE is defined for binary linear programs only, so the optimal vertex must be binary; instances that are infeasible or have non-binary optima raise (or are skipped when
skip_infeas=True). Binding-constraint extraction uses Gurobi’s sparse-matrix API, which is why this dataset currently requires a Gurobi-backedoptModel.Per-instance row counts differ (different constraints bind at different vertices), so batches must be assembled with
collate_tight_constraints.Reference: Tang & Khalil (2024) https://link.springer.com/chapter/10.1007/978-3-031-60599-4_12
- Variables:
model (optModel) – Gurobi-backed optimization model
feats (torch.Tensor) – Data features
costs (torch.Tensor) – Cost vectors
sols (torch.Tensor) – Optimal solutions
objs (torch.Tensor) – Optimal objective values
ctrs (list[torch.Tensor]) – Per-instance binding-constraint normals in canonical
<=orientation (ragged row counts)
Per-instance constraint matrices have different row counts (different sets of constraints are tight at different vertices), so batching requires a custom collate_fn:
- pyepo.data.dataset.collate_tight_constraints(batch)
Collate function for
optDatasetConstrsbatches.Stacks the standard
(x, c, w, z)tensors and zero-pads the ragged per-instance binding-constraint matrices to a common row count so they can be assembled into a single(batch, max_rows, num_cost)tensor forconeAlignedCosine.
import pyepo
from torch.utils.data import DataLoader
from pyepo.data.dataset import optDatasetConstrs, collate_tight_constraints
# model for TSP (Gurobi backend required)
model = pyepo.model.grb.tspDFJModel(num_nodes=10)
# generate data
x, c = pyepo.data.tsp.genData(num_data=1000, num_features=5, num_nodes=10, deg=4, seed=135)
# build CaVE dataset (extracts tight binding-constraint normals at the optimum)
dataset = optDatasetConstrs(model, x, c)
# collate_fn pads ragged per-instance constraint matrices
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_tight_constraints)