Data and Datasets¶
pyepo.data provides synthetic data generators and the optDataset class for wrapping data samples.
For more details, see the 02 Optimization Dataset notebook.
Data Generator¶
pyepo.data includes synthetic data generators for four optimization problems: shortest path, multi-dimensional knapsack, traveling salesperson, and portfolio optimization.
Each generator produces feature-cost pairs \((\mathbf{x}, \mathbf{c})\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the cost \(\mathbf{c}_i \in \mathbb{R}^d\) is computed from a polynomial function \(f(\mathbf{x}_i)\) scaled by a multiplicative noise factor \(\boldsymbol{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\).
Common parameters across all generators:
num_data (\(n\)): number of data samples
num_features (\(p\)): feature dimension
deg (\(deg\)): polynomial degree of the mapping \(f(\mathbf{x}_i)\)
noise_width (\(\bar{\epsilon}\)): noise half-width (shortest path, knapsack, TSP; portfolio uses
noise_levelinstead; see below)seed: random seed for reproducibility
Shortest Path¶
A random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) with Bernoulli(0.5) entries maps the feature vector into the cost coefficients: \(c_i^j = \tfrac{1}{{3.5}^{deg}} \big[\big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j\).
- pyepo.data.shortestpath.genData(num_data: int, num_features: int, grid: tuple[int, int], deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray]
Generate synthetic feature-cost pairs for the shortest path problem.
Features are sampled from a standard multivariate Gaussian \(\mathcal{N}(0, \mathbf{I})\). A random Bernoulli(0.5) matrix \(\mathcal{B}\) maps each feature vector into the edge-cost coefficients via a polynomial of degree
deg, scaled by multiplicative uniform noise of half-widthnoise_width.- Parameters:
num_data – number of data points
num_features – dimension of features
grid – size of grid network
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random seed (default 135 for reproducibility)
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)
Knapsack¶
Only the cost coefficients are uncertain; item weights are fixed. Let \(m\) be the number of items and \(k\) the number of resource dimensions. The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are drawn uniformly from the half-open range \([3, 8)\) to two decimal places. The cost coefficients are \(c_i^j = \big\lceil \tfrac{5}{{3.5}^{deg}} \big[\big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j \big\rceil\).
- pyepo.data.knapsack.genData(num_data: int, num_features: int, num_items: int, dim: int = 1, deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray, ndarray]
Generate synthetic feature-cost pairs for the multi-dimensional knapsack.
Item weights are fixed across instances; only the value (cost) of each item depends on features. Features are sampled from a standard Gaussian \(\mathcal{N}(0, \mathbf{I})\), mapped through a random Bernoulli(0.5) matrix \(\mathcal{B}\) and a polynomial of degree
deg, then scaled by multiplicative uniform noise of half-widthnoise_widthand rounded up to the nearest integer.- Parameters:
num_data – number of data points
num_features – dimension of features
num_items – number of items
dim – dimension of multi-dimensional knapsack
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random state seed (default 135 for reproducibility)
- Returns:
weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)
Traveling Salesperson¶
The distance matrix has two components: a Euclidean distance term and a feature-encoded term. Coordinates are drawn from a mixture of a Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\) and a uniform distribution \(\mathbf{U}(-2, 2)\). The feature-encoded component is \(\tfrac{1}{{3}^{deg - 1}} \big(\tfrac{1}{\sqrt{p}} (\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} \cdot \boldsymbol{\epsilon}_i\), where the elements of \(\mathcal{B}\) are products of Bernoulli \(\mathbf{B}(0.5)\) and uniform \(\mathbf{U}(-2, 2)\) samples.
- pyepo.data.tsp.genData(num_data: int, num_features: int, num_nodes: int, deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray]
Generate synthetic feature-cost pairs for the traveling salesperson problem.
Edge costs combine a Euclidean component (node coordinates drawn from a mixture of a Gaussian \(\mathcal{N}(0, \mathbf{I})\) and a uniform \(\mathbf{U}(-2, 2)\) distribution) with a feature-encoded component obtained by mapping the standard-Gaussian feature vector through a random Bernoulli \(\times\) uniform matrix \(\mathcal{B}\) and a polynomial of degree
deg, scaled by multiplicative noise of half-widthnoise_width.- Parameters:
num_data – number of data points
num_features – dimension of features
num_nodes – number of nodes
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random seed (default 135 for reproducibility)
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)
Portfolio¶
Let \(\bar{r}_{ij} = \big(\tfrac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}}\big)^{deg}\). The expected return is \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), and the covariance matrix is \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows a Bernoulli distribution, \(\mathbf{L} \sim \mathbf{U}(-0.0025\tau, 0.0025\tau)\), and \(\mathbf{f}, \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\).
Unlike the other generators, portfolio noise is controlled by noise_level (\(\tau\)), which scales both the factor loadings \(\mathbf{L}\) and the residual noise; noise_width does not apply here.
- pyepo.data.portfolio.genData(num_data: int, num_features: int, num_assets: int, deg: int = 1, noise_level: float = 1, seed: int = 135) tuple[ndarray, ndarray, ndarray]
Generate synthetic feature-cost pairs for portfolio optimization.
Returns the expected returns \(\mathbf{r}\) (the per-instance cost vectors) and a single shared covariance matrix \(\mathbf{\Sigma}\) used in the risk constraint of the predefined portfolio model. The mean returns follow a factor-model structure \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L}\mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), where the factor loadings \(\mathbf{L}\) and residual noise are both scaled by
noise_level(\(\tau\)). Unlike the other generators inpyepo.data, portfolio noise is controlled bynoise_levelrather thannoise_width.- Parameters:
num_data – number of data points
num_features – dimension of features
num_assets – number of assets
deg – polynomial degree of the feature-to-return mapping
noise_level – scales factor loadings L and residual noise (tau)
seed – random seed (default 135 for reproducibility)
- Returns:
covariance matrix (np.ndarray), data features (np.ndarray), mean returns (np.ndarray)
- Return type:
import pyepo
num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)
Built-in Problem Models¶
PyEPO includes built-in models for several classic problems. Each is built by a factory that takes a backend keyword (default "gurobi"). Pair one with generated data and an optDataset:
import pyepo
from pyepo import model
grid = (5, 5)
x, c = pyepo.data.shortestpath.genData(1000, 5, grid, deg=4, seed=135)
optmodel = model.shortestPathModel(grid) # default Gurobi
dataset = pyepo.data.dataset.optDataset(optmodel, x, c)
Switch the solver with backend. The generic backends take a solver= argument naming the open solver to run:
model.shortestPathModel(grid, backend="copt")
model.shortestPathModel(grid, backend="pyomo", solver="glpk")
model.shortestPathModel(grid, backend="mpax") # LP on GPU
Note
In end-to-end training, pyepo.func modules call setObj and solve during the forward pass.
Shortest Path Model¶
Minimum-cost path from the northwest to the southeast corner of an (h, w) grid, formulated as a minimum-cost-flow LP. Backends: gurobi, copt, pyomo, ortools, mpax.
Knapsack Model¶
Multi-dimensional 0/1 knapsack: maximize value subject to per-dimension capacities. weights has shape (dim, n_items) and capacity has length dim. Backends: gurobi, copt, pyomo, ortools, mpax (LP relaxation).
weights = [[3, 4, 3, 6, 4], [4, 5, 2, 3, 5], [5, 4, 6, 2, 3]]
capacity = [12, 10, 15]
optmodel = model.knapsackModel(weights, capacity)
- pyepo.model.knapsackModel(weights, capacity, *, backend='gurobi', **kwargs)¶
Multi-dimensional knapsack.
- Parameters:
weights (ndarray) – item weights with shape
(dim, n_items)capacity (ndarray) – per-dimension capacity with length
dimbackend (str) – solver backend; one of
"gurobi","copt","pyomo","ortools","mpax"
Traveling Salesperson Model¶
Shortest tour visiting each city once. formulation is "DFJ" (lazy subtour elimination), "GG", or "MTZ". Backends: gurobi and copt (all three); pyomo (GG, MTZ). On gurobi, recycle_cuts=True keeps the subtour cuts found in one solve for later solves.
optmodel = model.tspModel(20, formulation="DFJ", recycle_cuts=True)
- pyepo.model.tspModel(num_nodes, *, backend='gurobi', formulation='DFJ', **kwargs)¶
Traveling salesperson.
Capacitated Vehicle Routing Model¶
Shortest vehicle routes from a depot that serve every customer within capacity. formulation is "RCI" (lazy rounded-capacity cuts) or "MTZ". Backends: gurobi and copt (both); pyomo (MTZ). On gurobi, "RCI" also accepts recycle_cuts=True to keep the cuts found in one solve for later solves.
optmodel = model.vrpModel(10, demands=[2, 1, 3, 2, 1, 2, 1, 3, 2],
capacity=5.0, num_vehicle=3, formulation="RCI")
- pyepo.model.vrpModel(num_nodes, demands, capacity, num_vehicle, *, backend='gurobi', formulation='RCI', **kwargs)¶
Capacitated vehicle routing.
- Parameters:
num_nodes (int) – number of nodes, with the depot as node 0
demands (list) – per-customer demands with length
num_nodes - 1capacity (float) – vehicle capacity
num_vehicle (int) – number of vehicles
backend (str) – solver backend; one of
"gurobi","copt","pyomo"formulation (str) – ILP formulation;
"RCI"or"MTZ"("RCI"on gurobi and copt only)
Portfolio Model¶
Mean-variance allocation that maximizes return under a risk budget. Backends: gurobi, copt, pyomo.
import numpy as np
covariance = np.cov(np.random.randn(10, 50), rowvar=False)
optmodel = model.portfolioModel(50, covariance)
optDataset¶
pyepo.data.optDataset is a PyTorch Dataset that stores features and cost coefficients, and solves the optimization problem to obtain optimal solutions and objective values.
optDataset precomputes \(\mathbf{w}^*(\mathbf{c})\) and \(z^*(\mathbf{c})\) at construction time. If those labels are already available, batches can be passed directly to pyepo.func modules.
- class pyepo.data.dataset.optDataset(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor)
PyTorch
Datasetfor predict-then-optimize problems.At construction time it solves the optimization problem for every cost vector and caches the optimal solution \(\mathbf{w}^*(\mathbf{c})\) and objective value \(z^*(\mathbf{c})\). This precomputation removes solver overhead from the training loop, making
optDatasetthe standard input format for end-to-end training in PyEPO. When labels are already available from another source,optDatasetcan be skipped and batches fed directly topyepo.funcmodules.- Variables:
model (optModel) – Optimization model
feats (torch.Tensor) – Data features
costs (torch.Tensor) – Cost vectors
sols (torch.Tensor) – Cached optimal solutions w*(c)
objs (torch.Tensor) – Cached optimal objective values z*(c)
The following example shows how to use optDataset with a PyTorch DataLoader:
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
optDatasetKNN¶
pyepo.data.optDatasetKNN is a PyTorch Dataset that implements the k-nearest neighbors (kNN) robust loss [1] for predict-then-optimize training. It stores features and cost coefficients, and computes the mean k-nearest-neighbor solutions and the corresponding optimal objective values.
For a runnable walkthrough, see the 08 kNN Robust Losses notebook.
- class pyepo.data.dataset.optDatasetKNN(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, k: int = 10, weight: float = 0.5)
PyTorch
Datasetfor the kNN-robust decision-focused loss.For each instance the cost vector is replaced with a convex combination of its k nearest neighbours in feature space, and the optimization problem is solved on the smoothed costs. The mean kNN solutions and objective values are cached for training, providing a robust supervision signal under noisy or out-of-distribution feature observations.
Reference: Schutte et al. (2023) https://arxiv.org/abs/2310.04328
- Variables:
model (optModel) – Optimization model
k (int) – number of nearest neighbours selected
weight (float) – self-weight in the kNN convex combination (1.0 = no smoothing)
feats (torch.Tensor) – Data features
costs (torch.Tensor) – kNN-smoothed cost vectors
sols (torch.Tensor) – Mean kNN optimal solutions
objs (torch.Tensor) – Mean kNN optimal objective values
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
optDatasetConstrs¶
pyepo.data.dataset.optDatasetConstrs is a PyTorch Dataset for the CaVE [2] cone-aligned loss. In addition to the features, costs, optimal solutions, and objective values stored by optDataset, it also extracts the normals of the binding constraints at the optimal vertex for each instance. CaVE then projects the sense-flipped predicted cost vector onto the cone spanned by these normals.
optDatasetConstrs currently requires a Gurobi-backed optModel. The dataset also checks that the optimal vertex is binary, since CaVE is defined for binary linear programs.
For a runnable walkthrough that uses optDatasetConstrs end-to-end with the CaVE loss, see the 04 CaVE for Binary Linear Programs notebook.
- class pyepo.data.dataset.optDatasetConstrs(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, skip_infeas: bool = False)
PyTorch
Datasetfor the CaVE cone-aligned loss.Stores features and cost coefficients, solves each instance, and extracts the normals of the binding constraints at the optimal vertex in canonical
<=orientation. These normals span the polyhedral cone onto whichconeAlignedCosineprojects the predicted cost vector during training.CaVE is defined for binary linear programs only, so the optimal vertex must be binary; instances that are infeasible or have non-binary optima raise (or are skipped when
skip_infeas=True). Binding-constraint extraction uses Gurobi’s sparse-matrix API, which is why this dataset currently requires a Gurobi-backedoptModel.Per-instance row counts differ (different constraints bind at different vertices), so batch with
optDataLoaderor passcollate_tight_constraintsto a PyTorchDataLoader.Reference: Tang & Khalil (2024) https://link.springer.com/chapter/10.1007/978-3-031-60599-4_12
- Variables:
model (optModel) – Gurobi-backed optimization model
feats (torch.Tensor) – Data features
costs (torch.Tensor) – Cost vectors
sols (torch.Tensor) – Optimal solutions
objs (torch.Tensor) – Optimal objective values
ctrs (list[torch.Tensor]) – Per-instance binding-constraint normals in canonical
<=orientation (ragged row counts)
Per-instance constraint matrices have different row counts (different constraints bind at different vertices), so batch with optDataLoader, which pads them automatically:
- class pyepo.data.dataset.optDataLoader(dataset, *args, **kwargs)
DataLoaderthat applies a dataset’s owncollate_fnwhen present.Datasets with ragged samples (e.g.
optDatasetConstrs, whose binding-constraint matrices vary in row count) carry acollate_fn; this loader uses it so the caller never passes one explicitly. Plain datasets fall back to the default PyTorch collation.
import pyepo
from pyepo.data.dataset import optDatasetConstrs, optDataLoader
# model for TSP (Gurobi backend required)
model = pyepo.model.tspModel(num_nodes=10, formulation="DFJ")
# generate data
x, c = pyepo.data.tsp.genData(num_data=1000, num_features=5, num_nodes=10, deg=4, seed=135)
# build CaVE dataset (extracts tight binding-constraint normals at the optimum)
dataset = optDatasetConstrs(model, x, c)
# optDataLoader pads ragged per-instance constraint matrices
dataloader = optDataLoader(dataset, batch_size=32, shuffle=True)