Data and Datasets¶

pyepo.data provides synthetic data generators and the optDataset class for wrapping data samples.

For more details, see the 02 Optimization Dataset notebook.

Data Generator¶

pyepo.data includes synthetic data generators for four optimization problems: shortest path, multi-dimensional knapsack, traveling salesperson, and portfolio optimization.

Each generator produces feature-cost pairs \((\mathbf{x}, \mathbf{c})\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the cost \(\mathbf{c}_i \in \mathbb{R}^d\) is computed from a polynomial function \(f(\mathbf{x}_i)\) scaled by a multiplicative noise factor \(\boldsymbol{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\).

Common parameters across all generators:

num_data (\(n\)): number of data samples
num_features (\(p\)): feature dimension
deg (\(deg\)): polynomial degree of the mapping \(f(\mathbf{x}_i)\)
noise_width (\(\bar{\epsilon}\)): noise half-width (shortest path, knapsack, TSP; portfolio uses noise_level instead; see below)
seed: random seed for reproducibility

Shortest Path¶

A random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) with Bernoulli(0.5) entries maps the feature vector into the cost coefficients: \(c_i^j = \tfrac{1}{{3.5}^{deg}} \big[\big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j\).

pyepo.data.shortestpath.genData(num_data: int, num_features: int, grid: tuple[int, int], deg: int = 1, noise_width: float = 0, seed: int = 135) → tuple[ndarray, ndarray]

Generate synthetic feature-cost pairs for the shortest path problem.

Features are sampled from a standard multivariate Gaussian \(\mathcal{N}(0, \mathbf{I})\). A random Bernoulli(0.5) matrix \(\mathcal{B}\) maps each feature vector into the edge-cost coefficients via a polynomial of degree deg, scaled by multiplicative uniform noise of half-width noise_width.

Parameters:

num_data – number of data points
num_features – dimension of features
grid – size of grid network
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random seed (default 135 for reproducibility)

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)

Knapsack¶

Only the cost coefficients are uncertain; item weights are fixed. Let \(m\) be the number of items and \(k\) the number of resource dimensions. The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are drawn uniformly from the half-open range \([3, 8)\) to two decimal places. The cost coefficients are \(c_i^j = \big\lceil \tfrac{5}{{3.5}^{deg}} \big[\big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j \big\rceil\).

pyepo.data.knapsack.genData(num_data: int, num_features: int, num_items: int, dim: int = 1, deg: int = 1, noise_width: float = 0, seed: int = 135) → tuple[ndarray, ndarray, ndarray]

Generate synthetic feature-cost pairs for the multi-dimensional knapsack.

Item weights are fixed across instances; only the value (cost) of each item depends on features. Features are sampled from a standard Gaussian \(\mathcal{N}(0, \mathbf{I})\), mapped through a random Bernoulli(0.5) matrix \(\mathcal{B}\) and a polynomial of degree deg, then scaled by multiplicative uniform noise of half-width noise_width and rounded up to the nearest integer.

Parameters:

num_data – number of data points
num_features – dimension of features
num_items – number of items
dim – dimension of multi-dimensional knapsack
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random state seed (default 135 for reproducibility)

Returns:

weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)

Traveling Salesperson¶

The distance matrix has two components: a Euclidean distance term and a feature-encoded term. Coordinates are drawn from a mixture of a Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\) and a uniform distribution \(\mathbf{U}(-2, 2)\). The feature-encoded component is \(\tfrac{1}{{3}^{deg - 1}} \big(\tfrac{1}{\sqrt{p}} (\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} \cdot \boldsymbol{\epsilon}_i\), where the elements of \(\mathcal{B}\) are products of Bernoulli \(\mathbf{B}(0.5)\) and uniform \(\mathbf{U}(-2, 2)\) samples.

pyepo.data.tsp.genData(num_data: int, num_features: int, num_nodes: int, deg: int = 1, noise_width: float = 0, seed: int = 135) → tuple[ndarray, ndarray]

Generate synthetic feature-cost pairs for the traveling salesperson problem.

Edge costs combine a Euclidean component (node coordinates drawn from a mixture of a Gaussian \(\mathcal{N}(0, \mathbf{I})\) and a uniform \(\mathbf{U}(-2, 2)\) distribution) with a feature-encoded component obtained by mapping the standard-Gaussian feature vector through a random Bernoulli \(\times\) uniform matrix \(\mathcal{B}\) and a polynomial of degree deg, scaled by multiplicative noise of half-width noise_width.

Parameters:

num_data – number of data points
num_features – dimension of features
num_nodes – number of nodes
deg – polynomial degree of the feature-to-cost mapping
noise_width – half-width of the multiplicative uniform noise
seed – random seed (default 135 for reproducibility)

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)

Portfolio¶

Let \(\bar{r}_{ij} = \big(\tfrac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}}\big)^{deg}\). The expected return is \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), and the covariance matrix is \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows a Bernoulli distribution, \(\mathbf{L} \sim \mathbf{U}(-0.0025\tau, 0.0025\tau)\), and \(\mathbf{f}, \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\).

Unlike the other generators, portfolio noise is controlled by noise_level (\(\tau\)), which scales both the factor loadings \(\mathbf{L}\) and the residual noise; noise_width does not apply here.

pyepo.data.portfolio.genData(num_data: int, num_features: int, num_assets: int, deg: int = 1, noise_level: float = 1, seed: int = 135) → tuple[ndarray, ndarray, ndarray]

Generate synthetic feature-cost pairs for portfolio optimization.

Returns the expected returns \(\mathbf{r}\) (the per-instance cost vectors) and a single shared covariance matrix \(\mathbf{\Sigma}\) used in the risk constraint of the predefined portfolio model. The mean returns follow a factor-model structure \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L}\mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), where the factor loadings \(\mathbf{L}\) and residual noise are both scaled by noise_level (\(\tau\)). Unlike the other generators in pyepo.data, portfolio noise is controlled by noise_level rather than noise_width.

Parameters:

num_data – number of data points
num_features – dimension of features
num_assets – number of assets
deg – polynomial degree of the feature-to-return mapping
noise_level – scales factor loadings L and residual noise (tau)
seed – random seed (default 135 for reproducibility)

Returns:

covariance matrix (np.ndarray), data features (np.ndarray), mean returns (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)

Built-in Problem Models¶

PyEPO includes built-in models for several classic problems. Each is built by a factory that takes a backend keyword (default "gurobi"). Pair one with generated data and an optDataset:

import pyepo
from pyepo import model

grid = (5, 5)
x, c = pyepo.data.shortestpath.genData(1000, 5, grid, deg=4, seed=135)
optmodel = model.shortestPathModel(grid)                  # default Gurobi
dataset = pyepo.data.dataset.optDataset(optmodel, x, c)

Switch the solver with backend. The generic backends take a solver= argument naming the open solver to run:

model.shortestPathModel(grid, backend="copt")
model.shortestPathModel(grid, backend="pyomo", solver="glpk")
model.shortestPathModel(grid, backend="mpax")             # LP on GPU

Note

In end-to-end training, pyepo.func modules call setObj and solve during the forward pass.

Shortest Path Model¶

Minimum-cost path from the northwest to the southeast corner of an (h, w) grid, formulated as a minimum-cost-flow LP. Backends: gurobi, copt, pyomo, ortools, mpax.

pyepo.model.shortestPathModel(grid, *, backend='gurobi', **kwargs)¶

Shortest path on a grid network.

Parameters:

grid (tuple) – grid size (h, w)
backend (str) – solver backend; one of "gurobi", "copt", "pyomo", "ortools", "mpax"

Knapsack Model¶

Multi-dimensional 0/1 knapsack: maximize value subject to per-dimension capacities. weights has shape (dim, n_items) and capacity has length dim. Backends: gurobi, copt, pyomo, ortools, mpax (LP relaxation).

weights = [[3, 4, 3, 6, 4], [4, 5, 2, 3, 5], [5, 4, 6, 2, 3]]
capacity = [12, 10, 15]
optmodel = model.knapsackModel(weights, capacity)

pyepo.model.knapsackModel(weights, capacity, *, backend='gurobi', **kwargs)¶

Multi-dimensional knapsack.

Parameters:

weights (ndarray) – item weights with shape (dim, n_items)
capacity (ndarray) – per-dimension capacity with length dim
backend (str) – solver backend; one of "gurobi", "copt", "pyomo", "ortools", "mpax"

Traveling Salesperson Model¶

Shortest tour visiting each city once. formulation is "DFJ" (lazy subtour elimination), "GG", or "MTZ". Backends: gurobi and copt (all three); pyomo (GG, MTZ). On gurobi, recycle_cuts=True keeps the subtour cuts found in one solve for later solves.

optmodel = model.tspModel(20, formulation="DFJ", recycle_cuts=True)

pyepo.model.tspModel(num_nodes, *, backend='gurobi', formulation='DFJ', **kwargs)¶

Traveling salesperson.

Parameters:

num_nodes (int) – number of nodes
backend (str) – solver backend; one of "gurobi", "copt", "pyomo"
formulation (str) – ILP formulation; one of "DFJ", "GG", "MTZ" ("DFJ" on gurobi and copt only)

Capacitated Vehicle Routing Model¶

Shortest vehicle routes from a depot that serve every customer within capacity. formulation is "RCI" (lazy rounded-capacity cuts) or "MTZ". Backends: gurobi and copt (both); pyomo (MTZ). On gurobi, "RCI" also accepts recycle_cuts=True to keep the cuts found in one solve for later solves.

optmodel = model.vrpModel(10, demands=[2, 1, 3, 2, 1, 2, 1, 3, 2],
                          capacity=5.0, num_vehicle=3, formulation="RCI")

pyepo.model.vrpModel(num_nodes, demands, capacity, num_vehicle, *, backend='gurobi', formulation='RCI', **kwargs)¶

Capacitated vehicle routing.

Parameters:

num_nodes (int) – number of nodes, with the depot as node 0
demands (list) – per-customer demands with length num_nodes - 1
capacity (float) – vehicle capacity
num_vehicle (int) – number of vehicles
backend (str) – solver backend; one of "gurobi", "copt", "pyomo"
formulation (str) – ILP formulation; "RCI" or "MTZ" ("RCI" on gurobi and copt only)

Portfolio Model¶

Mean-variance allocation that maximizes return under a risk budget. Backends: gurobi, copt, pyomo.

import numpy as np
covariance = np.cov(np.random.randn(10, 50), rowvar=False)
optmodel = model.portfolioModel(50, covariance)

pyepo.model.portfolioModel(num_assets, covariance, *, backend='gurobi', **kwargs)¶

Mean-variance portfolio optimization.

Parameters:

num_assets (int) – number of assets
covariance (ndarray) – covariance matrix of the asset returns
backend (str) – solver backend; one of "gurobi", "copt", "pyomo"

optDataset¶

pyepo.data.optDataset is a PyTorch Dataset that stores features and cost coefficients, and solves the optimization problem to obtain optimal solutions and objective values.

optDataset precomputes \(\mathbf{w}^*(\mathbf{c})\) and \(z^*(\mathbf{c})\) at construction time. If those labels are already available, batches can be passed directly to pyepo.func modules.

class pyepo.data.dataset.optDataset(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor)

PyTorch Dataset for predict-then-optimize problems.

At construction time it solves the optimization problem for every cost vector and caches the optimal solution \(\mathbf{w}^*(\mathbf{c})\) and objective value \(z^*(\mathbf{c})\). This precomputation removes solver overhead from the training loop, making optDataset the standard input format for end-to-end training in PyEPO. When labels are already available from another source, optDataset can be skipped and batches fed directly to pyepo.func modules.

Variables:

model (optModel) – Optimization model
feats (torch.Tensor) – Data features
costs (torch.Tensor) – Cost vectors
sols (torch.Tensor) – Cached optimal solutions w*(c)
objs (torch.Tensor) – Cached optimal objective values z*(c)

The following example shows how to use optDataset with a PyTorch DataLoader:

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

optDatasetKNN¶

pyepo.data.optDatasetKNN is a PyTorch Dataset that implements the k-nearest neighbors (kNN) robust loss [1] for predict-then-optimize training. It stores features and cost coefficients, and computes the mean k-nearest-neighbor solutions and the corresponding optimal objective values.

For a runnable walkthrough, see the 08 kNN Robust Losses notebook.

class pyepo.data.dataset.optDatasetKNN(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, k: int = 10, weight: float = 0.5)

PyTorch Dataset for the kNN-robust decision-focused loss.

For each instance the cost vector is replaced with a convex combination of its k nearest neighbours in feature space, and the optimization problem is solved on the smoothed costs. The mean kNN solutions and objective values are cached for training, providing a robust supervision signal under noisy or out-of-distribution feature observations.

Reference: Schutte et al. (2023) https://arxiv.org/abs/2310.04328

Variables:

model (optModel) – Optimization model
k (int) – number of nearest neighbours selected
weight (float) – self-weight in the kNN convex combination (1.0 = no smoothing)
feats (torch.Tensor) – Data features
costs (torch.Tensor) – kNN-smoothed cost vectors
sols (torch.Tensor) – Mean kNN optimal solutions
objs (torch.Tensor) – Mean kNN optimal objective values

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

optDatasetConstrs¶

pyepo.data.dataset.optDatasetConstrs is a PyTorch Dataset for the CaVE [2] cone-aligned loss. In addition to the features, costs, optimal solutions, and objective values stored by optDataset, it also extracts the normals of the binding constraints at the optimal vertex for each instance. CaVE then projects the sense-flipped predicted cost vector onto the cone spanned by these normals.

optDatasetConstrs currently requires a Gurobi-backed optModel. The dataset also checks that the optimal vertex is binary, since CaVE is defined for binary linear programs.

For a runnable walkthrough that uses optDatasetConstrs end-to-end with the CaVE loss, see the 04 CaVE for Binary Linear Programs notebook.

class pyepo.data.dataset.optDatasetConstrs(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, skip_infeas: bool = False)

PyTorch Dataset for the CaVE cone-aligned loss.

Stores features and cost coefficients, solves each instance, and extracts the normals of the binding constraints at the optimal vertex in canonical <= orientation. These normals span the polyhedral cone onto which coneAlignedCosine projects the predicted cost vector during training.

CaVE is defined for binary linear programs only, so the optimal vertex must be binary; instances that are infeasible or have non-binary optima raise (or are skipped when skip_infeas=True). Binding-constraint extraction uses Gurobi’s sparse-matrix API, which is why this dataset currently requires a Gurobi-backed optModel.

Per-instance row counts differ (different constraints bind at different vertices), so batch with optDataLoader or pass collate_tight_constraints to a PyTorch DataLoader.

Reference: Tang & Khalil (2024) https://link.springer.com/chapter/10.1007/978-3-031-60599-4_12

Variables:

model (optModel) – Gurobi-backed optimization model
feats (torch.Tensor) – Data features
costs (torch.Tensor) – Cost vectors
sols (torch.Tensor) – Optimal solutions
objs (torch.Tensor) – Optimal objective values
ctrs (list[torch.Tensor]) – Per-instance binding-constraint normals in canonical <= orientation (ragged row counts)

Per-instance constraint matrices have different row counts (different constraints bind at different vertices), so batch with optDataLoader, which pads them automatically:

class pyepo.data.dataset.optDataLoader(dataset, *args, **kwargs)

DataLoader that applies a dataset’s own collate_fn when present.

Datasets with ragged samples (e.g. optDatasetConstrs, whose binding-constraint matrices vary in row count) carry a collate_fn; this loader uses it so the caller never passes one explicitly. Plain datasets fall back to the default PyTorch collation.

import pyepo
from pyepo.data.dataset import optDatasetConstrs, optDataLoader

# model for TSP (Gurobi backend required)
model = pyepo.model.tspModel(num_nodes=10, formulation="DFJ")

# generate data
x, c = pyepo.data.tsp.genData(num_data=1000, num_features=5, num_nodes=10, deg=4, seed=135)

# build CaVE dataset (extracts tight binding-constraint normals at the optimum)
dataset = optDatasetConstrs(model, x, c)

# optDataLoader pads ragged per-instance constraint matrices
dataloader = optDataLoader(dataset, batch_size=32, shuffle=True)