Data

pyepo.data provides synthetic data generators and the optDataset class for wrapping data samples.

For more details, see the 02 Optimization Dataset notebook.

Data Generator

pyepo.data includes synthetic data generators for four optimization problems: shortest path, multi-dimensional knapsack, traveling salesperson, and portfolio optimization.

Each generator produces feature-cost pairs \((\mathbf{x}, \mathbf{c})\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the cost \(\mathbf{c}_i \in \mathbb{R}^d\) is computed from a polynomial function \(f(\mathbf{x}_i)\) scaled by a multiplicative noise factor \(\boldsymbol{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\).

Common parameters across all generators:

  • num_data (\(n\)): number of data samples

  • num_features (\(p\)): feature dimension

  • deg (\(deg\)): polynomial degree of the mapping \(f(\mathbf{x}_i)\)

  • noise_width (\(\bar{\epsilon}\)): noise half-width (shortest path, knapsack, TSP; portfolio uses noise_level instead – see below)

  • seed: random seed for reproducibility

Shortest Path

A random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) with Bernoulli(0.5) entries maps the feature vector into the cost coefficients: \(c_i^j = \big[\tfrac{1}{{3.5}^{deg}} \big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j\).

pyepo.data.shortestpath.genData(num_data: int, num_features: int, grid: tuple[int, int], deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray]

Generate synthetic feature-cost pairs for the shortest path problem.

Features are sampled from a standard multivariate Gaussian \(\mathcal{N}(0, \mathbf{I})\). A random Bernoulli(0.5) matrix \(\mathcal{B}\) maps each feature vector into the edge-cost coefficients via a polynomial of degree deg, scaled by multiplicative uniform noise of half-width noise_width.

Parameters:
  • num_data – number of data points

  • num_features – dimension of features

  • grid – size of grid network

  • deg – polynomial degree of the feature-to-cost mapping

  • noise_width – half-width of the multiplicative uniform noise

  • seed – random seed (default 135 for reproducibility)

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)

Knapsack

Only the cost coefficients are uncertain; item weights are fixed. Let \(m\) be the number of items and \(k\) the number of resource dimensions. The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are sampled from 3 to 8 with one decimal place of precision. The cost coefficients are \(c_i^j = \big\lceil \big[\tfrac{5}{{3.5}^{deg}} \big(\tfrac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} + 1\big] \cdot \epsilon_i^j \big\rceil\).

pyepo.data.knapsack.genData(num_data: int, num_features: int, num_items: int, dim: int = 1, deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray, ndarray]

Generate synthetic feature-cost pairs for the multi-dimensional knapsack.

Item weights are fixed across instances; only the value (cost) of each item depends on features. Features are sampled from a standard Gaussian \(\mathcal{N}(0, \mathbf{I})\), mapped through a random Bernoulli(0.5) matrix \(\mathcal{B}\) and a polynomial of degree deg, then scaled by multiplicative uniform noise of half-width noise_width and rounded up to the nearest integer.

Parameters:
  • num_data – number of data points

  • num_features – dimension of features

  • num_items – number of items

  • dim – dimension of multi-dimensional knapsack

  • deg – polynomial degree of the feature-to-cost mapping

  • noise_width – half-width of the multiplicative uniform noise

  • seed – random state seed (default 135 for reproducibility)

Returns:

weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)

Traveling Salesperson

The distance matrix has two components: a Euclidean distance term and a feature-encoded term. Coordinates are drawn from a mixture of a Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\) and a uniform distribution \(\mathbf{U}(-2, 2)\). The feature-encoded component is \(\tfrac{1}{{3}^{deg - 1}} \big(\tfrac{1}{\sqrt{p}} (\mathcal{B} \mathbf{x}_i)_j + 3\big)^{deg} \cdot \boldsymbol{\epsilon}_i\), where the elements of \(\mathcal{B}\) are products of Bernoulli \(\mathbf{B}(0.5)\) and uniform \(\mathbf{U}(-2, 2)\) samples.

pyepo.data.tsp.genData(num_data: int, num_features: int, num_nodes: int, deg: int = 1, noise_width: float = 0, seed: int = 135) tuple[ndarray, ndarray]

Generate synthetic feature-cost pairs for the traveling salesperson problem.

Edge costs combine a Euclidean component (node coordinates drawn from a mixture of a Gaussian \(\mathcal{N}(0, \mathbf{I})\) and a uniform \(\mathbf{U}(-2, 2)\) distribution) with a feature-encoded component obtained by mapping the standard-Gaussian feature vector through a random Bernoulli \(\times\) uniform matrix \(\mathcal{B}\) and a polynomial of degree deg, scaled by multiplicative noise of half-width noise_width.

Parameters:
  • num_data – number of data points

  • num_features – dimension of features

  • num_nodes – number of nodes

  • deg – polynomial degree of the feature-to-cost mapping

  • noise_width – half-width of the multiplicative uniform noise

  • seed – random seed (default 135 for reproducibility)

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)

Portfolio

Let \(\bar{r}_{ij} = \big(\tfrac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}}\big)^{deg}\). The expected return is \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), and the covariance matrix is \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows a Bernoulli distribution, \(\mathbf{L} \sim \mathbf{U}(-0.0025\tau, 0.0025\tau)\), and \(\mathbf{f}, \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\).

Unlike the other generators, portfolio noise is controlled by noise_level (\(\tau\)), which scales both the factor loadings \(\mathbf{L}\) and the residual noise; noise_width does not apply here.

pyepo.data.portfolio.genData(num_data: int, num_features: int, num_assets: int, deg: int = 1, noise_level: float = 1, seed: int = 135) tuple[ndarray, ndarray, ndarray]

Generate synthetic feature-cost pairs for portfolio optimization.

Returns the expected returns \(\mathbf{r}\) (the per-instance cost vectors) and a single shared covariance matrix \(\mathbf{\Sigma}\) used in the risk constraint of the predefined portfolio model. The mean returns follow a factor-model structure \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L}\mathbf{f} + 0.01 \tau \boldsymbol{\epsilon}\), where the factor loadings \(\mathbf{L}\) and residual noise are both scaled by noise_level (\(\tau\)). Unlike the other generators in pyepo.data, portfolio noise is controlled by noise_level rather than noise_width.

Parameters:
  • num_data – number of data points

  • num_features – dimension of features

  • num_assets – number of assets

  • deg – polynomial degree of the feature-to-return mapping

  • noise_level – scales factor loadings L and residual noise (tau)

  • seed – random seed (default 135 for reproducibility)

Returns:

covariance matrix (np.ndarray), data features (np.ndarray), mean returns (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)

optDataset

pyepo.data.optDataset is a PyTorch Dataset that stores features and cost coefficients, and solves the optimization problem to obtain optimal solutions and objective values.

optDataset is the standard input format for end-to-end training in PyEPO: it precomputes \(\mathbf{w}^*(\mathbf{c})\) and \(z^*(\mathbf{c})\) once at construction time, so the training loop does not pay solver cost for these label lookups. If those labels are already available from another source, optDataset can be skipped and batches fed directly to pyepo.func modules.

class pyepo.data.dataset.optDataset(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor)

PyTorch Dataset for predict-then-optimize problems.

At construction time it solves the optimization problem for every cost vector and caches the optimal solution \(\mathbf{w}^*(\mathbf{c})\) and objective value \(z^*(\mathbf{c})\). This precomputation removes solver overhead from the training loop, making optDataset the standard input format for end-to-end training in PyEPO. When labels are already available from another source, optDataset can be skipped and batches fed directly to pyepo.func modules.

Variables:

The following example shows how to use optDataset with a PyTorch DataLoader:

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

optDatasetKNN

pyepo.data.optDatasetKNN is a PyTorch Dataset that implements the k-nearest neighbors (kNN) robust loss [1] for decision-focused learning. It stores features and cost coefficients, and computes the mean k-nearest-neighbor solutions and the corresponding optimal objective values.

For a runnable walkthrough, see the 08 kNN Robust Losses notebook.

class pyepo.data.dataset.optDatasetKNN(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, k: int = 10, weight: float = 0.5)

PyTorch Dataset for the kNN-robust decision-focused loss.

For each instance the cost vector is replaced with a convex combination of its k nearest neighbours in feature space, and the optimization problem is solved on the smoothed costs. The mean kNN solutions and objective values are cached for training, providing a robust supervision signal under noisy or out-of-distribution feature observations.

Reference: Schutte et al. (2023) https://arxiv.org/abs/2310.04328

Variables:
  • model (optModel) – Optimization model

  • k (int) – number of nearest neighbours selected

  • weight (float) – self-weight in the kNN convex combination (1.0 = no smoothing)

  • feats (torch.Tensor) – Data features

  • costs (torch.Tensor) – kNN-smoothed cost vectors

  • sols (torch.Tensor) – Mean kNN optimal solutions

  • objs (torch.Tensor) – Mean kNN optimal objective values

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

optDatasetConstrs

pyepo.data.dataset.optDatasetConstrs is a PyTorch Dataset for the CaVE [2] cone-aligned loss. In addition to the features, costs, optimal solutions, and objective values stored by optDataset, it also extracts the normals of the binding constraints at the optimal vertex for each instance. CaVE then projects the sense-flipped predicted cost vector onto the cone spanned by these normals.

Because the binding-constraint extraction relies on Gurobi’s sparse-matrix and constraint-sense APIs, optDatasetConstrs currently requires a Gurobi-backed optModel. The dataset also enforces that the optimal vertex is binary — CaVE is defined for binary linear programs.

For a runnable walkthrough that uses optDatasetConstrs end-to-end with the CaVE loss, see the 04 CaVE for Binary Linear Programs notebook.

class pyepo.data.dataset.optDatasetConstrs(model: optModel, feats: ndarray | Tensor, costs: ndarray | Tensor, skip_infeas: bool = False)

PyTorch Dataset for the CaVE cone-aligned loss.

Stores features and cost coefficients, solves each instance with a fresh copy of the Gurobi model, and extracts the normals of the binding constraints at the optimal vertex in canonical <= orientation. These normals span the polyhedral cone onto which coneAlignedCosine projects the predicted cost vector during training.

CaVE is defined for binary linear programs only, so the optimal vertex must be binary; instances that are infeasible or have non-binary optima raise (or are skipped when skip_infeas=True). Binding-constraint extraction uses Gurobi’s sparse-matrix API, which is why this dataset currently requires a Gurobi-backed optModel.

Per-instance row counts differ (different constraints bind at different vertices), so batches must be assembled with collate_tight_constraints.

Reference: Tang & Khalil (2024) https://link.springer.com/chapter/10.1007/978-3-031-60599-4_12

Variables:

Per-instance constraint matrices have different row counts (different sets of constraints are tight at different vertices), so batching requires a custom collate_fn:

pyepo.data.dataset.collate_tight_constraints(batch)

Collate function for optDatasetConstrs batches.

Stacks the standard (x, c, w, z) tensors and zero-pads the ragged per-instance binding-constraint matrices to a common row count so they can be assembled into a single (batch, max_rows, num_cost) tensor for coneAlignedCosine.

import pyepo
from torch.utils.data import DataLoader
from pyepo.data.dataset import optDatasetConstrs, collate_tight_constraints

# model for TSP (Gurobi backend required)
model = pyepo.model.grb.tspDFJModel(num_nodes=10)

# generate data
x, c = pyepo.data.tsp.genData(num_data=1000, num_features=5, num_nodes=10, deg=4, seed=135)

# build CaVE dataset (extracts tight binding-constraint normals at the optimum)
dataset = optDatasetConstrs(model, x, c)

# collate_fn pads ragged per-instance constraint matrices
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_tight_constraints)