Data

pyepo.data provides synthetic data generators and the optDataset class for wrapping data samples.

For more details, see the 02 Optimization Dataset notebook.

Data Generator

pyepo.data includes synthetic data generators for four optimization problems: shortest path, multi-dimensional knapsack, traveling salesperson, and portfolio optimization.

Each generator produces feature-cost pairs \((\mathbf{x}, \mathbf{c})\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the cost \(\mathbf{c}_i \in \mathbb{R}^d\) is computed from a polynomial function \(f(\mathbf{x}_i)\) multiplied by random noise \(\mathbf{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\).

Common parameters across all generators:

  • num_data (\(n\)): number of data samples

  • num_features (\(p\)): feature dimension

  • deg (\(deg\)): polynomial degree of the mapping \(f(\mathbf{x}_i)\)

  • noise_width (\(\bar{\epsilon}\)): noise half-width

  • seed: random seed for reproducibility

Shortest Path

A random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) with Bernoulli(0.5) entries encodes the features. The cost coefficients are generated as \(c_i^j = [\frac{1}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j\).

pyepo.data.shortestpath.genData(num_data, num_features, grid, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for shortest path

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • grid (int, int) – size of grid network

  • deg (int) – data polynomial degree

  • noise_width (float) – half width of data random noise

  • seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)

Knapsack

Since uncertain coefficients appear only in the objective function, item weights are fixed. Let \(m\) be the number of items and \(k\) the number of resource dimensions. The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are sampled from 3 to 8 with one decimal place of precision. The cost coefficients are \(c_i^j = \lceil [\frac{5}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j \rceil\).

pyepo.data.knapsack.genData(num_data, num_features, num_items, dim=1, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for knapsack

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • num_items (int) – number of items

  • dim (int) – dimension of multi-dimensional knapsack

  • deg (int) – data polynomial degree

  • noise_width (float) – half width of data random noise

  • seed (int) – random state seed

Returns:

weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)

Traveling Salesperson

The distance matrix has two components: a Euclidean distance term and a feature-encoded term. Coordinates are drawn from a mixture of Gaussian \(\mathcal{N}(0, I)\) and uniform \(\textbf{U}(-2, 2)\) distributions. The feature-encoded component is \(\frac{1}{{3}^{deg - 1}} (\frac{1}{\sqrt{p}} (\mathcal{B} x_i)_j + 3)^{deg} \cdot \epsilon_i\), where the elements of \(\mathcal{B}\) are products of Bernoulli \(\textbf{B}(0.5)\) and uniform \(\textbf{U}(-2, 2)\) samples.

pyepo.data.tsp.genData(num_data, num_features, num_nodes, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for traveling salesman

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • num_nodes (int) – number of nodes

  • deg (int) – data polynomial degree

  • noise_width (float) – half width of data random noise

  • seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)

Portfolio

Let \(\bar{r}_{ij} = (\frac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}})^{deg}\). The expected return is \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \mathbf{\epsilon}\), and the covariance matrix is \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows a Bernoulli distribution, \(\mathbf{L} \sim \textbf{U}(-0.0025\tau, 0.0025\tau)\), and \(\mathbf{f}, \mathbf{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\).

pyepo.data.portfolio.genData(num_data, num_features, num_assets, deg=1, noise_level=1, seed=135)

A function to generate synthetic data and features for portfolio

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • num_assets (int) – number of assets

  • deg (int) – data polynomial degree

  • noise_level (float) – level of data random noise

  • seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

import pyepo

num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)

optDataset

pyepo.data.optDataset is a PyTorch Dataset that stores features and cost coefficients, and solves the optimization problem to obtain optimal solutions and objective values.

optDataset is not required for training with PyEPO, but it provides a convenient way to precompute optimal solutions and objective values when they are not available in the original data.

class pyepo.data.dataset.optDataset(model, feats, costs)

This class is a Torch Dataset for optimization problems.

model

Optimization model

Type:

optModel

feats

Data features

Type:

np.ndarray

costs

Cost vectors

Type:

np.ndarray

sols

Optimal solutions

Type:

np.ndarray

objs

Optimal objective values

Type:

np.ndarray

The following example shows how to use optDataset with a PyTorch DataLoader:

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

optDatasetKNN

pyepo.data.optDatasetKNN is a PyTorch Dataset for k-nearest neighbors (kNN) robust loss [1] in decision-focused learning. It stores features and cost coefficients, and computes mean kNN solutions and optimal objective values.

class pyepo.data.dataset.optDatasetKNN(model, feats, costs, k=10, weight=0.5)

This class is a Torch Dataset for optimization problems, when using the robust kNN-loss.

Reference: <https://arxiv.org/abs/2310.04328>

model

Optimization model

Type:

optModel

k

number of nearest neighbours selected

Type:

int

weight

weight of kNN-loss

Type:

float

feats

Data features

Type:

np.ndarray

costs

Cost vectors

Type:

np.ndarray

sols

Optimal solutions

Type:

np.ndarray

objs

Optimal objective values

Type:

np.ndarray

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)