Data
pyepo.data provides synthetic data generators and the optDataset class for wrapping data samples.
For more details, see the 02 Optimization Dataset notebook.
Data Generator
pyepo.data includes synthetic data generators for four optimization problems: shortest path, multi-dimensional knapsack, traveling salesperson, and portfolio optimization.
Each generator produces feature-cost pairs \((\mathbf{x}, \mathbf{c})\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the cost \(\mathbf{c}_i \in \mathbb{R}^d\) is computed from a polynomial function \(f(\mathbf{x}_i)\) multiplied by random noise \(\mathbf{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\).
Common parameters across all generators:
num_data (\(n\)): number of data samples
num_features (\(p\)): feature dimension
deg (\(deg\)): polynomial degree of the mapping \(f(\mathbf{x}_i)\)
noise_width (\(\bar{\epsilon}\)): noise half-width
seed: random seed for reproducibility
Shortest Path
A random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) with Bernoulli(0.5) entries encodes the features. The cost coefficients are generated as \(c_i^j = [\frac{1}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j\).
- pyepo.data.shortestpath.genData(num_data, num_features, grid, deg=1, noise_width=0, seed=135)
A function to generate synthetic data and features for shortest path
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
grid (int, int) – size of grid network
deg (int) – data polynomial degree
noise_width (float) – half width of data random noise
seed (int) – random seed
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)
Knapsack
Since uncertain coefficients appear only in the objective function, item weights are fixed. Let \(m\) be the number of items and \(k\) the number of resource dimensions. The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are sampled from 3 to 8 with one decimal place of precision. The cost coefficients are \(c_i^j = \lceil [\frac{5}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j \rceil\).
- pyepo.data.knapsack.genData(num_data, num_features, num_items, dim=1, deg=1, noise_width=0, seed=135)
A function to generate synthetic data and features for knapsack
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
num_items (int) – number of items
dim (int) – dimension of multi-dimensional knapsack
deg (int) – data polynomial degree
noise_width (float) – half width of data random noise
seed (int) – random state seed
- Returns:
weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)
Traveling Salesperson
The distance matrix has two components: a Euclidean distance term and a feature-encoded term. Coordinates are drawn from a mixture of Gaussian \(\mathcal{N}(0, I)\) and uniform \(\textbf{U}(-2, 2)\) distributions. The feature-encoded component is \(\frac{1}{{3}^{deg - 1}} (\frac{1}{\sqrt{p}} (\mathcal{B} x_i)_j + 3)^{deg} \cdot \epsilon_i\), where the elements of \(\mathcal{B}\) are products of Bernoulli \(\textbf{B}(0.5)\) and uniform \(\textbf{U}(-2, 2)\) samples.
- pyepo.data.tsp.genData(num_data, num_features, num_nodes, deg=1, noise_width=0, seed=135)
A function to generate synthetic data and features for traveling salesman
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
num_nodes (int) – number of nodes
deg (int) – data polynomial degree
noise_width (float) – half width of data random noise
seed (int) – random seed
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)
Portfolio
Let \(\bar{r}_{ij} = (\frac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}})^{deg}\). The expected return is \(\mathbf{r}_i = \bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \mathbf{\epsilon}\), and the covariance matrix is \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows a Bernoulli distribution, \(\mathbf{L} \sim \textbf{U}(-0.0025\tau, 0.0025\tau)\), and \(\mathbf{f}, \mathbf{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\).
- pyepo.data.portfolio.genData(num_data, num_features, num_assets, deg=1, noise_level=1, seed=135)
A function to generate synthetic data and features for portfolio
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
num_assets (int) – number of assets
deg (int) – data polynomial degree
noise_level (float) – level of data random noise
seed (int) – random seed
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
import pyepo
num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)
optDataset
pyepo.data.optDataset is a PyTorch Dataset that stores features and cost coefficients, and solves the optimization problem to obtain optimal solutions and objective values.
optDataset is not required for training with PyEPO, but it provides a convenient way to precompute optimal solutions and objective values when they are not available in the original data.
- class pyepo.data.dataset.optDataset(model, feats, costs)
This class is a Torch Dataset for optimization problems.
- model
Optimization model
- Type:
- feats
Data features
- Type:
np.ndarray
- costs
Cost vectors
- Type:
np.ndarray
- sols
Optimal solutions
- Type:
np.ndarray
- objs
Optimal objective values
- Type:
np.ndarray
The following example shows how to use optDataset with a PyTorch DataLoader:
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
optDatasetKNN
pyepo.data.optDatasetKNN is a PyTorch Dataset for k-nearest neighbors (kNN) robust loss [1] in decision-focused learning. It stores features and cost coefficients, and computes mean kNN solutions and optimal objective values.
- class pyepo.data.dataset.optDatasetKNN(model, feats, costs, k=10, weight=0.5)
This class is a Torch Dataset for optimization problems, when using the robust kNN-loss.
Reference: <https://arxiv.org/abs/2310.04328>
- model
Optimization model
- Type:
- k
number of nearest neighbours selected
- Type:
int
- weight
weight of kNN-loss
- Type:
float
- feats
Data features
- Type:
np.ndarray
- costs
Cost vectors
- Type:
np.ndarray
- sols
Optimal solutions
- Type:
np.ndarray
- objs
Optimal objective values
- Type:
np.ndarray
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)