Data

pyepo.data contains synthetic data generator and a dataset class optDataset to wrap data samples.

For more information and details about the Dataset, please see the 02 Optimization Dataset

Data Generator

pyepo.data includes synthetic datasets for three of the most classic optimization problems: the shortest path problem, the multi-dimensional knapsack problem, the traveling salesperson problem, and portfolio optimization.

The synthetic datasets include features \(\mathbf{x}\) and cost coefficients \(\mathbf{c}\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the corresponding cost \(\mathbf{c}_i \in \mathbb{R}^d\) comes from a polynomial function \(f(\mathbf{x}_i)\) multiplicated with a random noise \(\mathbf{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\). In general, there are several parameters that users can control:

num_data (\(n\)): data size
num_features (\(p\)): feature dimension of costs \(\mathbf{c}\)
deg (\(deg\)): polynomial degree of function \(f(\mathbf{x}_i)\)
noise_width (\(\bar{\epsilon}\)): noise half-width of \(\mathbf{\epsilon}\)
seed: random state seed to generate data

Shortest Path

For the shortest path, a random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) which follows Bernoulli distribution with probability \(0.5\), encode the features \(x_i\). The cost of objective function \(c_{ij}\) is generated from \(c_i^j = [\frac{1}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j\).

pyepo.data.shortestpath.genData(num_data, num_features, grid, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for shortest path

Parameters:

num_data (int) – number of data points
num_features (int) – dimension of features
grid (int, int) – size of grid network
deg (int) – data polynomial degree
noise_width (float) – half witdth of data random noise
seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the shortest path on the grid network:

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)

Knapsack

Because we assume that the uncertain coefficients exist only on the objective function, the weights of items are fixed throughout the data. We define the number of items as \(m\) and the dimension of resources is \(k\). The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are sampled from \(3\) to \(8\) with a precision of \(1\) decimal place. With the same \(\mathcal{B}\), cost \(c_{ij}\) is calculated from \(c_i^j = \lceil [\frac{5}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j \rceil\).

pyepo.data.knapsack.genData(num_data, num_features, num_items, dim=1, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for knapsack

Parameters:

num_data (int) – number of data points
num_features (int) – dimension of features
num_items (int) – number of items
dim (int) – dimension of multi-dimensional knapsack
deg (int) – data polynomial degree
noise_width (float) – half witdth of data random noise
seed (int) – random state seed

Returns:

weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the 3d-knapsack:

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)

Traveling Salesperson

The distance consists of two parts: one comes from Euclidean distance, the other derived from feature encoding. For Euclidean distance, we create coordinates from the mixture of Gaussian distribution \(\mathcal{N}(0, I)\) and uniform distribution \(\textbf{U}(-2, 2)\). For feature encoding, it is \(\frac{1}{{3}^{deg - 1}} (\frac{1}{\sqrt{p}} (\mathcal{B} x_i)_j + 3)^{deg} \cdot \epsilon_i\), where the elements of \(\mathcal{B}\) come from the multiplication of Bernoulli \(\textbf{B}(0.5)\) and uniform \(\textbf{U}(-2, 2)\).

pyepo.data.tsp.genData(num_data, num_features, num_nodes, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for travelling salesman

Parameters:

num_data (int) – number of data points
num_features (int) – dimension of features
num_nodes (int) – number of nodes
deg (int) – data polynomial degree
noise_width (float) – half witdth of data random noise
seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the Traveling salesperson:

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)

Portfolio

Let \(\bar{r}_{ij} = (\frac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}})^{deg}\). In the context of portfolio optimization, the expected return of the assets \(\mathbf{r}_i\) is defined as \(\bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \mathbf{\epsilon}\) and the covariance matrix \(\mathbf{\Sigma}\) is expressed \(\mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows Bernoulli distribution, \(\mathbf{L}\) follows uniform distribution between \(-0.0025 \tau\) and \(0.0025 \tau\), and \(\mathbf{f}\) and \(\mathbf{\epsilon}\) follow standard normal distribution.

pyepo.data.portfolio.genData(num_data, num_features, num_assets, deg=1, noise_level=1, seed=135)

A function to generate synthetic data and features for travelling salesman

Parameters:

num_data (int) – number of data points
num_features (int) – dimension of features
num_assets (int) – number of assets
deg (int) – data polynomial degree
noise_level (float) – level of data random noise
seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the portfolio:

import pyepo

num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)

optDataset

pyepo.data.optDataset is PyTorch Dataset, which stores the features and their corresponding costs of the objective function, and solves optimization problems to get optimal solutions and optimal objective values.

optDataset is not necessary for training with PyEPO, but it can be easier to obtain optimal solutions and objective values when they are not available in the original data.

class pyepo.data.dataset.optDataset(model, feats, costs)

This class is Torch Dataset for optimization problems.

model

Optimization models

Type:: optModel

feats

Data features

Type:: np.ndarray

costs

Cost vectors

Type:: np.ndarray

sols

Optimal solutions

Type:: np.ndarray

objs

Optimal objective values

Type:: np.ndarray

A method to create a optDataset from optModel

Parameters:

model (optModel) – an instance of optModel
feats (np.ndarray) – data features
costs (np.ndarray) – costs of objective function

As the following example, optDataset and Pytorch DataLoader wrap the data samples, which can make the model training cleaner and more organized.

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

optDatasetKNN

pyepo.data.optDatasetKNN is a PyTorch Dataset designed for implementing k-nearest neighbors (kNN) robust loss [1] in decision-focused learning. It stores the features and their corresponding costs of the objective function and solves optimization problems to get mean kNN solutions and optimal objective values.

class pyepo.data.dataset.optDatasetKNN(model, feats, costs, k=10, weight=0.5)

This class is Torch Dataset for optimization problems, when using the robust kNN-loss.

Reference: <https://arxiv.org/abs/2310.04328>

model

Optimization models

Type:: optModel

k

number of nearest neighbours selected

Type:: int

weight

weight of kNN-loss

Type:: float

feats

Data features

Type:: np.ndarray

costs

Cost vectors

Type:: np.ndarray

sols

Optimal solutions

Type:: np.ndarray

objs

Optimal objective values

Type:: np.ndarray

A method to create a optDataset from optModel

Parameters:

model (optModel) – an instance of optModel
feats (np.ndarray) – data features
costs (np.ndarray) – costs of objective function
k (int) – number of nearest neighbours selected
weight (float) – weight of kNN-loss

As the following example, optDatasetKNN and Pytorch DataLoader wrap the data samples, which can make the model training cleaner and more organized.

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)