Data

pyepo.data contains synthetic data generator and a dataset class optDataset to wrap data samples.

For more information and details about the Dataset, please see the 02 Optimization Dataset

Data Generator

pyepo.data includes synthetic datasets for three of the most classic optimization problems: the shortest path problem, the multi-dimensional knapsack problem, the traveling salesperson problem, and portfolio optimization.

The synthetic datasets include features \(\mathbf{x}\) and cost coefficients \(\mathbf{c}\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the corresponding cost \(\mathbf{c}_i \in \mathbb{R}^d\) comes from a polynomial function \(f(\mathbf{x}_i)\) multiplicated with a random noise \(\mathbf{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\). In general, there are several parameters that users can control:

  • num_data (\(n\)): data size

  • num_features (\(p\)): feature dimension of costs \(\mathbf{c}\)

  • deg (\(deg\)): polynomial degree of function \(f(\mathbf{x}_i)\)

  • noise_width (\(\bar{\epsilon}\)): noise half-width of \(\mathbf{\epsilon}\)

  • seed: random state seed to generate data

Shortest Path

For the shortest path, a random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) which follows Bernoulli distribution with probability \(0.5\), encode the features \(x_i\). The cost of objective function \(c_{ij}\) is generated from \(c_i^j = [\frac{1}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j\).

pyepo.data.shortestpath.genData(num_data, num_features, grid, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for shortest path

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • grid (int, int) – size of grid network

  • deg (int) – data polynomial degree

  • noise_width (float) – half witdth of data random noise

  • seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the shortest path on the grid network:

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)

Knapsack

Because we assume that the uncertain coefficients exist only on the objective function, the weights of items are fixed throughout the data. We define the number of items as \(m\) and the dimension of resources is \(k\). The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are sampled from \(3\) to \(8\) with a precision of \(1\) decimal place. With the same \(\mathcal{B}\), cost \(c_{ij}\) is calculated from \(c_i^j = \lceil [\frac{5}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j \rceil\).

pyepo.data.knapsack.genData(num_data, num_features, num_items, dim=1, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for knapsack

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • num_items (int) – number of items

  • dim (int) – dimension of multi-dimensional knapsack

  • deg (int) – data polynomial degree

  • noise_width (float) – half witdth of data random noise

  • seed (int) – random state seed

Returns:

weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the 3d-knapsack:

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)

Traveling Salesperson

The distance consists of two parts: one comes from Euclidean distance, the other derived from feature encoding. For Euclidean distance, we create coordinates from the mixture of Gaussian distribution \(\mathcal{N}(0, I)\) and uniform distribution \(\textbf{U}(-2, 2)\). For feature encoding, it is \(\frac{1}{{3}^{deg - 1}} (\frac{1}{\sqrt{p}} (\mathcal{B} x_i)_j + 3)^{deg} \cdot \epsilon_i\), where the elements of \(\mathcal{B}\) come from the multiplication of Bernoulli \(\textbf{B}(0.5)\) and uniform \(\textbf{U}(-2, 2)\).

pyepo.data.tsp.genData(num_data, num_features, num_nodes, deg=1, noise_width=0, seed=135)

A function to generate synthetic data and features for travelling salesman

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • num_nodes (int) – number of nodes

  • deg (int) – data polynomial degree

  • noise_width (float) – half witdth of data random noise

  • seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the Traveling salesperson:

import pyepo

num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)

Portfolio

Let \(\bar{r}_{ij} = (\frac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}})^{deg}\). In the context of portfolio optimization, the expected return of the assets \(\mathbf{r}_i\) is defined as \(\bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \mathbf{\epsilon}\) and the covariance matrix \(\mathbf{\Sigma}\) is expressed \(\mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows Bernoulli distribution, \(\mathbf{L}\) follows uniform distribution between \(-0.0025 \tau\) and \(0.0025 \tau\), and \(\mathbf{f}\) and \(\mathbf{\epsilon}\) follow standard normal distribution.

pyepo.data.portfolio.genData(num_data, num_features, num_assets, deg=1, noise_level=1, seed=135)

A function to generate synthetic data and features for travelling salesman

Parameters:
  • num_data (int) – number of data points

  • num_features (int) – dimension of features

  • num_assets (int) – number of assets

  • deg (int) – data polynomial degree

  • noise_level (float) – level of data random noise

  • seed (int) – random seed

Returns:

data features (np.ndarray), costs (np.ndarray)

Return type:

tuple

The following code is to generate data for the portfolio:

import pyepo

num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)

optDataset

pyepo.data.optDataset is PyTorch Dataset, which stores the features and their corresponding costs of the objective function, and solves optimization problems to get optimal solutions and optimal objective values.

optDataset is not necessary for training with PyEPO, but it can be easier to obtain optimal solutions and objective values when they are not available in the original data.

class pyepo.data.dataset.optDataset(model, feats, costs)

This class is Torch Dataset for optimization problems.

model

Optimization models

Type:

optModel

feats

Data features

Type:

np.ndarray

costs

Cost vectors

Type:

np.ndarray

sols

Optimal solutions

Type:

np.ndarray

objs

Optimal objective values

Type:

np.ndarray

A method to create a optDataset from optModel

Parameters:
  • model (optModel) – an instance of optModel

  • feats (np.ndarray) – data features

  • costs (np.ndarray) – costs of objective function

As the following example, optDataset and Pytorch DataLoader wrap the data samples, which can make the model training cleaner and more organized.

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

optDatasetKNN

pyepo.data.optDatasetKNN is a PyTorch Dataset designed for implementing k-nearest neighbors (kNN) robust loss [1] in decision-focused learning. It stores the features and their corresponding costs of the objective function and solves optimization problems to get mean kNN solutions and optimal objective values.

class pyepo.data.dataset.optDatasetKNN(model, feats, costs, k=10, weight=0.5)

This class is Torch Dataset for optimization problems, when using the robust kNN-loss.

Reference: <https://arxiv.org/abs/2310.04328>

model

Optimization models

Type:

optModel

k

number of nearest neighbours selected

Type:

int

weight

weight of kNN-loss

Type:

float

feats

Data features

Type:

np.ndarray

costs

Cost vectors

Type:

np.ndarray

sols

Optimal solutions

Type:

np.ndarray

objs

Optimal objective values

Type:

np.ndarray

A method to create a optDataset from optModel

Parameters:
  • model (optModel) – an instance of optModel

  • feats (np.ndarray) – data features

  • costs (np.ndarray) – costs of objective function

  • k (int) – number of nearest neighbours selected

  • weight (float) – weight of kNN-loss

As the following example, optDatasetKNN and Pytorch DataLoader wrap the data samples, which can make the model training cleaner and more organized.

import pyepo
from torch.utils.data import DataLoader

# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)

# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)

# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)

# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)