Data
pyepo.data
contains synthetic data generator and a dataset class optDataset
to wrap data samples.
For more information and details about the Dataset, please see the 02 Optimization Dataset
Data Generator
pyepo.data
includes synthetic datasets for three of the most classic optimization problems: the shortest path problem, the multi-dimensional knapsack problem, the traveling salesperson problem, and portfolio optimization.
The synthetic datasets include features \(\mathbf{x}\) and cost coefficients \(\mathbf{c}\). The feature vector \(\mathbf{x}_i \in \mathbb{R}^p\) follows a standard multivariate Gaussian distribution \(\mathcal{N}(0, \mathbf{I})\), and the corresponding cost \(\mathbf{c}_i \in \mathbb{R}^d\) comes from a polynomial function \(f(\mathbf{x}_i)\) multiplicated with a random noise \(\mathbf{\epsilon}_i \sim U(1-\bar{\epsilon}, 1+\bar{\epsilon})\). In general, there are several parameters that users can control:
num_data (\(n\)): data size
num_features (\(p\)): feature dimension of costs \(\mathbf{c}\)
deg (\(deg\)): polynomial degree of function \(f(\mathbf{x}_i)\)
noise_width (\(\bar{\epsilon}\)): noise half-width of \(\mathbf{\epsilon}\)
seed: random state seed to generate data
Shortest Path
For the shortest path, a random matrix \(\mathcal{B} \in \mathbb{R}^{d \times p}\) which follows Bernoulli distribution with probability \(0.5\), encode the features \(x_i\). The cost of objective function \(c_{ij}\) is generated from \(c_i^j = [\frac{1}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j\).
- pyepo.data.shortestpath.genData(num_data, num_features, grid, deg=1, noise_width=0, seed=135)
A function to generate synthetic data and features for shortest path
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
grid (int, int) – size of grid network
deg (int) – data polynomial degree
noise_width (float) – half witdth of data random noise
seed (int) – random seed
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
The following code is to generate data for the shortest path on the grid network:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
grid = (5,5) # grid size
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg=4, noise_width=0, seed=135)
Knapsack
Because we assume that the uncertain coefficients exist only on the objective function, the weights of items are fixed throughout the data. We define the number of items as \(m\) and the dimension of resources is \(k\). The weights \(\mathcal{W} \in \mathbb{R}^{k \times m}\) are sampled from \(3\) to \(8\) with a precision of \(1\) decimal place. With the same \(\mathcal{B}\), cost \(c_{ij}\) is calculated from \(c_i^j = \lceil [\frac{5}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j \rceil\).
- pyepo.data.knapsack.genData(num_data, num_features, num_items, dim=1, deg=1, noise_width=0, seed=135)
A function to generate synthetic data and features for knapsack
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
num_items (int) – number of items
dim (int) – dimension of multi-dimensional knapsack
deg (int) – data polynomial degree
noise_width (float) – half witdth of data random noise
seed (int) – random state seed
- Returns:
weights of items (np.ndarray), data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
The following code is to generate data for the 3d-knapsack:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_item = 32 # number of items
dim = 3 # dimension of knapsack
weights, x, c = pyepo.data.knapsack.genData(num_data, num_feat, num_item, dim, deg=4, noise_width=0, seed=135)
Traveling Salesperson
The distance consists of two parts: one comes from Euclidean distance, the other derived from feature encoding. For Euclidean distance, we create coordinates from the mixture of Gaussian distribution \(\mathcal{N}(0, I)\) and uniform distribution \(\textbf{U}(-2, 2)\). For feature encoding, it is \(\frac{1}{{3}^{deg - 1}} (\frac{1}{\sqrt{p}} (\mathcal{B} x_i)_j + 3)^{deg} \cdot \epsilon_i\), where the elements of \(\mathcal{B}\) come from the multiplication of Bernoulli \(\textbf{B}(0.5)\) and uniform \(\textbf{U}(-2, 2)\).
- pyepo.data.tsp.genData(num_data, num_features, num_nodes, deg=1, noise_width=0, seed=135)
A function to generate synthetic data and features for travelling salesman
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
num_nodes (int) – number of nodes
deg (int) – data polynomial degree
noise_width (float) – half witdth of data random noise
seed (int) – random seed
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
The following code is to generate data for the Traveling salesperson:
import pyepo
num_data = 1000 # number of data
num_feat = 5 # size of feature
num_node = 20 # number of nodes
x, c = pyepo.data.tsp.genData(num_data, num_feat, num_node, deg=4, noise_width=0, seed=135)
Portfolio
Let \(\bar{r}_{ij} = (\frac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}})^{deg}\). In the context of portfolio optimization, the expected return of the assets \(\mathbf{r}_i\) is defined as \(\bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \mathbf{\epsilon}\) and the covariance matrix \(\mathbf{\Sigma}\) is expressed \(\mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}\), where \(\mathcal{B}\) follows Bernoulli distribution, \(\mathbf{L}\) follows uniform distribution between \(-0.0025 \tau\) and \(0.0025 \tau\), and \(\mathbf{f}\) and \(\mathbf{\epsilon}\) follow standard normal distribution.
- pyepo.data.portfolio.genData(num_data, num_features, num_assets, deg=1, noise_level=1, seed=135)
A function to generate synthetic data and features for travelling salesman
- Parameters:
num_data (int) – number of data points
num_features (int) – dimension of features
num_assets (int) – number of assets
deg (int) – data polynomial degree
noise_level (float) – level of data random noise
seed (int) – random seed
- Returns:
data features (np.ndarray), costs (np.ndarray)
- Return type:
tuple
The following code is to generate data for the portfolio:
import pyepo
num_data = 1000 # number of data
num_feat = 4 # size of feature
num_assets = 50 # number of assets
cov, x, r = pyepo.data.portfolio.genData(num_data, num_feat, num_assets, deg=4, noise_level=1, seed=135)
optDataset
pyepo.data.optDataset
is PyTorch Dataset, which stores the features and their corresponding costs of the objective function, and solves optimization problems to get optimal solutions and optimal objective values.
optDataset
is not necessary for training with PyEPO, but it can be easier to obtain optimal solutions and objective values when they are not available in the original data.
- class pyepo.data.dataset.optDataset(model, feats, costs)
This class is Torch Dataset for optimization problems.
- model
Optimization models
- Type:
- feats
Data features
- Type:
np.ndarray
- costs
Cost vectors
- Type:
np.ndarray
- sols
Optimal solutions
- Type:
np.ndarray
- objs
Optimal objective values
- Type:
np.ndarray
A method to create a optDataset from optModel
- Parameters:
model (optModel) – an instance of optModel
feats (np.ndarray) – data features
costs (np.ndarray) – costs of objective function
As the following example, optDataset
and Pytorch DataLoader
wrap the data samples, which can make the model training cleaner and more organized.
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDataset(model, x, c)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
optDatasetKNN
pyepo.data.optDatasetKNN
is a PyTorch Dataset designed for implementing k-nearest neighbors (kNN) robust loss [1] in decision-focused learning. It stores the features and their corresponding costs of the objective function and solves optimization problems to get mean kNN solutions and optimal objective values.
- class pyepo.data.dataset.optDatasetKNN(model, feats, costs, k=10, weight=0.5)
This class is Torch Dataset for optimization problems, when using the robust kNN-loss.
Reference: <https://arxiv.org/abs/2310.04328>
- model
Optimization models
- Type:
- k
number of nearest neighbours selected
- Type:
int
- weight
weight of kNN-loss
- Type:
float
- feats
Data features
- Type:
np.ndarray
- costs
Cost vectors
- Type:
np.ndarray
- sols
Optimal solutions
- Type:
np.ndarray
- objs
Optimal objective values
- Type:
np.ndarray
A method to create a optDataset from optModel
- Parameters:
model (optModel) – an instance of optModel
feats (np.ndarray) – data features
costs (np.ndarray) – costs of objective function
k (int) – number of nearest neighbours selected
weight (float) – weight of kNN-loss
As the following example, optDatasetKNN
and Pytorch DataLoader
wrap the data samples, which can make the model training cleaner and more organized.
import pyepo
from torch.utils.data import DataLoader
# model for shortest path
grid = (5,5) # grid size
model = pyepo.model.grb.shortestPathModel(grid)
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
noise_width = 0 # noise width
x, c = pyepo.data.shortestpath.genData(num_data, num_feat, grid, deg, noise_width, seed=135)
# build dataset
dataset = pyepo.data.dataset.optDatasetKNN(model, x, c, k=10, weight=0.5)
# get data loader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)