Training Methods +++++++++++++++++ Overview ======== ``pyepo.func`` provides PyTorch autograd modules that wrap an optimization solver for end-to-end training. All modules assume a linear objective with known, fixed constraints; the cost vector is predicted from contextual features. Every module accepts ``processes`` for parallel solving, and all except ``CaVE`` accept ``solve_ratio < 1`` with ``dataset`` for solution-pool caching (see :doc:`../advanced/pool`); modules that return a loss also accept ``reduction`` (``"mean"``, ``"sum"``, or ``"none"``). Each method below is presented with its definition followed by a runnable training loop; all loops build on the shared **Common Setup**. Choosing a Method ================= The modules differ in what they return, which determines how you use them: * **Loss-returning**: return a scalar loss, passed directly to ``.backward()``: SPO+, PG, PFYL, RFYL, CaVE, NCE, CMAP, LTR. * **Solution-returning**: return a predicted, expected, or regularized solution, on which you define a task loss: DPO, I-MLE, AI-MLE, RFWO, DBB, NID. A combined name like ``DPO+MSE`` or ``NID+L1`` denotes a solution-returning module (DPO, NID) followed by a task loss (here MSE or L1) on its output. The table below summarizes each module's return type, typical supervision, and notes. .. list-table:: :header-rows: 1 :widths: 18 28 28 26 * - Module - Returns - Typical supervision - Notes * - ``SPOPlus`` - loss - true costs, true optimal solutions, true objective values - convex surrogate for regret * - ``PG`` - loss - true costs - directional finite differences along the true cost * - ``DPO`` / ``DPOMul`` - expected perturbed solutions - task loss chosen by the user - use the multiplicative variant for sign-sensitive oracles * - ``PFY`` / ``PFYMul`` - loss - true optimal solutions - PFYL; use the multiplicative variant for sign-sensitive oracles * - ``IMLE`` / ``AIMLE`` - expected perturbed solutions - task loss chosen by the user - perturb-and-MAP with Sum-of-Gamma noise * - ``RFWO`` - regularized predicted solutions - task loss chosen by the user - L2-regularized smooth optimizer over the convex hull of feasible solutions * - ``RFY`` - loss - true optimal solutions - RFYL; Fenchel-Young loss paired with L2-regularized Frank-Wolfe * - ``DBB`` / ``NID`` - predicted solutions - task loss chosen by the user - direct solution-level or objective-level losses * - ``CaVE`` - loss - tight binding-constraint normals at the true optimum (``optDatasetConstrs``) - binary linear programs, Gurobi backend only * - ``NCE`` / ``CMAP`` - loss - true optimal solutions and a solution pool - contrastive training with cached negative solutions * - ``ptLTR`` / ``prLTR`` / ``lsLTR`` - loss - true costs and a solution pool - learning-to-rank formulations over feasible solutions Common Setup ============ All training loops below share the same setup: a DSL-defined knapsack problem with a linear predictor. The examples use PyTorch; JAX follows the same method families, see :doc:`../frontends/jax`. For a runnable walkthrough, see the `03 Training and Testing `_ notebook. .. code-block:: python import pyepo from pyepo import EPO, dsl import torch from torch import nn from torch.utils.data import DataLoader # generate data num_data, num_feat, num_item, num_dim = 1000, 5, 10, 3 weights, feat, costs = pyepo.data.knapsack.genData( num_data, num_feat, num_item, num_dim, deg=4, noise_width=0.5, seed=135, ) capacity = (weights.sum(axis=1) * 0.5).astype(int) # define the optimization problem x = dsl.Variable(num_item, vtype=EPO.BINARY) c = dsl.Parameter(num_item) optmodel = dsl.Problem( dsl.Maximize(c @ x), [weights @ x <= capacity], ).compile(backend="gurobi") # dataset and data loader dataset = pyepo.data.dataset.optDataset(optmodel, feat, costs) dataloader = DataLoader(dataset, batch_size=32, shuffle=True) # prediction model predmodel = nn.Linear(num_feat, num_item) optimizer = torch.optim.Adam(predmodel.parameters(), lr=1e-3) # for multiplicative perturbed variants positive_predmodel = nn.Sequential(nn.Linear(num_feat, num_item), nn.Softplus()) Surrogate Losses ================ Surrogate losses replace the non-differentiable regret with a differentiable training objective. Smart Predict-then-Optimize+ Loss (SPO+) ---------------------------------------- SPO+ [#f1]_ is a convex surrogate for the SPO loss (regret), the decision error of the downstream optimization. The derivation starts from the observation that for any :math:`\alpha \geq 0`, .. math:: \mathrm{regret}(\hat{\mathbf{c}}, \mathbf{c}) \leq \max_{\mathbf{w} \in \mathcal{S}} \big\{ \mathbf{c}^\top \mathbf{w} - \alpha\, \hat{\mathbf{c}}^\top \mathbf{w} \big\} + z^*(\alpha \hat{\mathbf{c}}) - z^*(\mathbf{c}). Setting :math:`\alpha = 2` and bounding :math:`z^*(2 \hat{\mathbf{c}}) \leq 2 \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c})` via the concavity of :math:`z^*` gives the SPO+ loss, .. math:: \mathcal{L}_{\mathrm{SPO+}}(\hat{\mathbf{c}}, \mathbf{c}) = -z^*(2 \hat{\mathbf{c}} - \mathbf{c}) + 2\, \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - z^*(\mathbf{c}), which is convex in :math:`\hat{\mathbf{c}}` and upper-bounds the regret. By Danskin's theorem, a subgradient is .. math:: 2 \big( \mathbf{w}^*(\mathbf{c}) - \mathbf{w}^*(2 \hat{\mathbf{c}} - \mathbf{c}) \big) \in \frac{\partial \mathcal{L}_{\mathrm{SPO+}}(\hat{\mathbf{c}}, \mathbf{c})}{\partial \hat{\mathbf{c}}}. .. autoclass:: pyepo.func.SPOPlus :noindex: :members: ``processes`` sets the number of solver processes (``0`` uses all available cores). Training loop: .. code-block:: python spo = pyepo.func.SPOPlus(optmodel, processes=2) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = spo(cp, c, w, z) optimizer.zero_grad() loss.backward() optimizer.step() Perturbation Gradient (PG) -------------------------- PG [#f11]_ is a surrogate loss based on zeroth-order gradient approximation. By Danskin's theorem, regret can be written as the directional derivative of :math:`z^*(\hat{\mathbf{c}})` along :math:`\mathbf{c}` (up to a constant independent of :math:`\hat{\mathbf{c}}`). PG approximates this directional derivative with finite differences. Two variants are available, backward (``two_sides=False``) and central (``two_sides=True``): .. math:: \mathcal{L}_{\mathrm{PG}}^{\mathrm{back}}(\hat{\mathbf{c}}, \mathbf{c}) &\approx \frac{z^*(\hat{\mathbf{c}}) - z^*(\hat{\mathbf{c}} - \sigma \mathbf{c})}{\sigma}, \\ \mathcal{L}_{\mathrm{PG}}^{\mathrm{cent}}(\hat{\mathbf{c}}, \mathbf{c}) &\approx \frac{z^*(\hat{\mathbf{c}} + \sigma \mathbf{c}) - z^*(\hat{\mathbf{c}} - \sigma \mathbf{c})}{2 \sigma}, where :math:`\sigma > 0` is the finite-difference width (the ``sigma`` parameter). The corresponding gradient estimate is .. math:: \frac{\partial \mathcal{L}_{\mathrm{PG}}^{\mathrm{back}}(\hat{\mathbf{c}}, \mathbf{c})}{\partial \hat{\mathbf{c}}} \approx \frac{\mathbf{w}^*(\hat{\mathbf{c}}) - \mathbf{w}^*(\hat{\mathbf{c}} - \sigma \mathbf{c})}{\sigma}. .. autoclass:: pyepo.func.PG :noindex: :members: ``sigma`` controls the finite-difference width. ``two_sides`` enables central differencing. Training loop: .. code-block:: python pg = pyepo.func.PG(optmodel, sigma=0.1, two_sides=False, processes=2) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = pg(cp, c) optimizer.zero_grad() loss.backward() optimizer.step() Perturbed Methods ================= Perturbed methods estimate gradients by Monte Carlo averaging over random perturbations of the cost vector. They share two design knobs: ``n_samples`` (number of Monte Carlo samples) and ``sigma`` (perturbation amplitude). Differentiable Perturbed Optimizer (DPO) ---------------------------------------- DPO [#f5]_ uses Monte Carlo sampling to estimate solutions by optimizing randomly perturbed costs. Its custom backward pass provides a gradient estimator for end-to-end training. ``DPO`` is the additive Gaussian version; ``DPOMul`` is the multiplicative version for sign-sensitive oracles [#f6]_. The multiplicative variant assumes predicted costs already have the intended nonzero sign. For nonnegative-cost problems, use a positive-output predictor such as ``nn.Softplus()`` plus a small epsilon. DPO replaces the piecewise-constant solution map :math:`\hat{\mathbf{c}} \mapsto \mathbf{w}^*(\hat{\mathbf{c}})` with the expectation over a random perturbation, .. math:: \mathbb{E}_{\boldsymbol{\xi}} \big[ \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}) \big] \approx \frac{1}{K} \sum_{\kappa=1}^{K} \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}_\kappa), where :math:`\boldsymbol{\xi} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})` and :math:`K` is ``n_samples``. The multiplicative variant ``DPOMul`` perturbs each entry of :math:`\hat{\mathbf{c}}` as .. math:: \hat{\mathbf{c}} \odot \exp\!\big(\sigma \boldsymbol{\xi} - \tfrac{1}{2} \sigma^2\big), with :math:`\odot` the Hadamard product. The exponential factor preserves the sign of each cost entry, which is required when the solver expects, e.g., nonnegative edge costs. .. autoclass:: pyepo.func.DPO :noindex: :members: .. autoclass:: pyepo.func.DPOMul :noindex: :members: Training loop (swap ``predmodel`` for ``positive_predmodel`` when using ``DPOMul``): .. code-block:: python dpo = pyepo.func.DPO(optmodel, n_samples=10, sigma=0.5, processes=2) # if using DPOMul, replace predmodel with positive_predmodel below # dpo = pyepo.func.DPOMul(optmodel, n_samples=10, sigma=0.5, processes=2) criterion = nn.MSELoss() for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) we = dpo(cp) loss = criterion(we, w) optimizer.zero_grad() loss.backward() optimizer.step() Perturbed Fenchel-Young Loss (PFYL) ----------------------------------- PFYL [#f5]_ uses the same perturbed expected solution as DPO inside a Fenchel-Young loss, comparing it directly with the true optimal solution. It returns a scalar loss. ``PFY`` is the additive Gaussian version; ``PFYMul`` is the multiplicative sign-preserving variant with the same sign convention as ``DPOMul``. Let :math:`F(\hat{\mathbf{c}}) = \mathbb{E}_{\boldsymbol{\xi}}\big[ \min_{\mathbf{w} \in \mathcal{S}} (\hat{\mathbf{c}} + \sigma \boldsymbol{\xi})^\top \mathbf{w} \big]` be the expected perturbed minimum, and let :math:`\Omega(\mathbf{w}^*(\mathbf{c}))` be its Fenchel conjugate. The perturbed Fenchel-Young loss is .. math:: \mathcal{L}_{\mathrm{PFYL}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - F(\hat{\mathbf{c}}) - \Omega(\mathbf{w}^*(\mathbf{c})). The dual term :math:`\Omega(\mathbf{w}^*(\mathbf{c}))` does not depend on :math:`\hat{\mathbf{c}}`, so the gradient collapses to a simple difference of solutions, .. math:: \frac{\partial \mathcal{L}_{\mathrm{PFYL}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c}))}{\partial \hat{\mathbf{c}}} = \mathbf{w}^*(\mathbf{c}) - \mathbb{E}_{\boldsymbol{\xi}} \big[\mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi})\big] \approx \mathbf{w}^*(\mathbf{c}) - \frac{1}{K} \sum_{\kappa=1}^{K} \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}_\kappa). The multiplicative variant uses the same sign-preserving perturbation, .. math:: \hat{\mathbf{c}} \odot \exp\!\big(\sigma \boldsymbol{\xi} - \tfrac{1}{2} \sigma^2\big). .. autoclass:: pyepo.func.PFY :noindex: :members: .. autoclass:: pyepo.func.PFYMul :noindex: :members: Training loop (swap ``predmodel`` for ``positive_predmodel`` when using ``PFYMul``): .. code-block:: python pfy = pyepo.func.PFY(optmodel, n_samples=10, sigma=0.5, processes=2) # if using PFYMul, replace predmodel with positive_predmodel below # pfy = pyepo.func.PFYMul(optmodel, n_samples=10, sigma=0.5, processes=2) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = pfy(cp, w) optimizer.zero_grad() loss.backward() optimizer.step() Implicit Maximum Likelihood Estimator (I-MLE) --------------------------------------------- I-MLE [#f9]_ uses the perturb-and-MAP framework, sampling noise from a Sum-of-Gamma distribution and interpolating the loss function to approximate finite differences. ``lambd`` controls the interpolation step. I-MLE is framed as imitation learning: bring the model distribution :math:`p(\mathbf{w} \mid \hat{\mathbf{c}})` closer to a target distribution :math:`q(\mathbf{w} \mid \hat{\mathbf{c}})` by minimizing their KL divergence. An upstream task gradient :math:`\mathbf{d} = \nabla_{\mathbf{w}} \mathcal{L}(\hat{\mathbf{c}}, \cdot) \big|_{\mathbf{w} = \mathbf{w}^*(\hat{\mathbf{c}})}` induces a virtual update :math:`\hat{\mathbf{c}}' = \hat{\mathbf{c}} + \lambda \mathbf{d}`, and the gradient is estimated by a directional finite difference between the smoothed solutions at :math:`\hat{\mathbf{c}}'` and :math:`\hat{\mathbf{c}}`: .. math:: \frac{\partial \mathcal{L}_{\mathrm{IMLE}}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx \frac{1}{K \lambda} \sum_{\kappa=1}^{K} \Big( \mathbf{w}^*(\hat{\mathbf{c}} + \lambda \mathbf{d} + \sigma \boldsymbol{\xi}_\kappa) - \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}_\kappa) \Big), where :math:`\sigma` smooths the solution map and :math:`\lambda` (``lambd``) is the finite-difference step. The same noise realization :math:`\boldsymbol{\xi}_\kappa` is shared across the two perturbed evaluations to reduce variance. .. autoclass:: pyepo.func.IMLE :noindex: :members: Training loop: .. code-block:: python imle = pyepo.func.IMLE(optmodel, n_samples=10, sigma=1.0, lambd=10, processes=2) criterion = nn.L1Loss() for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) we = imle(cp) loss = criterion(we, w) optimizer.zero_grad() loss.backward() optimizer.step() Adaptive Implicit Maximum Likelihood Estimator (AI-MLE) ------------------------------------------------------- AI-MLE [#f10]_ extends I-MLE with an adaptive interpolation step. AI-MLE uses the same finite-difference estimator as I-MLE but replaces the fixed step size :math:`\lambda` with an adaptive choice driven by the magnitudes of the predicted cost and the upstream gradient, .. math:: \lambda_t = \alpha_t \cdot \frac{\|\hat{\mathbf{c}}\|_2}{\|\mathbf{d}\|_2}, where :math:`\mathbf{d}` is the upstream task gradient and :math:`\alpha_t > 0` is tuned online: when an exponential moving average of the fraction of nonzero gradient entries drops below one, :math:`\alpha_t` is increased; otherwise it is decreased. This rescaling keeps the perturbation magnitude commensurate with :math:`\hat{\mathbf{c}}` and decouples the step from the absolute scale of :math:`\mathbf{d}`, removing the need to tune :math:`\lambda` by hand. .. autoclass:: pyepo.func.AIMLE :noindex: :members: Training loop (the adaptive step works with far fewer samples than I-MLE): .. code-block:: python aimle = pyepo.func.AIMLE(optmodel, n_samples=2, sigma=1.0, processes=2) criterion = nn.L1Loss() for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) we = aimle(cp) loss = criterion(we, w) optimizer.zero_grad() loss.backward() optimizer.step() Regularized Methods =================== A pure linear optimization layer is not differentiable: the optimal solution :math:`\mathbf{w}^*(\hat{\mathbf{c}}) = \arg\min_{\mathbf{w} \in \mathcal{S}} \hat{\mathbf{c}}^\top \mathbf{w}` is piecewise constant in :math:`\hat{\mathbf{c}}`. It jumps between vertices of :math:`\mathcal{S}`, and its derivative is zero almost everywhere. Regularized methods fix this by adding a strongly convex regularizer :math:`\Omega` to the linear objective, yielding a regularized solution .. math:: \hat{\mathbf{w}}_\Omega(\hat{\mathbf{c}}) = \arg\min_{\mathbf{w} \in \mathrm{conv}(\mathcal{S})} \hat{\mathbf{c}}^\top \mathbf{w} + \Omega(\mathbf{w}), which is unique, lies inside :math:`\mathrm{conv}(\mathcal{S})`, and varies continuously with :math:`\hat{\mathbf{c}}`. This deterministic smoothing is dual to the stochastic smoothing of the perturbed methods above: both replace the piecewise-constant solution map with a continuous surrogate, the former via a convex regularizer and the latter via Monte Carlo averaging over a noise distribution [#f6]_. PyEPO implements the L2 special case :math:`\Omega(\mathbf{w}) = \tfrac{\lambda}{2} \|\mathbf{w}\|_2^2` and solves the regularized program with batched Frank-Wolfe iteration. Frank-Wolfe accesses :math:`\mathrm{conv}(\mathcal{S})` through the original LP/ILP solver of :math:`\mathcal{S}` (a *linear minimization oracle*). ``lambd`` is the L2 strength :math:`\lambda`; ``max_iter`` caps Frank-Wolfe iterations and ``tol`` is the convergence tolerance. L2 Regularized Frank-Wolfe (RFWO) --------------------------------- RFWO [#f6]_ exposes :math:`\hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}})` directly as a differentiable layer. Since it returns a regularized solution rather than a loss, the user defines a task loss on the output. A common choice is MSE against :math:`\mathbf{w}^*(\mathbf{c})`. .. math:: \hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}}) = \arg\min_{\mathbf{w} \in \mathrm{conv}(\mathcal{S})} \hat{\mathbf{c}}^\top \mathbf{w} + \tfrac{\lambda}{2} \|\mathbf{w}\|_2^2. .. autoclass:: pyepo.func.RFWO :noindex: :members: Training loop: .. code-block:: python rfwo = pyepo.func.RFWO(optmodel, lambd=1.0, processes=2) criterion = nn.MSELoss() for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) wr = rfwo(cp) loss = criterion(wr, w) optimizer.zero_grad() loss.backward() optimizer.step() L2 Regularized Frank-Wolfe with Fenchel-Young Loss (RFYL) --------------------------------------------------------- RFYL [#f6]_ pairs the RFWO layer with the **Fenchel-Young loss** [#f13]_ of the regularizer :math:`\Omega`, returning a scalar loss that compares the predicted cost :math:`\hat{\mathbf{c}}` to the true optimum :math:`\mathbf{w}^*(\mathbf{c})`. For any convex regularizer :math:`\Omega: \mathrm{conv}(\mathcal{S}) \to \mathbb{R}` with conjugate :math:`\Omega^*(\boldsymbol{\theta}) = \sup_{\mathbf{w}} \boldsymbol{\theta}^\top \mathbf{w} - \Omega(\mathbf{w})`, the Fenchel-Young loss is .. math:: \mathcal{L}_\Omega^{\mathrm{FY}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \Omega(\mathbf{w}^*(\mathbf{c})) + \Omega^*(-\hat{\mathbf{c}}) + \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}). The Fenchel-Young inequality gives three properties used by the training loss: 1. **Non-negative**, with equality iff :math:`\mathbf{w}^*(\mathbf{c}) = \hat{\mathbf{w}}_\Omega(\hat{\mathbf{c}})`. 2. **Convex** in :math:`\hat{\mathbf{c}}`. 3. By Danskin's theorem, the gradient is the residual between the regularized prediction and the target, .. math:: \frac{\partial \mathcal{L}_\Omega^{\mathrm{FY}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c}))}{\partial \hat{\mathbf{c}}} = \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{w}}_\Omega(\hat{\mathbf{c}}), without differentiating through the Frank-Wolfe iterate. PyEPO specializes this to :math:`\Omega(\mathbf{w}) = \tfrac{\lambda}{2}\|\mathbf{w}\|_2^2`. Substituting the closed form of :math:`\Omega^*` and the definition of :math:`\hat{\mathbf{w}}_\lambda` collapses the loss to .. math:: \mathcal{L}_{\mathrm{RFYL}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \Omega(\mathbf{w}^*(\mathbf{c})) - \Omega(\hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}})) + \hat{\mathbf{c}}^\top \big( \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}}) \big), a "compare regularizers + linear residual" form. The gradient remains :math:`\mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}})`. .. autoclass:: pyepo.func.RFY :noindex: :members: Training loop: .. code-block:: python rfyl = pyepo.func.RFY(optmodel, lambd=1.0, processes=2) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = rfyl(cp, w) optimizer.zero_grad() loss.backward() optimizer.step() Black-Box Methods ================= Black-box methods replace the zero gradient of the discrete solver with a surrogate backward rule. The training loops below define an objective-value loss on the returned solution. Differentiable Black-Box Optimizer (DBB) ---------------------------------------- DBB [#f3]_ estimates gradients by interpolating the solver output between the predicted cost and a perturbed one. ``lambd`` is a smoothing hyperparameter. Given an upstream gradient :math:`\mathbf{d} = \nabla_{\mathbf{w}} \mathcal{L}(\hat{\mathbf{c}}, \cdot) \big|_{\mathbf{w} = \mathbf{w}^*(\hat{\mathbf{c}})}`, DBB approximates the vector-Jacobian product by interpolating the loss between :math:`\hat{\mathbf{c}}` and a perturbed cost :math:`\hat{\mathbf{c}} + \lambda \mathbf{d}`, yielding .. math:: \frac{\partial \mathcal{L}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx \frac{\mathbf{w}^*(\hat{\mathbf{c}} + \lambda \mathbf{d}) - \mathbf{w}^*(\hat{\mathbf{c}})}{\lambda}. Larger :math:`\lambda` smooths more aggressively; the recommended range is 10 to 20. Unlike SPO+, the resulting surrogate is nonconvex in :math:`\hat{\mathbf{c}}`, which weakens convergence guarantees even when the predictor is convex in its parameters. .. autoclass:: pyepo.func.DBB :noindex: :members: Training loop: .. code-block:: python dbb = pyepo.func.DBB(optmodel, lambd=10, processes=2) criterion = nn.L1Loss() for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) wp = dbb(cp) zp = (wp * c).sum(1).view(-1, 1) loss = criterion(zp, z) optimizer.zero_grad() loss.backward() optimizer.step() Negative Identity Backpropagation (NID) --------------------------------------- NID [#f4]_ treats the solver as a negative identity mapping during backpropagation; it is hyperparameter-free. NID approximates the solver Jacobian by the (signed) identity, :math:`\partial \mathbf{w}^*(\hat{\mathbf{c}}) / \partial \hat{\mathbf{c}} \approx -\mathbf{I}` for a minimization problem (and :math:`+\mathbf{I}` for maximization). Given an upstream gradient :math:`\mathbf{d} = \partial \mathcal{L}(\hat{\mathbf{c}}, \cdot) / \partial \mathbf{w}^*(\hat{\mathbf{c}})`, the chain rule gives a sign-flipped straight-through estimator, .. math:: \frac{\partial \mathcal{L}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx -\mathbf{d} \quad (\text{minimization}), \qquad \frac{\partial \mathcal{L}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx \mathbf{d} \quad (\text{maximization}). For minimization, this rule decreases :math:`\hat{\mathbf{c}}` along directions where :math:`\mathbf{w}^*(\hat{\mathbf{c}})` tends to increase, and vice versa. NID is the special case of DBB in which :math:`\lambda` is chosen so that the interpolated solution coincides with the negative-identity update, but it saves the extra solver call required by DBB. .. autoclass:: pyepo.func.NID :noindex: :members: Training loop: .. code-block:: python nid = pyepo.func.NID(optmodel, processes=2) criterion = nn.L1Loss() for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) wp = nid(cp) zp = (wp * c).sum(1).view(-1, 1) loss = criterion(zp, z) optimizer.zero_grad() loss.backward() optimizer.step() Cone-Aligned Estimation ======================= Cone-aligned losses supervise the *predicted cost vector* directly by aligning it with the polyhedral cone of binding-constraint normals at the true optimum, rather than supervising on the optimal solution itself. Cone-Aligned Vector Estimation (CaVE) ------------------------------------- CaVE [#f12]_ is a surrogate loss for **binary linear programs** (TSP, CVRP, knapsack, shortest path with binary edges, etc.). For each instance, it projects the sense-flipped predicted cost vector onto the polyhedral cone spanned by the binding-constraint normals at the true optimal vertex, then minimizes ``1 - cos(pred, proj)``. For a runnable walkthrough, see the `04 CaVE for Binary Linear Programs `_ notebook. Let :math:`K(\mathbf{w}^*(\mathbf{c}))` be the polyhedral cone spanned by the constraint normals that bind at the true optimal vertex. For a minimization problem, the KKT conditions require :math:`-\hat{\mathbf{c}} \in K(\mathbf{w}^*(\mathbf{c}))` for :math:`\mathbf{w}^*(\mathbf{c})` to remain optimal under the predicted cost. CaVE measures this alignment via a cosine loss against the cone projection :math:`\mathbf{p} = \mathrm{proj}_{K}(-\hat{\mathbf{c}})`, .. math:: \mathcal{L}_{\mathrm{CaVE}}(\hat{\mathbf{c}}, K) = 1 - \frac{(-\hat{\mathbf{c}})^\top \mathbf{p}}{\|\hat{\mathbf{c}}\|_2\, \|\mathbf{p}\|_2}. When :math:`-\hat{\mathbf{c}}` already lies inside the cone, :math:`\mathbf{p} = -\hat{\mathbf{c}}` and the loss is zero; the further :math:`-\hat{\mathbf{c}}` strays outside the cone, the larger the loss. CaVE uses two solvers at different stages: a Gurobi-backed ``optModel`` extracts the binding-constraint normals when the dataset is built, and Clarabel projects the predicted cost onto that cone during training. This avoids solving an ILP in every training step. ``max_iter`` caps the Clarabel iterations; the default ``max_iter=3`` is the paper's **CaVE+** preset, which under-converges the projection on purpose so it stays interior to the cone. Raising it changes the loss, not just its precision. Setting ``solve_ratio < 1`` enables the **CaVE-Hybrid** update: each batch uses the QP projection with probability ``solve_ratio`` and otherwise uses a blend of the normalized predicted cost and the average binding-constraint normal. ``inner_ratio`` controls the blend. .. figure:: ../../images/cave_vrp20.png :width: 100% :align: center CVRP-20 results from notebook 04: ``num_data=1000``, 10 epochs, single process. In this setup, CaVE+ trains 8.2x faster than SPO+; CaVE-Hybrid with ``solve_ratio=0.3`` trains 10.5x faster with higher final regret. Training data comes from ``pyepo.data.dataset.optDatasetConstrs``, which extracts the binding-constraint normals at the optimum for each instance. Per-instance constraint counts are ragged, so batch it with ``pyepo.data.dataset.optDataLoader``, which zero-pads them automatically. CaVE currently requires a Gurobi-backed ``optModel``. .. autoclass:: pyepo.func.CaVE :noindex: :members: Training loop (the batch carries ``tight_ctrs`` in addition to ``(x, c, w, z)``): .. code-block:: python from pyepo.data.dataset import optDatasetConstrs, optDataLoader dataset_constr = optDatasetConstrs(optmodel, feat, costs) dataloader_constr = optDataLoader(dataset_constr, batch_size=32, shuffle=True) cave = pyepo.func.CaVE(optmodel, processes=2) for epoch in range(20): for x, c, w, z, tight_ctrs in dataloader_constr: cp = predmodel(x) loss = cave(cp, tight_ctrs) optimizer.zero_grad() loss.backward() optimizer.step() If you use the **CaVE** loss, please cite: .. code-block:: bibtex @inproceedings{tang2024cave, title={CaVE: A Cone-Aligned Approach for Fast Predict-then-Optimize with Binary Linear Programs}, author={Tang, Bo and Khalil, Elias B}, booktitle={Integration of Constraint Programming, Artificial Intelligence, and Operations Research}, pages={193--210}, year={2024}, publisher={Springer} } Contrastive Methods =================== Contrastive methods train against a pool of cached non-optimal solutions, treated as negative examples. ``solve_ratio`` controls how often new instances are solved exactly during training; ``dataset`` seeds the pool and is required by these methods. See :doc:`../advanced/pool` for details on the solution-pool mechanism. Noise Contrastive Estimation (NCE) ---------------------------------- NCE [#f7]_ is a surrogate loss based on negative examples. It uses a small set of non-optimal solutions as negative samples to maximize the predicted-cost margin between the optimal solution and the rest. Let :math:`\Gamma` be the cached pool of feasible solutions. For a minimization problem, NCE averages the predicted-cost margin between :math:`\mathbf{w}^*(\mathbf{c})` and every member of the pool, .. math:: \mathcal{L}_{\mathrm{NCE}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \frac{1}{|\Gamma|} \sum_{\mathbf{w} \in \Gamma} \big( \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{c}}^\top \mathbf{w} \big). (If :math:`\mathbf{w}^*(\mathbf{c}) \in \Gamma`, its term contributes zero and the sum effectively averages over the negatives.) The gradient has a closed form that requires no solver call in the backward pass, .. math:: \frac{\partial \mathcal{L}_{\mathrm{NCE}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c}))}{\partial \hat{\mathbf{c}}} = \mathbf{w}^*(\mathbf{c}) - \frac{1}{|\Gamma|} \sum_{\mathbf{w} \in \Gamma} \mathbf{w}. For a fixed :math:`\Gamma`, this update direction stays constant per instance. ``solve_ratio`` controls how often the pool is refreshed. .. autoclass:: pyepo.func.NCE :noindex: :members: Training loop: .. code-block:: python nce = pyepo.func.NCE(optmodel, processes=2, solve_ratio=0.05, dataset=dataset) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = nce(cp, w) optimizer.zero_grad() loss.backward() optimizer.step() Contrastive MAP (CMAP) ---------------------- CMAP [#f7]_ is a special case of NCE that uses only the lowest predicted-cost negative sample. CMAP keeps only the most-violating member of the pool, the one with the smallest predicted-cost objective, yielding a max-margin contrast against the true optimum. For a minimization problem, .. math:: \mathcal{L}_{\mathrm{CMAP}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \max_{\mathbf{w} \in \Gamma} \big( \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{c}}^\top \mathbf{w} \big) = \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - \min_{\mathbf{w} \in \Gamma} \hat{\mathbf{c}}^\top \mathbf{w}. If :math:`\mathbf{w}^*(\mathbf{c}) \in \Gamma`, that entry contributes a zero margin, so the loss is non-negative and vanishes precisely when the predicted costs already make :math:`\mathbf{w}^*(\mathbf{c})` the minimizer over :math:`\Gamma`. .. autoclass:: pyepo.func.CMAP :noindex: :members: Training loop: .. code-block:: python cmap = pyepo.func.CMAP(optmodel, processes=2, solve_ratio=0.05, dataset=dataset) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = cmap(cp, w) optimizer.zero_grad() loss.backward() optimizer.step() Learning to Rank ================ Learning to rank [#f8]_ treats predict-then-optimize training as ranking a pool of feasible solutions: predicted costs assign scores to solutions, and the loss encourages the optimal solution to rank highest. Each variant differs in how it scores the ranking. Like contrastive methods, LTR uses ``solve_ratio`` and ``dataset`` to manage the pool. * **Pointwise** regresses each predicted score :math:`\hat{\mathbf{c}}^\top \mathbf{w}` toward the true value :math:`\mathbf{c}^\top \mathbf{w}` for every :math:`\mathbf{w} \in \Gamma`. * **Pairwise** enforces a margin between the optimal solution and each suboptimal one. * **Listwise** models the entire ranking distribution with a SoftMax over the negative predicted costs (for a minimization problem, so the lowest-cost solution gets the highest probability), .. math:: P(\mathbf{w}' \mid \hat{\mathbf{c}}) = \frac{\exp(-\hat{\mathbf{c}}^\top \mathbf{w}')}{\sum_{\mathbf{w} \in \Gamma} \exp(-\hat{\mathbf{c}}^\top \mathbf{w})}, and minimizes the cross-entropy against the true ranking distribution, .. math:: \mathcal{L}_{\mathrm{LTR}}^{\mathrm{list}}(\hat{\mathbf{c}}, \mathbf{c}) = - \sum_{\mathbf{w} \in \Gamma} P(\mathbf{w} \mid \mathbf{c}) \log P(\mathbf{w} \mid \hat{\mathbf{c}}). The gradient has a closed form, .. math:: \frac{\partial \mathcal{L}_{\mathrm{LTR}}^{\mathrm{list}}(\hat{\mathbf{c}}, \mathbf{c})}{\partial \hat{\mathbf{c}}} = \sum_{\mathbf{w} \in \Gamma} P(\mathbf{w} \mid \mathbf{c})\, \mathbf{w} - \sum_{\mathbf{w} \in \Gamma} P(\mathbf{w} \mid \hat{\mathbf{c}})\, \mathbf{w}. Pointwise LTR ------------- .. autoclass:: pyepo.func.ptLTR :noindex: :members: Training loop: .. code-block:: python ltr = pyepo.func.ptLTR(optmodel, processes=2, solve_ratio=0.05, dataset=dataset) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = ltr(cp, c) optimizer.zero_grad() loss.backward() optimizer.step() Pairwise LTR ------------ .. autoclass:: pyepo.func.prLTR :noindex: :members: Training loop: .. code-block:: python ltr = pyepo.func.prLTR(optmodel, processes=2, solve_ratio=0.05, dataset=dataset) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = ltr(cp, c) optimizer.zero_grad() loss.backward() optimizer.step() Listwise LTR ------------ .. autoclass:: pyepo.func.lsLTR :noindex: :members: Training loop: .. code-block:: python ltr = pyepo.func.lsLTR(optmodel, processes=2, solve_ratio=0.05, dataset=dataset) for epoch in range(20): for x, c, w, z in dataloader: cp = predmodel(x) loss = ltr(cp, c) optimizer.zero_grad() loss.backward() optimizer.step() Parallel Computation ==================== All ``pyepo.func`` modules support parallel solving during training via the ``processes`` parameter (``0`` uses all available cores). .. image:: ../../images/parallel-tsp.png :width: 650 :class: light-bg The figure reports runtime for different values of ``processes``. .. rubric:: Footnotes .. [#f1] Elmachtoub, A. N., & Grigas, P. (2021). Smart "predict, then optimize". Management Science. .. [#f3] Vlastelica, M., Paulus, A., Musil, V., Martius, G., & Rolinek, M. (2019). Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175. .. [#f4] Sahoo, S. S., Paulus, A., Vlastelica, M., Musil, V., Kuleshov, V., & Martius, G. (2022). Backpropagation through combinatorial algorithms: Identity with projection works. arXiv preprint arXiv:2205.15213. .. [#f5] Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J. P., & Bach, F. (2020). Learning with differentiable perturbed optimizers. Advances in Neural Information Processing Systems, 33, 9508-9519. .. [#f6] Dalle, G., Baty, L., Bouvier, L., & Parmentier, A. (2022). Learning with Combinatorial Optimization Layers: a Probabilistic Approach. arXiv preprint arXiv:2207.13513. .. [#f7] Mulamba, M., Mandi, J., Diligenti, M., Lombardi, M., Bucarey, V., & Guns, T. (2021). Contrastive losses and solution caching for predict-and-optimize. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. .. [#f8] Mandi, J., Bucarey, V., Mulamba, M., & Guns, T. (2022). Decision-focused learning: through the lens of learning to rank. Proceedings of the 39th International Conference on Machine Learning. .. [#f9] Niepert, M., Minervini, P., & Franceschi, L. (2021). Implicit MLE: backpropagating through discrete exponential family distributions. Advances in Neural Information Processing Systems, 34, 14567-14579. .. [#f10] Minervini, P., Franceschi, L., & Niepert, M. (2023). Adaptive perturbation-based gradient estimation for discrete latent variable models. Proceedings of the AAAI Conference on Artificial Intelligence. .. [#f11] Gupta, V., & Huang, M. (2024). Decision-Focused Learning with Directional Gradients. arXiv preprint arXiv:2402.03256. .. [#f12] Tang, B., & Khalil, E. B. (2024). CaVE: A Cone-Aligned Approach for Fast Predict-then-Optimize with Binary Linear Programs. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research (pp. 193-210). .. [#f13] Blondel, M., Martins, A. F. T., & Niculae, V. (2020). Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21(35), 1-69.