Training Methods
+++++++++++++++++

Overview
========

``pyepo.func`` provides PyTorch autograd modules that wrap an optimization solver for end-to-end training. All modules assume a linear objective with known, fixed constraints; the cost vector is predicted from contextual features.

Every module accepts ``processes`` for parallel solving, and all except ``CaVE`` accept ``solve_ratio < 1`` with ``dataset`` for solution-pool caching (see :doc:`../advanced/pool`); modules that return a loss also accept ``reduction`` (``"mean"``, ``"sum"``, or ``"none"``).

Each method below is presented with its definition followed by a runnable training loop; all loops build on the shared **Common Setup**.


Choosing a Method
=================

The modules differ in what they return, which determines how you use them:

* **Loss-returning**: return a scalar loss, passed directly to ``.backward()``: SPO+, PG, PFYL, RFYL, CaVE, NCE, CMAP, LTR.
* **Solution-returning**: return a predicted, expected, or regularized solution, on which you define a task loss: DPO, I-MLE, AI-MLE, RFWO, DBB, NID.

A combined name like ``DPO+MSE`` or ``NID+L1`` denotes a solution-returning module (DPO, NID) followed by a task loss (here MSE or L1) on its output.

The table below summarizes each module's return type, typical supervision, and notes.

.. list-table::
   :header-rows: 1
   :widths: 18 28 28 26

   * - Module
     - Returns
     - Typical supervision
     - Notes
   * - ``SPOPlus``
     - loss
     - true costs, true optimal solutions, true objective values
     - convex surrogate for regret
   * - ``PG``
     - loss
     - true costs
     - directional finite differences along the true cost
   * - ``DPO`` / ``DPOMul``
     - expected perturbed solutions
     - task loss chosen by the user
     - use the multiplicative variant for sign-sensitive oracles
   * - ``PFY`` / ``PFYMul``
     - loss
     - true optimal solutions
     - PFYL; use the multiplicative variant for sign-sensitive oracles
   * - ``IMLE`` / ``AIMLE``
     - expected perturbed solutions
     - task loss chosen by the user
     - perturb-and-MAP with Sum-of-Gamma noise
   * - ``RFWO``
     - regularized predicted solutions
     - task loss chosen by the user
     - L2-regularized smooth optimizer over the convex hull of feasible solutions
   * - ``RFY``
     - loss
     - true optimal solutions
     - RFYL; Fenchel-Young loss paired with L2-regularized Frank-Wolfe
   * - ``DBB`` / ``NID``
     - predicted solutions
     - task loss chosen by the user
     - direct solution-level or objective-level losses
   * - ``CaVE``
     - loss
     - tight binding-constraint normals at the true optimum (``optDatasetConstrs``)
     - binary linear programs, Gurobi backend only
   * - ``NCE`` / ``CMAP``
     - loss
     - true optimal solutions and a solution pool
     - contrastive training with cached negative solutions
   * - ``ptLTR`` / ``prLTR`` / ``lsLTR``
     - loss
     - true costs and a solution pool
     - learning-to-rank formulations over feasible solutions


Common Setup
============

All training loops below share the same setup: a DSL-defined knapsack problem with a linear predictor. The examples use PyTorch; JAX follows the same method families, see :doc:`../frontends/jax`. For a runnable walkthrough, see the `03 Training and Testing <https://colab.research.google.com/github/khalil-research/PyEPO/blob/main/notebooks/03%20Training%20and%20Testing.ipynb>`_ notebook.

.. code-block:: python

   import pyepo
   from pyepo import EPO, dsl
   import torch
   from torch import nn
   from torch.utils.data import DataLoader

   # generate data
   num_data, num_feat, num_item, num_dim = 1000, 5, 10, 3
   weights, feat, costs = pyepo.data.knapsack.genData(
       num_data, num_feat, num_item, num_dim, deg=4, noise_width=0.5, seed=135,
   )
   capacity = (weights.sum(axis=1) * 0.5).astype(int)

   # define the optimization problem
   x = dsl.Variable(num_item, vtype=EPO.BINARY)
   c = dsl.Parameter(num_item)
   optmodel = dsl.Problem(
       dsl.Maximize(c @ x),
       [weights @ x <= capacity],
   ).compile(backend="gurobi")

   # dataset and data loader
   dataset = pyepo.data.dataset.optDataset(optmodel, feat, costs)
   dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

   # prediction model
   predmodel = nn.Linear(num_feat, num_item)
   optimizer = torch.optim.Adam(predmodel.parameters(), lr=1e-3)

   # for multiplicative perturbed variants
   positive_predmodel = nn.Sequential(nn.Linear(num_feat, num_item), nn.Softplus())


Surrogate Losses
================

Surrogate losses replace the non-differentiable regret with a differentiable training objective.


Smart Predict-then-Optimize+ Loss (SPO+)
----------------------------------------

SPO+ [#f1]_ is a convex surrogate for the SPO loss (regret), the decision error of the downstream optimization.

The derivation starts from the observation that for any :math:`\alpha \geq 0`,

.. math::

   \mathrm{regret}(\hat{\mathbf{c}}, \mathbf{c}) \leq \max_{\mathbf{w} \in \mathcal{S}} \big\{ \mathbf{c}^\top \mathbf{w} - \alpha\, \hat{\mathbf{c}}^\top \mathbf{w} \big\} + z^*(\alpha \hat{\mathbf{c}}) - z^*(\mathbf{c}).

Setting :math:`\alpha = 2` and bounding :math:`z^*(2 \hat{\mathbf{c}}) \leq 2 \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c})` via the concavity of :math:`z^*` gives the SPO+ loss,

.. math::

   \mathcal{L}_{\mathrm{SPO+}}(\hat{\mathbf{c}}, \mathbf{c}) = -z^*(2 \hat{\mathbf{c}} - \mathbf{c}) + 2\, \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - z^*(\mathbf{c}),

which is convex in :math:`\hat{\mathbf{c}}` and upper-bounds the regret. By Danskin's theorem, a subgradient is

.. math::

   2 \big( \mathbf{w}^*(\mathbf{c}) - \mathbf{w}^*(2 \hat{\mathbf{c}} - \mathbf{c}) \big) \in \frac{\partial \mathcal{L}_{\mathrm{SPO+}}(\hat{\mathbf{c}}, \mathbf{c})}{\partial \hat{\mathbf{c}}}.

.. autoclass:: pyepo.func.SPOPlus
    :noindex:
    :members:

``processes`` sets the number of solver processes (``0`` uses all available cores).

Training loop:

.. code-block:: python

   spo = pyepo.func.SPOPlus(optmodel, processes=2)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = spo(cp, c, w, z)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Perturbation Gradient (PG)
--------------------------

PG [#f11]_ is a surrogate loss based on zeroth-order gradient approximation.

By Danskin's theorem, regret can be written as the directional derivative of :math:`z^*(\hat{\mathbf{c}})` along :math:`\mathbf{c}` (up to a constant independent of :math:`\hat{\mathbf{c}}`). PG approximates this directional derivative with finite differences. Two variants are available, backward (``two_sides=False``) and central (``two_sides=True``):

.. math::

   \mathcal{L}_{\mathrm{PG}}^{\mathrm{back}}(\hat{\mathbf{c}}, \mathbf{c}) &\approx \frac{z^*(\hat{\mathbf{c}}) - z^*(\hat{\mathbf{c}} - \sigma \mathbf{c})}{\sigma}, \\
   \mathcal{L}_{\mathrm{PG}}^{\mathrm{cent}}(\hat{\mathbf{c}}, \mathbf{c}) &\approx \frac{z^*(\hat{\mathbf{c}} + \sigma \mathbf{c}) - z^*(\hat{\mathbf{c}} - \sigma \mathbf{c})}{2 \sigma},

where :math:`\sigma > 0` is the finite-difference width (the ``sigma`` parameter). The corresponding gradient estimate is

.. math::

   \frac{\partial \mathcal{L}_{\mathrm{PG}}^{\mathrm{back}}(\hat{\mathbf{c}}, \mathbf{c})}{\partial \hat{\mathbf{c}}} \approx \frac{\mathbf{w}^*(\hat{\mathbf{c}}) - \mathbf{w}^*(\hat{\mathbf{c}} - \sigma \mathbf{c})}{\sigma}.

.. autoclass:: pyepo.func.PG
    :noindex:
    :members:

``sigma`` controls the finite-difference width. ``two_sides`` enables central differencing.

Training loop:

.. code-block:: python

   pg = pyepo.func.PG(optmodel, sigma=0.1, two_sides=False, processes=2)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = pg(cp, c)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Perturbed Methods
=================

Perturbed methods estimate gradients by Monte Carlo averaging over random perturbations of the cost vector. They share two design knobs: ``n_samples`` (number of Monte Carlo samples) and ``sigma`` (perturbation amplitude).


Differentiable Perturbed Optimizer (DPO)
----------------------------------------

DPO [#f5]_ uses Monte Carlo sampling to estimate solutions by optimizing randomly perturbed costs. Its custom backward pass provides a gradient estimator for end-to-end training. ``DPO`` is the additive Gaussian version; ``DPOMul`` is the multiplicative version for sign-sensitive oracles [#f6]_. The multiplicative variant assumes predicted costs already have the intended nonzero sign. For nonnegative-cost problems, use a positive-output predictor such as ``nn.Softplus()`` plus a small epsilon.

DPO replaces the piecewise-constant solution map :math:`\hat{\mathbf{c}} \mapsto \mathbf{w}^*(\hat{\mathbf{c}})` with the expectation over a random perturbation,

.. math::

   \mathbb{E}_{\boldsymbol{\xi}} \big[ \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}) \big] \approx \frac{1}{K} \sum_{\kappa=1}^{K} \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}_\kappa),

where :math:`\boldsymbol{\xi} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})` and :math:`K` is ``n_samples``.

The multiplicative variant ``DPOMul`` perturbs each entry of :math:`\hat{\mathbf{c}}` as

.. math::

   \hat{\mathbf{c}} \odot \exp\!\big(\sigma \boldsymbol{\xi} - \tfrac{1}{2} \sigma^2\big),

with :math:`\odot` the Hadamard product. The exponential factor preserves the sign of each cost entry, which is required when the solver expects, e.g., nonnegative edge costs.

.. autoclass:: pyepo.func.DPO
    :noindex:
    :members:

.. autoclass:: pyepo.func.DPOMul
    :noindex:
    :members:

Training loop (swap ``predmodel`` for ``positive_predmodel`` when using ``DPOMul``):

.. code-block:: python

   dpo = pyepo.func.DPO(optmodel, n_samples=10, sigma=0.5, processes=2)
   # if using DPOMul, replace predmodel with positive_predmodel below
   # dpo = pyepo.func.DPOMul(optmodel, n_samples=10, sigma=0.5, processes=2)

   criterion = nn.MSELoss()

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           we = dpo(cp)
           loss = criterion(we, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Perturbed Fenchel-Young Loss (PFYL)
-----------------------------------

PFYL [#f5]_ uses the same perturbed expected solution as DPO inside a Fenchel-Young loss, comparing it directly with the true optimal solution. It returns a scalar loss. ``PFY`` is the additive Gaussian version; ``PFYMul`` is the multiplicative sign-preserving variant with the same sign convention as ``DPOMul``.

Let :math:`F(\hat{\mathbf{c}}) = \mathbb{E}_{\boldsymbol{\xi}}\big[ \min_{\mathbf{w} \in \mathcal{S}} (\hat{\mathbf{c}} + \sigma \boldsymbol{\xi})^\top \mathbf{w} \big]` be the expected perturbed minimum, and let :math:`\Omega(\mathbf{w}^*(\mathbf{c}))` be its Fenchel conjugate. The perturbed Fenchel-Young loss is

.. math::

   \mathcal{L}_{\mathrm{PFYL}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - F(\hat{\mathbf{c}}) - \Omega(\mathbf{w}^*(\mathbf{c})).

The dual term :math:`\Omega(\mathbf{w}^*(\mathbf{c}))` does not depend on :math:`\hat{\mathbf{c}}`, so the gradient collapses to a simple difference of solutions,

.. math::

   \frac{\partial \mathcal{L}_{\mathrm{PFYL}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c}))}{\partial \hat{\mathbf{c}}} = \mathbf{w}^*(\mathbf{c}) - \mathbb{E}_{\boldsymbol{\xi}} \big[\mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi})\big] \approx \mathbf{w}^*(\mathbf{c}) - \frac{1}{K} \sum_{\kappa=1}^{K} \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}_\kappa).

The multiplicative variant uses the same sign-preserving perturbation,

.. math::

   \hat{\mathbf{c}} \odot \exp\!\big(\sigma \boldsymbol{\xi} - \tfrac{1}{2} \sigma^2\big).

.. autoclass:: pyepo.func.PFY
    :noindex:
    :members:

.. autoclass:: pyepo.func.PFYMul
    :noindex:
    :members:

Training loop (swap ``predmodel`` for ``positive_predmodel`` when using ``PFYMul``):

.. code-block:: python

   pfy = pyepo.func.PFY(optmodel, n_samples=10, sigma=0.5, processes=2)
   # if using PFYMul, replace predmodel with positive_predmodel below
   # pfy = pyepo.func.PFYMul(optmodel, n_samples=10, sigma=0.5, processes=2)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = pfy(cp, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Implicit Maximum Likelihood Estimator (I-MLE)
---------------------------------------------

I-MLE [#f9]_ uses the perturb-and-MAP framework, sampling noise from a Sum-of-Gamma distribution and interpolating the loss function to approximate finite differences. ``lambd`` controls the interpolation step.

I-MLE is framed as imitation learning: bring the model distribution :math:`p(\mathbf{w} \mid \hat{\mathbf{c}})` closer to a target distribution :math:`q(\mathbf{w} \mid \hat{\mathbf{c}})` by minimizing their KL divergence. An upstream task gradient :math:`\mathbf{d} = \nabla_{\mathbf{w}} \mathcal{L}(\hat{\mathbf{c}}, \cdot) \big|_{\mathbf{w} = \mathbf{w}^*(\hat{\mathbf{c}})}` induces a virtual update :math:`\hat{\mathbf{c}}' = \hat{\mathbf{c}} + \lambda \mathbf{d}`, and the gradient is estimated by a directional finite difference between the smoothed solutions at :math:`\hat{\mathbf{c}}'` and :math:`\hat{\mathbf{c}}`:

.. math::

   \frac{\partial \mathcal{L}_{\mathrm{IMLE}}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx \frac{1}{K \lambda} \sum_{\kappa=1}^{K} \Big( \mathbf{w}^*(\hat{\mathbf{c}} + \lambda \mathbf{d} + \sigma \boldsymbol{\xi}_\kappa) - \mathbf{w}^*(\hat{\mathbf{c}} + \sigma \boldsymbol{\xi}_\kappa) \Big),

where :math:`\sigma` smooths the solution map and :math:`\lambda` (``lambd``) is the finite-difference step. The same noise realization :math:`\boldsymbol{\xi}_\kappa` is shared across the two perturbed evaluations to reduce variance.

.. autoclass:: pyepo.func.IMLE
    :noindex:
    :members:

Training loop:

.. code-block:: python

   imle = pyepo.func.IMLE(optmodel, n_samples=10, sigma=1.0, lambd=10, processes=2)
   criterion = nn.L1Loss()

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           we = imle(cp)
           loss = criterion(we, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Adaptive Implicit Maximum Likelihood Estimator (AI-MLE)
-------------------------------------------------------

AI-MLE [#f10]_ extends I-MLE with an adaptive interpolation step.

AI-MLE uses the same finite-difference estimator as I-MLE but replaces the fixed step size :math:`\lambda` with an adaptive choice driven by the magnitudes of the predicted cost and the upstream gradient,

.. math::

   \lambda_t = \alpha_t \cdot \frac{\|\hat{\mathbf{c}}\|_2}{\|\mathbf{d}\|_2},

where :math:`\mathbf{d}` is the upstream task gradient and :math:`\alpha_t > 0` is tuned online: when an exponential moving average of the fraction of nonzero gradient entries drops below one, :math:`\alpha_t` is increased; otherwise it is decreased. This rescaling keeps the perturbation magnitude commensurate with :math:`\hat{\mathbf{c}}` and decouples the step from the absolute scale of :math:`\mathbf{d}`, removing the need to tune :math:`\lambda` by hand.

.. autoclass:: pyepo.func.AIMLE
    :noindex:
    :members:

Training loop (the adaptive step works with far fewer samples than I-MLE):

.. code-block:: python

   aimle = pyepo.func.AIMLE(optmodel, n_samples=2, sigma=1.0, processes=2)
   criterion = nn.L1Loss()

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           we = aimle(cp)
           loss = criterion(we, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Regularized Methods
===================

A pure linear optimization layer is not differentiable: the optimal solution :math:`\mathbf{w}^*(\hat{\mathbf{c}}) = \arg\min_{\mathbf{w} \in \mathcal{S}} \hat{\mathbf{c}}^\top \mathbf{w}` is piecewise constant in :math:`\hat{\mathbf{c}}`. It jumps between vertices of :math:`\mathcal{S}`, and its derivative is zero almost everywhere. Regularized methods fix this by adding a strongly convex regularizer :math:`\Omega` to the linear objective, yielding a regularized solution

.. math::

   \hat{\mathbf{w}}_\Omega(\hat{\mathbf{c}}) = \arg\min_{\mathbf{w} \in \mathrm{conv}(\mathcal{S})} \hat{\mathbf{c}}^\top \mathbf{w} + \Omega(\mathbf{w}),

which is unique, lies inside :math:`\mathrm{conv}(\mathcal{S})`, and varies continuously with :math:`\hat{\mathbf{c}}`. This deterministic smoothing is dual to the stochastic smoothing of the perturbed methods above: both replace the piecewise-constant solution map with a continuous surrogate, the former via a convex regularizer and the latter via Monte Carlo averaging over a noise distribution [#f6]_.

PyEPO implements the L2 special case :math:`\Omega(\mathbf{w}) = \tfrac{\lambda}{2} \|\mathbf{w}\|_2^2` and solves the regularized program with batched Frank-Wolfe iteration. Frank-Wolfe accesses :math:`\mathrm{conv}(\mathcal{S})` through the original LP/ILP solver of :math:`\mathcal{S}` (a *linear minimization oracle*). ``lambd`` is the L2 strength :math:`\lambda`; ``max_iter`` caps Frank-Wolfe iterations and ``tol`` is the convergence tolerance.


L2 Regularized Frank-Wolfe (RFWO)
---------------------------------

RFWO [#f6]_ exposes :math:`\hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}})` directly as a differentiable layer. Since it returns a regularized solution rather than a loss, the user defines a task loss on the output. A common choice is MSE against :math:`\mathbf{w}^*(\mathbf{c})`.

.. math::

   \hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}}) = \arg\min_{\mathbf{w} \in \mathrm{conv}(\mathcal{S})} \hat{\mathbf{c}}^\top \mathbf{w} + \tfrac{\lambda}{2} \|\mathbf{w}\|_2^2.

.. autoclass:: pyepo.func.RFWO
    :noindex:
    :members:

Training loop:

.. code-block:: python

   rfwo = pyepo.func.RFWO(optmodel, lambd=1.0, processes=2)
   criterion = nn.MSELoss()

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           wr = rfwo(cp)
           loss = criterion(wr, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


L2 Regularized Frank-Wolfe with Fenchel-Young Loss (RFYL)
---------------------------------------------------------

RFYL [#f6]_ pairs the RFWO layer with the **Fenchel-Young loss** [#f13]_ of the regularizer :math:`\Omega`, returning a scalar loss that compares the predicted cost :math:`\hat{\mathbf{c}}` to the true optimum :math:`\mathbf{w}^*(\mathbf{c})`.

For any convex regularizer :math:`\Omega: \mathrm{conv}(\mathcal{S}) \to \mathbb{R}` with conjugate :math:`\Omega^*(\boldsymbol{\theta}) = \sup_{\mathbf{w}} \boldsymbol{\theta}^\top \mathbf{w} - \Omega(\mathbf{w})`, the Fenchel-Young loss is

.. math::

   \mathcal{L}_\Omega^{\mathrm{FY}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \Omega(\mathbf{w}^*(\mathbf{c})) + \Omega^*(-\hat{\mathbf{c}}) + \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}).

The Fenchel-Young inequality gives three properties used by the training loss:

1. **Non-negative**, with equality iff :math:`\mathbf{w}^*(\mathbf{c}) = \hat{\mathbf{w}}_\Omega(\hat{\mathbf{c}})`.
2. **Convex** in :math:`\hat{\mathbf{c}}`.
3. By Danskin's theorem, the gradient is the residual between the regularized prediction and the target,

   .. math::

      \frac{\partial \mathcal{L}_\Omega^{\mathrm{FY}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c}))}{\partial \hat{\mathbf{c}}} = \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{w}}_\Omega(\hat{\mathbf{c}}),

   without differentiating through the Frank-Wolfe iterate.

PyEPO specializes this to :math:`\Omega(\mathbf{w}) = \tfrac{\lambda}{2}\|\mathbf{w}\|_2^2`. Substituting the closed form of :math:`\Omega^*` and the definition of :math:`\hat{\mathbf{w}}_\lambda` collapses the loss to

.. math::

   \mathcal{L}_{\mathrm{RFYL}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \Omega(\mathbf{w}^*(\mathbf{c})) - \Omega(\hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}})) + \hat{\mathbf{c}}^\top \big( \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}}) \big),

a "compare regularizers + linear residual" form. The gradient remains :math:`\mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{w}}_\lambda(\hat{\mathbf{c}})`.

.. autoclass:: pyepo.func.RFY
    :noindex:
    :members:

Training loop:

.. code-block:: python

   rfyl = pyepo.func.RFY(optmodel, lambd=1.0, processes=2)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = rfyl(cp, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Black-Box Methods
=================

Black-box methods replace the zero gradient of the discrete solver with a surrogate backward rule. The training loops below define an objective-value loss on the returned solution.


Differentiable Black-Box Optimizer (DBB)
----------------------------------------

DBB [#f3]_ estimates gradients by interpolating the solver output between the predicted cost and a perturbed one. ``lambd`` is a smoothing hyperparameter.

Given an upstream gradient :math:`\mathbf{d} = \nabla_{\mathbf{w}} \mathcal{L}(\hat{\mathbf{c}}, \cdot) \big|_{\mathbf{w} = \mathbf{w}^*(\hat{\mathbf{c}})}`, DBB approximates the vector-Jacobian product by interpolating the loss between :math:`\hat{\mathbf{c}}` and a perturbed cost :math:`\hat{\mathbf{c}} + \lambda \mathbf{d}`, yielding

.. math::

   \frac{\partial \mathcal{L}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx \frac{\mathbf{w}^*(\hat{\mathbf{c}} + \lambda \mathbf{d}) - \mathbf{w}^*(\hat{\mathbf{c}})}{\lambda}.

Larger :math:`\lambda` smooths more aggressively; the recommended range is 10 to 20. Unlike SPO+, the resulting surrogate is nonconvex in :math:`\hat{\mathbf{c}}`, which weakens convergence guarantees even when the predictor is convex in its parameters.

.. autoclass:: pyepo.func.DBB
    :noindex:
    :members:

Training loop:

.. code-block:: python

   dbb = pyepo.func.DBB(optmodel, lambd=10, processes=2)
   criterion = nn.L1Loss()

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           wp = dbb(cp)
           zp = (wp * c).sum(1).view(-1, 1)
           loss = criterion(zp, z)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Negative Identity Backpropagation (NID)
---------------------------------------

NID [#f4]_ treats the solver as a negative identity mapping during backpropagation; it is hyperparameter-free.

NID approximates the solver Jacobian by the (signed) identity, :math:`\partial \mathbf{w}^*(\hat{\mathbf{c}}) / \partial \hat{\mathbf{c}} \approx -\mathbf{I}` for a minimization problem (and :math:`+\mathbf{I}` for maximization). Given an upstream gradient :math:`\mathbf{d} = \partial \mathcal{L}(\hat{\mathbf{c}}, \cdot) / \partial \mathbf{w}^*(\hat{\mathbf{c}})`, the chain rule gives a sign-flipped straight-through estimator,

.. math::

   \frac{\partial \mathcal{L}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx -\mathbf{d} \quad (\text{minimization}), \qquad \frac{\partial \mathcal{L}(\hat{\mathbf{c}}, \cdot)}{\partial \hat{\mathbf{c}}} \approx \mathbf{d} \quad (\text{maximization}).

For minimization, this rule decreases :math:`\hat{\mathbf{c}}` along directions where :math:`\mathbf{w}^*(\hat{\mathbf{c}})` tends to increase, and vice versa. NID is the special case of DBB in which :math:`\lambda` is chosen so that the interpolated solution coincides with the negative-identity update, but it saves the extra solver call required by DBB.

.. autoclass:: pyepo.func.NID
    :noindex:
    :members:

Training loop:

.. code-block:: python

   nid = pyepo.func.NID(optmodel, processes=2)
   criterion = nn.L1Loss()

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           wp = nid(cp)
           zp = (wp * c).sum(1).view(-1, 1)
           loss = criterion(zp, z)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Cone-Aligned Estimation
=======================

Cone-aligned losses supervise the *predicted cost vector* directly by aligning it with the polyhedral cone of binding-constraint normals at the true optimum, rather than supervising on the optimal solution itself.


Cone-Aligned Vector Estimation (CaVE)
-------------------------------------

CaVE [#f12]_ is a surrogate loss for **binary linear programs** (TSP, CVRP, knapsack, shortest path with binary edges, etc.). For each instance, it projects the sense-flipped predicted cost vector onto the polyhedral cone spanned by the binding-constraint normals at the true optimal vertex, then minimizes ``1 - cos(pred, proj)``.

For a runnable walkthrough, see the `04 CaVE for Binary Linear Programs <https://colab.research.google.com/github/khalil-research/PyEPO/blob/main/notebooks/04%20CaVE%20for%20Binary%20Linear%20Programs.ipynb>`_ notebook.

Let :math:`K(\mathbf{w}^*(\mathbf{c}))` be the polyhedral cone spanned by the constraint normals that bind at the true optimal vertex. For a minimization problem, the KKT conditions require :math:`-\hat{\mathbf{c}} \in K(\mathbf{w}^*(\mathbf{c}))` for :math:`\mathbf{w}^*(\mathbf{c})` to remain optimal under the predicted cost. CaVE measures this alignment via a cosine loss against the cone projection :math:`\mathbf{p} = \mathrm{proj}_{K}(-\hat{\mathbf{c}})`,

.. math::

   \mathcal{L}_{\mathrm{CaVE}}(\hat{\mathbf{c}}, K) = 1 - \frac{(-\hat{\mathbf{c}})^\top \mathbf{p}}{\|\hat{\mathbf{c}}\|_2\, \|\mathbf{p}\|_2}.

When :math:`-\hat{\mathbf{c}}` already lies inside the cone, :math:`\mathbf{p} = -\hat{\mathbf{c}}` and the loss is zero; the further :math:`-\hat{\mathbf{c}}` strays outside the cone, the larger the loss.

CaVE uses two solvers at different stages: a Gurobi-backed ``optModel`` extracts the binding-constraint normals when the dataset is built, and Clarabel projects the predicted cost onto that cone during training. This avoids solving an ILP in every training step. ``max_iter`` caps the Clarabel iterations; the default ``max_iter=3`` is the paper's **CaVE+** preset, which under-converges the projection on purpose so it stays interior to the cone. Raising it changes the loss, not just its precision.

Setting ``solve_ratio < 1`` enables the **CaVE-Hybrid** update: each batch uses the QP projection with probability ``solve_ratio`` and otherwise uses a blend of the normalized predicted cost and the average binding-constraint normal. ``inner_ratio`` controls the blend.

.. figure:: ../../images/cave_vrp20.png
    :width: 100%
    :align: center

    CVRP-20 results from notebook 04: ``num_data=1000``, 10 epochs, single process. In this setup, CaVE+ trains 8.2x faster than SPO+; CaVE-Hybrid with ``solve_ratio=0.3`` trains 10.5x faster with higher final regret.

Training data comes from ``pyepo.data.dataset.optDatasetConstrs``, which extracts the binding-constraint normals at the optimum for each instance. Per-instance constraint counts are ragged, so batch it with ``pyepo.data.dataset.optDataLoader``, which zero-pads them automatically. CaVE currently requires a Gurobi-backed ``optModel``.

.. autoclass:: pyepo.func.CaVE
    :noindex:
    :members:

Training loop (the batch carries ``tight_ctrs`` in addition to ``(x, c, w, z)``):

.. code-block:: python

   from pyepo.data.dataset import optDatasetConstrs, optDataLoader

   dataset_constr = optDatasetConstrs(optmodel, feat, costs)
   dataloader_constr = optDataLoader(dataset_constr, batch_size=32, shuffle=True)

   cave = pyepo.func.CaVE(optmodel, processes=2)

   for epoch in range(20):
       for x, c, w, z, tight_ctrs in dataloader_constr:
           cp = predmodel(x)
           loss = cave(cp, tight_ctrs)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()

If you use the **CaVE** loss, please cite:

.. code-block:: bibtex

   @inproceedings{tang2024cave,
     title={CaVE: A Cone-Aligned Approach for Fast Predict-then-Optimize with Binary Linear Programs},
     author={Tang, Bo and Khalil, Elias B},
     booktitle={Integration of Constraint Programming, Artificial Intelligence, and Operations Research},
     pages={193--210},
     year={2024},
     publisher={Springer}
   }


Contrastive Methods
===================

Contrastive methods train against a pool of cached non-optimal solutions, treated as negative examples. ``solve_ratio`` controls how often new instances are solved exactly during training; ``dataset`` seeds the pool and is required by these methods. See :doc:`../advanced/pool` for details on the solution-pool mechanism.


Noise Contrastive Estimation (NCE)
----------------------------------

NCE [#f7]_ is a surrogate loss based on negative examples. It uses a small set of non-optimal solutions as negative samples to maximize the predicted-cost margin between the optimal solution and the rest.

Let :math:`\Gamma` be the cached pool of feasible solutions. For a minimization problem, NCE averages the predicted-cost margin between :math:`\mathbf{w}^*(\mathbf{c})` and every member of the pool,

.. math::

   \mathcal{L}_{\mathrm{NCE}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \frac{1}{|\Gamma|} \sum_{\mathbf{w} \in \Gamma} \big( \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{c}}^\top \mathbf{w} \big).

(If :math:`\mathbf{w}^*(\mathbf{c}) \in \Gamma`, its term contributes zero and the sum effectively averages over the negatives.) The gradient has a closed form that requires no solver call in the backward pass,

.. math::

   \frac{\partial \mathcal{L}_{\mathrm{NCE}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c}))}{\partial \hat{\mathbf{c}}} = \mathbf{w}^*(\mathbf{c}) - \frac{1}{|\Gamma|} \sum_{\mathbf{w} \in \Gamma} \mathbf{w}.

For a fixed :math:`\Gamma`, this update direction stays constant per instance. ``solve_ratio`` controls how often the pool is refreshed.

.. autoclass:: pyepo.func.NCE
    :noindex:
    :members:

Training loop:

.. code-block:: python

   nce = pyepo.func.NCE(optmodel, processes=2, solve_ratio=0.05, dataset=dataset)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = nce(cp, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Contrastive MAP (CMAP)
----------------------

CMAP [#f7]_ is a special case of NCE that uses only the lowest predicted-cost negative sample.

CMAP keeps only the most-violating member of the pool, the one with the smallest predicted-cost objective, yielding a max-margin contrast against the true optimum. For a minimization problem,

.. math::

   \mathcal{L}_{\mathrm{CMAP}}(\hat{\mathbf{c}}, \mathbf{w}^*(\mathbf{c})) = \max_{\mathbf{w} \in \Gamma} \big( \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - \hat{\mathbf{c}}^\top \mathbf{w} \big) = \hat{\mathbf{c}}^\top \mathbf{w}^*(\mathbf{c}) - \min_{\mathbf{w} \in \Gamma} \hat{\mathbf{c}}^\top \mathbf{w}.

If :math:`\mathbf{w}^*(\mathbf{c}) \in \Gamma`, that entry contributes a zero margin, so the loss is non-negative and vanishes precisely when the predicted costs already make :math:`\mathbf{w}^*(\mathbf{c})` the minimizer over :math:`\Gamma`.

.. autoclass:: pyepo.func.CMAP
    :noindex:
    :members:

Training loop:

.. code-block:: python

   cmap = pyepo.func.CMAP(optmodel, processes=2, solve_ratio=0.05, dataset=dataset)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = cmap(cp, w)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Learning to Rank
================

Learning to rank [#f8]_ treats predict-then-optimize training as ranking a pool of feasible solutions: predicted costs assign scores to solutions, and the loss encourages the optimal solution to rank highest. Each variant differs in how it scores the ranking. Like contrastive methods, LTR uses ``solve_ratio`` and ``dataset`` to manage the pool.

* **Pointwise** regresses each predicted score :math:`\hat{\mathbf{c}}^\top \mathbf{w}` toward the true value :math:`\mathbf{c}^\top \mathbf{w}` for every :math:`\mathbf{w} \in \Gamma`.
* **Pairwise** enforces a margin between the optimal solution and each suboptimal one.
* **Listwise** models the entire ranking distribution with a SoftMax over the negative predicted costs (for a minimization problem, so the lowest-cost solution gets the highest probability),

  .. math::

     P(\mathbf{w}' \mid \hat{\mathbf{c}}) = \frac{\exp(-\hat{\mathbf{c}}^\top \mathbf{w}')}{\sum_{\mathbf{w} \in \Gamma} \exp(-\hat{\mathbf{c}}^\top \mathbf{w})},

  and minimizes the cross-entropy against the true ranking distribution,

  .. math::

     \mathcal{L}_{\mathrm{LTR}}^{\mathrm{list}}(\hat{\mathbf{c}}, \mathbf{c}) = - \sum_{\mathbf{w} \in \Gamma} P(\mathbf{w} \mid \mathbf{c}) \log P(\mathbf{w} \mid \hat{\mathbf{c}}).

  The gradient has a closed form,

  .. math::

     \frac{\partial \mathcal{L}_{\mathrm{LTR}}^{\mathrm{list}}(\hat{\mathbf{c}}, \mathbf{c})}{\partial \hat{\mathbf{c}}} = \sum_{\mathbf{w} \in \Gamma} P(\mathbf{w} \mid \mathbf{c})\, \mathbf{w} - \sum_{\mathbf{w} \in \Gamma} P(\mathbf{w} \mid \hat{\mathbf{c}})\, \mathbf{w}.


Pointwise LTR
-------------

.. autoclass:: pyepo.func.ptLTR
    :noindex:
    :members:

Training loop:

.. code-block:: python

   ltr = pyepo.func.ptLTR(optmodel, processes=2, solve_ratio=0.05, dataset=dataset)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = ltr(cp, c)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Pairwise LTR
------------

.. autoclass:: pyepo.func.prLTR
    :noindex:
    :members:

Training loop:

.. code-block:: python

   ltr = pyepo.func.prLTR(optmodel, processes=2, solve_ratio=0.05, dataset=dataset)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = ltr(cp, c)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Listwise LTR
------------

.. autoclass:: pyepo.func.lsLTR
    :noindex:
    :members:

Training loop:

.. code-block:: python

   ltr = pyepo.func.lsLTR(optmodel, processes=2, solve_ratio=0.05, dataset=dataset)

   for epoch in range(20):
       for x, c, w, z in dataloader:
           cp = predmodel(x)
           loss = ltr(cp, c)
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()


Parallel Computation
====================

All ``pyepo.func`` modules support parallel solving during training via the ``processes`` parameter (``0`` uses all available cores).

.. image:: ../../images/parallel-tsp.png
   :width: 650
   :class: light-bg

The figure reports runtime for different values of ``processes``.


.. rubric:: Footnotes

.. [#f1] Elmachtoub, A. N., & Grigas, P. (2021). Smart "predict, then optimize". Management Science.
.. [#f3] Vlastelica, M., Paulus, A., Musil, V., Martius, G., & Rolinek, M. (2019). Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175.
.. [#f4] Sahoo, S. S., Paulus, A., Vlastelica, M., Musil, V., Kuleshov, V., & Martius, G. (2022). Backpropagation through combinatorial algorithms: Identity with projection works. arXiv preprint arXiv:2205.15213.
.. [#f5] Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J. P., & Bach, F. (2020). Learning with differentiable perturbed optimizers. Advances in Neural Information Processing Systems, 33, 9508-9519.
.. [#f6] Dalle, G., Baty, L., Bouvier, L., & Parmentier, A. (2022). Learning with Combinatorial Optimization Layers: a Probabilistic Approach. arXiv preprint arXiv:2207.13513.
.. [#f7] Mulamba, M., Mandi, J., Diligenti, M., Lombardi, M., Bucarey, V., & Guns, T. (2021). Contrastive losses and solution caching for predict-and-optimize. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.
.. [#f8] Mandi, J., Bucarey, V., Mulamba, M., & Guns, T. (2022). Decision-focused learning: through the lens of learning to rank. Proceedings of the 39th International Conference on Machine Learning.
.. [#f9] Niepert, M., Minervini, P., & Franceschi, L. (2021). Implicit MLE: backpropagating through discrete exponential family distributions. Advances in Neural Information Processing Systems, 34, 14567-14579.
.. [#f10] Minervini, P., Franceschi, L., & Niepert, M. (2023). Adaptive perturbation-based gradient estimation for discrete latent variable models. Proceedings of the AAAI Conference on Artificial Intelligence.
.. [#f11] Gupta, V., & Huang, M. (2024). Decision-Focused Learning with Directional Gradients. arXiv preprint arXiv:2402.03256.
.. [#f12] Tang, B., & Khalil, E. B. (2024). CaVE: A Cone-Aligned Approach for Fast Predict-then-Optimize with Binary Linear Programs. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research (pp. 193-210).
.. [#f13] Blondel, M., Martins, A. F. T., & Niculae, V. (2020). Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21(35), 1-69.