Learn Neural SDE from Scratch

Neural SDE from zero

1/5

Section 1

Let the network learn the SDE

Every model you have seen so far -- Black-Scholes, Heston, SABR -- starts from a human-chosen equation. You pick the SDE, then fit a few parameters. Neural SDEs flip the script: let a neural network learn the equation itself from data.

The classical workflow is: human writes down dS = f(S,t)·dW with a specific f (like σ·S, orσ·Sᵝ, or something involving stochastic vol). Then you calibrate 3-5 parameters to market data.

The neural SDE workflow is: the drift μ(S,t) and diffusionσ(S,t) are the outputs of a neural network. The network has thousands of parameters (weights and biases). You train it by minimizing the error between model prices and observed option prices.

Neural SDE

dX = μₕ(X, t)·dt + σₕ(X, t)·dW

μₕ and σₕ are neural networks with parameters θ. They take the current state X and time t as inputs and output the instantaneous drift and diffusion.

Mental model

Classical modeling is like choosing a recipe and tuning the oven temperature. Neural SDE modeling is like teaching a chef to invent the recipe by tasting thousands of dishes (observed prices) and adjusting until the output matches what the market serves.

Why bother? Because sometimes no standard model family fits the data well enough. The market dynamics might have features -- regime switches, asymmetric clustering, path-dependent behavior -- that no five-parameter model can capture. A neural SDE can, in principle, approximate any continuous drift and diffusion functions. The question is whether you have enough data and discipline to train it reliably.

Section 2

Architecture

The network is a standard feedforward architecture. The inputs are the current market state. The outputs are the SDE coefficients. The network IS the model.

Inputs: Spot price S, time t, and optionally market features like current implied vol, skew slope, or term structure shape. The richer the input, the more context the network has for deciding what σ should be at this point.

Hidden layers: Typically 2-4 layers with 32-128 neurons each. ReLU or softplus activations. Nothing exotic. The magic is not in the architecture; it is in what the network learns to represent.

Outputs: The drift μ(S,t) and the diffusion σ(S,t). The diffusion output passes through a softplus or exponential to ensure it stays positive. These two numbers, evaluated at the current state, define what the SDE does at this instant.

Neural SDE Architecture

Market state (S, t, features) enters on the left. Hidden layers with nonlinear activations transform it. The output layer produces the instantaneous drift μ and diffusion σ -- the two functions that define the learned SDE. Hover to highlight layers.

Training: Generate paths from the neural SDE using an Euler-Maruyama discretization. Price options along those paths via Monte Carlo. Compare model prices to observed market prices. Backpropagate the pricing error through the path simulation and into the network weights. This is differentiable programming applied to stochastic processes.

The key technical insight: the entire pipeline -- from network weights to SDE coefficients to simulated paths to option prices -- is differentiable. You can compute gradients of the pricing loss with respect to every weight in the network. That is what makes training feasible.

Section 3

Deep hedging

Once you have a learned SDE for the price dynamics, the natural next step is to also learn the hedge. Deep hedging uses a second network to output the hedge ratio at each timestep, trained jointly with the pricing model.

Classical hedging computes delta from the model analytically:∂C/∂S under BS, or a numerical approximation under more complex models. This ignores transaction costs, market impact, discrete rebalancing, and liquidity constraints.

Deep hedging says: train a network to output the hedge ratioδ(S, t, portfolio) at each timestep. The training objective is not to minimize tracking error against a theoretical delta. It is to minimize the actual hedging P&L variance (or CVaR, or any risk measure) including transaction costs.

Deep hedging objective

minₕ Risk[ PnL(V₀, δₕ, costs) ]

The network δₕ outputs the hedge ratio at each rebalancing step. The objective function includes the actual costs of trading, not just theoretical tracking error.

The result: a hedging strategy that is aware of the real-world frictions that classical delta ignores. In backtests, deep hedging strategies often show lower realized hedging cost than model-based delta, especially for:

1. High transaction cost regimes. The network learns to hedge less frequently when costs are high, effectively choosing a wider no-trade band.

2. Illiquid underlyings. The network learns to use correlated liquid instruments as proxy hedges when the direct hedge is expensive.

3. Path-dependent exotics. Where no simple delta formula exists, the network can still learn effective hedges from simulated paths.

The joint learning insight

The most powerful version trains the pricing SDE and the hedging network simultaneously. The SDE learns dynamics that are consistent with observed prices, and the hedging network learns to hedge under those dynamics. The two networks regularize each other: the SDE cannot learn unrealistic dynamics because the hedging network would perform poorly, and vice versa.

Section 4

What the network discovers

When you inspect the learned σ(S,t) function, it often looks like local vol with stochastic features. The network independently discovers structures that humans spent decades designing.

Train a neural SDE on equity or crypto option data and then plot the learned diffusion function σ(S,t) as a heatmap. Typical findings:

Leverage effect. The network learns thatσ(S,t) is higher when S is low and lower when S is high. This is exactly the mechanism that Heston captures with negativeρ and that CEV captures with β < 1. The network does not know about these models. It finds the pattern in the data.

Mean reversion in vol. The learned σ tends to be elevated after recent large moves and reverts toward a baseline. The network has independently discovered the CIR-like mean reversion that Heston hardcodes.

Vol clustering. The network learns that high-vol states persist -- σ(S,t) stays elevated for a while after a spike. This is the GARCH-like clustering that practitioners know well but that simple stochastic vol models struggle with.

What the Network Discovers

Vol rises as price falls -- the network learned the classic equity/crypto pattern

Switch between the three patterns above. Each represents what a neural SDE trained on different data regimes might discover. The point is not that the network is smarter than Heston or SABR. The point is that it arrives at similar structures without being told to look for them. That is strong evidence that those structures are real features of the data, not artifacts of the model family.

The flip side: the network can also discover spurious patterns if the data is noisy or the training is not disciplined. A large network trained on thin data will overfit beautifully -- it will memorize the noise and call it structure.

Section 5

Practical considerations

Neural SDEs are powerful but demanding. The gap between a research paper and a production system is wide. Know the costs before you commit.

Training Convergence

Epoch: 0Loss: 2.089Phase: Rapid descent

Click Train above and watch the loss converge. Notice the three phases: rapid initial descent (the network learns the broad structure), slower refinement (fine-tuning the wings and tails), and plateau (diminishing returns, potential overfitting risk).

Training data requirements. You need enough option price data to constrain a high-dimensional function. For a single underlier, that means months or years of daily smile snapshots across multiple expiries. Sparse data (few strikes, few expiries) leads to underdetermined networks that overfit.

Overfitting risk. A neural network with 10,000 parameters can memorize 10,000 data points perfectly. That does not mean it has learned the dynamics. Regularization (dropout, weight decay, early stopping) is essential. Validation on held-out data is non-negotiable.

Interpretability. A five-parameter Heston model tells you a story: kappa says this, rho says that. A neural SDE is a black box with 10,000 parameters. You can inspect the learned function (as in the heatmap above), but you cannot point to a single number and say "that is the mean reversion speed." For a trading desk that needs to explain its model to risk managers, this is a serious drawback.

Computational cost. Training requires thousands of forward passes through the SDE (Monte Carlo paths), each requiring backpropagation through the network at every timestep. This is orders of magnitude more expensive than calibrating Heston or SABR. Inference (pricing a single option with the trained model) is fast, but recalibration is slow.

Current adoption. Neural SDEs and deep hedging are used in research and by quantitative hedge funds with the infrastructure to support them. They are not yet standard on vanilla desks. The typical production setup is: a classical model (Heston, SABR, SLV) for day-to-day pricing, with neural methods used for specific high-value problems where classical models consistently fail.

When to reach for neural SDEs

Use a neural SDE when: (1) you have rich data and the classical model family keeps missing the same patterns, (2) you are pricing exotic instruments where no clean analytical solution exists, or (3) you need a hedging strategy that accounts for real-world frictions. Do not use it when a five-parameter model fits well enough -- you are adding complexity without adding value.

Where to go next:

Heston Model -- the classical stochastic vol benchmark

Stochastic Local Vol -- production-grade calibration with dynamics

Rough Bergomi -- fractional stochastic vol, the frontier before neural methods