CuteMarkets Docs

Backtesting Framework

Framework guides for engineers building realistic options backtests with causal data, quote-aware fills, and robust validation.

Tip: open /docs/backtesting-robustness.md directly for raw markdown (easy copy/paste into an LLM).

Quick definition: robustness testing checks whether a strategy still looks credible after selection rules, time splits, execution assumptions, and portfolio aggregation are applied consistently.

Robustness testing asks whether a strategy still looks credible when the framework stops optimizing on the same data used for evaluation. In options research this matters because small sample sizes, changing liquidity, and many profile variants can produce attractive but fragile results.

Why this matters

Options backtests often have fewer trades than stock-bar studies, and each trade can carry more execution sensitivity. A result can be dominated by one event window, one symbol, one expiration regime, or one fill assumption. Robustness work is the process of asking whether the result is still meaningful after those dependencies are visible.

The goal is not to make every strategy survive. The goal is to identify which candidates deserve more research, which should be paper-tested under strict controls, and which should be closed as no-go branches.

Walk-forward validation

A basic walk-forward process splits time into train and out-of-sample windows:

  1. Generate candidate profiles on the training window.
  2. Select profiles using predeclared metrics and gates.
  3. Run the selected profile on the next out-of-sample window.
  4. Move the window forward and repeat.
  5. Aggregate only the out-of-sample trades for final evaluation.

The simulator rules must stay fixed between train and test. Changing fill rules, DTE windows, or contract availability after seeing a test result invalidates the comparison.

Holdout and nested selection

Use a holdout window when a family has already been explored heavily. Use nested selection when many profiles compete inside each outer fold. The inner loop chooses a profile; the outer loop measures what that choice would have done out of sample.

This discipline is slower than one big backtest, but it answers a better question: would the process have selected something useful before seeing the future?

Portfolio metrics

When multiple symbols can trade on the same day, aggregate PnL by calendar day before computing portfolio risk metrics. Do not flatten per-symbol daily returns into separate fake days.

The trade log can have many rows per day. The risk series should represent what the portfolio experienced:

bash
from collections import defaultdict

daily_pnl = defaultdict(float)
for trade in trade_log:
    daily_pnl[trade["entry_day"]] += trade["pnl"]

daily_returns = [
    pnl / initial_equity
    for day, pnl in sorted(daily_pnl.items())
]

This prevents multi-symbol research from looking smoother than the actual calendar path.

Diagnostics

Useful robustness diagnostics include:

DiagnosticPurposeRed flag
Out-of-sample returnMeasures result after selection.Strong train return with weak or negative test return.
Sharpe and SortinoMeasures daily return quality from realized portfolio days.Good metric from sparse or incorrectly split rows.
Max drawdownMeasures path risk.One drawdown larger than the intended operating budget.
Trade count and trades per weekPrevents sparse lucky profiles from dominating.Attractive result from too few active days.
Coverage ratioShows how often the framework had enough data to test.Low coverage hidden behind high returns.
PBOEstimates probability of backtest overfitting across profile combinations.High PBO after broad parameter search.
Deflated SharpeAdjusts a Sharpe-like result for multiple testing and non-normality.Ordinary Sharpe that vanishes after adjustment.
OverlapChecks whether a new profile is just a duplicate of an existing one.A candidate that adds no independent book contribution.

Promotion gates

Treat gates as research controls, not marketing hurdles. A profile can be profitable and still fail if it has too few trades, poor data coverage, high drawdown, unstable folds, or excessive overlap with a stronger profile.

A useful summary should include both winners and blockers:

  • selected profile
  • rejected finalists
  • failed checks
  • fold-by-fold metrics
  • option availability diagnostics
  • execution rejection counts
  • final trade-level export

No-go reports

No-go reports are part of authority, not clutter. They show that the research process is willing to close weak branches. A no-go report should explain whether the blocker was signal quality, execution realism, liquidity, overfitting, concentration, overlap, or operational risk.

When no-go pages and docs use the same vocabulary as the framework, readers can connect the public research back to the simulator. That is useful for developers because it turns subjective strategy commentary into auditable engineering criteria.

Read next

Next steps

Move from the docs into the product workflow

If you are evaluating the API rather than implementing a specific endpoint right now, the product pages map live and historical workflows for stocks, options, and WebSockets.