Backtesting Framework Test Plan

Read this page with Backtesting Framework, Backtesting Engine Loop, Backtesting Data Quality Checklist, Same-Bar Fill Lookahead, and Backtest to Paper Trading Parity Checklist.

Quick definition: a backtesting test plan proves that the simulator preserves causality, selects historical contracts correctly, prices fills from observable data, and reports portfolio risk from the right unit of observation.

A backtesting framework needs tests for scientific behavior beyond syntax. The test suite should prove that the simulator does not leak future information, select stale contracts, overstate fills, or compute portfolio metrics from the wrong rows.

Why this matters

Backtesting bugs rarely look like crashes. They look like attractive research. A same-bar fill might improve a breakout system. A current-chain selector might find cleaner historical contracts. A midpoint fill might rescue a short-dated options strategy. A portfolio report might split one calendar day into several pseudo-days and smooth the risk path.

The test plan should turn those failure modes into regression tests. If a future change breaks a causality rule, contract-selection rule, or fill policy, the test name should say exactly what scientific guarantee was lost.

Core test groups

Group	What to prove	Example failure
Causality	Signals only use completed and available data.	Same-bar fill after completed-bar signal.
Contract selection	Historical universes, DTE rules, and cache keys produce stable choices.	Current chain leaks into past date.
Execution realism	Quotes, spreads, stops, targets, and fallbacks behave as configured.	Last price fills despite missing bid/ask.
Portfolio math	Combined daily PnL drives risk metrics when multiple symbols trade.	Two symbols on one day counted as two days.
Robustness	Folds, holdouts, PBO, and deflated Sharpe use the intended rows.	Train rows included in test summary.
Public examples	Docs and CLI examples import public names and run against sample or mocked data.	Renamed symbol breaks published guide.

Causality tests

Write tests that make leakage obvious. For an intraday setup, create a session where the signal bar closes beyond a threshold but the same bar's open would have been an impossible fill. The expected behavior is signal on bar t, entry on the next observable bar or quote after t.

Also test:

opening-range values come only from the opening-range window
prior-day filters do not include the current session close
premarket filters do not use regular-session bars
exit timestamps cannot precede entry timestamps
daily forecast paths only use data available before the forecast date

Contract-selection tests

Use tiny deterministic contract universes. Include similar strikes and expirations so ranking errors are visible.

Required cases:

no contracts in the DTE window returns a rejection
spread filters reject the wider contract
volume or open-interest filters reject inactive contracts
changing the entry underlying price can change the selected strike
changing the selection timestamp can change quote-aware ranking
vertical structures reject missing or invalid paired legs
persistent caches do not override a different selection context

Execution tests

Build quote windows by hand. Tests should cover valid quotes, crossed quotes, missing entry quotes, missing exit quotes, wide spreads, and bar fallback settings.

For stops and targets, create quote sequences where the stop is touched before the target and another where the target is touched first. The framework should record the right exit reason and use the configured fill side.

Scenario	Expected assertion
Fresh valid quote	Fill uses configured side or haircut.
Crossed quote	Trade is rejected or quote is ignored.
Missing entry quote	No entry fill is created.
Wide spread	Reject reason names the spread threshold.
Stop before target	Exit reason is stop with observable timestamp.
Target before stop	Exit reason is target with observable timestamp.

Portfolio and robustness tests

Portfolio tests should use at least two symbols trading on the same calendar day. The expected Sharpe and Sortino inputs should come from one combined daily PnL row for that day, not two separate pseudo-days.

Robustness tests should verify:

train and test windows do not overlap unless intentionally configured
selected-fold rows feed selection diagnostics
combined-fold rows feed portfolio diagnostics
sparse profiles fail minimum trade gates
rejected profiles still appear in diagnostic summaries

CI commands

For the public site, run:

bash

npm run lint
npm run build

For a Python framework package, keep a focused public-surface test command:

bash

PYTHONPATH=src python -m pytest tests/test_public_surface.py -q

The exact file names can change. The standard should not: every framework change that affects causality, selection, fills, or metrics needs a regression test.

Treat these cases as executable coverage for Backtesting Engine Loop, Backtesting Data Quality Checklist, and Backtesting Execution Realism. Good tests assert timestamp ordering, missing-bar behavior, contract selector output, quote-window rejection, same-bar fill prevention, and artifact reproducibility.

Test plan implementation notes

A useful test suite has fixtures that try to fool the simulator. Include a contract that appears after the research date, a quote window with no usable bid, a crossed market, an incomplete cursor, a bar whose high and low arrive after the signal, and a plan response that lacks quote entitlement. The expected result is a reject, not a silent fallback.

Pair fixture tests with artifact tests. A passed backtest should include the selected OCC symbol, as_of contract request, quote timestamp, spread percent, fill rule, aggregate bar window, and rejected candidates. The Backtesting Data Quality Checklist gives the reject vocabulary, while Backtesting Execution Realism defines the fill evidence that each test needs to inspect.

Backtesting Framework

Docs