Backtest vs Paper Trading: Why Good Trading Results Break in Live Markets
CuteMarkets Team
Research

Repository reference: cutebacktests
Abstract
The gap between backtest and paper trading is usually discussed as psychology or brokerage friction. In systematic options research, that explanation is too shallow. The deeper problem is parity. A research result can be directionally correct, statistically encouraging, and still fail to survive contact with live routing, real-time contract choice, operational safeguards, and daily review discipline.
This repository has a clean example of that distinction. The current lead paper bot, c66_opening_compression_option_native_short_balance_dte35_v1, did not reach the top of the paper ladder because it had the loudest anecdotal PnL. It reached that slot because it combined out-of-sample stability with a stricter deployment process. Its baseline summary shows base return 19.18%, stress-medium 16.70%, stress-harsh 15.56%, and 76 out-of-sample trades across all three scenarios, as recorded in baseline_summary.json. That is exactly the sort of profile that should be tested against paper-trading parity instead of being celebrated prematurely.
Question
The practical question is not "why does live trading always feel worse?" The better question is: what has to stay true when a strategy moves from historical inference into paper execution?
In this repo, the answer is operational and specific. The strategy has to survive a promotion ladder. It has to pass targeted tests. It has to be synced into a clean environment. It has to survive parity checks, dry-run smoke tests, a limited paper loop, and daily review. That ladder is documented in Paper Bots, and it is more valuable than a generic slogan about discipline because it names the failure surfaces directly.
Method: Why Backtest vs Paper Trading Becomes a Parity Problem
The repo's paper-bot contract is useful because it turns "paper trading" into a concrete validation regime rather than a vague next step.
According to Paper Bots, every candidate follows the same sequence:
- local targeted pytest gate
- fresh remote workspace sync and targeted remote gate
- import-origin preflight
orb-paper-parityon benchmark days or chosen trade days- one in-session dry-run smoke
- limited live paper loop
- daily review using the generated checklist
That sequence matters because each step catches a different class of failure. Local and remote test gates catch code drift. Import-origin preflight catches workspace and package-path mistakes. Paper parity catches contract or timing drift relative to the research artifact. A dry-run smoke test catches live wiring issues before a longer loop starts. Daily review catches the messy operational problems that a static backtest cannot see.
The daily review checklist in PAPER_BOTS.md is also revealing. It explicitly asks for opened versus expected trades, no-trade symbols, parity mismatches, contract mismatches, fill failures or rejects, broker reconciliation events, duplicate-entry prevention, kill-switch status, and diversification shape versus the existing portfolio. In other words, paper trading is treated here as a measurement exercise. The goal is to find where the live path diverges from the backtest, not to watch green trades print.
Evidence / Results
The portfolio ladder in Paper Bots currently lists:
c66_strict_parity_paper_bot_r1c4_open_paper_candidate_r1c36_open_paper_candidate_r1
That order tells an important story. The repo did not say that every promising branch deserved the same live treatment. It promoted one branch, kept one as the next candidate, and kept another as a backup candidate. The roadmap in paper_bot_portfolio_r1/roadmap.json is equally explicit: the goal is to build a small diversified paper-bot portfolio instead of continuing broad standalone ORB frontier search.
The positive result is that one strategy really did look strong enough to operationalize. c66 is the current lead_paper_bot, and the repo's summary in Toward The One Piece Of Sharpe reports:
- base out-of-sample return
19.18% - stress-medium out-of-sample return
16.70% - stress-harsh out-of-sample return
15.56% 76out-of-sample trades in all three scenarios
Those figures matter because paper trading should start from stability, not from the single best in-sample anecdote. The operational history around c66, summarized in Episode 6, then extended that research artifact into strict-parity validation on server3, first live paper deployment on April 13, and a restart after the server reboot on April 18.
What Worked
What worked was the distinction between research success and operational readiness. This repo did not flatten those into the same category.
c66 worked because it combined several qualities that rarely show up together in one options branch. It had positive out-of-sample returns under base and stress conditions. It had a stable trade count across those stress scenarios. It passed a harsher selection process. Then it was wired into an explicit paper-trading contract with kill-switch logic and daily review discipline. That is the right reason to trust a paper candidate.
The promotion ladder also worked as a communication device. Public trading research often sounds more certain than it is because every green branch is presented as "working." Here the ordering itself communicates uncertainty and selectivity. c66 is the lead. c4 is the next candidate, not a peer. c36 is a backup candidate, not a promoted bot. That is much closer to how real research programs should report progress.
What Failed
The most important negative result is that good backtest results alone were not enough for promotion.
c36 is the cleanest example. In Toward The One Piece Of Sharpe, the quality version of the VWAP mean-reversion lane showed +16004 PnL on 15 trades with DSR 0.6400, yet it still failed the trades_per_week_ok gate and stayed backup_candidate or open_paper_only. That is a strong warning against treating a profitable backtest as automatically deployable.
c4 is another useful case. It improved after debugging, but the repo still concluded park_c4 because the portfolio gate remained too harsh. The required conditions included feasible selection, positive return, trades_per_week >= 1.5, orb_overlap_days >= 30, c66_overlap_days >= 30, offset_ratio_on_orb_down_days >= 0.5, zero extra option attempts, and zero quote rejects, as summarized in Episode 8. That is exactly the kind of evidence that disappears when public strategy content only reports the best chart and the best number.
There is also a more general failure mode. A backtest can be internally strong and still break in paper because paper trading exposes environment drift, contract mismatches, quote rejects, and operational race conditions that no static research artifact can reveal. That is why "backtest vs paper trading" is not mainly a mindset issue. It is a parity issue, an environment issue, and a process issue.
Takeaway
Backtest vs paper trading is not a contest between theory and emotion. It is a test of whether the research object survives a stricter version of reality. In this repository, that stricter version includes explicit promotion steps, parity checks, dry-run smoke tests, limited paper deployment, and daily review with kill-switch logic.
The best lesson from the current state of the repo is that one branch, c66, earned the right to lead because it survived both research and operational scrutiny. Other branches with real signal still stopped short of promotion. If you want to understand why the research side has to be strict first, read What Is Realistic Options Backtesting? A Practical Guide for Serious Traders. If you want the data-layer view beneath that, Historical Options Backtesting: Data, Fills, and Slippage That Actually Matter covers the contract, quote, and slippage stack. Join the research log to get the next backtest and failure report.
Product links
Build the workflow with CuteMarkets
This article is part of the broader CuteMarkets product and research stack. Use the landing pages below to move from the blog into the specific API workflow you want to evaluate.
Learn Options From Zero
Send newcomers to the beginner path for calls, puts, chains, Greeks, IV, and risk.
Options Data API
See the primary product page for real-time and historical options data.
Historical Options Data API
Inspect the historical contracts, quotes, trades, and aggregates workflow.
Options Chain API
Go straight to chain snapshots, expirations, and strike discovery.
Pricing
Review plans before you move from free evaluation into production usage.