Step 6 — The Test
Before shipping, automated tests play the game like a robot. If anything fails, the deploy is blocked.
Code without tests is a guess.
Before any game ships, we run automated playthroughs — a robot plays the game and checks the result against the plan from Step 3.
If even one test fails, we don't ship. We fix it first.
What the robot does
expected: ball bounces off wall actual: ball passes through
✗ 1 test failed → deploy blocked Fix the spec, then try again.
/tests
The robot plays the game.
Real taps. Real physics. Real score. No human watching.
Each test maps to one rule from the plan. If rule #6 fails, we know exactly which line broke and exactly which programmer wrote it.
Why we don't let the AI grade itself
This is the most important rule in the whole pipeline.
The AI helped write the code. Then a different system runs the tests. The AI never decides "the test passed." A computer compares the actual output to the expected output. Pure numbers.
# The robot taps 5 times perfectly. # Reads the combo number from screen.
actual combo: 5 expected: 5
✓ Match → PASS
# If it had been 4 instead: actual combo: 4 expected: 5
✗ No match → FAIL No "looks fine" allowed.
/no-vibes
Numbers don't lie.
The check isn't "looks correct" — it's "5 equals 5." If the numbers don't match, the test fails.
This is why our games don't ship with the kind of bugs you see in vibe-coded apps.
What we test for every rule
Each rule from the plan gets at least 2 tests. A "happy path" (it works) and a "negative path" (it fails correctly).
Happy: tap exactly at apex → combo += 1 expect: combo == 1
Negative: tap 200ms after apex → combo == 0 expect: combo == 0
Edge: tap exactly at +100ms boundary expect: combo += 1 (still in window)
tap at +101ms boundary expect: combo == 0 (just outside)
# 4 tests for 1 rule. # Bugs hide in edge cases.
/coverage
Happy + negative + edge.
Most bugs aren't "code is wrong" — they're "edge case forgotten." Testing the boundary (±100ms vs ±101ms) catches the real issues.
Across 23 rules in BOP, that's ~70 automated tests. Tester runs them all in 90 seconds.
Performance tests too
Functional tests check correctness. Performance tests check it works on real devices.
▶ 60fps over 60 seconds median: 59.4 fps ✓ ≥ 58 p99: 57.1 fps ✓ ≥ 55
▶ tap input latency measured: 47ms ✓ ≤ 100ms
▶ bundle size measured: 184KB ✓ ≤ 250KB
▶ cold start measured: 1.2s ✓ ≤ 2s
✓ All perf gates pass
/perf
Slow games don't get installs.
iPhone SE is our perf target. If it runs there, it runs everywhere.
Failing perf = failing test = blocked deploy. Same rule as functional bugs.
What happens when a test fails
# Tester comments on Trello:
Rule #20 fails: expected ball rebounds off wall actual ball passes through
File: src/scenes/BossScene.js:142 Programmer 2 owns this file.
# Team lead routes back: → Programmer 2 fix task created → Card stays in DEV column → Run again after fix
# Max 3 fix rounds. # If still failing → kill or escalate.
/recovery
Failures are routed automatically.
The tester doesn't fix anything — it identifies which file failed which rule. The team lead routes the fix back to the programmer who wrote it.
3 rounds max. If the programmer can't fix it in 3 rounds, the rule is wrong, not the code.
Common pitfalls
✗ Letting the AI grade its own tests (vibes scoring)
✗ Skipping perf tests "because it's fast on my Mac"
✗ Only testing happy path
✗ Disabling failing tests instead of fixing them
✗ Treating flaky tests as "intermittent" — they're real bugs
The discipline is: a failing test is a real failure. No exceptions. Disabling tests is a one-way ticket to bug-ridden production.
What you can copy
If you ship anything code-related:
- Write the rules first (Step 3). Then write tests against the rules.
- Don't let the AI grade itself. Use a separate test runner.
- Block deploys on failure. No "we'll fix it later."
- Test happy + negative + edge for every rule.
- Perf tests on the slowest hardware you target.
You'll ship slower for the first week. After that, you ship without bugs.