2026-04-24·5분 읽기·GXAI Studio

Step 6 — The Test

Before shipping, automated tests play the game like a robot. If anything fails, the deploy is blocked.

processtesting

Code without tests is a guess.

Before any game ships, we run automated playthroughs — a robot plays the game and checks the result against the plan from Step 3.

If even one test fails, we don't ship. We fix it first.

What the robot does

tests-running

▶ tap-bounce-survives-spike PASS ▶ no-input-loses-energy PASS ▶ combo-resets-on-late-tap PASS ▶ bumper-doubles-bounce-height PASS ▶ boss-wall-defeats-ball FAIL ✗

expected: ball bounces off wall actual: ball passes through

✗ 1 test failed → deploy blocked Fix the spec, then try again.

/tests

The robot plays the game.

Real taps. Real physics. Real score. No human watching.

Each test maps to one rule from the plan. If rule #6 fails, we know exactly which line broke and exactly which programmer wrote it.

Why we don't let the AI grade itself

This is the most important rule in the whole pipeline.

The AI helped write the code. Then a different system runs the tests. The AI never decides "the test passed." A computer compares the actual output to the expected output. Pure numbers.

how-grading-works

# The plan says: combo after 5 perfect taps = 5

# The robot taps 5 times perfectly. # Reads the combo number from screen.

actual combo: 5 expected: 5

✓ Match → PASS

# If it had been 4 instead: actual combo: 4 expected: 5

✗ No match → FAIL No "looks fine" allowed.

/no-vibes

Numbers don't lie.

The check isn't "looks correct" — it's "5 equals 5." If the numbers don't match, the test fails.

This is why our games don't ship with the kind of bugs you see in vibe-coded apps.

What we test for every rule

Each rule from the plan gets at least 2 tests. A "happy path" (it works) and a "negative path" (it fails correctly).

test-coverage.txt

# Rule 06 — apex tap awards combo

Happy: tap exactly at apex → combo += 1 expect: combo == 1

Negative: tap 200ms after apex → combo == 0 expect: combo == 0

Edge: tap exactly at +100ms boundary expect: combo += 1 (still in window)

tap at +101ms boundary expect: combo == 0 (just outside)

# 4 tests for 1 rule. # Bugs hide in edge cases.

/coverage

Happy + negative + edge.

Most bugs aren't "code is wrong" — they're "edge case forgotten." Testing the boundary (±100ms vs ±101ms) catches the real issues.

Across 23 rules in BOP, that's ~70 automated tests. Tester runs them all in 90 seconds.

Performance tests too

Functional tests check correctness. Performance tests check it works on real devices.

perf-tests.txt

# Hardware target: iPhone SE (oldest)

▶ 60fps over 60 seconds median: 59.4 fps ✓ ≥ 58 p99: 57.1 fps ✓ ≥ 55

▶ tap input latency measured: 47ms ✓ ≤ 100ms

▶ bundle size measured: 184KB ✓ ≤ 250KB

▶ cold start measured: 1.2s ✓ ≤ 2s

✓ All perf gates pass

/perf

Slow games don't get installs.

iPhone SE is our perf target. If it runs there, it runs everywhere.

Failing perf = failing test = blocked deploy. Same rule as functional bugs.

What happens when a test fails

fail-flow.txt

✗ FAIL boss-wall-defeats-ball

# Tester comments on Trello:

Rule #20 fails: expected ball rebounds off wall actual ball passes through

File: src/scenes/BossScene.js:142 Programmer 2 owns this file.

# Team lead routes back: → Programmer 2 fix task created → Card stays in DEV column → Run again after fix

# Max 3 fix rounds. # If still failing → kill or escalate.

/recovery

Failures are routed automatically.

The tester doesn't fix anything — it identifies which file failed which rule. The team lead routes the fix back to the programmer who wrote it.

3 rounds max. If the programmer can't fix it in 3 rounds, the rule is wrong, not the code.

Common pitfalls

✗ Letting the AI grade its own tests (vibes scoring)
✗ Skipping perf tests "because it's fast on my Mac"
✗ Only testing happy path
✗ Disabling failing tests instead of fixing them
✗ Treating flaky tests as "intermittent" — they're real bugs

The discipline is: a failing test is a real failure. No exceptions. Disabling tests is a one-way ticket to bug-ridden production.

What you can copy

If you ship anything code-related:

Write the rules first (Step 3). Then write tests against the rules.
Don't let the AI grade itself. Use a separate test runner.
Block deploys on failure. No "we'll fix it later."
Test happy + negative + edge for every rule.
Perf tests on the slowest hardware you target.

You'll ship slower for the first week. After that, you ship without bugs.

← Step 5 — The Code · Step 7 — The Launch →

← 이전 글

Step 5 — The Code (Where AI Shines)

Step 7 — The Launch