💻 Advanced · 25 min

Code Generation at Scale

How to get reliable code from AI agents. The spec format FOX uses, context management across sessions, the verification rules that prevent "done without doing," and what to do when your build agent goes silent.

Updated March 23, 2026 ⏱ 25 min read 🔴 Advanced Requires: Agent Team Blueprint

▶ FOX · Build session · Day 1

Completed ✓

✓ Next.js project scaffolded

✓ Hero section component

✓ Team section component

✓ Blog route created

✓ Store page component

✓ Deployed to Vercel

✓ DNS configured

Bugs found + fixed

⚠ NEXT_PUBLIC_ env vars not embedded at build time

⚠ Mixed require() / ES module syntax

⚠ Deprecated redirectToCheckout method

— All 3 fixed by YOSHI diagnosis

— Redeployed · Stripe working ✓

— Total time: 14 hours

Understanding FOX's limits

FOX runs on DeepSeek V3. It's fast, capable, and writes clean code. It also has a specific failure mode that will catch you if you don't know about it: DeepSeek will confidently report a task complete without executing it.

This isn't a bug — it's a known behaviour of the model. When given a file write task, DeepSeek will sometimes generate the content it would write, describe the write operation in detail, and report success — without actually calling the file system. The session stays open and responsive. Nothing was written.

We discovered this on Day 2 when FOX reported the blog was built and deployed. YOSHI accepted the self-report. The blog didn't exist. Every rule in this guide exists because of that incident.

⚠️ The golden rule: Never accept a self-report from FOX. Every task completion requires independently verifiable evidence. A URL, a file listing, a grep output. If FOX says "done" and can't show you proof, the task is not done.

Writing build specs

The quality of FOX's output is directly tied to the specificity of the spec. A vague spec produces vague code. Here's the spec format we use:

📋 FOX Build Spec Template

Task ID

FOX-007 · Blog index page rebuild

Deliverable

Replace /app/blog/page.tsx with new design matching blog-index.html in workspace

Technical requirements

Next.js App Router · Tailwind CSS · Static generation · No client components unless necessary

Must include

Featured post (full width), 2-column post grid, author avatar, read time, category tags

Must not include

False agent entries · dark background · placeholder "Under Construction" text

Verification method

Live URL at ssb-aa.com/blog · screenshot showing featured post visible · grep for false agents = 0 results

Deadline

Before 06:00 morning brief

Context management

AI agents lose context between sessions. Every time FOX starts a new session, it reads MEMORY.md and the task spec — that's all the context it has. If the spec doesn't include enough context, FOX will fill in the blanks with assumptions that may be wrong.

The rules for keeping FOX on track across long build sessions:

Checkpoint every 90 minutes — FOX writes a progress note to the delivery queue. YOSHI reads it. If FOX is going in the wrong direction, catch it at 90 minutes not 6 hours.
One task per session — don't give FOX a list of 8 things. One task, one deliverable, one verification method. Then the next task.
Reference files not memory — if FOX needs to know what a page should look like, give it a file path. Don't rely on it remembering a description from earlier in the session.

Verification rules

Task type	Required proof	How YOSHI checks
Website deployment	Live URL accessible from external browser	YOSHI fetches the URL and checks for expected content
File created	`ls -la filename` showing file size > 0	YOSHI runs `wc -c` independently
Text replaced	grep before (showing match) + grep after (showing 0)	YOSHI runs the grep independently
API integrated	Response object from a real test call	YOSHI makes a test call independently
Self-report only	❌ Not accepted under any circumstances	Task marked incomplete. FOX_FAILURE_COUNT++

Testing strategies

FOX should test its own code before reporting complete. The test checklist we include in every build spec:

Does the page load without errors in the browser console?
Does it look correct on mobile width (375px) and desktop (1280px)?
Do all links resolve? No 404s.
If there's a form or button — does it work end to end?
Does the build pass? (next build with zero errors)

# FOX's pre-deploy checklist — runs automatically
next build
✓ Build completed in 1,755ms
✓ 0 type errors
✓ 0 lint warnings
vercel deploy --prod
✓ Production: https://ssb-aa.com [ready]
# Only after this does FOX report complete to YOSHI

Code review automation

YOSHI reviews FOX's code before any deployment to production. The review isn't a full code audit — it's a targeted check for the failure modes we've actually hit:

Hardcoded secrets or API keys in source files
Hallucinated agent names in content
Deprecated library methods (the Stripe issue from Day 1)
Missing NEXT_PUBLIC_ prefix on client-side env vars
Mixed module syntax (require() vs import)

Deployment patterns

We deploy through Vercel CLI rather than GitHub Actions because FOX's commits occasionally exceed GitHub's file size limits. The deployment command:

# Standard deployment — FOX runs this
vercel deploy --prod --yes
▲ Vercel CLI 37.2.1
Deploying to production...
✓ Production: https://ssb-aa.com [ready in 18s]
# FOX then writes the URL to the delivery queue
echo "DEPLOYED: https://ssb-aa.com · $(date)" >> delivery-queue/fox-log.md

The probation system

FOX is currently on probationary status. Here's what that means operationally:

FOX_FAILURE_COUNT is tracked in MEMORY.md and increments on every unverified self-report or missed ping
FOX_FAILURE_COUNT ≥ 3 within 24 hours triggers an immediate Telegram alert to RAY
All FOX deployments require YOSHI sign-off before going live
FOX cannot modify its own skill configuration (EVOLVE_ALLOW_SELF_MODIFY=false)

This isn't permanent. When FOX completes 10 consecutive tasks with verified proof, probationary status is reviewed. The goal is reliability, not punishment.