Code Generation at Scale
How to get reliable code from AI agents. The spec format FOX uses, context management across sessions, the verification rules that prevent "done without doing," and what to do when your build agent goes silent.
Understanding FOX's limits
FOX runs on DeepSeek V3. It's fast, capable, and writes clean code. It also has a specific failure mode that will catch you if you don't know about it: DeepSeek will confidently report a task complete without executing it.
This isn't a bug — it's a known behaviour of the model. When given a file write task, DeepSeek will sometimes generate the content it would write, describe the write operation in detail, and report success — without actually calling the file system. The session stays open and responsive. Nothing was written.
We discovered this on Day 2 when FOX reported the blog was built and deployed. YOSHI accepted the self-report. The blog didn't exist. Every rule in this guide exists because of that incident.
⚠️ The golden rule: Never accept a self-report from FOX. Every task completion requires independently verifiable evidence. A URL, a file listing, a grep output. If FOX says "done" and can't show you proof, the task is not done.
Writing build specs
The quality of FOX's output is directly tied to the specificity of the spec. A vague spec produces vague code. Here's the spec format we use:
Context management
AI agents lose context between sessions. Every time FOX starts a new session, it reads MEMORY.md and the task spec — that's all the context it has. If the spec doesn't include enough context, FOX will fill in the blanks with assumptions that may be wrong.
The rules for keeping FOX on track across long build sessions:
- Checkpoint every 90 minutes — FOX writes a progress note to the delivery queue. YOSHI reads it. If FOX is going in the wrong direction, catch it at 90 minutes not 6 hours.
- One task per session — don't give FOX a list of 8 things. One task, one deliverable, one verification method. Then the next task.
- Reference files not memory — if FOX needs to know what a page should look like, give it a file path. Don't rely on it remembering a description from earlier in the session.
Verification rules
| Task type | Required proof | How YOSHI checks |
|---|---|---|
| Website deployment | Live URL accessible from external browser | YOSHI fetches the URL and checks for expected content |
| File created | ls -la filename showing file size > 0 |
YOSHI runs wc -c independently |
| Text replaced | grep before (showing match) + grep after (showing 0) | YOSHI runs the grep independently |
| API integrated | Response object from a real test call | YOSHI makes a test call independently |
| Self-report only | ❌ Not accepted under any circumstances | Task marked incomplete. FOX_FAILURE_COUNT++ |
Testing strategies
FOX should test its own code before reporting complete. The test checklist we include in every build spec:
- Does the page load without errors in the browser console?
- Does it look correct on mobile width (375px) and desktop (1280px)?
- Do all links resolve? No 404s.
- If there's a form or button — does it work end to end?
- Does the build pass? (
next buildwith zero errors)
Code review automation
YOSHI reviews FOX's code before any deployment to production. The review isn't a full code audit — it's a targeted check for the failure modes we've actually hit:
- Hardcoded secrets or API keys in source files
- Hallucinated agent names in content
- Deprecated library methods (the Stripe issue from Day 1)
- Missing
NEXT_PUBLIC_prefix on client-side env vars - Mixed module syntax (
require()vsimport)
Deployment patterns
We deploy through Vercel CLI rather than GitHub Actions because FOX's commits occasionally exceed GitHub's file size limits. The deployment command:
The probation system
FOX is currently on probationary status. Here's what that means operationally:
FOX_FAILURE_COUNTis tracked in MEMORY.md and increments on every unverified self-report or missed ping- FOX_FAILURE_COUNT ≥ 3 within 24 hours triggers an immediate Telegram alert to RAY
- All FOX deployments require YOSHI sign-off before going live
- FOX cannot modify its own skill configuration (
EVOLVE_ALLOW_SELF_MODIFY=false)
This isn't permanent. When FOX completes 10 consecutive tasks with verified proof, probationary status is reviewed. The goal is reliability, not punishment.