Why AI Needs a New Data Format
The Problem: JSON Was Designed for Humans, Not AI
Every day, AI agents process millions of JSON payloads. They parse curly braces, quoted keys, escaped strings, nested objects — all conventions designed for human readability in the 1990s. But AI agents don't read. They tokenize.
Every redundant quote, every structural delimiter, every human-friendly naming convention costs tokens. And tokens cost money, latency, and context window space.
The data formats we use today — JSON, XML, YAML — were never designed with tokenization in mind. They were designed for human eyes. The result is that AI agents spend a significant portion of their processing budget on syntax rather than semantics.
The Data: 26–71% Token Reduction
We didn't just theorize. We benchmarked. Across 9 live model APIs (MiMo, DeepSeek, Grok, GLM-5.1, Qwen3.6, Hy3, Kimi-K2.6, Nex-N2, and more), we tested MarkZero (Agent IR) against JSON, Markdown, TOON, and other formats.
The results:
| Scenario | MarkZero | JSON | Markdown | Reduction |
|---|---|---|---|---|
| Java OOM (flat, repeated data) | 137 tokens | 192 tokens | 154 tokens | 28.6% |
| LLM Reasoning (AST Node Repair) | 687 tokens | 928 tokens | — | 26% |
| Nested JS (after escaper fix) | 120 tokens | 157 tokens | — | 23.6% |
The range is 26–71% because efficiency depends on data shape. Flat, repeated data benefits most from value interning. The pattern is clear: the more structured data you throw at an LLM, the more tokens you save with Agent IR.
The Key Insight: Agent IR vs Human IR
Here's what makes this fundamentally different from every other "compact format" proposal:
TOON, ONTO, JTON, TERSE — they all use visual delimiters. Indentation, columnar alignment, special characters arranged to be visually parsable by humans. They're still Human IR (Intermediate Representation).
Agent IR is different. It doesn't try to be readable. It uses flat relational referencing — values are interned once and referenced by position. The "markup" is optimized for tokenization, not for your eyes.
Markup is for Screens. Markdown is for Docs. MarkZero is for Intelligence.
This distinction matters. A human can glance at JSON and understand the structure. An LLM doesn't "glance" — it tokenizes. And every token that carries structural overhead instead of semantic meaning is a wasted token.
The Discovery: Header Placement Matters (5.4× Difference)
During our LLM reasoning benchmark with MiMo on an AST node repair task, we found something unexpected. The same MarkZero header — the decoder instructions that tell the LLM how to interpret the format — produced wildly different results depending on where you put it.
| Configuration | Total Tokens | Latency | Result |
|---|---|---|---|
| MZ header in system prompt | 3,733 | 108s | ❌ Worst |
| MZ header in payload (English) | 687 | 11.1s | ✅ Best |
| JSON compact (baseline) | 928 | 23.6s | Baseline |
5.4× difference — from the same format, the same header, just placed differently. When the header was in the system prompt, the model over-tokenized it. When placed inline with the payload, the model processed it as context and decoded efficiently.
This finding has implications beyond MarkZero: how you structure instructions for LLMs matters as much as what you say.
The Escaper Bug: 421T → 120T with Grid Referencing
Early in development, our nested JS data structure required 421 tokens — the worst result in our benchmark. We almost gave up on nested data entirely.
The culprit was the encoder's escaping strategy: when data branched, the encoder escaped the branches inline, adding structural tokens at every level.
The fix was simple in principle: instead of escaping data branches, we reference their path in a grid. Values are interned once in a flat grid, then referenced by coordinate.
Result: 421 tokens → 120 tokens — a 3.5× reduction. This beat TOON (154T) and JSON (157T) for the same nested data.
"Never escape data branches — just reference their path."
Grid referencing is now a core part of MarkZero's design. It's the mechanism that makes Agent IR work for nested structures, not just flat data.
The Positioning: Not Competing, Complementing
Let's be clear about what Agent IR is not:
- Not competing with MCP — MCP is a protocol layer (97M+ monthly SDK downloads). Agent IR is a data format. You can use MarkZero through MCP.
- Not competing with Trae or AGY — Those are agent IDEs. If you use Trae/AGY/Kilo, you can add Agent IR for 26–71% token efficiency on structured data.
- Not replacing JSON everywhere — JSON is great for APIs, config files, and human-readable data. Agent IR is for the AI-to-data interface.
The positioning is simple:
"Agent IR: A data format optimized for AI, not for human eyes. 26–71% more efficient than JSON."
Every other compact format (TOON, ONTO, JTON, TERSE) still uses visual delimiters. They're more compact than JSON, but they're still Human IR. MarkZero is the first format that is genuinely Agent IR — designed from the ground up for tokenization, not for reading.
Try MarkZero
MarkZero is open source. The spec is finalized. The encoder and decoder are tested across 9 models.
npm install @pakakas/markzero
Whether you're building agent tools, optimizing token budgets, or designing AI-native systems — Agent IR gives your agents what they actually need: structured data without the overhead.