Key Takeaways
Token cost + latency are dominated by what you send/receive. In RAG + agent systems, prompt/tool payload size is often the biggest controllable driver of spend and throughput, and JSON’s verbosity (quotes, brackets, long keys) compounds quickly.
Token reduction doesn’t have to hurt quality. Prior work on compression/prompt distillation shows big token cuts can keep quality steady, or even improve it by removing distractors.
TOON is the proposed “format swap” lever. The article argues you can replace JSON with TOON (Token-Oriented Object Notation), a compact, schema-aware, still-human-readable serialization, often saving (in specific conditions) ~30–60% tokens, especially with deep nesting and verbose keys.
Bigger wins show up in “structured-heavy” parts of systems. The best targets are RAG context packaging, tool call args/results, and agent memory/transcripts, where repeated keys and nested objects balloon token counts.
Measure it with a simple A/B harness. Instrument token counts with tiktoken, keep pricing in config, and compare JSON vs TOON on representative tasks for: prompt tokens, output tokens, total tokens, latency, parse success, and estimated cost.
Use typed validation to make it production-safe. Define Pydantic models for outputs, round-trip serialize/parse, and validate every response so format changes don’t silently break downstream logic.
Reliability comes from guardrails, not magic. The recommended tactics: strict “TOON only” instructions, short/consistent keys, minimal repair (strip fences, trim to balanced braces), bounded retries, and fallback to JSON if TOON parsing fails.
Short keys compound savings. Even within TOON, mapping verbose fields to compact keys (e.g., a/c/conf) is presented as a major multiplier across multi-step agent loops.
Streaming works if you parse only when complete. For low-latency UIs, buffer deltas, track bracket balance, and validate only after the structure closes; keep raw blobs for audit logs.
Integration is mostly “swap serialization at the edges.” In LlamaIndex/LangChain/Pydantic-AI style stacks, you keep typed layers the same, but encode/decode context + tool payloads as TOON before they hit memory/logs/prompts.
Know when not to bother. For tiny one-shot prompts with minimal structure, savings may be negligible; and TOON isn’t ideal when a downstream consumer strictly requires JSON and you can’t (or won’t) convert.
Suggested rollout path: inventory schemas → implement TOON adapters + unit tests (incl. UTF-8/non-ASCII) → run A/B suite → feature-flag rollout → monitor parse/accuracy/cost → add retries/fallbacks + observability.
Main thesis: Format efficiency is a high-ROI, low-effort optimization—swap JSON for a compact structured format (TOON), validate aggressively, and you can reduce cost/latency fast, especially in RAG + agent loops where payloads accumulate.
Why formats matter now
In most production RAG and agent systems, the largest, and most controllable, contributors to cost and latency are the tokens you send and receive. Every extra key name, bracket, and quote in your prompts and tool calls gets byte-pair encoded and billed (and increases the latency). JSON is human-friendly but verbose, and it was never designed for token efficiency in generative systems. The consequence: agent transcripts and RAG context routinely grow until they bottleneck throughput and explode budgets.
Specialized compression and prompt-distillation research has repeatedly shown that you can reduce tokens substantially without hurting task quality, sometimes even improving it by removing distractors [5][6][7]. You don’t need a research pipeline to realize a big part of those gains. In this article, we’ll replace JSON with TOON (Token-Oriented Object Notation) to serialize both prompts and structured outputs, measure the impact, and harden the approach for production. TOON is a compact, schema-aware, human-readable format designed for LLMs; typical savings range from 30–60% depending on nesting depth and key verbosity [1][3].
We’ll cover practical Python patterns with Pydantic models, an A/B harness you can run today, integration notes for LlamaIndex/LangChain/Pydantic-AI style stacks, and reliability tactics (schema validation, retries, and fallbacks). We’ll also touch on streaming and low-latency paths where format efficiency stacks with transport gains [13][14][15].
What is TOON, and why it helps
TOON (Toonify / Token-Oriented Object Notation) is a compact serialization format that preserves structure while reducing redundant characters common in JSON (quotes, long key names, whitespace). It’s designed to be easy for LLMs to produce and parse, while staying readable for humans and interoperable with typed models. Multiple Python libraries implement the format, including toonify, toons (Rust-backed), and toon-formatter [1][2][3][4][5]. In practice, the more nested your data and the more verbose your keys, the larger the win versus JSON, hence the strong results for RAG metadata, tool call arguments, and multi-step agent memories.
Setup and instrumentation
We’ll instrument token usage and cost using tiktoken. Keep pricing in a config so your harness remains reproducible across models and contracts [8][9]. For TOON, we’ll show a simple adapter that works with popular Python packages.
You can skip the code below; it's included just as context to our experiments:
"""JSON vs TOON: API latency, token, and cost comparison across formats."""
import json
import time
from statistics import stdev
import dotenv
import tiktoken
from openai import OpenAI
from pydantic import BaseModel
from scipy.stats import ttest_ind
dotenv.load_dotenv()
MODEL = "gpt-5.2"
RUNS = 5
PRICING = {"gpt-5.2": {"input": 1.75, "output": 14.0}} # $/1M tokens
SYSTEM = "Answer the question using only the provided context. Be concise."
QUESTION = "Explain the fast inverse square root and cite the relevant doc ids."
client = OpenAI()
Then, we also have to add the data we are going to use for the experiment:
# --- Sample data ---
DOCS = [
DocSchema(id="A1", title="Fast Inverse Square Root (1999)",
source="https://en.wikipedia.org/wiki/Fast_inverse_square_root",
span="sec:impl", score=0.92,
summary="Quake III's 1/sqrt(x) bit hack: magic constant 0x5f3759df."),
DocSchema(id="B7", title="Bit-level Floating Point Tricks",
source="https://example.com/bit-tricks",
span="p3", score=0.85,
summary="Casting ints to floats, Newton-Raphson refinement."),
DocSchema(id="C3", title="Numerical Methods in C (Press et al.)",
source="https://example.com/numerical-methods",
span="ch5", score=0.80,
summary="Comprehensive guide to numerical algorithms including sqrt approximations."),
DocSchema(id="D2", title="IEEE 754 Floating-Point Standard",
source="https://en.wikipedia.org/wiki/IEEE_754",
span="sec:binary32", score=0.78,
summary="Binary32 format layout: sign, exponent, mantissa bit fields."),
DocSchema(id="E9", title="Game Engine Architecture (Gregory)",
source="https://example.com/game-engine-arch",
span="ch6.3", score=0.74,
summary="Performance tricks used in AAA game engines, including SIMD and bit hacks."),
DocSchema(id="F5", title="Hacker's Delight (Warren)",
source="https://example.com/hackers-delight",
span="ch11", score=0.71,
summary="Low-level bit manipulation tricks, integer approximations of float ops."),
DocSchema(id="G4", title="Newton-Raphson Method — MathWorld",
source="https://mathworld.wolfram.com/Newton-RaphsonMethod.html",
span="eq:3", score=0.68,
summary="Iterative root-finding method used as refinement step in fast inverse sqrt."),
DocSchema(id="H6", title="Quake III Arena Source Code",
source="https://github.com/id-Software/Quake-III-Arena",
span="q_math.c:561", score=0.66,
summary="Original C implementation of Q_rsqrt with the magic constant and one NR step."),
]
DOCS_DICT = [d.model_dump() for d in DOCS]Finally some helper functions to: (a) count the number of tokens, (b) estimate the pricing, (c) make the API calls, and (d) run and report the experiment:
# --- Helpers ---
def count_tokens(text: str) -> int:
try:
enc = tiktoken.encoding_for_model(MODEL)
except Exception:
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text or ""))
def estimate_cost(tok_in: int, tok_out: int) -> float:
r = PRICING.get(MODEL, {"input": 0.0, "output": 0.0})
return (tok_in / 1e6) * r["input"] + (tok_out / 1e6) * r["output"]
def call_api(context: str) -> dict:
prompt = f"Context:\n{context}\n\nQuestion: {QUESTION}"
t0 = time.perf_counter()
resp = client.chat.completions.create(
model=MODEL,
temperature=0,
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": prompt},
],
)
latency_ms = (time.perf_counter() - t0) * 1000
tok_in = resp.usage.prompt_tokens
tok_out = resp.usage.completion_tokens
return {
"latency_ms": round(latency_ms, 2),
"tok_in": tok_in,
"tok_out": tok_out,
"cost_usd": round(estimate_cost(tok_in, tok_out), 6),
}
def run_and_report(label: str, context: str, baseline_cost: float) -> list[dict]:
ctx_bytes = len(context.encode())
ctx_tokens = count_tokens(context)
print(f"\n{label} — bytes: {ctx_bytes} context_tokens: {ctx_tokens}")
runs = []
for i in range(RUNS):
r = call_api(context)
runs.append(r)
print(f" run {i+1}: latency_ms={r['latency_ms']:>8} "
f"tok_in={r['tok_in']:>4} tok_out={r['tok_out']:>4} cost_usd={r['cost_usd']}")
lats = [r["latency_ms"] for r in runs]
costs = [r["cost_usd"] for r in runs]
mean_lat = round(sum(lats) / RUNS, 2)
mean_cost = round(sum(costs) / RUNS, 6)
print(f" → mean/std/min/max latency (ms): "
f"{mean_lat} / {round(stdev(lats), 2)} / {round(min(lats), 2)} / {round(max(lats), 2)}")
print(f" → mean cost_usd: {mean_cost} ", end="")
if label != "JSON":
print( f"cost_ratio vs JSON: {round(mean_cost / max(baseline_cost, 1e-9), 3)}", end="")
print()
return runsNotes
tiktoken is the canonical tokenizer for OpenAI models; use it to count both prompt and output tokens [8][9].
Keep a model-to-pricing map in config or env to make A/B runs portable across providers.
We pick the first available TOON implementation to ease adoption; you can standardize later on toonify or a Rust-backed variant for speed [2][4][5].
Minimal TOON round-trip with Pydantic
Start by validating that TOON faithfully preserves your structure. Use typed models to catch mismatch early and to power guardrails and retries later [10][11][16].
For the data we defined previously, we defined the pydantic simple schema below:
# --- Schema ---
class DocSchema(BaseModel):
id: str
title: str
source: str
span: str
score: float
summary: str
If your keys are long or you nest deep, expect larger relative wins with TOON. Document these baselines for your main schemas before moving on to models.
Serialization step (main code!)
We’ll instruct the model to produce TOON and parse it back to a Pydantic object. While OpenAI offers structured outputs for JSON schemas, TOON works well with instruction-only prompting and can be validated the same way after parsing [12]. Keep keys concise (e.g., a for answer, c for citations, conf for confidence) to maximize savings.
Then, how can we use TOON? The answer may surprise you because it's actually quite simple:
# --- Context serializers ---
def json_context() -> str:
return json.dumps(DOCS_DICT, separators=(",", ":"))
def toon_context() -> str:
import toon
return toon.encode(DOCS_DICT)
def toons_context() -> str:
import toons
return toons.dumps(DOCS_DICT, indent=2, delimiter=",")
Yes, it's that simple. In the same way you use json package to dump a dictionary, you use TOON packages to do the same. In our case, we used two different packages:
However, they basically do the same thing, which is transform a long JSON entry like this below:
[{"id":"A1","title":"Fast Inverse Square Root (1999)","source":"https://en.wikipedia.org/wiki/Fast_inverse_square_root","span":"sec:impl","score":0.92,"summary":"Quake III's 1/sqrt(x) bit hack: magic constant 0x5f3759df."},{"id":"B7"...To something shorter and simpler:
[8]{id,title,source,span,score,summary}:
A1,Fast Inverse Square Root (1999),"https://en.wikipedia.org/wiki/Fast_inverse_square_root","sec:impl",0.92,"Quake III's 1/sqrt(x) bit hack: magic constant 0x5f3759df."
...In our case, this reduced the input tokens from 515 to 424 (that's 18% less tokens!!). Considering large scale usage, this can significantly reduce costs and inference time!
Production tip: If you rely on strong structure guarantees, you can also combine these instructions with a constrained decoding tool (e.g., JSON Schema-based approaches) for JSON and then convert. But most teams find instruction-only TOON reliable when coupled with validation and a small retry budget [17][18][19][20].
RAG packaging with TOON and an A/B harness
Now, let’s compress RAG context and measure. We’ll simulate a small retrieval result, package it as JSON vs TOON, call the model, and collect tokens, latency, parse success, and estimated cost. Frameworks like LlamaIndex and LangChain already support structured outputs and token counting; you can graft the packaging functions into their retriever->LLM chains [13][14].
Below we show the experiment run script (that uses the data and auxiliary functions we defined before):
def main():
# JSON baseline
json_ctx = json_context()
json_runs = run_and_report("JSON", json_ctx, baseline_cost=1.0)
json_mean_lat = sum(r["latency_ms"] for r in json_runs) / RUNS
json_mean_cost = sum(r["cost_usd"] for r in json_runs) / RUNS
json_tok_in = sum(r["tok_in"] for r in json_runs) / RUNS
json_tok_out = sum(r["tok_out"] for r in json_runs) / RUNS
json_bytes = len(json_ctx.encode())
# TOON libraries
toon_results = {}
for label, ctx_fn in [("toon (toonify)", toon_context), ("toons", toons_context)]:
try:
ctx = ctx_fn()
runs = run_and_report(label, ctx, baseline_cost=json_mean_cost)
toon_results[label] = {"runs": runs, "bytes": len(ctx.encode())}
except Exception as exc:
print(f"\n{label}: error — {exc}")
# --- Summary ---
print("\n" + "=" * 80)
print(f"{'SUMMARY':^80}")
print("=" * 80)
json_lats = [r["latency_ms"] for r in json_runs]
json_costs = [r["cost_usd"] for r in json_runs]
for label, data in toon_results.items():
runs = data["runs"]
lats = [r["latency_ms"] for r in runs]
costs = [r["cost_usd"] for r in runs]
mean_lat = sum(lats) / RUNS
mean_cost = sum(costs) / RUNS
tok_in = sum(r["tok_in"] for r in runs) / RUNS
tok_out = sum(r["tok_out"] for r in runs) / RUNS
# Welch's t-test (independent samples, unequal variances)
t_lat, p_lat = ttest_ind(json_lats, lats, equal_var=False)
t_cost, p_cost = ttest_ind(json_costs, costs, equal_var=False)
def sig(p):
if p < 0.001: return "*** (p<0.001)"
if p < 0.01: return "** (p<0.01)"
if p < 0.05: return "* (p<0.05)"
return "ns (p≥0.05)"
print(f"\n{label} vs JSON")
print(f" {'metric':<18} {'JSON mean':>12} {'TOON mean':>12} {'ratio':>7} {'t':>7} {'p':>8} significance")
print(f" {'-'*78}")
print(f" {'latency (ms)':<18} {json_mean_lat:>12.2f} {mean_lat:>12.2f} "
f"{mean_lat/max(json_mean_lat,1e-9):>7.3f} {t_lat:>7.3f} {p_lat:>8.4f} {sig(p_lat)}")
print(f" {'cost_usd':<18} {json_mean_cost:>12.6f} {mean_cost:>12.6f} "
f"{mean_cost/max(json_mean_cost,1e-9):>7.3f} {t_cost:>7.3f} {p_cost:>8.4f} {sig(p_cost)}")
print(f" {'tok_in':<18} {json_tok_in:>12.1f} {tok_in:>12.1f} "
f"{tok_in/max(json_tok_in,1e-9):>7.3f}")
print(f" {'tok_out':<18} {json_tok_out:>12.1f} {tok_out:>12.1f} "
f"{tok_out/max(json_tok_out,1e-9):>7.3f}")
print(f" {'context bytes':<18} {json_bytes:>12} {data['bytes']:>12} "
f"{data['bytes']/max(json_bytes,1e-9):>7.3f}")
print("\n" + "=" * 80)
print(" Welch's t-test (two-sided, unequal variances). *p<0.05 **p<0.01 ***p<0.001")
print("=" * 80)
Use your real retriever and schemas to get representative measurements. On typical agent/RAG payloads, you should see TOON prompts consuming substantially fewer tokens than JSON, resulting in lower latency and cost per request, often compounding over multi-step agent loops. Keep a close eye on parse success and task accuracy; in practice, these remain stable when you document short keys and validate outputs.
Experimental results
For the experiments I ran 40 times the same query/context (defined above) using JSON, and the two TOON packages. Then I calculated the statistical significance of the results (so we have statistically grounded results), check it below!
JSON (baseline)
Mean Latency (ms) = 4420.84
Mean Cost (USD, GPT-5.2) = 0.003897
Input Tokens: 515
Output Tokens: 214
TOON (toonify)
Mean Latency (ms): 4217.43 (4.6% lower, p<0.05)
Mean Cost (USD, GPT-5.2): 0.003710 (4.8% lower, p<0.001)
Input Tokens: 424
Output Tokens: 212
TOONS (toons)
Mean Latency (ms): 4135.05 (6.5% lower, p<0.05)
Mean Cost (USD, GPT-5.2): 0.003698 (5.1% lower, p<0.001)
Input Tokens: 424
Output Tokens: 211
In general, TOON and TOONS performed the same, but they are statistically superior (in terms of latency and cost) to the classical JSON result. However, we didn't test things like if the task objective were affected (or not), but this must be something you need to test by yourself since we cannot generalize that.
What about streaming and real‑time? parsing TOON incrementally
OpenAI’s Responses API supports typed streaming events over a persistent path. TOON works fine with streaming: buffer text deltas and attempt a parse only once a complete structure is observed (e.g., until brackets balance). This keeps your UI responsive without risking partial-structure parse errors [13][14][15].
Heuristics that help streaming parses: ignore deltas until you see the first opening brace or bracket; count bracket balance; only send for validation when closed; and always keep the raw TOON text for audit logging.
Reliability tactics for production
Schema-first: Define Pydantic models and validate every output. Use retries with clarifying prompts when validation fails; bound retries to avoid loops [10][16].
Short keys and consistency: Document short, consistent keys for common schemas. Models learn them quickly and they compound savings across steps.
Repair strategies: Implement minimal repairs (strip code fences, trim to first balanced structure). If parsing still fails, retry once with a focused system reminder (“Return only TOON, no prose”).
Fallbacks: If a request still fails, accept JSON (instruction says TOON or JSON) and parse it; you can convert JSON to your typed object then re‑emit TOON if you need consistent storage.
Observability: Log the raw TOON blob, parse status, retries, and final structured object. For product dashboards, render a JSON view so non-engineers can browse comfortably.
Guardrails: If your domain requires stronger guarantees, pair validation with a guard framework that supports typed outputs and retries. This pattern dovetails with Pydantic models [16].
Integrating with LlamaIndex/LangChain/Pydantic‑AI
LlamaIndex: Use structured outputs for Pydantic models and swap your context builder to TOON. Token counters in LlamaIndex (tiktoken-backed) help quantify gains in traces. Your parser stays the same; just decode TOON to dict and validate [13][14].
LangChain: If you already use structured output parsers or function-calling interfaces, you can serialize tool arguments/results as TOON before appending to memory, and decode back on the receiving side. LangSmith’s token/cost tracking can capture savings across chains [21].
Pydantic‑AI: Tools and agent results are typed already. Add a pre/post-processor that encodes conversation state deltas and tool payloads to TOON to slow memory growth. Validation on the typed layer remains unchanged [11].
Accuracy and failure modes
Accuracy: In practice, switching JSON to TOON as a container does not change task semantics. Keep a small test set (extraction, classification, tool-augmented QA) and A/B to confirm equivalence for your domain.
Deeply nested data: TOON’s savings grow with depth, but prompts become harder for the model to navigate if you over-pack. Consider summarizing or using ranked slices (top‑K) to reduce noise [5][6].
Non-ASCII: Ensure your chosen TOON library round-trips UTF‑8 as expected; include a few tests with non‑ASCII to guard regressions.
Unexpected lists/dicts: Pydantic validation will catch shape mismatches. Keep error messages in logs; they are useful for few-shot reminders in retries.
Grammar hints: Include a micro “schema grammar” in the system message (short keys, brackets, no prose). This dramatically reduces deviations without resorting to heavy constrained decoding machinery [17][18][19].
Results, guidance, and when not to use TOON
Expect larger benefits in agent loops (where memory grows across steps) and RAG prompts that embed many chunks with metadata; these cases amplify token savings and reduce cumulative latency per user session. On small one-shot classification prompts with minimal structure, the marginal gain may be negligible. TOON shines when:
Prompts or outputs carry nested structure or many keys.
Tool arguments/results are appended to memories over time.
You need compact logs and payloads for streaming/real-time paths.
Don’t use TOON if a downstream consumer strictly requires JSON and you cannot convert, or if your platform mandates model-side JSON-schema constrained decoding and you’re unwilling to add a conversion layer. Otherwise, TOON is a pragmatic format swap with immediate budget impact, especially when paired with low-latency transports and streaming [13][14][15].
Migration checklist (production)
Inventory schemas used in prompts, tool calls, and outputs; define Pydantic models.
Implement TOON serialize/deserialize and validate round-trips (unit tests with edge cases).
Instrument token counts and latency; run your A/B harness on a representative suite.
Roll out behind a flag; monitor parse success, accuracy, and cost diffs.
Add retries and fallbacks; log raw TOON and typed outputs for observability [16].
Coordinate with roadmap changes (model deprecations, API migrations) so you bundle format gains with transport/perf upgrades [22].
Conclusion
Format efficiency is one of the simplest, highest-ROI levers you can pull today. TOON lets you keep structure without paying JSON’s overhead, and the integration cost is low: swap serializers, add validation, and ship. With a modest A/B harness and a couple of reliability guards, you can move your RAG and agent stacks to TOON in days, and start compounding savings right away.
