Code Mode

When an MCP tool returns a large response — say, 50 FDA adverse event reports weighing 5 MB — that entire payload enters the LLM's context window. Code mode prevents this: the LLM writes a small processing script, the server runs it in a sandbox, and only the script's compact output (typically under 1 KB) reaches the LLM.

Result: 98–100% context reduction across all tested scenarios.

Data Flow

The key idea: the 5,130 KB of raw data never enters the LLM context. The sandbox processes it and only the 194 bytes of extracted output gets returned.

The sandbox is a QuickJS engine compiled to WebAssembly — it has no file system, no network, and no access to Node.js. Scripts can only read DATA (the tool response) and write to console.log(). It has a 10-second timeout and 64 MB memory limit.

Benchmarks

All numbers below are from real API calls to live government endpoints, measured March 9, 2026.

Results by Scenario

Category	What was tested	Tool Response	Code Mode Output	Reduction
Health	FDA drug events → top 10 reactions	5.0 MB	194 B	100.0%
Health	FDA drug events → deaths only	3.5 MB	408 B	100.0%
Health	FDA drug labels → boxed warnings	1.0 MB	1.6 KB	99.9%
Health	FDA 510(k) → clearance summary	203.5 KB	285 B	99.9%
Health	FDA NDC → DEA schedule count	126.8 KB	66 B	99.9%
Health	FDA shortages → status summary	60.6 KB	267 B	99.6%
Health	CDC mortality → national trend	13.0 KB	31 B	99.8%
Financial	CFPB complaints → company ranking	185.2 KB	301 B	99.8%
Economic	Treasury debt → latest values	35.1 KB	129 B	99.6%
Economic	BLS CPI → latest by category	11.4 KB	181 B	98.5%
Justice	DOJ press releases → title list	81.5 KB	941 B	98.9%
	11 scenarios total	10.2 MB	4.3 KB	99.96%

Token Impact

LLM tokens estimated at ~4 characters per token:

2,684,738 → 1,099 tokens — a 2,442x reduction across 11 scenarios.

Context Window Usage

A single FDA drug events query (50 results) uses 1.3 million tokens — more than 6x the context window of Claude 3.5 Sonnet (200K). With code mode, the same query uses 49 tokens.

Model	Context Window	FDA Events (Normal)	FDA Events (Code Mode)
Claude 3.5 Sonnet	200K	6.5x overflow ❌	0.025% ✅
GPT-4o	128K	10.3x overflow ❌	0.038% ✅
Gemini 1.5 Pro	1M	131% ❌	0.005% ✅

Cost at Scale

At $3 per million input tokens (Claude 3.5 Sonnet):

Usage	Normal	Code Mode	Saved
1 FDA query	$3.94	$0.0001	$3.94
10-tool research session	~$7.50	~$0.003	~$7.50
100 queries/day	~$750/day	~$0.03/day	~$750/day

When to Use It

Use code mode for	Call tool directly for
Counting and aggregating	Cross-referencing multiple sources
Filtering (e.g., deaths only)	Interpreting or explaining data
Extracting specific fields	Exploring unknown data ("show me everything")
Top-N lists	Small responses (FRED, BLS already compact)

Rule of Thumb

Need to think about the data? → Call the tool directly. Need specific values from the data? → Use code mode.

What Can Go Wrong

Code mode is powerful, but using it in the wrong situation can hurt your results.

Worst Case: Using code mode when you shouldn't

Scenario	What happens	Impact
Cross-referencing — "Compare FDA adverse events to lobbying spend"	Code mode extracts a count from FDA data, but the LLM never sees the individual reports — it can't connect specific drugs to specific lobbying patterns	Misses the correlations that make the analysis valuable
Discovery — "What's interesting in this drug's safety data?"	The LLM has to write extraction code before knowing what's in the data — it guesses wrong and extracts the wrong fields	Returns unhelpful or misleading output
Narrative context — "Explain why this drug was recalled"	The script extracts `reason_for_recall` as a string, but the LLM never sees the surrounding context (classification, distribution, timeline)	Shallow answer missing important details
Small data — Using code mode on a 2 KB FRED response	The sandbox overhead actually makes the response slower with no size benefit	Wasted processing time, same token usage

Best Case: Using code mode when you should

Scenario	What happens	Impact
Counting — "How many serious vs non-serious Ozempic events?"	Script counts `serious === "1"` across 5 MB of reports, returns two numbers	5 MB → 20 bytes. LLM has room for 5 more data sources
Filtering — "Show me only the death reports for metformin"	Script filters 100 reports down to 10 deaths, returns just those	3.5 MB → 400 bytes. LLM sees only the relevant cases
Pre-aggregation — "What are the top 10 adverse reactions?"	Script counts reaction names and sorts by frequency	5 MB → 200 bytes. LLM gets a clean ranked list
Bulk extraction — "List all drug names from these 510(k) clearances"	Script pulls one field from 50 records	200 KB → 300 bytes. Clean list, no noise

The key difference

Best case: The LLM already knows what specific values it needs, and the raw data is just a container to extract them from.
Worst case: The LLM needs to read the data to figure out what matters — extracting blindly throws away the signal.

Code mode doesn't limit analysis — it enhances it. By shrinking extraction calls from megabytes to bytes, it frees context space to fit more data sources in a single session. The LLM decides per-call whether to use code mode or call tools directly.

Data Sources

All Tools

Code Mode

Data Flow

Benchmarks

Results by Scenario

Token Impact

Context Window Usage

Cost at Scale

When to Use It

What Can Go Wrong

Worst Case: Using code mode when you shouldn't

Best Case: Using code mode when you should

The key difference

Code Mode ​

Data Flow ​

Benchmarks ​

Results by Scenario ​

Token Impact ​

Context Window Usage ​

Cost at Scale ​

When to Use It ​

What Can Go Wrong ​

Worst Case: Using code mode when you shouldn't ​

Best Case: Using code mode when you should ​

The key difference ​

Code Mode

Data Flow

Benchmarks

Results by Scenario

Token Impact

Context Window Usage

Cost at Scale

When to Use It

What Can Go Wrong

Worst Case: Using code mode when you shouldn't

Best Case: Using code mode when you should

The key difference