Multi-Model Code Auditing: The Council Pattern

Carlos Garavito||4 min read
aiagentsllmarchitecturesecurity

The Problem

When you use a single language model to review code, you inherit its blind spots. Claude Opus can be excellent at detecting architecture issues but miss edge cases in input validation. GPT-4 can find obvious security vulnerabilities but ignore subtle resource leaks. Gemini can outperform in mathematical reasoning but overlook concurrency problems.

It's not the model's fault. It's their nature. Each LLM is trained on different datasets, has different architectures, and optimizes for different objectives. What one sees, another ignores.

The obvious answer is: use the most expensive model. But even flagship models have limits. And more importantly: recent research shows that a diverse set of cheap models can outperform a single premium model.

The Pattern

Multi-model auditing (or "Council Pattern") consists of running multiple LLMs in parallel on the same code and synthesizing their findings. It's not simply running several models and concatenating results. It's leveraging diversity to find what any individual model would miss.

Mixture-of-Agents: Academic Foundation

In 2024, Wang et al. published a paper on Mixture-of-Agents (MoA) that demonstrates this principle formally. Their MoA architecture outperformed GPT-4o on AlpacaEval 2.0 with 65.1% vs 57.5%, using a combination of models weaker than the strongest individual model.

The core idea: each model has a different "field of vision" over the problem. When you combine multiple perspectives, you capture angles that no single model would see. It's the "wisdom of crowds" applied to LLMs.

Why Consensus Works

When 3 out of 4 models find the same issue, there are two explanations:

  1. It's a real and obvious problem
  2. It's a false positive that multiple models share (rare)

In my experience auditing real code: when multiple independent models converge on a finding, it's almost always legitimate. False positives tend to be idiosyncratic to the model.

The inverse pattern is also valuable: when a single model finds something nobody else sees, it can be:

  1. A unique finding that other models lack the capability to detect
  2. An incorrect interpretation of the code

Both cases require human investigation. The point is: diversity of opinion gives you signals that a single model cannot provide.

Architectures and Blind Spots

Each model family has distinct strengths:

  • Anthropic Claude: Excellent at data flow analysis, architectural reasoning
  • OpenAI GPT: Strong on common security patterns, general knowledge
  • Google Gemini: Superior mathematical reasoning, analysis efficiency
  • Chinese models (Kimi, DeepSeek): Different training corpora, unique perspectives

These differences aren't bugs, they're features. When auditing code, you want all these angles.

My Test

I tested this pattern on a real project: a TypeScript AI Gateway with ~5,000 lines of code distributed across 46 source files. I ran 4 models in parallel:

  • Claude Opus 4.6 (~$0.83)
  • Claude Sonnet 4.5 (~$0.25)
  • Gemini 3 Pro (~$0.08)
  • Kimi K2.5 (~$0.04)

Total cost: ~$1.20. Wall time: ~2 minutes.

Results

ModelFindingsUniqueCost
Opus 4.6215$0.83
Sonnet 4.5275$0.25
Gemini 3 Pro112$0.08
Kimi K2.5253$0.04
Total8415$1.20

Consensus: 9 findings found by 3 or more models.

Universal criticals: 2 issues found by all 4 models (a missing authentication check and an injection vulnerability in the caching layer).

What the Cheap Models Found

The most surprising result: the cheapest model ($0.04 Kimi) found a critical SSRF vulnerability that the 3 more expensive models missed. Gemini ($0.08) detected a cost leak that no Anthropic model caught.

This validates the MoA hypothesis: diversity matters more than individual price.

Unique Behaviors

Opus 4.6 was the only model that proactively explored files beyond those in the initial prompt. The others limited themselves to the explicit scope.

Gemini had the best signal-to-noise ratio: 11 findings, all relevant. Sonnet was more thorough but with more overlap.

Disagreements

Models didn't always agree on severity. A memory growth issue was classified as "Medium" by Opus and Gemini, but "Critical" by Sonnet and Kimi. The difference: it depends on expected traffic volume. Both assessments are valid under different assumptions.

False Positives

Zero. All findings were legitimate, though with different urgency levels.

Practical Insights

1. Diversity > Quantity

4 diverse models outperform 10 copies of the same model. Don't run GPT-4 ten times. Run GPT-4, Claude, Gemini, and a Chinese model once each.

2. Diminishing Returns

After 4-5 diverse models, additional unique findings drop sharply. In my test, 15 unique findings out of 84 total means each model contributed ~3-5 unique insights. A fifth model would probably add 1-2.

3. Cheap Models Matter

Don't assume premium models find everything. Kimi ($0.04) found a critical SSRF. Gemini ($0.08) found a cost leak. Both would have been missed with Claude alone.

4. Synthesis is Hard

Combining 84 findings from 4 models into a coherent report requires human effort or an additional synthesizer model. In my implementation, I used a fifth model (another Claude Opus) to deduplicate and prioritize.

5. Consensus != Correctness

That 4 models agree doesn't mean they're right. But in security code, it's a strong signal that warrants immediate investigation.

Implementation

Here's a simplified example using OpenClaw and OpenCode:

# Same prompt, 4 models, parallel execution
PROMPT=$(cat audit-prompt.md source-code.txt)
 
opencode run --model anthropic/claude-opus-4-6 --agent coder "$PROMPT" &
opencode run --model anthropic/claude-sonnet-4-5 --agent coder "$PROMPT" &
opencode run --model google/gemini-3-pro-preview --agent coder "$PROMPT" &
opencode run --model opencode/kimi-k2.5-free --agent coder "$PROMPT" &
wait

Run in parallel, wait for all results, synthesize. Total: ~2 minutes for a medium-sized repository.

Conclusion

Multi-model auditing isn't theory. With $1.20 and 2 minutes, I found 15 unique problems in production code, including 2 critical vulnerabilities that all models confirmed.

The pattern is simple: run diverse models in parallel, look for consensus on critical findings, investigate unique findings. You don't need the most expensive model. You need the most diverse set.

The MoA research confirms it. My real test validates it. Models have blind spots. Councils don't.


Built with OpenClaw (orchestration) and OpenCode (model execution). The council audit pattern is model-agnostic -- adapt it to your provider setup.