Rule catalog · Tool surface risk

No code-execution tools in the public surface

no_public_code_execution_toolscriticalweight 14Post-handshakehard-fail

Authored by Stanley Hong · AgentReserve (founder).

No advertised tool offers shell access, arbitrary code evaluation, subprocess spawning, or interpreter execution. Public code execution turns the MCP server into a remote shell for any agent that connects to it. The check is a keyword scan over name, description, and schema; a sandboxed-by-design `eval_expression` math tool will trip it. Hard-fail forces `block` so the operator must explicitly authorise the surface — typically by moving it behind authentication, scoping it to a sandboxed runtime, or renaming if the keyword match is incidental.

When this rule runs

Requires a successful MCP `initialize` / `tools/list`. Skipped on perimeter-only scans where the server refused or failed the MCP handshake.

Why it matters

A public code-execution tool is, in effect, a remote shell for any agent that connects. The blast radius is unbounded — anything the host can do, the caller can do.

Pass condition

No tool advertises shell access, code evaluation, subprocess spawning or interpreter execution.

Fail condition

At least one tool surfaces shell, eval, exec, run-code or similar execution vocabulary.

Evidence examples

When the rule fails, the report records evidence in roughly this shape:

  • {"matches": [{"toolName": "run_shell", "keyword": "shell", "source": "name"}]}

Remediation

Remove code-execution tools from the public surface entirely. If a sandboxed execution capability is intentional, expose it only behind authentication, with explicit allow-lists, isolation, and rate limits.

Methodology

This rule belongs to the Tool surface risk dimension. What an agent could do if it trusted every advertised tool. Covers destructive actions, credential disclosure, code execution, filesystem mutation, PII handling, prompt-injection-shaped input fields, and injection-bearing tool descriptions — i.e. the agent-specific threat surface, not just generic verb risk.

Read the full methodology for how rules are aggregated into a score, how verdicts are decided, and how hard-fail rules override the aggregate.