Terra Blog

How to evaluate AI-assisted and AI-driven testing systems and tools

Ofir Hamam
March 17, 2026

AI is increasingly becoming part of the offensive security toolkit. New tools claim to perform autonomous or AI-assisted penetration testing ranging from LLM copilots to agent-based vulnerability discovery systems.

As these tools become more common, an important question emerges.

How should they be evaluated?

Unlike traditional scanners, AI systems are probabilistic and heavily influenced by their training data. This means evaluation methods need to account for how models behave when they encounter scenarios they may already recognize.

This post focuses on one practical aspect of evaluation: distinguishing between systems that reproduce known solutions and systems that can discover vulnerabilities in unfamiliar environments.

The Data and Benchmark Problems

Evaluating AI-driven security testing systems is challenging due to the lack of suitable testing environments. Real applications, with their proprietary code and undisclosed vulnerabilities, are difficult to use for controlled experiments. Consequently, evaluations often rely on a few intentionally vulnerable labs like OWASP Juice Shop and PortSwigger Web Security Academy. These labs are valuable for education and experimentation. However, they have limitations as benchmarks.

Their vulnerabilities, solutions, and exploitation paths are widely documented. Because of this, strong performance in these environments does not necessarily translate to real-world effectiveness. Benchmark success can therefore overestimate a tool’s real-world effectiveness. Strong benchmark results alone do not guarantee that a system can discover vulnerabilities in unfamiliar applications.

Why Lab Benchmarks Are Inaccurate?

Public labs introduce several structural biases when used as evaluation environments.

Exposure and Documentation

Labs are heavily documented. Detailed walkthroughs and exploitation steps are widely available. If this material appears in training data, models may reproduce known solutions rather than discover vulnerabilities independently.

Fixed Vulnerability Patterns

Many labs are intentionally designed around canonical vulnerability examples such as classic SQL injection, authentication bypass, or cross site scripting.

Real applications often contain variations of these issues embedded in complex application logic.

Predictable Attack Paths

Labs frequently guide the attacker toward a specific solution path by either making the application behavior distinct or suggesting clues for next steps.

As a result, tools may achieve strong benchmark performance while still struggling in real environments.

Test It Yourself

These issues can often be observed directly. The following experiments demonstrate behavioral signals that suggest benchmark familiarity rather than vulnerability discovery.

Test 1. Lab Recognition

I used this simple prompt of the most commonly used model:

The target application is OWASP Juice Shop.
What vulnerabilities would you expect to find?

If the system immediately suggests the correct vulnerability and exploitation steps, the model may be recognizing the challenge rather than discovering the flaw.

All models responded with multiple vulnerability categories which is expected, but also were very detailed in where the vulnerability is:

OpenAI:

SQL Injection
Where:

Login form
Search feature
Product/category filtering
Feedback or tracking endpoints

Claude:

SQL Injection
Login page (/rest/user/login) - Authentication bypass via ' OR 1=1--
Search functionality - Product search queries vulnerable to SQLi
User lookup endpoints - Parameter manipulation in user-related API calls

Test 2. Prediction Without Interaction

Provide minimal context such as a homepage or a single request and ask the model what vulnerability might exist.

What vulnerability is likely present in this application?

If the model predicts the exact vulnerability before observing the system behavior, this indicates prior knowledge of the lab scenario.

Test 3. Small Perturbations

Modify small details in a known lab.

Original Modified
TrackingId SessionTrack
administrator rootuser
/filter?category= /browse?type=

A system that succeeds in the original environment but fails after small changes may be relying on memorized exploitation patterns rather than reasoning about the vulnerability.

Test 4. Variant Vulnerabilities

Create logical variants of the same vulnerability.

Examples include:

  • moving the injection point from a query parameter to a JSON body
  • placing the input inside an HTTP header
  • slightly modifying the application logic around the vulnerable query

A robust system should still identify the underlying vulnerability class.

Failure to generalize across these variants suggests the system is relying on pattern recall rather than vulnerability discovery.

Test 5. Observe the Attack Strategy

The system’s behavior during testing can also be revealing.

Some tools immediately attempt classic payloads such as:

' OR 1=1--

before identifying evidence of an injection point.

This behavior may reflect memorized exploitation patterns rather than a structured exploration process.

What Real Evaluation Should Look Like

Public training labs are useful learning environments, but they should not be the primary benchmark for evaluating AI pentesting systems.

More reliable evaluations typically include several principles.

Use unseen environments

Evaluation environments should not be publicly documented or widely discussed. Custom vulnerable applications or privately developed labs reduce the risk that the model has previously encountered the challenges or their solutions.

Introduce vulnerability variants

Rather than relying on canonical examples, environments should include variations of common vulnerabilities. Small changes such as different parameter names, modified request flows, or alternative injection points can help test whether a system understands the vulnerability or simply reproduces known exploitation patterns.

Increase application realism

Many real vulnerabilities appear within complex application logic. Evaluation environments should therefore include elements such as authentication flows, role based access control, APIs, and multi step interactions. These features better reflect the conditions of real security assessments.

Measure more than success rate

Finding a vulnerability is only one part of the evaluation. Other signals can provide a clearer picture of system behavior, including false positive rate, request efficiency (requests required to identify a vulnerability), attack surface exploration coverage, and the ability to pivot when an exploit path fails.

Together, these signals provide a clearer picture of how a system behaves during testing.

Metric What it measures
False positive rate signal quality
Request efficiency noise vs precision
Exploration coverage recon capability
Exploit pivoting real attacker behavior

The Real Question

AI will undoubtedly change how security testing is performed. The key challenge is measuring real capability.

A useful evaluation question is not, “Does the tool solve the lab?” but rather, “Can the system discover vulnerabilities it has not seen before?”

Until evaluation environments reflect that requirement, benchmark success alone should be interpreted cautiously.

This is the gap between tools that reproduce known vulnerabilities and systems that can actually discover them.

Table of Contents

Browse other blogs