result-figure-consistencycheck
Veto GatesRequired pass for any deployment consideration
Core Capability84 / 100 — 8 Categories
Medical TaskExecution Average: 85.6 / 100 — Assertions: 20/20 Passed
This canonical case stayed focused on extracting and normalizing evidence from the provided records instead of drifting into unsupported interpretation.
You want to detect missing figure references in the Results section... remained tied to the documented analysis contract even when the preserved evidence centered on instructions instead of a full rerun.
This edge case stayed focused on extracting and normalizing evidence from the provided records instead of drifting into unsupported interpretation.
Produces: remained tied to the documented analysis contract even when the preserved evidence centered on instructions instead of a full rerun.
End-to-end case for Compares Results descriptions vs figure legend... remained an analysis-style extraction path whose value came from structured data capture rather than a freeform narrative response.
Key Strengths
- Primary routing is Other with execution mode A
- Static quality score is 84/100 and dynamic average is 77.6/100
- Assertions and command execution outcomes are recorded per input for human review
- Execution verification summary: No script verification was applicable