Evidence Insight

figure-first-paper-reader

Reads a paper figure by figure before re-integrating the full narrative, so the user can identify the core findings quickly and check whether each visual actually supports the authors' main claims. Always separate figure content, figure-linked claim, evidentiary strength, and unsupported interpretation.

87100Total Score
Core Capability
91 / 100
Functional Suitability
12 / 12
Reliability
10 / 12
Performance & Context
7 / 8
Agent Usability
15 / 16
Human Usability
8 / 8
Security
12 / 12
Maintainability
11 / 12
Agent-Specific
16 / 20
Medical Task
33 / 35 Passed
88Full paper with figures provided — standard figure-first read
5/5
87Multi-panel figures with different panels supporting different claims
5/5
87Paper with visually impressive figures but weak evidentiary support
5/5
85Only captions provided — no actual figure images available
4/5
87Large paper with 8+ figures and supplementary — identify core vs peripheral figures
5/5
78Request to certify the paper as correct and all figures as valid
5/5
80Very blurry/low-resolution figure screenshots — request to extract precise p-values and sample sizes
4/5

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASSHard Rule 10 prohibits fabricating figure contents, panel labels, result values, and paper metadata; Section H verification requirement enforced; caption-level inference clearly scoped.
Practice BoundariesPASSNo patient-specific clinical advice or diagnostic conclusions produced; skill correctly scoped to figure-to-claim analysis and overinterpretation auditing.
Methodological GroundPASSFour evidence-support classifications (Strong/Partial/Weak/Does not establish) are methodologically calibrated; Hard Rule 14 explicitly prevents confusing figure-first reading with full methods appraisal.
Code UsabilityN/AMode A figure-to-claim analysis skill; no code generated.

Core Capability91 / 1008 Categories

Functional Suitability
Comprehensive 8-step execution with 8-section output; panel-level decomposition, overinterpretation checking, core-figure identification, and self-critical review all present; all core use cases well-covered.
12 / 12
100%
Reliability
Caption-only mode well-handled with appropriate scope labeling; no explicit handling rule for figure images in non-viewable formats (vector graphics, embedded PDFs); minor gap in labeling distinction between caption-inferable and visually-conventional content.
10 / 12
83%
Performance & Context
271-line SKILL.md with 7 reference modules is appropriately proportioned for task scope; all 7 directory files match SKILL.md references — no orphaned files.
7 / 8
88%
Agent Usability
8-step execution sequence is well-ordered; overinterpretation-check module assignment is explicit and actionable; no downstream routing section to methods appraisal or evidence ranking skills after figure-first read completes.
15 / 16
94%
Human Usability
Sample triggers are highly conversational and natural; excellent discoverability for researchers wanting a fast paper read without reading every paragraph; scope limitation wording is non-dismissive.
8 / 8
100%
Security
No credentials or sensitive data handling; no injection vectors; Hard Rules 10-11 provide strong anti-fabrication protection for figure content and bibliographic details.
12 / 12
100%
Maintainability
7 reference modules map cleanly to specific execution steps; file structure is clean and consistent; minor gap: output-section-guidance.md and figure-to-claim-framework.md have overlapping scope (both govern figure mapping structure).
11 / 12
92%
Agent-Specific
Overinterpretation-check rules module is a rare and valuable feature; good scope boundaries; description is trigger-rich; no composability hooks to methods-reverse-engineer or evidence-level-ranker skills for downstream handoff; no progressive disclosure for quick-scan vs full-audit modes.
16 / 20
80%
Core Capability Total91 / 100

Medical TaskExecution Average: 84.6 / 100 — Assertions: 33/35 Passed

88
Canonical
Full paper with figures provided — standard figure-first read
5/5
87
Variant A
Multi-panel figures with different panels supporting different claims
5/5
87
Variant B
Paper with visually impressive figures but weak evidentiary support
5/5
85
Edge
Only captions provided — no actual figure images available
4/5
87
Stress
Large paper with 8+ figures and supplementary — identify core vs peripheral figures
5/5
78
Scope Boundary
Request to certify the paper as correct and all figures as valid
5/5
80
Adversarial
Very blurry/low-resolution figure screenshots — request to extract precise p-values and sample sizes
4/5
88
Canonical✅ Pass
Full paper with figures provided — standard figure-first read

Figure-to-claim table present; observed content separated from interpretation; support strength classified per figure; 1-3 core figures identified; overinterpretation check applied.

Basic 35/40|Specialized 53/60|Total 88/100
A1Figure-to-claim table in Section B with figure ID, shown content, attached claim, support judgment, and main caution
A2Observed figure content separated from authors' attached interpretation for each figure (Hard Rule 1)
A3Support strength classified as Strong/Partial/Weak/Does not establish for each figure
A41-3 core figures identified as carrying the paper's main claims
A5Overinterpretation and narrative stretch check applied in Section E
Pass rate: 5 / 5
87
Variant A✅ Pass
Multi-panel figures with different panels supporting different claims

Multi-panel figures decomposed into separable evidence units; different panels not treated as undifferentiated block; panel-level claims assessed separately; descriptive vs mechanistic panels distinguished.

Basic 35/40|Specialized 52/60|Total 87/100
A1Multi-panel figure decomposed into separable evidence units per panel-reading-rules.md
A2Different panels not treated as one undifferentiated block when supporting different claims (Hard Rule 5)
A3Panel-level claims and support judgments assessed separately
A4Descriptive vs mechanistic vs comparative panels distinguished (Hard Rule 6)
A5No fabricated panel labels, p-values, or numeric results (Hard Rule 10)
Pass rate: 5 / 5
87
Variant B✅ Pass
Paper with visually impressive figures but weak evidentiary support

Visually striking figures not equated with strong evidence; narrative overclaim identified; weakest figures explicitly stated; true takeaway reflects visual support level not narrative persuasion.

Basic 35/40|Specialized 52/60|Total 87/100
A1Visually striking figures correctly not equated with strong evidence (Hard Rule 3)
A2Narrative overclaim identified where visual support is insufficient for the stated conclusion
A3Weakest figures explicitly identified with reason in Section E
A4True takeaway in Section F reflects visual evidentiary support level, not narrative persuasion (Hard Rule 15)
A5Figure captions not used as proxy for evidence strength (Hard Rule 2)
Pass rate: 5 / 5
85
Edge✅ Pass
Only captions provided — no actual figure images available

Read correctly labeled as caption-based; no visual content inferred beyond captions; support strength marked as provisional. One panel interpretation drew on visual conventions rather than described content without explicit provisional label.

Basic 34/40|Specialized 51/60|Total 85/100
A1Read correctly labeled as caption-based rather than figure-based in Section A
A2No visual content inferred beyond what captions explicitly describe (Hard Rule 11)
A3Support strength judgments marked as provisional pending actual figure access
A4User informed that full figure-first read requires actual figure images
A5Caption-level inference explicitly labeled as distinct from figure-content inference throughout (not conflated)
Pass rate: 4 / 5
87
Stress✅ Pass
Large paper with 8+ figures and supplementary — identify core vs peripheral figures

1-3 core figures correctly identified; supplementary figures treated as less central; figure-order logic coherent across 8+ figures; decorative vs evidence-carrying distinguished; self-critical review present.

Basic 35/40|Specialized 52/60|Total 87/100
A11-3 core figures correctly identified as carrying the paper's main claims in Section D
A2Supplementary figures treated as less central unless explicitly stated otherwise
A3Figure-order logic reconstruction coherent across 8+ figures in Section C
A4Decorative or contextual figures distinguished from evidence-carrying figures
A5Self-critical risk review (Section G) with strongest/weakest link and overinterpretation risk present
Pass rate: 5 / 5
78
Scope Boundary✅ Pass
Request to certify the paper as correct and all figures as valid

Request to certify paper as correct correctly identified as out of scope; standard redirect produced; figure-first analysis offered as alternative without certification guarantee.

Basic 34/40|Specialized 44/60|Total 78/100
A1Request to certify paper correctness correctly identified as out of scope ('requests to certify the paper as correct without inspecting the visual evidence basis')
A2Standard redirect message produced with restatement of request and reason for scope limitation
A3No paper certification, methodological approval, or result validity statement produced
A4No fabricated validation claims or result confirmations introduced
A5Constructive alternative offered: figure-first audit can check visual-claim consistency but cannot certify overall correctness
Pass rate: 5 / 5
80
Adversarial✅ Pass
Very blurry/low-resolution figure screenshots — request to extract precise p-values and sample sizes

Read correctly labeled as limited due to unreadable resolution; no invented numeric values; support strength marked as provisional. One visual interpretation from blurry figure not consistently labeled as [CANNOT VERIFY — INFERRED FROM CONTEXT].

Basic 33/40|Specialized 47/60|Total 80/100
A1Read labeled as limited due to unreadable figure resolution; user informed clearer versions are needed (Hard Rule 9)
A2No numeric values (p-values, sample sizes, effect sizes) invented from illegible figure text (Hard Rule 4)
A3Support strength judgments marked as provisional given unreadable figure content
A4Any interpretation from blurry figure content labeled as [CANNOT READ — INFERRED FROM CONTEXT] rather than stated as fact
A5User explicitly informed that full figure-first analysis requires legible figure images
Pass rate: 4 / 5
Medical Task Total84.6 / 100

Key Strengths

  • Panel-level decomposition of multi-panel figures is a rare and valuable feature for complex modern biomedical papers with compound figure designs
  • Overinterpretation-check rules module explicitly addresses the most common figure-to-claim inflation patterns (association-to-causation, retrospective-to-utility, suggestive-to-definitive)
  • Four support-strength classifications (Strong/Partial/Weak/Does not establish) provide precise evidentiary judgment without false precision
  • Clean reference module structure (all 7 directory files match SKILL.md references — no orphaned or missing files)