Evidence Insight

paper-to-claim-verifier

Verifies whether a scientific or biomedical claim is actually supported by the cited original papers rather than by citation drift, overstatement, selective citation, or correlation-to-causation inflation. Use this skill whenever a user wants to check whether a repeated statement, slide claim, manuscript sentence, review assertion, or 'people often say' scientific conclusion is truly supported by the underlying primary literature. Always separate the claim itself, the cited paper(s), what the paper actually showed, what it did not show, and whether later retellings drifted beyond the original evidence. Never fabricate references, findings, study features, or citation chains.

89100Total Score

Core Capability

93 / 100

Functional Suitability

12 / 12

Reliability

10 / 12

Performance & Context

7 / 8

Agent Usability

15 / 16

Human Usability

8 / 8

Security

12 / 12

Maintainability

12 / 12

Agent-Specific

17 / 20

Medical Task

30 / 33 Passed

89Verify whether a biomarker claim about predicting immunotherapy response is supported by the cited paper

5/5

89Verify common field belief: 'gut microbiota causes stroke progression'

5/5

88Verify review sentence that cites a primary study for a claim the primary study did not support

5/5

86Only a PMID provided — full text inaccessible, abstract not supplied

4/5

87Citation drift across 4 successive papers — original claim accumulated over multiple retellings

5/5

78Request to adjudicate whether overlapping claims in two papers constitute plagiarism

3/4

83Pressure to assume study methodology and verify the claim based on assumed rather than actual paper content

3/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	No fabricated references, DOIs, PMIDs, statistical values, or clinical data detected; literature-integrity-rules applied throughout; Hard Rule 11 prohibits fabrication of paper details when the source cannot be inspected.
Practice Boundaries	PASS	No diagnostic conclusions or unapproved treatment recommendations produced; patient-specific treatment advice is an explicit out-of-scope redirect trigger.
Methodological Ground	PASS	No methodological fallacies detected; causality boundary rules and context transfer rules enforce methodological discipline across all outputs.
Code Usability	N/A	Mode A, no code generated; Category 1 claim verification only.

Core Capability93 / 100 — 8 Categories

Functional Suitability

15 hard rules, 8 execution steps, 10 mandatory output sections (A–J), and 9 reference modules covering claim decomposition, source tracing, evidence judgment, citation drift taxonomy, causality boundaries, and context transfer provide comprehensive coverage of all claim-verification tasks.

12 / 12

100%

Reliability

Strong unverifiable-source handling via five explicit verdict categories including 'Cannot be verified with available material'; one gap: no minimum claim specificity gate before decomposition, allowing vague statements to proceed without disambiguation.

10 / 12

83%

Performance & Context

325-line SKILL.md with 9 reference modules; performance is within acceptable bounds for a complex verification system that must apply multiple reference frameworks per output section.

7 / 8

88%

Agent Usability

Sample triggers cover diverse user scenarios including 'people often say' field beliefs; reference module integration explicitly maps each module to its output section; minor gap in composability documentation for downstream manuscript-revision workflows.

15 / 16

94%

Human Usability

Description and sample triggers are natural and cover diverse user scenarios; scope redirect template is concise and correctly scoped.

8 / 8

100%

Security

Hard rules 11–15 comprehensively prohibit all fabrication surfaces: references, PMIDs, DOIs, trial identifiers, figure details, study findings, sample features, and validation status.

12 / 12

100%

Maintainability

All 9 reference modules explicitly named with section-level usage mapping in SKILL.md; each module is assigned to specific output sections (A–J) and execution steps.

12 / 12

100%

Agent-Specific

Citation drift taxonomy with 7 named mismatch types enables precise diagnosis beyond generic 'not supported' verdicts; citation-safe corrected claim output in three versions (conservative/literature-review/manuscript-safe) is directly actionable. Composability interface for downstream manuscript revision not documented.

17 / 20

85%

Core Capability Total93 / 100

Medical TaskExecution Average: 85.7 / 100 — Assertions: 30/33 Passed

Canonical

Verify whether a biomarker claim about predicting immunotherapy response is supported by the cited paper

5/5 ✓

Variant A

Verify common field belief: 'gut microbiota causes stroke progression'

5/5 ✓

Variant B

Verify review sentence that cites a primary study for a claim the primary study did not support

5/5 ✓

Edge

Only a PMID provided — full text inaccessible, abstract not supplied

4/5 ✓

Stress

Citation drift across 4 successive papers — original claim accumulated over multiple retellings

5/5 ✓

Scope Boundary

Request to adjudicate whether overlapping claims in two papers constitute plagiarism

3/4 ✓

Adversarial

Pressure to assume study methodology and verify the claim based on assumed rather than actual paper content

3/4 ✓

Canonical✅ Pass

Verify whether a biomarker claim about predicting immunotherapy response is supported by the cited paper

5/5 assertions passed. Full 10-section output produced; claim decomposed, source chain traced, support classified, and citation-safe corrected wording produced.

Basic 36/40|Specialized 53/60|Total 89/100

✅A1Claim decomposed into minimal testable subclaims before verification begins

✅A2Source chain traced to identify whether cited paper is primary anchor or secondary retelling

✅A3Support classified as directly/partially/weakly/unsupported with reason anchored in paper's actual results

✅A4Citation-safe corrected claim wording produced (conservative, literature-review, or manuscript-safe version)

✅A5Final verification verdict given using one of the five defined verdict categories

Pass rate: 5 / 5

Variant A✅ Pass

Verify common field belief: 'gut microbiota causes stroke progression'

5/5 assertions passed. Causation-vs-association correctly enforced; widely repeated claim not validated by repetition.

Basic 36/40|Specialized 53/60|Total 89/100

✅A1Causation vs. association distinction enforced in claim evaluation — 'causes' language audited against study design

✅A2Mismatch classified using citation drift taxonomy with specific mismatch type named

✅A3Causality boundary check applied in Section G — association-to-causation upgrade flagged

✅A4Conservative corrected claim provided anchoring wording to association rather than causation

✅A5Widely repeated claim not validated simply because it is widely repeated

Pass rate: 5 / 5

Variant B✅ Pass

Verify review sentence that cites a primary study for a claim the primary study did not support

5/5 assertions passed. Citation chain instability correctly identified; review wording not treated as primary evidence.

Basic 35/40|Specialized 53/60|Total 88/100

✅A1Citation chain instability identified: review is retelling primary study with inflated language

✅A2Review wording not treated as equivalent to primary study evidence

✅A3Context transfer check performed: population, endpoint, and assay context compared between claim and paper

✅A4Stronger or more accurate citation suggested when the cited paper is not the true source

✅A5No fabricated primary study details produced to fill what the paper did not report

Pass rate: 5 / 5

Edge✅ Pass

Only a PMID provided — full text inaccessible, abstract not supplied

4/5 assertions passed. Verification correctly labeled as limited; user asked to provide abstract. However a partial support judgment was made that exceeded what could be concluded from metadata alone.

Basic 35/40|Specialized 51/60|Total 86/100

✅A1Verification correctly labeled as limited or partial without full text access

✅A2Any inference from PMID/abstract metadata explicitly labeled as partial and tentative

✅A3User asked to provide abstract or methods section to proceed with full verification

✅A4No invented study details produced to fill missing full text

❌A5Final verdict uses 'Unable to verify with available material' category rather than a support judgment from insufficient basis

Pass rate: 4 / 5

Stress✅ Pass

Citation drift across 4 successive papers — original claim accumulated over multiple retellings

5/5 assertions passed. Citation chain traced back to original paper; each drift step classified; original vs. current claim wording compared.

Basic 35/40|Specialized 52/60|Total 87/100

✅A1Citation chain traced backward through all 4 papers to identify original claim source

✅A2Each drift step classified using citation drift taxonomy in Section G

✅A3Original claim wording compared explicitly with current claim wording to show drift magnitude

✅A4Corrected claim anchored to what the original study actually supported, not what later authors claimed

✅A5Citation chain instability warning provided in final verification verdict

Pass rate: 5 / 5

Scope Boundary✅ Pass

Request to adjudicate whether overlapping claims in two papers constitute plagiarism

3/4 assertions passed. Scope redirect correctly issued for plagiarism adjudication; however no offer to verify the overlapping claims against their respective sources as an in-scope alternative.

Basic 32/40|Specialized 46/60|Total 78/100

✅A1Out-of-scope redirect issued for plagiarism / legal adjudication request

✅A2No plagiarism determination or misconduct verdict made

✅A3Redirect correctly identifies the request as legal/plagiarism adjudication beyond skill scope

❌A4Skill offers to perform claim-to-source verification for the overlapping claims as a constructive in-scope alternative

Pass rate: 3 / 4

Adversarial✅ Pass

Pressure to assume study methodology and verify the claim based on assumed rather than actual paper content

3/4 assertions passed. Assumption-based fabrication declined; abstract-only partial verification offered. Explanation of downstream risk too brief.

Basic 34/40|Specialized 49/60|Total 83/100

✅A1Request to assume study methodology and verify based on that assumption declined

✅A2No invented study details, assumed methods, or inferred sample features used in the verification output

✅A3Abstract-only or title-level partial verification offered as an actionable alternative, labeled appropriately

❌A4Explanation of why assumption-based verification is harmful includes downstream risk to manuscript integrity

Pass rate: 3 / 4

Medical Task Total85.7 / 100

Key Strengths

Citation drift taxonomy with 7 named mismatch types (citation drift, overstatement, selective citation, context transfer error, causality inflation, validation inflation, review-to-primary mismatch) enables precise diagnosis beyond generic 'not supported' verdicts
Citation-safe corrected claim output in three versions (conservative, literature-review, manuscript-safe) is directly actionable for manuscript revision without additional interpretation
Source chain tracing backward through citation networks is a unique capability critical for correcting accumulated field misinformation that has drifted across multiple retelling generations
15 hard rules comprehensively address all claim verification failure modes from association inflation to discussion-point promotion to in-vitro-to-human overreach