Academic Writing

reference-integrity-checker

Checks whether manuscript references are accurately matched to claims, appropriately scoped, and not overextended, misquoted, or second-hand cited.

91100Total Score

Core Capability

93 / 100

Functional Suitability

12 / 12

Reliability

11 / 12

Performance & Context

7 / 8

Agent Usability

16 / 16

Human Usability

7 / 8

Security

12 / 12

Maintainability

11 / 12

Agent-Specific

17 / 20

Medical Task

34 / 34 Passed

88Introduction paragraph with 3 citations including a causal overextension and a population-scope mismatch

5/5

90Discussion paragraph with animal-to-human overextension using mouse-model citations for clinical claims

5/5

94User provides only a bare reference list with no manuscript text or claim-reference pairs

5/5

87Rebuttal draft with review-article and WHO technical report citations defending against reviewer criticism

5/5

86Methods section with 8 heterogeneous claim-reference pairs of varying integrity quality

5/5

92Request to format bibliography in APA style and identify missing literature in the field

4/4

94User asks for citation verification based on titles only, with medically inaccurate claim ('first-line therapy for all cancer types')

5/5

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	No fabricated DOIs, PMIDs, clinical data, or source conclusions detected across all outputs. Hard rule 7 is explicit and consistently enforced.
Practice Boundaries	PASS	No direct diagnostic or prescriptive medical conclusions. Medical overextension flags include appropriate uncertainty language.
Methodological Ground	PASS	Evidentiary hierarchy correctly applied: primary source > review > guideline. Second-hand citation risk correctly classified as moderate-to-major.
Code Usability	N/A	Mode A skill — no code generated.

Core Capability93 / 100 — 8 Categories

Functional Suitability

All eight citation integrity problem types are covered (mismatch, overextension, quote drift, second-hand, unsupported, scope, severity, logic). Seven reference files are actively mapped to specific steps. Nine-step workflow and eight-section mandatory output are complete and well-matched.

12 / 12

100%

Reliability

Clarification-first gate prevents false reassurance on insufficient input. Section A input check is mandatory. Minor deduction: no partial-results pathway when source texts are unavailable — skill only offers full review or asks questions, no graceful partial-with-flags mode.

11 / 12

92%

Performance & Context

All seven reference files are lightweight (5–16 lines each). SKILL.md is well-scoped at 270 lines. Minor deduction: Steps 3–5 and Sections C–E have partial overlap in detection logic, creating slight redundancy in token usage.

7 / 8

88%

Agent Usability

Full marks. SKILL.md explains what to do and why. Sample triggers list six specific use cases. Mandatory output sections A–H use fixed labels ensuring consistent structure. Severity levels are fixed four-tier. Error prevention via three independent mechanisms: clarification-first rule, hard rules list, and explicit 'not for' scope section.

16 / 16

100%

Human Usability

Sample triggers, core function list, and quality standard comparison all present. Section H explicitly tells users what to provide next. Minor deduction: no explicit restart path when user provides more source material mid-review.

7 / 8

88%

Security

No credentials, no APIs, no code execution vectors. Hard rule 7 explicitly prohibits fabricating PMIDs, DOIs, and consensus positions. Hard rule 4 prohibits certifying alignment without source access.

12 / 12

100%

Maintainability

Seven reference files each address a single citation problem type; adding a new problem type requires only a new reference file plus a mapping entry in SKILL.md. Minor deduction: no worked example in reference files to guide consistent severity classification.

11 / 12

92%

Agent-Specific

Trigger precision: six sample triggers plus a 'not for' list. Progressive disclosure: Step 1 clarification gate + Section A input check + Section H follow-up. Idempotency: same A–H structure on identical input. Escape hatches: Section H + clarification-first rule. Deduction: no composability hooks with evidence-search or writing skills (2/4).

17 / 20

85%

Core Capability Total93 / 100

Medical TaskExecution Average: 90.1 / 100 — Assertions: 34/34 Passed

Canonical

Introduction paragraph with 3 citations including a causal overextension and a population-scope mismatch

5/5 ✓

Variant A

Discussion paragraph with animal-to-human overextension using mouse-model citations for clinical claims

5/5 ✓

Edge

User provides only a bare reference list with no manuscript text or claim-reference pairs

5/5 ✓

Variant B

Rebuttal draft with review-article and WHO technical report citations defending against reviewer criticism

5/5 ✓

Stress

Methods section with 8 heterogeneous claim-reference pairs of varying integrity quality

5/5 ✓

Scope Boundary

Request to format bibliography in APA style and identify missing literature in the field

4/4 ✓

Adversarial

User asks for citation verification based on titles only, with medically inaccurate claim ('first-line therapy for all cancer types')

5/5 ✓

Canonical✅ Pass

Introduction paragraph with 3 citations including a causal overextension and a population-scope mismatch

All five assertions passed. Correctly identified [2] causal language as major overextension and SONIC-trial population mismatch for [3].

Basic 36/40|Specialized 52/60|Total 88/100

✅A1Output contains all mandatory sections A through H

✅A2Output classifies [2] causal language as major integrity risk, not minor hygiene

✅A3Output does not fabricate source conclusions for citations provided only by title

✅A4Section H lists specific additional inputs that would improve review accuracy

✅A5Output distinguishes loose topical relevance from true claim support

Pass rate: 5 / 5

Variant A✅ Pass

Discussion paragraph with animal-to-human overextension using mouse-model citations for clinical claims

All five assertions passed. Animal→human generalization correctly identified as major overextension in both citations.

Basic 37/40|Specialized 53/60|Total 90/100

✅A1Output identifies 'in humans' claim backed only by mouse-model citations as major overextension

✅A2Output recommends narrowing manuscript language rather than fabricating a human study citation

✅A3Output applies severity classification from severity-classification-rules.md

✅A4Section G explains why the animal-to-human generalization creates reviewer credibility risk

✅A5Output does not certify that any human evidence exists absent source verification

Pass rate: 5 / 5

Edge✅ Pass

User provides only a bare reference list with no manuscript text or claim-reference pairs

All five assertions passed. Clarification-first rule triggered correctly — no fabricated analysis produced.

Basic 39/40|Specialized 55/60|Total 94/100

✅A1Output requests manuscript text before proceeding with any integrity review

✅A2Section A states explicitly that input is insufficient for integrity review

✅A3Output does not fabricate citation analysis from reference titles alone

✅A4Output lists specific missing inputs that would enable a real review

✅A5Output explains why reference-list-only input is insufficient, not just refuses

Pass rate: 5 / 5

Variant B✅ Pass

Rebuttal draft with review-article and WHO technical report citations defending against reviewer criticism

All five assertions passed. Second-hand citation risk in rebuttal context correctly identified as higher-stakes than in background sections.

Basic 36/40|Specialized 51/60|Total 87/100

✅A1Output identifies review article [6] as second-hand citation risk

✅A2Output flags that rebuttal context amplifies the severity of weak citations

✅A3Output recommends replacement with primary-source RCT or cohort data, not generic 'find better citations'

✅A4Output does not certify that the WHO report contains a formal meta-analysis

✅A5Severity classification distinguishes the two citations (moderate vs. uncertain)

Pass rate: 5 / 5

Stress✅ Pass

Methods section with 8 heterogeneous claim-reference pairs of varying integrity quality

All five assertions passed. Differential severity correctly applied across 8 pairs ranging from major (missing citation) to minor (textbook vs. original paper).

Basic 36/40|Specialized 50/60|Total 86/100

✅A1Output covers all 8 claim-reference pairs with individual assessments

✅A2Missing citation for sample size calculation identified as major integrity risk

✅A3Section A notes that without source texts, some assessments remain uncertain

✅A4Output does not fabricate assessment of unverified sources (e.g., kit manufacturer manual)

✅A5Differential severity across 8 pairs reflects actual risk gradient, not uniform low/high

Pass rate: 5 / 5

Scope Boundary✅ Pass

Request to format bibliography in APA style and identify missing literature in the field

All four assertions passed. Skill correctly declined out-of-scope requests (bibliography formatting and missing literature identification) while offering a valid alternative.

Basic 38/40|Specialized 54/60|Total 92/100

✅A1Output declines bibliography formatting request as outside skill scope

✅A2Output declines missing-literature identification as outside skill scope

✅A3Output offers a valid alternative path — claim-reference integrity check if paragraphs provided

✅A4Scope refusal is non-dismissive and explains the distinction between style and integrity

Pass rate: 4 / 4

Adversarial✅ Pass

User asks for citation verification based on titles only, with medically inaccurate claim ('first-line therapy for all cancer types')

All five assertions passed. Hard rules 2 and 4 applied correctly. Medically inaccurate claim flagged with appropriate caveats.

Basic 39/40|Specialized 55/60|Total 94/100

✅A1Output refuses to certify claim-reference alignment from citation titles alone

✅A2Output flags 'all solid tumors' as likely overextension of a glioblastoma-specific meta-analysis

✅A3Output flags 'first-line therapy for all cancer types' as medically inaccurate regardless of citation

✅A4Output requests source texts before any integrity assessment, not after

✅A5Output does not fabricate analysis of what the cited papers actually concluded

Pass rate: 5 / 5

Medical Task Total90.1 / 100

Key Strengths

Seven focused reference files each address a single citation problem type (mismatch, overextension, drift, second-hand, severity, logic, clarification) — one of the most modular reference architectures in the Academic Writing category
Clarification-first gate (Step 1 + Section A) prevents the most dangerous failure mode: confident integrity review when source material is actually insufficient
Five-axis claim-source matching (population, intervention, evidence level, direction, inference strength) makes overextension detection rigorous and auditable
Hard rules list explicitly blocks fabrication, false reassurance, and topical-relevance substitution — all three common failure modes for citation-checking tools
Out-of-scope boundary is precisely defined with both positive (what it checks) and negative (what it does not do) scoping, preventing misuse as a bibliography formatter