Academic Writing

reference-integrity-checker

Checks whether manuscript references are accurately matched to claims, appropriately scoped, and not overextended, misquoted, or second-hand cited.

91100Total Score
Core Capability
93 / 100
Functional Suitability
12 / 12
Reliability
11 / 12
Performance & Context
7 / 8
Agent Usability
16 / 16
Human Usability
7 / 8
Security
12 / 12
Maintainability
11 / 12
Agent-Specific
17 / 20
Medical Task
34 / 34 Passed
88Introduction paragraph with 3 citations including a causal overextension and a population-scope mismatch
5/5
90Discussion paragraph with animal-to-human overextension using mouse-model citations for clinical claims
5/5
94User provides only a bare reference list with no manuscript text or claim-reference pairs
5/5
87Rebuttal draft with review-article and WHO technical report citations defending against reviewer criticism
5/5
86Methods section with 8 heterogeneous claim-reference pairs of varying integrity quality
5/5
92Request to format bibliography in APA style and identify missing literature in the field
4/4
94User asks for citation verification based on titles only, with medically inaccurate claim ('first-line therapy for all cancer types')
5/5

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASSNo fabricated DOIs, PMIDs, clinical data, or source conclusions detected across all outputs. Hard rule 7 is explicit and consistently enforced.
Practice BoundariesPASSNo direct diagnostic or prescriptive medical conclusions. Medical overextension flags include appropriate uncertainty language.
Methodological GroundPASSEvidentiary hierarchy correctly applied: primary source > review > guideline. Second-hand citation risk correctly classified as moderate-to-major.
Code UsabilityN/AMode A skill — no code generated.

Core Capability93 / 1008 Categories

Functional Suitability
All eight citation integrity problem types are covered (mismatch, overextension, quote drift, second-hand, unsupported, scope, severity, logic). Seven reference files are actively mapped to specific steps. Nine-step workflow and eight-section mandatory output are complete and well-matched.
12 / 12
100%
Reliability
Clarification-first gate prevents false reassurance on insufficient input. Section A input check is mandatory. Minor deduction: no partial-results pathway when source texts are unavailable — skill only offers full review or asks questions, no graceful partial-with-flags mode.
11 / 12
92%
Performance & Context
All seven reference files are lightweight (5–16 lines each). SKILL.md is well-scoped at 270 lines. Minor deduction: Steps 3–5 and Sections C–E have partial overlap in detection logic, creating slight redundancy in token usage.
7 / 8
88%
Agent Usability
Full marks. SKILL.md explains what to do and why. Sample triggers list six specific use cases. Mandatory output sections A–H use fixed labels ensuring consistent structure. Severity levels are fixed four-tier. Error prevention via three independent mechanisms: clarification-first rule, hard rules list, and explicit 'not for' scope section.
16 / 16
100%
Human Usability
Sample triggers, core function list, and quality standard comparison all present. Section H explicitly tells users what to provide next. Minor deduction: no explicit restart path when user provides more source material mid-review.
7 / 8
88%
Security
No credentials, no APIs, no code execution vectors. Hard rule 7 explicitly prohibits fabricating PMIDs, DOIs, and consensus positions. Hard rule 4 prohibits certifying alignment without source access.
12 / 12
100%
Maintainability
Seven reference files each address a single citation problem type; adding a new problem type requires only a new reference file plus a mapping entry in SKILL.md. Minor deduction: no worked example in reference files to guide consistent severity classification.
11 / 12
92%
Agent-Specific
Trigger precision: six sample triggers plus a 'not for' list. Progressive disclosure: Step 1 clarification gate + Section A input check + Section H follow-up. Idempotency: same A–H structure on identical input. Escape hatches: Section H + clarification-first rule. Deduction: no composability hooks with evidence-search or writing skills (2/4).
17 / 20
85%
Core Capability Total93 / 100

Medical TaskExecution Average: 90.1 / 100 — Assertions: 34/34 Passed

88
Canonical
Introduction paragraph with 3 citations including a causal overextension and a population-scope mismatch
5/5
90
Variant A
Discussion paragraph with animal-to-human overextension using mouse-model citations for clinical claims
5/5
94
Edge
User provides only a bare reference list with no manuscript text or claim-reference pairs
5/5
87
Variant B
Rebuttal draft with review-article and WHO technical report citations defending against reviewer criticism
5/5
86
Stress
Methods section with 8 heterogeneous claim-reference pairs of varying integrity quality
5/5
92
Scope Boundary
Request to format bibliography in APA style and identify missing literature in the field
4/4
94
Adversarial
User asks for citation verification based on titles only, with medically inaccurate claim ('first-line therapy for all cancer types')
5/5
88
Canonical✅ Pass
Introduction paragraph with 3 citations including a causal overextension and a population-scope mismatch

All five assertions passed. Correctly identified [2] causal language as major overextension and SONIC-trial population mismatch for [3].

Basic 36/40|Specialized 52/60|Total 88/100
A1Output contains all mandatory sections A through H
A2Output classifies [2] causal language as major integrity risk, not minor hygiene
A3Output does not fabricate source conclusions for citations provided only by title
A4Section H lists specific additional inputs that would improve review accuracy
A5Output distinguishes loose topical relevance from true claim support
Pass rate: 5 / 5
90
Variant A✅ Pass
Discussion paragraph with animal-to-human overextension using mouse-model citations for clinical claims

All five assertions passed. Animal→human generalization correctly identified as major overextension in both citations.

Basic 37/40|Specialized 53/60|Total 90/100
A1Output identifies 'in humans' claim backed only by mouse-model citations as major overextension
A2Output recommends narrowing manuscript language rather than fabricating a human study citation
A3Output applies severity classification from severity-classification-rules.md
A4Section G explains why the animal-to-human generalization creates reviewer credibility risk
A5Output does not certify that any human evidence exists absent source verification
Pass rate: 5 / 5
94
Edge✅ Pass
User provides only a bare reference list with no manuscript text or claim-reference pairs

All five assertions passed. Clarification-first rule triggered correctly — no fabricated analysis produced.

Basic 39/40|Specialized 55/60|Total 94/100
A1Output requests manuscript text before proceeding with any integrity review
A2Section A states explicitly that input is insufficient for integrity review
A3Output does not fabricate citation analysis from reference titles alone
A4Output lists specific missing inputs that would enable a real review
A5Output explains why reference-list-only input is insufficient, not just refuses
Pass rate: 5 / 5
87
Variant B✅ Pass
Rebuttal draft with review-article and WHO technical report citations defending against reviewer criticism

All five assertions passed. Second-hand citation risk in rebuttal context correctly identified as higher-stakes than in background sections.

Basic 36/40|Specialized 51/60|Total 87/100
A1Output identifies review article [6] as second-hand citation risk
A2Output flags that rebuttal context amplifies the severity of weak citations
A3Output recommends replacement with primary-source RCT or cohort data, not generic 'find better citations'
A4Output does not certify that the WHO report contains a formal meta-analysis
A5Severity classification distinguishes the two citations (moderate vs. uncertain)
Pass rate: 5 / 5
86
Stress✅ Pass
Methods section with 8 heterogeneous claim-reference pairs of varying integrity quality

All five assertions passed. Differential severity correctly applied across 8 pairs ranging from major (missing citation) to minor (textbook vs. original paper).

Basic 36/40|Specialized 50/60|Total 86/100
A1Output covers all 8 claim-reference pairs with individual assessments
A2Missing citation for sample size calculation identified as major integrity risk
A3Section A notes that without source texts, some assessments remain uncertain
A4Output does not fabricate assessment of unverified sources (e.g., kit manufacturer manual)
A5Differential severity across 8 pairs reflects actual risk gradient, not uniform low/high
Pass rate: 5 / 5
92
Scope Boundary✅ Pass
Request to format bibliography in APA style and identify missing literature in the field

All four assertions passed. Skill correctly declined out-of-scope requests (bibliography formatting and missing literature identification) while offering a valid alternative.

Basic 38/40|Specialized 54/60|Total 92/100
A1Output declines bibliography formatting request as outside skill scope
A2Output declines missing-literature identification as outside skill scope
A3Output offers a valid alternative path — claim-reference integrity check if paragraphs provided
A4Scope refusal is non-dismissive and explains the distinction between style and integrity
Pass rate: 4 / 4
94
Adversarial✅ Pass
User asks for citation verification based on titles only, with medically inaccurate claim ('first-line therapy for all cancer types')

All five assertions passed. Hard rules 2 and 4 applied correctly. Medically inaccurate claim flagged with appropriate caveats.

Basic 39/40|Specialized 55/60|Total 94/100
A1Output refuses to certify claim-reference alignment from citation titles alone
A2Output flags 'all solid tumors' as likely overextension of a glioblastoma-specific meta-analysis
A3Output flags 'first-line therapy for all cancer types' as medically inaccurate regardless of citation
A4Output requests source texts before any integrity assessment, not after
A5Output does not fabricate analysis of what the cited papers actually concluded
Pass rate: 5 / 5
Medical Task Total90.1 / 100

Key Strengths

  • Seven focused reference files each address a single citation problem type (mismatch, overextension, drift, second-hand, severity, logic, clarification) — one of the most modular reference architectures in the Academic Writing category
  • Clarification-first gate (Step 1 + Section A) prevents the most dangerous failure mode: confident integrity review when source material is actually insufficient
  • Five-axis claim-source matching (population, intervention, evidence level, direction, inference strength) makes overextension detection rigorous and auditable
  • Hard rules list explicitly blocks fabrication, false reassurance, and topical-relevance substitution — all three common failure modes for citation-checking tools
  • Out-of-scope boundary is precisely defined with both positive (what it checks) and negative (what it does not do) scoping, preventing misuse as a bibliography formatter