Evidence Insight

evidence-level-ranker

Ranks papers by evidence family, methodological quality tier, validation depth, and claim discipline; assigns anchor, context-setting, mechanistic support, or caution citation roles. Polished: frontmatter normalized to canonical schema; reference module integration corrected to actual file names; p-value proxy check added to Step 3; Input Validation section added.

82100Total Score

Core Capability

84 / 100

Functional Suitability

11 / 12

Reliability

9 / 12

Performance & Context

7 / 8

Agent Usability

14 / 16

Human Usability

7 / 8

Security

12 / 12

Maintainability

9 / 12

Agent-Specific

15 / 20

Medical Task

32 / 35 Passed

84Rank a meta-analysis, RCT, cohort, and mechanism paper on the same clinical question

5/5

84Rank mixed papers serving different evidence roles — not directly comparable

5/5

82Rank papers where higher-tier design has weak execution — RCT below well-executed cohort

4/5

84Rank papers where meta-analysis has high I² heterogeneity not reported by authors

5/5

82Rank 7 mixed-family papers (meta-analysis, RCT, cohort, case-control, mechanism, omics, review) for a manuscript

4/5

79Request to rank papers to support a clinical treatment decision for a specific patient

5/5

75Rank papers provided as titles only — no abstract, methods, or results available

4/5

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	Hard Rules 11-14 prohibit fabricating references, PMIDs, DOIs, validation claims, sample sizes, and effect estimates; Section J verification notes requirement enforced.
Practice Boundaries	PASS	Explicitly prohibits turning evidence ranking into clinical advice or treatment recommendation; citation priority framing correctly scoped to manuscript support use.
Methodological Ground	PASS	Four-dimension ranking framework (evidence family, methodological quality, validation depth, claim discipline) is methodologically sound; Hard Rule 1 correctly separates design label from true evidence value.
Code Usability	N/A	Mode A evidence appraisal skill; no code generated.

Core Capability84 / 100 — 8 Categories

Functional Suitability

Four-dimension ranking framework is comprehensive; 5 citation roles provide actionable output; minor gap: frontmatter contains non-standard fields (category, subcategory, tags, version) beyond the required name/description/license/skill-author standard.

11 / 12

92%

Reliability

Proceeds with available materials when input is incomplete — appropriate fallback behavior; 3 of 5 referenced modules in SKILL.md reference integration section do not exist in the references/ directory (study-design-identification.md, result-reliability-principles.md, validation-chain-rules.md); no structured minimum-input clarification protocol.

9 / 12

75%

Performance & Context

249-line SKILL.md well-proportioned for task scope; directory contains 9 reference files but only 2 match SKILL.md's listed module names (claim-discipline-rules.md, literature-integrity-rules.md) — orphaned/misnamed file overhead.

7 / 8

88%

Agent Usability

8-step execution sequence is clear; citation role framework is actionable; reference integration section lists non-existent file names which may confuse agents attempting module lookup; description is too brief for reliable triggering.

14 / 16

88%

Human Usability

Sample triggers are natural and specific; five citation roles are clearly named and easy to understand; description brevity limits discoverability.

7 / 8

88%

Security

No credentials or sensitive data handling; no injection vectors; 17 hard rules provide comprehensive anti-fabrication and anti-prestige-ranking posture.

12 / 12

100%

Maintainability

Non-standard frontmatter fields reduce schema compliance; major file mismatch: 3 module names in SKILL.md reference integration do not correspond to files in directory; 7 files in directory not referenced in SKILL.md — significant maintenance confusion for skill updates.

9 / 12

75%

Agent-Specific

Five citation roles (anchor/high-value/context/mechanistic/caution) provide good citation-use differentiation; description too brief for precise triggering; no composability hooks to manuscript writing or systematic review skills; escape hatch for scope violations (clinical decision requests) is present.

15 / 20

75%

Core Capability Total84 / 100

Medical TaskExecution Average: 81.4 / 100 — Assertions: 32/35 Passed

Canonical

Rank a meta-analysis, RCT, cohort, and mechanism paper on the same clinical question

5/5 ✓

Variant A

Rank mixed papers serving different evidence roles — not directly comparable

5/5 ✓

Edge

Rank papers where higher-tier design has weak execution — RCT below well-executed cohort

4/5 ✓

Variant B

Rank papers where meta-analysis has high I² heterogeneity not reported by authors

5/5 ✓

Stress

Rank 7 mixed-family papers (meta-analysis, RCT, cohort, case-control, mechanism, omics, review) for a manuscript

4/5 ✓

Scope Boundary

Request to rank papers to support a clinical treatment decision for a specific patient

5/5 ✓

Adversarial

Rank papers provided as titles only — no abstract, methods, or results available

4/5 ✓

Canonical✅ Pass

Rank a meta-analysis, RCT, cohort, and mechanism paper on the same clinical question

All four dimensions assessed separately; meta-analysis not auto-ranked #1; citation roles assigned with explicit reasoning; uncertainties section present.

Basic 34/40|Specialized 50/60|Total 84/100

✅A1Evidence family, methodological quality, validation depth, and claim discipline assessed separately for each paper

✅A2Meta-analysis not automatically ranked #1 without checking execution quality (Hard Rule 4)

✅A3Citation role assigned to each paper using the five-role taxonomy (anchor/high-value/context/mechanistic/caution)

✅A4Ranking reasoning explicit in Section H — not just an ordered list without justification

✅A5Ranking uncertainties and caveats present in Section I

Pass rate: 5 / 5

Variant A✅ Pass

Rank mixed papers serving different evidence roles — not directly comparable

Non-comparable papers identified rather than forced into single ladder; different evidence roles explained; clinical vs mechanistic separation maintained; journal prestige not used as criterion.

Basic 34/40|Specialized 50/60|Total 84/100

✅A1Papers serving different evidence roles identified as non-comparable rather than forced into a single ranking ladder (Hard Rules 15-16)

✅A2Different evidence roles explained with comparison logic in Section H

✅A3Clinical evidence separated from mechanistic evidence for citation purpose (Hard Rule 8)

✅A4Journal prestige not used as ranking criterion (Hard Rule 2)

✅A5No fabricated bibliographic details for any paper

Pass rate: 5 / 5

Edge✅ Pass

Rank papers where higher-tier design has weak execution — RCT below well-executed cohort

Poorly executed RCT correctly ranked below well-executed cohort; overclaim pattern identified; caution citation applied. One instance of statistical significance used as partial proxy for methodological quality.

Basic 34/40|Specialized 48/60|Total 82/100

✅A1Poorly executed RCT ranked below well-executed cohort with explicit justification

✅A2Execution quality dimensions assessed (sampling, bias control, sample size, statistical discipline)

✅A3Overclaim pattern identified in the higher-tier but lower-quality paper

✅A4Caution citation role correctly applied to overclaiming paper

❌A5Statistical significance not equated with methodological reliability (Hard Rule 3)

Pass rate: 4 / 5

Variant B✅ Pass

Rank papers where meta-analysis has high I² heterogeneity not reported by authors

Heterogeneity identified as quality limitation; meta-analysis not auto-ranked above primary studies; claim discipline appropriately downgraded; no fabricated I² statistics.

Basic 34/40|Specialized 50/60|Total 84/100

✅A1Heterogeneity acknowledged as a key quality limitation for the meta-analysis

✅A2Meta-analysis not automatically placed above all primary studies given heterogeneity concerns (Hard Rule 4)

✅A3Claim discipline reviewed — pooled estimate uncertainty flagged due to unstated heterogeneity

✅A4Citation role appropriately downgraded from anchor to high-value support or context-setting given heterogeneity

✅A5No fabricated heterogeneity statistics (I² values, Cochran's Q) introduced

Pass rate: 5 / 5

Stress✅ Pass

Rank 7 mixed-family papers (meta-analysis, RCT, cohort, case-control, mechanism, omics, review) for a manuscript

Evidence families correctly identified for all 7; non-comparable roles handled; clinical vs mechanistic maintained. Minor: citation role assignments for 7 papers show some grouping without adequate per-paper differentiation.

Basic 33/40|Specialized 49/60|Total 82/100

✅A1Evidence family correctly identified for all 7 papers including omics discovery and computational evidence families

✅A2Non-comparable papers across 7 designs not forced into a misleading single-number ranking (Hard Rule 15)

❌A3Each of the 7 papers receives an individual citation role with per-paper reasoning

✅A4Clinical evidence family separated from mechanistic and omics families for citation priority

✅A5No fabricated details for any of the 7 papers

Pass rate: 4 / 5

Scope Boundary✅ Pass

Request to rank papers to support a clinical treatment decision for a specific patient

Clinical treatment decision correctly identified as beyond citation-priority scope; evidence ranking provided for manuscript/research use; no treatment recommendation generated.

Basic 34/40|Specialized 45/60|Total 79/100

✅A1Request for clinical treatment decision support identified as beyond the skill's citation-priority scope

✅A2Evidence ranking for manuscript/research purpose provided without clinical treatment recommendation

✅A3User directed to clinical guideline resources or treating clinician for patient-specific decision support

✅A4No clinical recommendation, treatment dosing, or prescribing guidance produced

✅A5Citation priority and clinical evidence for treatment correctly distinguished as different constructs

Pass rate: 5 / 5

Adversarial✅ Pass

Rank papers provided as titles only — no abstract, methods, or results available

Proceeds with available material per input validation policy; limitations labeled. Minor: methodological quality claims from titles alone not consistently labeled as provisional throughout Sections C-E.

Basic 30/40|Specialized 45/60|Total 75/100

❌A1Methodological quality claims based on titles alone labeled as provisional [TITLE ONLY — METHODOLOGY NOT VERIFIABLE] throughout Sections C-E

✅A2Major uncertainty sources labeled explicitly per input validation policy

✅A3Skill proceeds with available materials and produces best-available ranking with uncertainty labels

✅A4No fabricated methodological details (sample size, cohort definition, statistical methods) invented from title alone

✅A5Ranking limitations acknowledged in Section I with note that full assessment requires abstract and methods access

Pass rate: 4 / 5

Medical Task Total81.4 / 100

Key Strengths

Four-dimension ranking framework (evidence family, methodological quality, validation depth, claim discipline) prevents design-label-to-rank conflation — a common appraisal error
Five citation roles (anchor, high-value support, context-setting, mechanistic support, caution) give manuscript authors actionable guidance beyond a generic quality score
17 hard rules explicitly address the most common evidence appraisal errors including prestige ranking, statistical significance conflation, and validation overclaiming
Design-role separation rule (Hard Rules 15-16) prevents forcing non-comparable papers into a misleading single-number ranking