Evidence Insight

reference-finder

91100Total Score

Core Capability

88 / 100

Functional Suitability

11 / 12

Reliability

10 / 12

Performance & Context

8 / 8

Agent Usability

14 / 16

Human Usability

8 / 8

Security

10 / 12

Maintainability

10 / 12

Agent-Specific

17 / 20

Medical Task

20 / 20 Passed

97You have a scientific paragraph and want suggested PubMed papers for each sentence

4/4

93You need top-ranked references with title, DOI, PMID, year, and a short why recommended explanation

4/4

91Sentence-level reference matching for scientific text

4/4

91Returns the top N (default: 3) most relevant PubMed records per sentence

4/4

91End-to-end case for Sentence-level reference matching for scientific text

4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	The archived evaluation did not indicate fabricated or unsupported scientific claims in reference-finder.
Practice Boundaries	PASS	Practice boundaries held because the package remained focused on source handling, lookup, or structured evidence use.
Methodological Ground	PASS	Methodological grounding was preserved through the documented inputs, transformations, and expected artifacts.
Code Usability	PASS	The legacy audit did not record a code-usability failure in the packaged analysis path.

Core Capability88 / 100 — 8 Categories

Functional Suitability

Functional suitability was softened by the legacy issue 'Improve stress-case output rigor'. Stress and boundary scenarios show weaker consistency

11 / 12

92%

Reliability

Related legacy finding for reference-finder: Improve stress-case output rigor. Stress and boundary scenarios show weaker consistency

10 / 12

83%

Performance & Context

The legacy audit gave full marks to performance context for this package.

8 / 8

100%

Agent Usability

The legacy audit deducted points for reference-finder in agent usability.

14 / 16

88%

Human Usability

No point loss was recorded for human usability in the legacy audit.

8 / 8

100%

Security

The archived evaluation left some headroom for reference-finder under security.

10 / 12

83%

Maintainability

A modest deduction remained in maintainability for reference-finder in the archived review.

10 / 12

83%

Agent-Specific

The archived deduction in agent specific traces back to: Improve stress-case output rigor. Stress and boundary scenarios show weaker consistency

17 / 20

85%

Core Capability Total88 / 100

Medical TaskExecution Average: 92.6 / 100 — Assertions: 20/20 Passed

Canonical

You have a scientific paragraph and want suggested PubMed papers for each sentence

4/4 ✓

Variant A

You need top-ranked references with title, DOI, PMID, year, and a short why recommended explanation

4/4 ✓

Edge

Sentence-level reference matching for scientific text

4/4 ✓

Variant B

Returns the top N (default: 3) most relevant PubMed records per sentence

4/4 ✓

Stress

End-to-end case for Sentence-level reference matching for scientific text

4/4 ✓

Canonical✅ Pass

You have a scientific paragraph and want suggested PubMed papers for each sentence

The archived run treated You have a scientific paragraph and want suggested PubMed papers... as a bounded extraction workflow, keeping attention on source fields, fallback logic, and output shape.

Basic 36/40|Specialized 60/60|Total 97/100

✅A1The reference-finder output structure matches the documented deliverable

✅A2The instruction path remains actionable for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Variant A✅ Pass

You need top-ranked references with title, DOI, PMID, year, and a short why recommended explanation

This variant a case stayed focused on extracting and normalizing evidence from the provided records instead of drifting into unsupported interpretation.

Basic 34/40|Specialized 59/60|Total 93/100

✅A1The reference-finder output structure matches the documented deliverable

✅A2The instruction path remains actionable for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Edge✅ Pass

Sentence-level reference matching for scientific text

This edge case stayed within the packaged analysis boundary and kept a reviewable task contract.

Basic 33/40|Specialized 58/60|Total 91/100

✅A1The reference-finder output structure matches the documented deliverable

✅A2The instruction path remains actionable for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Variant B✅ Pass

Returns the top N (default: 3) most relevant PubMed records per sentence

The archived run treated Returns the top N (default: 3) most relevant PubMed records per sentence as a bounded extraction workflow, keeping attention on source fields, fallback logic, and output shape.

Basic 32/40|Specialized 59/60|Total 91/100

✅A1The reference-finder output structure matches the documented deliverable

✅A2The instruction path remains actionable for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Stress✅ Pass

End-to-end case for Sentence-level reference matching for scientific text

End-to-end case for Sentence-level reference matching for... remained tied to the documented analysis contract even when the preserved evidence centered on instructions instead of a full rerun.

Basic 29/40|Specialized 60/60|Total 91/100

✅A1The reference-finder output structure matches the documented deliverable

✅A2The instruction path remains actionable for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Medical Task Total92.6 / 100

Key Strengths

Primary routing is Evidence Insight with execution mode B
Static quality score is 88/100 and dynamic average is 79.6/100
Assertions and command execution outcomes are recorded per input for human review
Execution verification summary: Script verification 1/1; adjustment=5. find_refs.py: OK