Other

ebm-calculator

Evidence-Based Medicine diagnostic test calculator. Computes sensitivity, specificity, PPV, NPV, likelihood ratios, NNT, and pre/post-test probability from 2x2 contingency table inputs.

88100Total Score

Core Capability

90 / 100

Functional Suitability

11 / 12

Reliability

11 / 12

Performance & Context

7 / 8

Agent Usability

15 / 16

Human Usability

8 / 8

Security

11 / 12

Maintainability

12 / 12

Agent-Specific

15 / 20

Medical Task

19 / 20 Passed

90Diagnostic mode: TP=90, FN=10, TN=85, FP=15

4/4

90Diagnostic mode with prevalence adjustment (prevalence=0.1)

4/4

90NNT mode: control-rate=0.3, experimental-rate=0.2

4/4

90Probability mode: pretest=0.15, lr=5.2

4/4

66Negative TP value (TP=-5)

3/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Core Capability90 / 100 — 8 Categories

Functional Suitability

All three modes work correctly; validation rules for negative values and prevalence range now documented; result variable initialization documented

11 / 12

92%

Reliability

Negative TP/FP/FN/TN validation documented with exact error message; prevalence range validation documented; Optional[float] type annotation fix documented

11 / 12

92%

Performance & Context

Token usage is proportional to input complexity; execution overhead is acceptable for clinical/medical data processing tasks though room for compression exists.

7 / 8

88%

Agent Usability

Three modes clearly documented; validation rules section added; fallback template present; references/guidelines.md present

15 / 16

94%

Human Usability

When-to-Use and When-Not-to-Use sections are clearly stated; error scenarios and recovery paths are documented for typical clinical/medical data processing use cases.

8 / 8

100%

Security

No credential concerns; all inputs are numeric; no injection risk

11 / 12

92%

Maintainability

Clean class-based design; 177 lines; good docstrings; type hints present; validation rules documented

12 / 12

100%

Agent-Specific

Trigger description is precise; output format documented; validation rules documented; LSP type error documented for fix

15 / 20

75%

Core Capability Total90 / 100

Medical TaskExecution Average: 87.2 / 100 — Assertions: 19/20 Passed

Canonical

Diagnostic mode: TP=90, FN=10, TN=85, FP=15

4/4 ✓

Variant A

Diagnostic mode with prevalence adjustment (prevalence=0.1)

4/4 ✓

Variant B

NNT mode: control-rate=0.3, experimental-rate=0.2

4/4 ✓

Variant C

Probability mode: pretest=0.15, lr=5.2

4/4 ✓

Edge

Negative TP value (TP=-5)

3/4 ⚠

Canonical✅ Pass

Diagnostic mode: TP=90, FN=10, TN=85, FP=15

sensitivity=0.9, specificity=0.85, ppv=0.8571, npv=0.8947, lr+=6.0, lr-=0.1176, accuracy=0.875. All values mathematically correct.

Basic 36/40|Specialized 54/60|Total 90/100

✅A1Sensitivity = TP / (TP + FN)

✅A2Specificity = TN / (TN + FP)

✅A3LR+ = sensitivity / (1 - specificity)

✅A4Output is valid JSON

Pass rate: 4 / 4

Variant A✅ Pass

Diagnostic mode with prevalence adjustment (prevalence=0.1)

PPV adjusted via Bayes theorem: (0.9*0.1)/((0.9*0.1)+(0.15*0.9)) = 0.4. Correct.

Basic 36/40|Specialized 54/60|Total 90/100

✅A1PPV uses Bayes theorem when prevalence is provided

✅A2NPV uses Bayes theorem when prevalence is provided

✅A3PPV is lower than sample-based PPV when prevalence < 0.5

✅A4Output is valid JSON

Pass rate: 4 / 4

Variant B✅ Pass

NNT mode: control-rate=0.3, experimental-rate=0.2

ARR=0.1, RRR=0.3333, NNT=10.0. All correct.

Basic 36/40|Specialized 54/60|Total 90/100

✅A1ARR = control_rate - experimental_rate

✅A2NNT = 1 / ARR

✅A3RRR = ARR / control_rate

✅A4Output is valid JSON

Pass rate: 4 / 4

Variant C✅ Pass

Probability mode: pretest=0.15, lr=5.2

pretest_odds=0.1765, posttest_odds=0.9176, posttest_prob=0.4785. Correct Fagan nomogram calculation.

Basic 36/40|Specialized 54/60|Total 90/100

✅A1Pretest odds = pretest_prob / (1 - pretest_prob)

✅A2Posttest odds = pretest_odds * LR

✅A3Posttest probability = posttest_odds / (1 + posttest_odds)

✅A4Output is valid JSON

Pass rate: 4 / 4

Edge⚠️ Warning

Negative TP value (TP=-5)

SKILL.md now documents that negative values must be rejected with 'Confusion matrix values must be non-negative.' Script still needs to enforce this at runtime.

Basic 26/40|Specialized 40/60|Total 66/100

✅A1Negative TP value rejection is documented with exact error message

✅A2Output is valid JSON even for edge inputs

✅A3Prevalence value outside 0-1 range rejection is documented

❌A4Script actually enforces negative value rejection at runtime

Pass rate: 3 / 4

Medical Task Total87.2 / 100

Key Strengths

All three calculation modes (diagnostic, NNT, probability) produce mathematically correct results verified against EBM textbook formulas
Validation rules for negative confusion matrix values and out-of-range prevalence now documented with exact error messages
Bayes theorem PPV/NPV adjustment with prevalence is correctly implemented; references/guidelines.md cites Sackett et al.
Clean class-based design with good docstrings; 177 lines; no external dependencies