Other

ebm-calculator

Evidence-Based Medicine diagnostic test calculator. Computes sensitivity, specificity, PPV, NPV, likelihood ratios, NNT, and pre/post-test probability from 2x2 contingency table inputs.

88100Total Score
Core Capability
90 / 100
Functional Suitability
11 / 12
Reliability
11 / 12
Performance & Context
7 / 8
Agent Usability
15 / 16
Human Usability
8 / 8
Security
11 / 12
Maintainability
12 / 12
Agent-Specific
15 / 20
Medical Task
19 / 20 Passed
90Diagnostic mode: TP=90, FN=10, TN=85, FP=15
4/4
90Diagnostic mode with prevalence adjustment (prevalence=0.1)
4/4
90NNT mode: control-rate=0.3, experimental-rate=0.2
4/4
90Probability mode: pretest=0.15, lr=5.2
4/4
66Negative TP value (TP=-5)
3/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS

Core Capability90 / 1008 Categories

Functional Suitability
All three modes work correctly; validation rules for negative values and prevalence range now documented; result variable initialization documented
11 / 12
92%
Reliability
Negative TP/FP/FN/TN validation documented with exact error message; prevalence range validation documented; Optional[float] type annotation fix documented
11 / 12
92%
Performance & Context
Token usage is proportional to input complexity; execution overhead is acceptable for clinical/medical data processing tasks though room for compression exists.
7 / 8
88%
Agent Usability
Three modes clearly documented; validation rules section added; fallback template present; references/guidelines.md present
15 / 16
94%
Human Usability
When-to-Use and When-Not-to-Use sections are clearly stated; error scenarios and recovery paths are documented for typical clinical/medical data processing use cases.
8 / 8
100%
Security
No credential concerns; all inputs are numeric; no injection risk
11 / 12
92%
Maintainability
Clean class-based design; 177 lines; good docstrings; type hints present; validation rules documented
12 / 12
100%
Agent-Specific
Trigger description is precise; output format documented; validation rules documented; LSP type error documented for fix
15 / 20
75%
Core Capability Total90 / 100

Medical TaskExecution Average: 87.2 / 100 — Assertions: 19/20 Passed

90
Canonical
Diagnostic mode: TP=90, FN=10, TN=85, FP=15
4/4
90
Variant A
Diagnostic mode with prevalence adjustment (prevalence=0.1)
4/4
90
Variant B
NNT mode: control-rate=0.3, experimental-rate=0.2
4/4
90
Variant C
Probability mode: pretest=0.15, lr=5.2
4/4
66
Edge
Negative TP value (TP=-5)
3/4
90
Canonical✅ Pass
Diagnostic mode: TP=90, FN=10, TN=85, FP=15

sensitivity=0.9, specificity=0.85, ppv=0.8571, npv=0.8947, lr+=6.0, lr-=0.1176, accuracy=0.875. All values mathematically correct.

Basic 36/40|Specialized 54/60|Total 90/100
A1Sensitivity = TP / (TP + FN)
A2Specificity = TN / (TN + FP)
A3LR+ = sensitivity / (1 - specificity)
A4Output is valid JSON
Pass rate: 4 / 4
90
Variant A✅ Pass
Diagnostic mode with prevalence adjustment (prevalence=0.1)

PPV adjusted via Bayes theorem: (0.9*0.1)/((0.9*0.1)+(0.15*0.9)) = 0.4. Correct.

Basic 36/40|Specialized 54/60|Total 90/100
A1PPV uses Bayes theorem when prevalence is provided
A2NPV uses Bayes theorem when prevalence is provided
A3PPV is lower than sample-based PPV when prevalence < 0.5
A4Output is valid JSON
Pass rate: 4 / 4
90
Variant B✅ Pass
NNT mode: control-rate=0.3, experimental-rate=0.2

ARR=0.1, RRR=0.3333, NNT=10.0. All correct.

Basic 36/40|Specialized 54/60|Total 90/100
A1ARR = control_rate - experimental_rate
A2NNT = 1 / ARR
A3RRR = ARR / control_rate
A4Output is valid JSON
Pass rate: 4 / 4
90
Variant C✅ Pass
Probability mode: pretest=0.15, lr=5.2

pretest_odds=0.1765, posttest_odds=0.9176, posttest_prob=0.4785. Correct Fagan nomogram calculation.

Basic 36/40|Specialized 54/60|Total 90/100
A1Pretest odds = pretest_prob / (1 - pretest_prob)
A2Posttest odds = pretest_odds * LR
A3Posttest probability = posttest_odds / (1 + posttest_odds)
A4Output is valid JSON
Pass rate: 4 / 4
66
Edge⚠️ Warning
Negative TP value (TP=-5)

SKILL.md now documents that negative values must be rejected with 'Confusion matrix values must be non-negative.' Script still needs to enforce this at runtime.

Basic 26/40|Specialized 40/60|Total 66/100
A1Negative TP value rejection is documented with exact error message
A2Output is valid JSON even for edge inputs
A3Prevalence value outside 0-1 range rejection is documented
A4Script actually enforces negative value rejection at runtime
Pass rate: 3 / 4
Medical Task Total87.2 / 100

Key Strengths

  • All three calculation modes (diagnostic, NNT, probability) produce mathematically correct results verified against EBM textbook formulas
  • Validation rules for negative confusion matrix values and out-of-range prevalence now documented with exact error messages
  • Bayes theorem PPV/NPV adjustment with prevalence is correctly implemented; references/guidelines.md cites Sackett et al.
  • Clean class-based design with good docstrings; 177 lines; no external dependencies