Other
ebm-calculator
Evidence-Based Medicine diagnostic test calculator. Computes sensitivity, specificity, PPV, NPV, likelihood ratios, NNT, and pre/post-test probability from 2x2 contingency table inputs.
88100Total Score
Core Capability
90 / 100
Functional Suitability
11 / 12
Reliability
11 / 12
Performance & Context
7 / 8
Agent Usability
15 / 16
Human Usability
8 / 8
Security
11 / 12
Maintainability
12 / 12
Agent-Specific
15 / 20
Medical Task
19 / 20 Passed
90Diagnostic mode: TP=90, FN=10, TN=85, FP=15
4/4
90Diagnostic mode with prevalence adjustment (prevalence=0.1)
4/4
90NNT mode: control-rate=0.3, experimental-rate=0.2
4/4
90Probability mode: pretest=0.15, lr=5.2
4/4
66Negative TP value (TP=-5)
3/4
Veto GatesRequired pass for any deployment consideration
Skill Veto✓ All 4 gates passed
✓
Operational Stability
System remains stable across varied inputs and edge cases
PASS✓
Structural Consistency
Output structure conforms to expected skill contract format
PASS✓
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS✓
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASSCore Capability90 / 100 — 8 Categories
Functional Suitability
All three modes work correctly; validation rules for negative values and prevalence range now documented; result variable initialization documented
11 / 12
92%
Reliability
Negative TP/FP/FN/TN validation documented with exact error message; prevalence range validation documented; Optional[float] type annotation fix documented
11 / 12
92%
Performance & Context
Token usage is proportional to input complexity; execution overhead is acceptable for clinical/medical data processing tasks though room for compression exists.
7 / 8
88%
Agent Usability
Three modes clearly documented; validation rules section added; fallback template present; references/guidelines.md present
15 / 16
94%
Human Usability
When-to-Use and When-Not-to-Use sections are clearly stated; error scenarios and recovery paths are documented for typical clinical/medical data processing use cases.
8 / 8
100%
Security
No credential concerns; all inputs are numeric; no injection risk
11 / 12
92%
Maintainability
Clean class-based design; 177 lines; good docstrings; type hints present; validation rules documented
12 / 12
100%
Agent-Specific
Trigger description is precise; output format documented; validation rules documented; LSP type error documented for fix
15 / 20
75%
Core Capability Total90 / 100
Medical TaskExecution Average: 87.2 / 100 — Assertions: 19/20 Passed
90
Canonical
Diagnostic mode: TP=90, FN=10, TN=85, FP=15
4/4 ✓
90
Variant A
Diagnostic mode with prevalence adjustment (prevalence=0.1)
4/4 ✓
90
Variant B
NNT mode: control-rate=0.3, experimental-rate=0.2
4/4 ✓
90
Variant C
Probability mode: pretest=0.15, lr=5.2
4/4 ✓
66
Edge
Negative TP value (TP=-5)
3/4 ⚠
90
Canonical✅ Pass
Diagnostic mode: TP=90, FN=10, TN=85, FP=15
sensitivity=0.9, specificity=0.85, ppv=0.8571, npv=0.8947, lr+=6.0, lr-=0.1176, accuracy=0.875. All values mathematically correct.
Basic 36/40|Specialized 54/60|Total 90/100
✅A1Sensitivity = TP / (TP + FN)
✅A2Specificity = TN / (TN + FP)
✅A3LR+ = sensitivity / (1 - specificity)
✅A4Output is valid JSON
Pass rate: 4 / 4
90
Variant A✅ Pass
Diagnostic mode with prevalence adjustment (prevalence=0.1)
PPV adjusted via Bayes theorem: (0.9*0.1)/((0.9*0.1)+(0.15*0.9)) = 0.4. Correct.
Basic 36/40|Specialized 54/60|Total 90/100
✅A1PPV uses Bayes theorem when prevalence is provided
✅A2NPV uses Bayes theorem when prevalence is provided
✅A3PPV is lower than sample-based PPV when prevalence < 0.5
✅A4Output is valid JSON
Pass rate: 4 / 4
90
Variant B✅ Pass
NNT mode: control-rate=0.3, experimental-rate=0.2
ARR=0.1, RRR=0.3333, NNT=10.0. All correct.
Basic 36/40|Specialized 54/60|Total 90/100
✅A1ARR = control_rate - experimental_rate
✅A2NNT = 1 / ARR
✅A3RRR = ARR / control_rate
✅A4Output is valid JSON
Pass rate: 4 / 4
90
Variant C✅ Pass
Probability mode: pretest=0.15, lr=5.2
pretest_odds=0.1765, posttest_odds=0.9176, posttest_prob=0.4785. Correct Fagan nomogram calculation.
Basic 36/40|Specialized 54/60|Total 90/100
✅A1Pretest odds = pretest_prob / (1 - pretest_prob)
✅A2Posttest odds = pretest_odds * LR
✅A3Posttest probability = posttest_odds / (1 + posttest_odds)
✅A4Output is valid JSON
Pass rate: 4 / 4
66
Edge⚠️ Warning
Negative TP value (TP=-5)
SKILL.md now documents that negative values must be rejected with 'Confusion matrix values must be non-negative.' Script still needs to enforce this at runtime.
Basic 26/40|Specialized 40/60|Total 66/100
✅A1Negative TP value rejection is documented with exact error message
✅A2Output is valid JSON even for edge inputs
✅A3Prevalence value outside 0-1 range rejection is documented
❌A4Script actually enforces negative value rejection at runtime
Pass rate: 3 / 4
Medical Task Total87.2 / 100
Key Strengths
- All three calculation modes (diagnostic, NNT, probability) produce mathematically correct results verified against EBM textbook formulas
- Validation rules for negative confusion matrix values and out-of-range prevalence now documented with exact error messages
- Bayes theorem PPV/NPV adjustment with prevalence is correctly implemented; references/guidelines.md cites Sackett et al.
- Clean class-based design with good docstrings; 177 lines; no external dependencies