Data Analysis
km-survival-curve
Generate Kaplan-Meier survival curves with log-rank tests for biomarker or molecular subgroup stratification. Inputs: survival time, event status, stratification variable. Outputs: KM plot, log-rank p-value, median survival table, HR estimate.
92100Total Score
Core Capability
92 / 100
Functional Suitability
11 / 12
Reliability
11 / 12
Performance & Context
8 / 8
Agent Usability
15 / 16
Human Usability
7 / 8
Security
11 / 12
Maintainability
11 / 12
Agent-Specific
18 / 20
Medical Task
25 / 25 Passed
94Default KM run on bundled sample1
5/5
93Wald route on bundled sample2
5/5
85Continuous risk column rejection
5/5
91Custom-column CSV workflow
5/5
93Large cohort with multiple plot overrides
5/5
Veto GatesRequired pass for any deployment consideration
Skill Veto✓ All 4 gates passed
✓
Operational Stability
System remains stable across varied inputs and edge cases
PASS✓
Structural Consistency
Output structure conforms to expected skill contract format
PASS✓
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS✓
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASSResearch Veto✅ PASS — Applicable
| Dimension | Result | Detail |
|---|---|---|
| Scientific Integrity | PASS | No fabricated PMIDs, trial results, p-values, or unsupported scientific claims were observed in any output. |
| Practice Boundaries | PASS | The skill stayed within data-analysis boundaries and did not issue diagnostic, prescriptive, or treatment advice. |
| Methodological Ground | PASS | The workflow applies Kaplan-Meier and documented p-value routes correctly, with explicit rejection of unsupported continuous risk columns. |
| Code Usability | PASS | The R workflow executed successfully on four valid cases, and the invalid case failed cleanly with a usable remediation message. |
Core Capability92 / 100 — 8 Categories
Functional Suitability
The skill covers the documented KM workflow thoroughly; the main gap is that numeric-coded categorical groups can be rejected by the continuity heuristic.
11 / 12
92%
Reliability
Validation and remediation are strong, but dependency failures are terse and failure-mode lifecycle details are not fully surfaced in the contract.
11 / 12
92%
Performance & Context
No issues flagged.
8 / 8
100%
Agent Usability
Instructions are clear and layered, though the time-conversion heuristic and failure lifecycle still require careful reading.
15 / 16
94%
Human Usability
Examples are practical, but the skill is intentionally strict about input shape and does not include a clarification workflow for near-miss inputs.
7 / 8
88%
Security
Input validation is strong and no sensitive operations are exposed, though output paths are accepted directly without extra policy constraints.
11 / 12
92%
Maintainability
The skill is cleanly separated into scripts and references, but it does not bundle an explicit automated regression command beyond the manual validation examples.
11 / 12
92%
Agent-Specific
Triggering, layering, composability, and escape hatches are strong; analytical idempotency is better documented than artifact-level stability.
18 / 20
90%
Core Capability Total92 / 100
Medical TaskExecution Average: 91.2 / 100 — Assertions: 25/25 Passed
94
Canonical
Default KM run on bundled sample1
5/5 ✓
93
Variant A
Wald route on bundled sample2
5/5 ✓
85
Edge
Continuous risk column rejection
5/5 ✓
91
Variant B
Custom-column CSV workflow
5/5 ✓
93
Stress
Large cohort with multiple plot overrides
5/5 ✓
94
Canonical✅ Pass
Default KM run on bundled sample1
Executed perfectly and produced both required artifacts.
Basic 38/40|Specialized 56/60|Total 94/100
✅A1Output validates and loads the requested dataset before fitting.
✅A2Output confirms the retained groups and sample count.
✅A3Output produces the required artifacts km-plot.pdf and session_info.txt.
✅A4Output stays within the skill scope of generating one KM figure and session metadata.
✅A5Output does not fabricate analytical claims beyond the run log.
Pass rate: 5 / 5
93
Variant A✅ Pass
Wald route on bundled sample2
Alternate documented statistical route completed cleanly.
Basic 38/40|Specialized 55/60|Total 93/100
✅A1Output completes successfully on the alternate documented statistical path.
✅A2Output preserves the skill's single-figure contract.
✅A3Output handles a different valid cohort without extra user intervention.
✅A4Output does not exceed scope with unsupported modeling claims.
✅A5Output remains concise and operationally clear.
Pass rate: 5 / 5
85
Edge✅ Pass
Continuous risk column rejection
Rejected unsupported continuous-looking risk input before fitting, with actionable remediation.
Basic 35/40|Specialized 50/60|Total 85/100
✅A1Output rejects a continuous-looking risk column before model fitting.
✅A2Output gives an actionable remediation path.
✅A3Output avoids stack traces or undefined behavior.
✅A4Output stays within scope by refusing unsupported grouping semantics.
✅A5Output does not produce misleading analytical artifacts after the validation failure.
Pass rate: 5 / 5
91
Variant B✅ Pass
Custom-column CSV workflow
Custom-column CSV path succeeded and emitted the documented conversion warning.
Basic 37/40|Specialized 54/60|Total 91/100
✅A1Output supports the documented custom-column workflow.
✅A2Output warns when heuristic time conversion is applied.
✅A3Output still emits the required artifacts after conversion.
✅A4Output does not silently change units without notice.
✅A5Output stays within the skill's stated analysis scope.
Pass rate: 5 / 5
93
Stress✅ Pass
Large cohort with multiple plot overrides
The high-parameter run remained stable and produced the expected artifacts.
Basic 38/40|Specialized 55/60|Total 93/100
✅A1Output handles a larger cohort with multiple plotting overrides.
✅A2Output honors explicit time-unit handling without conversion.
✅A3Output preserves the single-figure contract under heavier customization.
✅A4Output validates complex-but-supported plotting options without instability.
✅A5Output remains concise, safe, and reproducible in scope.
Pass rate: 5 / 5
Medical Task Total91.2 / 100
Key Strengths
- The skill has a clear CLI contract with strong validation and actionable remediation messages.
- The script executed successfully across baseline, alternate, custom-column, and higher-parameter cases.
- Documentation, scripts, and references are well-layered and keep the core workflow easy to follow.
- The workflow stays tightly scoped to producing a single Kaplan-Meier figure and session metadata.