Data Analysis

km-survival-curve

Generate Kaplan-Meier survival curves with log-rank tests for biomarker or molecular subgroup stratification. Inputs: survival time, event status, stratification variable. Outputs: KM plot, log-rank p-value, median survival table, HR estimate.

92100Total Score
Core Capability
92 / 100
Functional Suitability
11 / 12
Reliability
11 / 12
Performance & Context
8 / 8
Agent Usability
15 / 16
Human Usability
7 / 8
Security
11 / 12
Maintainability
11 / 12
Agent-Specific
18 / 20
Medical Task
25 / 25 Passed
94Default KM run on bundled sample1
5/5
93Wald route on bundled sample2
5/5
85Continuous risk column rejection
5/5
91Custom-column CSV workflow
5/5
93Large cohort with multiple plot overrides
5/5

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASSNo fabricated PMIDs, trial results, p-values, or unsupported scientific claims were observed in any output.
Practice BoundariesPASSThe skill stayed within data-analysis boundaries and did not issue diagnostic, prescriptive, or treatment advice.
Methodological GroundPASSThe workflow applies Kaplan-Meier and documented p-value routes correctly, with explicit rejection of unsupported continuous risk columns.
Code UsabilityPASSThe R workflow executed successfully on four valid cases, and the invalid case failed cleanly with a usable remediation message.

Core Capability92 / 1008 Categories

Functional Suitability
The skill covers the documented KM workflow thoroughly; the main gap is that numeric-coded categorical groups can be rejected by the continuity heuristic.
11 / 12
92%
Reliability
Validation and remediation are strong, but dependency failures are terse and failure-mode lifecycle details are not fully surfaced in the contract.
11 / 12
92%
Performance & Context
No issues flagged.
8 / 8
100%
Agent Usability
Instructions are clear and layered, though the time-conversion heuristic and failure lifecycle still require careful reading.
15 / 16
94%
Human Usability
Examples are practical, but the skill is intentionally strict about input shape and does not include a clarification workflow for near-miss inputs.
7 / 8
88%
Security
Input validation is strong and no sensitive operations are exposed, though output paths are accepted directly without extra policy constraints.
11 / 12
92%
Maintainability
The skill is cleanly separated into scripts and references, but it does not bundle an explicit automated regression command beyond the manual validation examples.
11 / 12
92%
Agent-Specific
Triggering, layering, composability, and escape hatches are strong; analytical idempotency is better documented than artifact-level stability.
18 / 20
90%
Core Capability Total92 / 100

Medical TaskExecution Average: 91.2 / 100 — Assertions: 25/25 Passed

94
Canonical
Default KM run on bundled sample1
5/5
93
Variant A
Wald route on bundled sample2
5/5
85
Edge
Continuous risk column rejection
5/5
91
Variant B
Custom-column CSV workflow
5/5
93
Stress
Large cohort with multiple plot overrides
5/5
94
Canonical✅ Pass
Default KM run on bundled sample1

Executed perfectly and produced both required artifacts.

Basic 38/40|Specialized 56/60|Total 94/100
A1Output validates and loads the requested dataset before fitting.
A2Output confirms the retained groups and sample count.
A3Output produces the required artifacts km-plot.pdf and session_info.txt.
A4Output stays within the skill scope of generating one KM figure and session metadata.
A5Output does not fabricate analytical claims beyond the run log.
Pass rate: 5 / 5
93
Variant A✅ Pass
Wald route on bundled sample2

Alternate documented statistical route completed cleanly.

Basic 38/40|Specialized 55/60|Total 93/100
A1Output completes successfully on the alternate documented statistical path.
A2Output preserves the skill's single-figure contract.
A3Output handles a different valid cohort without extra user intervention.
A4Output does not exceed scope with unsupported modeling claims.
A5Output remains concise and operationally clear.
Pass rate: 5 / 5
85
Edge✅ Pass
Continuous risk column rejection

Rejected unsupported continuous-looking risk input before fitting, with actionable remediation.

Basic 35/40|Specialized 50/60|Total 85/100
A1Output rejects a continuous-looking risk column before model fitting.
A2Output gives an actionable remediation path.
A3Output avoids stack traces or undefined behavior.
A4Output stays within scope by refusing unsupported grouping semantics.
A5Output does not produce misleading analytical artifacts after the validation failure.
Pass rate: 5 / 5
91
Variant B✅ Pass
Custom-column CSV workflow

Custom-column CSV path succeeded and emitted the documented conversion warning.

Basic 37/40|Specialized 54/60|Total 91/100
A1Output supports the documented custom-column workflow.
A2Output warns when heuristic time conversion is applied.
A3Output still emits the required artifacts after conversion.
A4Output does not silently change units without notice.
A5Output stays within the skill's stated analysis scope.
Pass rate: 5 / 5
93
Stress✅ Pass
Large cohort with multiple plot overrides

The high-parameter run remained stable and produced the expected artifacts.

Basic 38/40|Specialized 55/60|Total 93/100
A1Output handles a larger cohort with multiple plotting overrides.
A2Output honors explicit time-unit handling without conversion.
A3Output preserves the single-figure contract under heavier customization.
A4Output validates complex-but-supported plotting options without instability.
A5Output remains concise, safe, and reproducible in scope.
Pass rate: 5 / 5
Medical Task Total91.2 / 100

Key Strengths

  • The skill has a clear CLI contract with strong validation and actionable remediation messages.
  • The script executed successfully across baseline, alternate, custom-column, and higher-parameter cases.
  • Documentation, scripts, and references are well-layered and keep the core workflow easy to follow.
  • The workflow stays tightly scoped to producing a single Kaplan-Meier figure and session metadata.