Data Analysis

km-survival-curve

Generate Kaplan-Meier survival curves with log-rank tests for biomarker or molecular subgroup stratification. Inputs: survival time, event status, stratification variable. Outputs: KM plot, log-rank p-value, median survival table, HR estimate.

92100Total Score

Core Capability

92 / 100

Functional Suitability

11 / 12

Reliability

11 / 12

Performance & Context

8 / 8

Agent Usability

15 / 16

Human Usability

7 / 8

Security

11 / 12

Maintainability

11 / 12

Agent-Specific

18 / 20

Medical Task

25 / 25 Passed

94Default KM run on bundled sample1

5/5

93Wald route on bundled sample2

5/5

85Continuous risk column rejection

5/5

91Custom-column CSV workflow

5/5

93Large cohort with multiple plot overrides

5/5

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	No fabricated PMIDs, trial results, p-values, or unsupported scientific claims were observed in any output.
Practice Boundaries	PASS	The skill stayed within data-analysis boundaries and did not issue diagnostic, prescriptive, or treatment advice.
Methodological Ground	PASS	The workflow applies Kaplan-Meier and documented p-value routes correctly, with explicit rejection of unsupported continuous risk columns.
Code Usability	PASS	The R workflow executed successfully on four valid cases, and the invalid case failed cleanly with a usable remediation message.

Core Capability92 / 100 — 8 Categories

Functional Suitability

The skill covers the documented KM workflow thoroughly; the main gap is that numeric-coded categorical groups can be rejected by the continuity heuristic.

11 / 12

92%

Reliability

Validation and remediation are strong, but dependency failures are terse and failure-mode lifecycle details are not fully surfaced in the contract.

11 / 12

92%

Performance & Context

No issues flagged.

8 / 8

100%

Agent Usability

Instructions are clear and layered, though the time-conversion heuristic and failure lifecycle still require careful reading.

15 / 16

94%

Human Usability

Examples are practical, but the skill is intentionally strict about input shape and does not include a clarification workflow for near-miss inputs.

7 / 8

88%

Security

Input validation is strong and no sensitive operations are exposed, though output paths are accepted directly without extra policy constraints.

11 / 12

92%

Maintainability

The skill is cleanly separated into scripts and references, but it does not bundle an explicit automated regression command beyond the manual validation examples.

11 / 12

92%

Agent-Specific

Triggering, layering, composability, and escape hatches are strong; analytical idempotency is better documented than artifact-level stability.

18 / 20

90%

Core Capability Total92 / 100

Medical TaskExecution Average: 91.2 / 100 — Assertions: 25/25 Passed

Canonical

Default KM run on bundled sample1

5/5 ✓

Variant A

Wald route on bundled sample2

5/5 ✓

Edge

Continuous risk column rejection

5/5 ✓

Variant B

Custom-column CSV workflow

5/5 ✓

Stress

Large cohort with multiple plot overrides

5/5 ✓

Canonical✅ Pass

Default KM run on bundled sample1

Executed perfectly and produced both required artifacts.

Basic 38/40|Specialized 56/60|Total 94/100

✅A1Output validates and loads the requested dataset before fitting.

✅A2Output confirms the retained groups and sample count.

✅A3Output produces the required artifacts km-plot.pdf and session_info.txt.

✅A4Output stays within the skill scope of generating one KM figure and session metadata.

✅A5Output does not fabricate analytical claims beyond the run log.

Pass rate: 5 / 5

Variant A✅ Pass

Wald route on bundled sample2

Alternate documented statistical route completed cleanly.

Basic 38/40|Specialized 55/60|Total 93/100

✅A1Output completes successfully on the alternate documented statistical path.

✅A2Output preserves the skill's single-figure contract.

✅A3Output handles a different valid cohort without extra user intervention.

✅A4Output does not exceed scope with unsupported modeling claims.

✅A5Output remains concise and operationally clear.

Pass rate: 5 / 5

Edge✅ Pass

Continuous risk column rejection

Rejected unsupported continuous-looking risk input before fitting, with actionable remediation.

Basic 35/40|Specialized 50/60|Total 85/100

✅A1Output rejects a continuous-looking risk column before model fitting.

✅A2Output gives an actionable remediation path.

✅A3Output avoids stack traces or undefined behavior.

✅A4Output stays within scope by refusing unsupported grouping semantics.

✅A5Output does not produce misleading analytical artifacts after the validation failure.

Pass rate: 5 / 5

Variant B✅ Pass

Custom-column CSV workflow

Custom-column CSV path succeeded and emitted the documented conversion warning.

Basic 37/40|Specialized 54/60|Total 91/100

✅A1Output supports the documented custom-column workflow.

✅A2Output warns when heuristic time conversion is applied.

✅A3Output still emits the required artifacts after conversion.

✅A4Output does not silently change units without notice.

✅A5Output stays within the skill's stated analysis scope.

Pass rate: 5 / 5

Stress✅ Pass

Large cohort with multiple plot overrides

The high-parameter run remained stable and produced the expected artifacts.

Basic 38/40|Specialized 55/60|Total 93/100

✅A1Output handles a larger cohort with multiple plotting overrides.

✅A2Output honors explicit time-unit handling without conversion.

✅A3Output preserves the single-figure contract under heavier customization.

✅A4Output validates complex-but-supported plotting options without instability.

✅A5Output remains concise, safe, and reproducible in scope.

Pass rate: 5 / 5

Medical Task Total91.2 / 100

Key Strengths

The skill has a clear CLI contract with strong validation and actionable remediation messages.
The script executed successfully across baseline, alternate, custom-column, and higher-parameter cases.
Documentation, scripts, and references are well-layered and keep the core workflow easy to follow.
The workflow stays tightly scoped to producing a single Kaplan-Meier figure and session metadata.