Data Analysis

rf-model-importance-analysis

Train random forest classifiers and rank biomarker features by importance using mean decrease in accuracy and Gini impurity. Inputs: feature matrix, class labels. Outputs: trained model, importance table, OOB error curve, partial dependence plots.

92100Total Score

Core Capability

97 / 100

Functional Suitability

12 / 12

Reliability

11 / 12

Performance & Context

8 / 8

Agent Usability

15 / 16

Human Usability

7 / 8

Security

12 / 12

Maintainability

12 / 12

Agent-Specific

20 / 20

Medical Task

20 / 20 Passed

90Bundled dataset full analysis

4/4

89Custom importance metric and thresholds

4/4

86Identical case and control labels

4/4

87Plot-only rerender from existing model

4/4

89Heavier forest on bundled data

4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	No fabricated scientific claims, unverifiable statistics, or invented study results appeared in any audited execution.
Practice Boundaries	PASS	The skill stayed inside offline statistical execution boundaries and did not issue diagnostic, treatment, or clinical decision advice.
Methodological Ground	PASS	The workflow remained aligned with binary random-forest feature ranking, and the documentation clearly warns that preprocessing, imputation, and multiclass use are out of scope.
Code Usability	PASS	The CLI help check, packaged tests, canonical runs, plot-only rerender, stress run, and repeated seeded comparison all executed successfully in the audited environment.

Core Capability97 / 100 — 8 Categories

Functional Suitability

The skill fully covers its promised binary random-forest training, importance export, plot generation, plot-only reuse, troubleshooting, and test flows.

12 / 12

100%

Reliability

Validation, standardized errors, timeout control, and deterministic reruns are strong, although corrupted-bundle recovery guidance remains primarily documentation-based rather than explicitly smoke-tested.

11 / 12

92%

Performance & Context

Full score achieved. The skill uses progressive disclosure and a direct CLI workflow without unnecessary context or execution overhead.

8 / 8

100%

Agent Usability

The structure is clear and consistent, but first-use onboarding still benefits from scanning a long option matrix before reaching the most common execution scenarios.

15 / 16

94%

Human Usability

The trigger language and examples are natural for technical users, but the skill remains intentionally strict about already-cleaned binary inputs.

7 / 8

88%

Security

Full score achieved. Output paths are constrained to the skill root, inputs are validated, execution is offline-only, and the workflow does not expose eval-like code execution.

12 / 12

100%

Maintainability

Full score achieved. The codebase is modular, references are separated cleanly, and the packaged tests verify both full-run and plot-only behavior.

12 / 12

100%

Agent-Specific

Full score achieved. Trigger precision, layered references, composability, deterministic re-runs, and explicit out-of-scope boundaries are all strong.

20 / 20

100%

Core Capability Total97 / 100

Medical TaskExecution Average: 88.2 / 100 — Assertions: 20/20 Passed

Canonical

Bundled dataset full analysis

4/4 ✓

Variant A

Custom importance metric and thresholds

4/4 ✓

Edge

Identical case and control labels

4/4 ✓

Variant B

Plot-only rerender from existing model

4/4 ✓

Stress

Heavier forest on bundled data

4/4 ✓

Canonical✅ Pass

Bundled dataset full analysis

Executed cleanly and produced the full documented artifact set, including session metadata, model bundle, ranked tables, and both plots.

Basic 38/40|Specialized 52/60|Total 90/100

✅A1Output directory contains the documented artifact set.

✅A2rf_top_features.csv honors the requested reporting cap.

✅A3Reproducibility metadata is recorded.

✅A4The run stays within local-file execution boundaries.

Pass rate: 4 / 4

Variant A✅ Pass

Custom importance metric and thresholds

Handled a tuned forest and Gini-based ranking without instability, and the filtered output remained coherent and within the requested cap.

Basic 38/40|Specialized 51/60|Total 89/100

✅A1Custom modeling parameters execute successfully on the bundled data.

✅A2The importance metric switches to MeanDecreaseGini when requested.

✅A3The filtered feature table honors the requested threshold and cap.

✅A4Plot generation remains stable under custom settings.

Pass rate: 4 / 4

Edge✅ Pass

Identical case and control labels

The expected validation failure was returned immediately with a standardized message, which is correct behavior for this boundary case.

Basic 37/40|Specialized 49/60|Total 86/100

✅A1The CLI rejects identical case and control labels.

✅A2Validation happens before model training or result writing.

✅A3The failure message is standardized and actionable.

✅A4Boundary handling remains deterministic and safe.

Pass rate: 4 / 4

Variant B✅ Pass

Plot-only rerender from existing model

The plot-only branch reused the saved model bundle and regenerated plots without retraining.

Basic 38/40|Specialized 49/60|Total 87/100

✅A1Plot-only mode succeeds without raw input-file arguments.

✅A2Plot-only mode reuses the existing model bundle from output_dir/data/rf_result.rds.

✅A3Plot outputs exist after the rerender.

✅A4Plot-only mode avoids unnecessary retraining side effects.

Pass rate: 4 / 4

Stress✅ Pass

Heavier forest on bundled data

A heavier forest configuration completed within the timeout and still produced an interpretable ranked table.

Basic 38/40|Specialized 51/60|Total 89/100

✅A1The heavier random-forest configuration completes successfully within the configured timeout.

✅A2The thresholded top-feature table honors the requested cap and threshold.

✅A3The ranked output remains interpretable under the heavier settings.

✅A4Stress execution preserves the documented artifact contract.

Pass rate: 4 / 4

Medical Task Total88.2 / 100

Key Strengths

The skill is operationally stable across standard, tuned, plot-only, stress, and expected-failure workflows.
Input validation, path confinement, and offline-only execution give the CLI a strong safety profile.
Deterministic behavior is explicit and was confirmed by a repeated seeded run with byte-for-byte matching output.
Documentation, examples, troubleshooting, and packaged tests align closely with the actual implementation.