Data Analysis

rf-model-importance-analysis

Train random forest classifiers and rank biomarker features by importance using mean decrease in accuracy and Gini impurity. Inputs: feature matrix, class labels. Outputs: trained model, importance table, OOB error curve, partial dependence plots.

92100Total Score
Core Capability
97 / 100
Functional Suitability
12 / 12
Reliability
11 / 12
Performance & Context
8 / 8
Agent Usability
15 / 16
Human Usability
7 / 8
Security
12 / 12
Maintainability
12 / 12
Agent-Specific
20 / 20
Medical Task
20 / 20 Passed
90Bundled dataset full analysis
4/4
89Custom importance metric and thresholds
4/4
86Identical case and control labels
4/4
87Plot-only rerender from existing model
4/4
89Heavier forest on bundled data
4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASSNo fabricated scientific claims, unverifiable statistics, or invented study results appeared in any audited execution.
Practice BoundariesPASSThe skill stayed inside offline statistical execution boundaries and did not issue diagnostic, treatment, or clinical decision advice.
Methodological GroundPASSThe workflow remained aligned with binary random-forest feature ranking, and the documentation clearly warns that preprocessing, imputation, and multiclass use are out of scope.
Code UsabilityPASSThe CLI help check, packaged tests, canonical runs, plot-only rerender, stress run, and repeated seeded comparison all executed successfully in the audited environment.

Core Capability97 / 1008 Categories

Functional Suitability
The skill fully covers its promised binary random-forest training, importance export, plot generation, plot-only reuse, troubleshooting, and test flows.
12 / 12
100%
Reliability
Validation, standardized errors, timeout control, and deterministic reruns are strong, although corrupted-bundle recovery guidance remains primarily documentation-based rather than explicitly smoke-tested.
11 / 12
92%
Performance & Context
Full score achieved. The skill uses progressive disclosure and a direct CLI workflow without unnecessary context or execution overhead.
8 / 8
100%
Agent Usability
The structure is clear and consistent, but first-use onboarding still benefits from scanning a long option matrix before reaching the most common execution scenarios.
15 / 16
94%
Human Usability
The trigger language and examples are natural for technical users, but the skill remains intentionally strict about already-cleaned binary inputs.
7 / 8
88%
Security
Full score achieved. Output paths are constrained to the skill root, inputs are validated, execution is offline-only, and the workflow does not expose eval-like code execution.
12 / 12
100%
Maintainability
Full score achieved. The codebase is modular, references are separated cleanly, and the packaged tests verify both full-run and plot-only behavior.
12 / 12
100%
Agent-Specific
Full score achieved. Trigger precision, layered references, composability, deterministic re-runs, and explicit out-of-scope boundaries are all strong.
20 / 20
100%
Core Capability Total97 / 100

Medical TaskExecution Average: 88.2 / 100 — Assertions: 20/20 Passed

90
Canonical
Bundled dataset full analysis
4/4
89
Variant A
Custom importance metric and thresholds
4/4
86
Edge
Identical case and control labels
4/4
87
Variant B
Plot-only rerender from existing model
4/4
89
Stress
Heavier forest on bundled data
4/4
90
Canonical✅ Pass
Bundled dataset full analysis

Executed cleanly and produced the full documented artifact set, including session metadata, model bundle, ranked tables, and both plots.

Basic 38/40|Specialized 52/60|Total 90/100
A1Output directory contains the documented artifact set.
A2rf_top_features.csv honors the requested reporting cap.
A3Reproducibility metadata is recorded.
A4The run stays within local-file execution boundaries.
Pass rate: 4 / 4
89
Variant A✅ Pass
Custom importance metric and thresholds

Handled a tuned forest and Gini-based ranking without instability, and the filtered output remained coherent and within the requested cap.

Basic 38/40|Specialized 51/60|Total 89/100
A1Custom modeling parameters execute successfully on the bundled data.
A2The importance metric switches to MeanDecreaseGini when requested.
A3The filtered feature table honors the requested threshold and cap.
A4Plot generation remains stable under custom settings.
Pass rate: 4 / 4
86
Edge✅ Pass
Identical case and control labels

The expected validation failure was returned immediately with a standardized message, which is correct behavior for this boundary case.

Basic 37/40|Specialized 49/60|Total 86/100
A1The CLI rejects identical case and control labels.
A2Validation happens before model training or result writing.
A3The failure message is standardized and actionable.
A4Boundary handling remains deterministic and safe.
Pass rate: 4 / 4
87
Variant B✅ Pass
Plot-only rerender from existing model

The plot-only branch reused the saved model bundle and regenerated plots without retraining.

Basic 38/40|Specialized 49/60|Total 87/100
A1Plot-only mode succeeds without raw input-file arguments.
A2Plot-only mode reuses the existing model bundle from output_dir/data/rf_result.rds.
A3Plot outputs exist after the rerender.
A4Plot-only mode avoids unnecessary retraining side effects.
Pass rate: 4 / 4
89
Stress✅ Pass
Heavier forest on bundled data

A heavier forest configuration completed within the timeout and still produced an interpretable ranked table.

Basic 38/40|Specialized 51/60|Total 89/100
A1The heavier random-forest configuration completes successfully within the configured timeout.
A2The thresholded top-feature table honors the requested cap and threshold.
A3The ranked output remains interpretable under the heavier settings.
A4Stress execution preserves the documented artifact contract.
Pass rate: 4 / 4
Medical Task Total88.2 / 100

Key Strengths

  • The skill is operationally stable across standard, tuned, plot-only, stress, and expected-failure workflows.
  • Input validation, path confinement, and offline-only execution give the CLI a strong safety profile.
  • Deterministic behavior is explicit and was confirmed by a repeated seeded run with byte-for-byte matching output.
  • Documentation, examples, troubleshooting, and packaged tests align closely with the actual implementation.