Data Analysis

lightgbm-analysis

Train and evaluate LightGBM gradient boosting models for classification or regression with hyperparameter tuning. Inputs: feature matrix, labels. Outputs: trained model, feature importance ranking, SHAP summary plot, ROC or RMSE performance curves.

86100Total Score
Core Capability
90 / 100
Functional Suitability
10 / 12
Reliability
12 / 12
Performance & Context
7 / 8
Agent Usability
15 / 16
Human Usability
6 / 8
Security
11 / 12
Maintainability
12 / 12
Agent-Specific
17 / 20
Medical Task
20 / 20 Passed
67Binary preset on dt_sample1.csv
4/4
92Regression preset on dt_sample1.csv
4/4
90TXT smoke test with Group target
4/4
72Split-importance binary preset
4/4
94Overwrite protection guardrail
4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASSNo fabricated metrics, references, or claims were observed; all reported values came from generated artifacts.
Practice BoundariesPASSThe skill stays inside model-training and artifact-export scope and does not make direct medical or prescriptive claims.
Methodological GroundPASSWeak-model cases were flagged as caution-only with rerun guidance rather than being presented as valid interpretation-ready findings.
Code UsabilityPASSThe R entrypoint ran successfully across binary, regression, TXT-input, and overwrite-guardrail scenarios.

Core Capability90 / 1008 Categories

Functional Suitability
The core workflow is complete, but the bundled binary presets can finish in diagnostic-only mode more often than the current example framing suggests.
10 / 12
83%
Reliability
Clear typed validation, specific SKILL_* errors, and strong rerun guidance were observed across success and failure paths.
12 / 12
100%
Performance & Context
Progressive disclosure is good, although the SKILL keeps many large examples inline that could be pushed further into references.
7 / 8
88%
Agent Usability
The response contract is strong, but a concrete final answer template would reduce agent variance further.
15 / 16
94%
Human Usability
Trigger language is understandable, but the skill remains fairly CLI-centric and rigid about input shape.
6 / 8
75%
Security
Input validation and identifier warnings are strong, though exported artifacts and session metadata can still expose raw feature names unless the user reviews inputs carefully.
11 / 12
92%
Maintainability
The implementation is cleanly split across entrypoint, utilities, model functions, references, and test data.
12 / 12
100%
Agent-Specific
Trigger precision and escape hatches are strong; default overwrite behavior slightly weakens idempotency, and example positioning could better separate smoke tests from report-ready runs.
17 / 20
85%
Core Capability Total90 / 100

Medical TaskExecution Average: 83 / 100 — Assertions: 20/20 Passed

67
Canonical
Binary preset on dt_sample1.csv
4/4
92
Variant A
Regression preset on dt_sample1.csv
4/4
90
Edge
TXT smoke test with Group target
4/4
72
Variant B
Split-importance binary preset
4/4
94
Stress
Overwrite protection guardrail
4/4
67
Canonical⚠️ Warning
Binary preset on dt_sample1.csv

Run completed and exported all artifacts, but the model collapsed to one class and correctly downgraded itself to caution-only.

Basic 28/40|Specialized 39/60|Total 67/100
A1Output creates the documented metrics, importance, remediation, figure, and summary artifacts.
A2Output flags weak-model conditions instead of silently presenting a report-ready ranking.
A3Output includes actionable rerun guidance tied to the detected issue codes.
A4Output stays within the skill's stated LightGBM training and export scope.
Pass rate: 4 / 4
92
Variant A✅ Pass
Regression preset on dt_sample1.csv

Regression workflow executed cleanly and exported an interpretation-eligible result set.

Basic 37/40|Specialized 55/60|Total 92/100
A1Output resolves the requested regression task correctly.
A2Output produces the documented artifact set.
A3Output reports an interpretation-ready status when no quality issues are detected.
A4Output remains reproducible and traceable through saved metadata.
Pass rate: 4 / 4
90
Edge✅ Pass
TXT smoke test with Group target

The tab-delimited TXT path worked correctly, and a repeated run produced identical metrics and importance tables.

Basic 36/40|Specialized 54/60|Total 90/100
A1Output handles tab-delimited TXT input without parser failure.
A2Output generates the requested lightweight binary ranking artifacts.
A3Output reports an interpretation-ready result when no caution flags are present.
A4Output is deterministic under the documented fixed seed.
Pass rate: 4 / 4
72
Variant B✅ Pass
Split-importance binary preset

The split-importance workflow completed, but the bundled example again degraded to a caution-only binary result.

Basic 31/40|Specialized 41/60|Total 72/100
A1Output generates the requested split-based ranking artifacts.
A2Output explicitly downgrades interpretation status when the classifier collapses.
A3Output supplies concrete rerun guidance for the detected classification failure mode.
A4Output stays within the promised model-training and export scope.
Pass rate: 4 / 4
94
Stress✅ Pass
Overwrite protection guardrail

The first run succeeded, and the second run correctly stopped with SKILL_OUTPUT_EXISTS instead of overwriting a populated directory.

Basic 38/40|Specialized 56/60|Total 94/100
A1Output stops before overwriting a populated output directory when fail_if_output_exists is enabled.
A2Output provides a specific and actionable overwrite-protection error.
A3Output still creates the documented artifacts for the initial successful run.
A4Output behavior matches the documented overwrite contract.
Pass rate: 4 / 4
Medical Task Total83 / 100

Key Strengths

  • The CLI contract is explicit, with strong parameter validation and clear SKILL_* failure messages.
  • The implementation is reproducible: the workflow seeds its sampling and produced identical outputs on a repeated TXT smoke test.
  • Weak-model cases are handled responsibly through model_quality_flag, interpretation_status, remediation tables, and rerun hints.
  • The skill ships a genuinely runnable R workflow with bundled test data, references, and structured output artifacts.