Data Analysis

lightgbm-analysis

Train and evaluate LightGBM gradient boosting models for classification or regression with hyperparameter tuning. Inputs: feature matrix, labels. Outputs: trained model, feature importance ranking, SHAP summary plot, ROC or RMSE performance curves.

86100Total Score

Core Capability

90 / 100

Functional Suitability

10 / 12

Reliability

12 / 12

Performance & Context

7 / 8

Agent Usability

15 / 16

Human Usability

6 / 8

Security

11 / 12

Maintainability

12 / 12

Agent-Specific

17 / 20

Medical Task

20 / 20 Passed

67Binary preset on dt_sample1.csv

4/4

92Regression preset on dt_sample1.csv

4/4

90TXT smoke test with Group target

4/4

72Split-importance binary preset

4/4

94Overwrite protection guardrail

4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	No fabricated metrics, references, or claims were observed; all reported values came from generated artifacts.
Practice Boundaries	PASS	The skill stays inside model-training and artifact-export scope and does not make direct medical or prescriptive claims.
Methodological Ground	PASS	Weak-model cases were flagged as caution-only with rerun guidance rather than being presented as valid interpretation-ready findings.
Code Usability	PASS	The R entrypoint ran successfully across binary, regression, TXT-input, and overwrite-guardrail scenarios.

Core Capability90 / 100 — 8 Categories

Functional Suitability

The core workflow is complete, but the bundled binary presets can finish in diagnostic-only mode more often than the current example framing suggests.

10 / 12

83%

Reliability

Clear typed validation, specific SKILL_* errors, and strong rerun guidance were observed across success and failure paths.

12 / 12

100%

Performance & Context

Progressive disclosure is good, although the SKILL keeps many large examples inline that could be pushed further into references.

7 / 8

88%

Agent Usability

The response contract is strong, but a concrete final answer template would reduce agent variance further.

15 / 16

94%

Human Usability

Trigger language is understandable, but the skill remains fairly CLI-centric and rigid about input shape.

6 / 8

75%

Security

Input validation and identifier warnings are strong, though exported artifacts and session metadata can still expose raw feature names unless the user reviews inputs carefully.

11 / 12

92%

Maintainability

The implementation is cleanly split across entrypoint, utilities, model functions, references, and test data.

12 / 12

100%

Agent-Specific

Trigger precision and escape hatches are strong; default overwrite behavior slightly weakens idempotency, and example positioning could better separate smoke tests from report-ready runs.

17 / 20

85%

Core Capability Total90 / 100

Medical TaskExecution Average: 83 / 100 — Assertions: 20/20 Passed

Canonical

Binary preset on dt_sample1.csv

4/4 ⚠

Variant A

Regression preset on dt_sample1.csv

4/4 ✓

Edge

TXT smoke test with Group target

4/4 ✓

Variant B

Split-importance binary preset

4/4 ✓

Stress

Overwrite protection guardrail

4/4 ✓

Canonical⚠️ Warning

Binary preset on dt_sample1.csv

Run completed and exported all artifacts, but the model collapsed to one class and correctly downgraded itself to caution-only.

Basic 28/40|Specialized 39/60|Total 67/100

✅A1Output creates the documented metrics, importance, remediation, figure, and summary artifacts.

✅A2Output flags weak-model conditions instead of silently presenting a report-ready ranking.

✅A3Output includes actionable rerun guidance tied to the detected issue codes.

✅A4Output stays within the skill's stated LightGBM training and export scope.

Pass rate: 4 / 4

Variant A✅ Pass

Regression preset on dt_sample1.csv

Regression workflow executed cleanly and exported an interpretation-eligible result set.

Basic 37/40|Specialized 55/60|Total 92/100

✅A1Output resolves the requested regression task correctly.

✅A2Output produces the documented artifact set.

✅A3Output reports an interpretation-ready status when no quality issues are detected.

✅A4Output remains reproducible and traceable through saved metadata.

Pass rate: 4 / 4

Edge✅ Pass

TXT smoke test with Group target

The tab-delimited TXT path worked correctly, and a repeated run produced identical metrics and importance tables.

Basic 36/40|Specialized 54/60|Total 90/100

✅A1Output handles tab-delimited TXT input without parser failure.

✅A2Output generates the requested lightweight binary ranking artifacts.

✅A3Output reports an interpretation-ready result when no caution flags are present.

✅A4Output is deterministic under the documented fixed seed.

Pass rate: 4 / 4

Variant B✅ Pass

Split-importance binary preset

The split-importance workflow completed, but the bundled example again degraded to a caution-only binary result.

Basic 31/40|Specialized 41/60|Total 72/100

✅A1Output generates the requested split-based ranking artifacts.

✅A2Output explicitly downgrades interpretation status when the classifier collapses.

✅A3Output supplies concrete rerun guidance for the detected classification failure mode.

✅A4Output stays within the promised model-training and export scope.

Pass rate: 4 / 4

Stress✅ Pass

Overwrite protection guardrail

The first run succeeded, and the second run correctly stopped with SKILL_OUTPUT_EXISTS instead of overwriting a populated directory.

Basic 38/40|Specialized 56/60|Total 94/100

✅A1Output stops before overwriting a populated output directory when fail_if_output_exists is enabled.

✅A2Output provides a specific and actionable overwrite-protection error.

✅A3Output still creates the documented artifacts for the initial successful run.

✅A4Output behavior matches the documented overwrite contract.

Pass rate: 4 / 4

Medical Task Total83 / 100

Key Strengths

The CLI contract is explicit, with strong parameter validation and clear SKILL_* failure messages.
The implementation is reproducible: the workflow seeds its sampling and produced identical outputs on a repeated TXT smoke test.
Weak-model cases are handled responsibly through model_quality_flag, interpretation_status, remediation tables, and rerun hints.
The skill ships a genuinely runnable R workflow with bundled test data, references, and structured output artifacts.