Data Analysis

experimental-data-analysis

91100Total Score
Core Capability
88 / 100
Functional Suitability
11 / 12
Reliability
10 / 12
Performance & Context
8 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
10 / 12
Maintainability
10 / 12
Agent-Specific
17 / 20
Medical Task
20 / 20 Passed
98You have experimental results in CSV form and need a reproducible end-to-end analysis workflow (clean → test → report)
4/4
94You need to compare two conditions (independent or paired) and determine statistical significance with effect sizes
4/4
92Reproducible, run-based execution that writes all artifacts into outputs/runs/<timestamp>/
4/4
92Data preparation guidance: missing values, outliers, and variable type identification (continuous/categorical; grouping factors)
4/4
92End-to-end case for Reproducible, run-based execution that writes all artifacts into outputs/runs/<timestamp>/
4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASSNo scientific-integrity problem was surfaced because the package did not claim more than the available records, article text, or script evidence supported.
Practice BoundariesPASSThe evaluated outputs stayed inside the Statistical analysis and reporting for experimental datasets and did not drift into unsupported interpretation beyond the available inputs.
Methodological GroundPASSMethodological grounding was preserved through the documented inputs, transformations, and expected artifacts.
Code UsabilityPASSThe legacy audit did not record a code-usability failure in the packaged analysis path.

Core Capability88 / 1008 Categories

Functional Suitability
The archived review left a small gap in how directly Statistical analysis and reporting for experimental datasets resolves into a finished analysis deliverable.
11 / 12
92%
Reliability
The legacy audit preserved a modest reliability gap around harder runs or more demanding inputs.
10 / 12
83%
Performance & Context
No point loss was recorded for performance context in the legacy audit.
8 / 8
100%
Agent Usability
The packaged analysis path is understandable, though the archived score suggests slightly clearer routing would help.
14 / 16
88%
Human Usability
No point loss was recorded for human usability in the legacy audit.
8 / 8
100%
Security
Security remained strong, though the archived review still left some room for clearer execution guardrails.
10 / 12
83%
Maintainability
The analysis package is maintainable overall, though the archived score suggests modest cleanup headroom.
10 / 12
83%
Agent-Specific
The package is strongly shaped for agent use, though the archived score still left a small gap in execution determinism.
17 / 20
85%
Core Capability Total88 / 100

Medical TaskExecution Average: 93.6 / 100 — Assertions: 20/20 Passed

98
Canonical
You have experimental results in CSV form and need a reproducible end-to-end analysis workflow (clean → test → report)
4/4
94
Variant A
You need to compare two conditions (independent or paired) and determine statistical significance with effect sizes
4/4
92
Edge
Reproducible, run-based execution that writes all artifacts into outputs/runs/<timestamp>/
4/4
92
Variant B
Data preparation guidance: missing values, outliers, and variable type identification (continuous/categorical; grouping factors)
4/4
92
Stress
End-to-end case for Reproducible, run-based execution that writes all artifacts into outputs/runs/<timestamp>/
4/4
98
Canonical✅ Pass
You have experimental results in CSV form and need a reproducible end-to-end analysis workflow (clean → test → report)

The You have experimental results in CSV form and need a reproducible... scenario completed within the documented Statistical analysis and reporting for experimental datasets boundary.

Basic 38/40|Specialized 60/60|Total 98/100
A1The experimental-data-analysis output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
94
Variant A✅ Pass
You need to compare two conditions (independent or paired) and determine statistical significance with effect sizes

The archived evaluation treated You need to compare two conditions (independent or paired) and... as a clean in-scope run.

Basic 36/40|Specialized 58/60|Total 94/100
A1The experimental-data-analysis output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
92
Edge✅ Pass
Reproducible, run-based execution that writes all artifacts into outputs/runs/<timestamp>/

The Reproducible, run-based execution that writes all artifacts into... scenario completed within the documented Statistical analysis and reporting for experimental datasets boundary.

Basic 35/40|Specialized 57/60|Total 92/100
A1The experimental-data-analysis output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
92
Variant B✅ Pass
Data preparation guidance: missing values, outliers, and variable type identification (continuous/categorical; grouping factors)

Data preparation guidance: missing values, outliers, and variable... remained well-aligned with the documented contract in the preserved audit.

Basic 34/40|Specialized 58/60|Total 92/100
A1The experimental-data-analysis output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
92
Stress✅ Pass
End-to-end case for Reproducible, run-based execution that writes all artifacts into outputs/runs/<timestamp>/

The End-to-end case for Reproducible, run-based execution that writes... scenario completed within the documented Statistical analysis and reporting for experimental datasets boundary.

Basic 31/40|Specialized 60/60|Total 92/100
A1The experimental-data-analysis output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
Medical Task Total93.6 / 100

Key Strengths

  • Primary routing is Data Analysis with execution mode B
  • Static quality score is 88/100 and dynamic average is 82.6/100
  • Assertions and command execution outcomes are recorded per input for human review
  • Execution verification summary: Script verification 1/2; adjustment=3. analyze_experiment.py: rc=1; init_run.py: OK