Data Analysis

meta-rob2-plot

89100Total Score
Core Capability
83 / 100
Functional Suitability
11 / 12
Reliability
10 / 12
Performance & Context
8 / 8
Agent Usability
13 / 16
Human Usability
7 / 8
Security
9 / 12
Maintainability
9 / 12
Agent-Specific
16 / 20
Medical Task
20 / 20 Passed
98"Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input is a CSV file with ROB2 assessments for each study; output are two PNG plot files."
4/4
94Step 2: Execute R script
4/4
92Step 1: Validate input data
4/4
92Step 3: Output results
4/4
92Step 3: Output results
4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed
Operational Stability
System remains stable across varied inputs and edge cases
PASS
Structural Consistency
Output structure conforms to expected skill contract format
PASS
Result Determinism
Equivalent inputs produce semantically equivalent outputs
PASS
System Security
No prompt injection, data leakage, or unsafe tool use detected
PASS
Research Veto✅ PASS — Applicable
DimensionResultDetail
Scientific IntegrityPASSThe archived review kept this workflow anchored to supplied data fields and observable execution behavior, not fabricated results.
Practice BoundariesPASSThe evaluated outputs stayed inside the "Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input... and did not drift into unsupported interpretation beyond the available inputs.
Methodological GroundPASSThe workflow stayed grounded in its declared rubric or scale-selection logic rather than improvised criteria.
Code UsabilityPASSNo code-usability failure was preserved for meta-rob2-plot in the legacy evaluation.

Core Capability83 / 1008 Categories

Functional Suitability
The package fits its analysis task well, although the final artifact contract could still be sharpened slightly.
11 / 12
92%
Reliability
Reliability remained good, but the archived review still saw room for steadier behavior under edge conditions.
10 / 12
83%
Performance & Context
No point loss was recorded for performance context in the legacy audit.
8 / 8
100%
Agent Usability
The archived review left some headroom in how quickly an agent can lock onto the intended analysis path.
13 / 16
81%
Human Usability
The archived deduction in human usability traces back to: Minor polish before wide rollout. No major defects found
7 / 8
88%
Security
Security remained strong, though the archived review still left some room for clearer execution guardrails.
9 / 12
75%
Maintainability
The analysis package is maintainable overall, though the archived score suggests modest cleanup headroom.
9 / 12
75%
Agent-Specific
Agent-specific quality remained high, with only modest headroom in structured prompting or edge handling.
16 / 20
80%
Core Capability Total83 / 100

Medical TaskExecution Average: 93.6 / 100 — Assertions: 20/20 Passed

98
Canonical
"Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input is a CSV file with ROB2 assessments for each study; output are two PNG plot files."
4/4
94
Variant A
Step 2: Execute R script
4/4
92
Edge
Step 1: Validate input data
4/4
92
Variant B
Step 3: Output results
4/4
92
Stress
Step 3: Output results
4/4
98
Canonical✅ Pass
"Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input is a CSV file with ROB2 assessments for each study; output are two PNG plot files."

For "Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a..., the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 38/40|Specialized 60/60|Total 98/100
A1The meta-rob2-plot output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
94
Variant A✅ Pass
Step 2: Execute R script

For Step 2: Execute R script, the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 36/40|Specialized 58/60|Total 94/100
A1The meta-rob2-plot output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
92
Edge✅ Pass
Step 1: Validate input data

For Step 1: Validate input data, the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 35/40|Specialized 57/60|Total 92/100
A1The meta-rob2-plot output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
92
Variant B✅ Pass
Step 3: Output results

The Step 3: Output results path verified the packaged helper command without exposing a deeper execution issue.

Basic 34/40|Specialized 58/60|Total 92/100
A1The meta-rob2-plot output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
92
Stress✅ Pass
Step 3: Output results

For Step 3: Output results, the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 31/40|Specialized 60/60|Total 92/100
A1The meta-rob2-plot output structure matches the documented deliverable
A2The script execution path completed successfully for the documented case
A3The output stays fully within the documented skill boundary
A4The response quality is acceptable for the documented path
Pass rate: 4 / 4
Medical Task Total93.6 / 100

Key Strengths

  • Primary routing is Data Analysis with execution mode B
  • Static quality score is 83/100 and dynamic average is 82.6/100
  • Assertions and command execution outcomes are recorded per input for human review
  • Execution verification summary: Script verification 1/2; adjustment=3. rob2_plot.py: rc=1; validate_skill.py: OK