Data Analysis

meta-rob2-plot

89100Total Score

Core Capability

83 / 100

Functional Suitability

11 / 12

Reliability

10 / 12

Performance & Context

8 / 8

Agent Usability

13 / 16

Human Usability

7 / 8

Security

9 / 12

Maintainability

9 / 12

Agent-Specific

16 / 20

Medical Task

20 / 20 Passed

98"Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input is a CSV file with ROB2 assessments for each study; output are two PNG plot files."

4/4

94Step 2: Execute R script

4/4

92Step 1: Validate input data

4/4

92Step 3: Output results

4/4

92Step 3: Output results

4/4

Veto GatesRequired pass for any deployment consideration

Skill Veto✓ All 4 gates passed

✓

Operational Stability

System remains stable across varied inputs and edge cases

PASS

✓

Structural Consistency

Output structure conforms to expected skill contract format

PASS

✓

Result Determinism

Equivalent inputs produce semantically equivalent outputs

PASS

✓

System Security

No prompt injection, data leakage, or unsafe tool use detected

PASS

Research Veto✅ PASS — Applicable

Dimension	Result	Detail
Scientific Integrity	PASS	The archived review kept this workflow anchored to supplied data fields and observable execution behavior, not fabricated results.
Practice Boundaries	PASS	The evaluated outputs stayed inside the "Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input... and did not drift into unsupported interpretation beyond the available inputs.
Methodological Ground	PASS	The workflow stayed grounded in its declared rubric or scale-selection logic rather than improvised criteria.
Code Usability	PASS	No code-usability failure was preserved for meta-rob2-plot in the legacy evaluation.

Core Capability83 / 100 — 8 Categories

Functional Suitability

The package fits its analysis task well, although the final artifact contract could still be sharpened slightly.

11 / 12

92%

Reliability

Reliability remained good, but the archived review still saw room for steadier behavior under edge conditions.

10 / 12

83%

Performance & Context

No point loss was recorded for performance context in the legacy audit.

8 / 8

100%

Agent Usability

The archived review left some headroom in how quickly an agent can lock onto the intended analysis path.

13 / 16

81%

Human Usability

The archived deduction in human usability traces back to: Minor polish before wide rollout. No major defects found

7 / 8

88%

Security

Security remained strong, though the archived review still left some room for clearer execution guardrails.

9 / 12

75%

Maintainability

The analysis package is maintainable overall, though the archived score suggests modest cleanup headroom.

9 / 12

75%

Agent-Specific

Agent-specific quality remained high, with only modest headroom in structured prompting or edge handling.

16 / 20

80%

Core Capability Total83 / 100

Medical TaskExecution Average: 93.6 / 100 — Assertions: 20/20 Passed

Canonical

"Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input is a CSV file with ROB2 assessments for each study; output are two PNG plot files."

4/4 ✓

Variant A

Step 2: Execute R script

4/4 ✓

Edge

Step 1: Validate input data

4/4 ✓

Variant B

Step 3: Output results

4/4 ✓

Stress

Step 3: Output results

4/4 ✓

Canonical✅ Pass

"Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a Summary Bar Plot. Input is a CSV file with ROB2 assessments for each study; output are two PNG plot files."

For "Draw ROB2 risk-of-bias plots, including a Traffic Light Plot and a..., the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 38/40|Specialized 60/60|Total 98/100

✅A1The meta-rob2-plot output structure matches the documented deliverable

✅A2The script execution path completed successfully for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Variant A✅ Pass

Step 2: Execute R script

For Step 2: Execute R script, the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 36/40|Specialized 58/60|Total 94/100

✅A1The meta-rob2-plot output structure matches the documented deliverable

✅A2The script execution path completed successfully for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Edge✅ Pass

Step 1: Validate input data

For Step 1: Validate input data, the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 35/40|Specialized 57/60|Total 92/100

✅A1The meta-rob2-plot output structure matches the documented deliverable

✅A2The script execution path completed successfully for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Variant B✅ Pass

Step 3: Output results

The Step 3: Output results path verified the packaged helper command without exposing a deeper execution issue.

Basic 34/40|Specialized 58/60|Total 92/100

✅A1The meta-rob2-plot output structure matches the documented deliverable

✅A2The script execution path completed successfully for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Stress✅ Pass

Step 3: Output results

For Step 3: Output results, the preserved evidence is lightweight but positive: the packaged validation command behaved as expected.

Basic 31/40|Specialized 60/60|Total 92/100

✅A1The meta-rob2-plot output structure matches the documented deliverable

✅A2The script execution path completed successfully for the documented case

✅A3The output stays fully within the documented skill boundary

✅A4The response quality is acceptable for the documented path

Pass rate: 4 / 4

Medical Task Total93.6 / 100

Key Strengths

Primary routing is Data Analysis with execution mode B
Static quality score is 83/100 and dynamic average is 82.6/100
Assertions and command execution outcomes are recorded per input for human review
Execution verification summary: Script verification 1/2; adjustment=3. rob2_plot.py: rc=1; validate_skill.py: OK