Agent Skills
Scholar Evaluation
AIPOCH
Implements the ScholarEval framework to evaluate scholarly documents; trigger when the user provides a PDF/DOCX/TXT file or pasted text and requests critique, scoring, or quality assessment.
103
6
FILES
86100Total Score
View Evaluation ReportCore Capability
87 / 100
Functional Suitability
11 / 12
Reliability
10 / 12
Performance & Context
8 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
9 / 12
Maintainability
10 / 12
Agent-Specific
17 / 20
Medical Task
15 / 20 Passed
86Evaluate a research paper, thesis, or proposal and produce a structured critique with scores
3/4
86Generate actionable revision recommendations across core academic writing dimensions
3/4
86Automatic text extraction from PDF/DOCX/TXT via scripts/extract_text.py (intended as the first step for file inputs)
3/4
86ScholarEval rubric with 8 evaluation dimensions (see references/evaluation_framework.md)
3/4
86End-to-end case for Automatic text extraction from PDF/DOCX/TXT via scripts/extract_text.py (intended as the first step for file inputs)
3/4
SKILL.md
When to Use
- Evaluate a research paper, thesis, or proposal and produce a structured critique with scores.
- Generate actionable revision recommendations across core academic writing dimensions.
- Compare multiple drafts/versions of a manuscript using consistent rubric-based scoring.
- Assess submission readiness (e.g., for a conference/journal) and identify major weaknesses.
- Review a document provided as a PDF/DOCX/TXT file when the user expects automatic text extraction.
Key Features
- Automatic text extraction from PDF/DOCX/TXT via
scripts/extract_text.py(intended as the first step for file inputs). - ScholarEval rubric with 8 evaluation dimensions (see
references/evaluation_framework.md). - Per-dimension scoring (1–5) with qualitative feedback and concrete recommendations.
- Weighted score calculation via
scripts/calculate_scores.pyfrom a JSON score file. - Produces a final report summarizing strengths, weaknesses, and next steps.
Dependencies
- Python 3.10+
- See
requirements.txtfor pinned Python package versions (install viapip install -r requirements.txt).
Example Usage
A) Evaluate a PDF/DOCX/TXT file (end-to-end)
- Extract text (run this first for file inputs):
python scripts/extract_text.py "paper.pdf"
- Create a scores JSON (example:
scores.json):
{
"problem_formulation": 4,
"literature_review": 3,
"methodology": 4,
"data_quality": 3,
"analysis": 4,
"results": 3,
"writing_quality": 4,
"citations": 3
}
- Compute the weighted/aggregate score:
python scripts/calculate_scores.py --scores scores.json
- Use the extracted text plus the rubric to generate the evaluation report:
- Apply the 8-dimension criteria from
references/evaluation_framework.md - Provide per-dimension justification, then summarize strengths/risks and prioritized revisions
B) Evaluate pasted text (no extraction)
If the user pastes text directly (e.g., abstract, full paper text), skip extraction and evaluate immediately using the 8 dimensions and the 1–5 scale.
Implementation Details
File ingestion protocol (for PDF/DOCX/TXT)
- For any user-provided file, run:
python scripts/extract_text.py "<filename-or-path>" - The extraction script is designed to locate the file even if the full path is not provided.
- Use the extracted plain text as the sole input to the evaluation rubric and scoring.
Evaluation dimensions (8)
The framework evaluates:
- Problem Formulation
- Literature Review
- Methodology
- Data Quality
- Analysis
- Results
- Writing Quality
- Citations
Detailed criteria and guidance are defined in:
references/evaluation_framework.md
Scoring scale (1–5)
- 1 — Poor: Major flaws; not usable as-is.
- 2 — Weak: Significant issues; major revision required.
- 3 — Average: Acceptable baseline; improvement needed.
- 4 — Good: Strong overall; minor issues.
- 5 — Excellent: High quality; clear impact and rigor.
Score calculation
- Raw per-dimension scores are stored in a JSON file and passed to:
python scripts/calculate_scores.py --scores <path_to_scores_json> - The script computes an aggregate score (and any configured weighting logic) based on the provided metrics.