Agent Skills

Xgboost Analysis

AIPOCH

Use when building XGBoost models on tabular data and returning feature importance ranking outputs. Supports binary classification and regression with automatic task detection, train-test split, performance tables, feature importance ranking tables, and PNG importance plots.

12
0
FILES
xgboost-analysis/
skill.md
scripts
functions.R
main.R
run_analysis.R
utils.R
references
algorithm.md
cli-guide.md
troubleshooting.md
88100Total Score
View Evaluation Report
Core Capability
87 / 100
Functional Suitability
10 / 12
Reliability
11 / 12
Performance & Context
7 / 8
Agent Usability
14 / 16
Human Usability
7 / 8
Security
11 / 12
Maintainability
10 / 12
Agent-Specific
17 / 20
Medical Task
20 / 20 Passed
89Auto binary classification on dt_sample1
4/4
92Character-label classification on dt_sample3
4/4
84Missing target column rejection
4/4
88Frequency metric TXT export on dt_sample2
4/4
89Regression run on dt_sample1 using RIBC2 target
4/4

SKILL.md

XGBoost Modeling And Feature Importance Ranking

Use this skill to train an XGBoost model from a tabular dataset and export both feature importance ranking tables and feature importance plots.

Use This Skill When

  • You need a command-line XGBoost workflow in R for tabular data.
  • You need a reproducible train-test split, model training, and evaluation.
  • You need feature importance ranking outputs as both a table and a figure.
  • You need automatic one-hot encoding for categorical predictors.
  • Your data may contain a first unnamed sample ID column such as V1 that should not enter the model.

Do Not Use This Skill When

  • Your classification target has more than 2 classes.
  • Your input is not tabular CSV, TXT, or TSV data.
  • You need causal interpretation, mechanism claims, or policy, business, or clinical conclusions.
  • You only need narrative interpretation or triage of an existing result rather than model training.

Primary Command

Rscript scripts/main.R \
  --data_file <input_file> \
  --target_var <target_column> \
  --task_type <auto|classification|regression> \
  --output_dir <output_dir>

Prerequisites

  • Rscript is available in the shell.
  • Required R packages: optparse, data.table, Matrix, xgboost.
  • Install missing packages with Rscript -e 'install.packages(c("optparse", "data.table", "Matrix", "xgboost"), repos="https://cloud.r-project.org")'.

Core Arguments

ArgumentRequiredDescription
--data_fileYesInput CSV, TXT, or TSV file
--target_varYesTarget column used for modeling
--task_typeNoauto, classification, or regression. Default auto
--output_dirNoOutput directory, default ./XGBoost_Results
--ignore_varsNoComma-separated columns to exclude from predictors
--positive_classNoPositive class label for binary classification
--test_sizeNoTest set proportion between 0 and 1, default 0.2
--seedNoRandom seed, default 123
--nroundsNoMaximum boosting rounds, default 300
--max_depthNoTree depth, default 6
--etaNoLearning rate, default 0.1
--subsampleNoRow sampling ratio, default 0.8
--colsample_bytreeNoColumn sampling ratio, default 0.8
--min_child_weightNoMinimum child weight, default 1
--gammaNoMinimum split loss reduction, default 0
--lambdaNoL2 regularization, default 1
--alphaNoL1 regularization, default 0
--early_stopping_roundsNoEarly stopping rounds, default 20
--importance_metricNogain, cover, or frequency. Default gain
--top_nNoNumber of features to plot, default 20
--output_formatNoTable format: csv or txt, default csv
--output_prefixNoOutput filename prefix, default xgboost

Input Requirements

  • The input file must contain the target column.
  • Predictor columns can be numeric, integer, logical, character, or factor-like text.
  • Character and factor predictors are one-hot encoded automatically.
  • A first unnamed identifier column such as V1 is automatically excluded when it contains unique sample IDs.
  • Rows with missing target values are removed before training.
  • For classification, exactly 2 classes are required.
  • For regression, the target column must be numeric.
  • Each class should have at least 2 rows so both training and test sets can be created.

Example input:

,fustat,CAMK2N2,GGT6,GPR161,RAB26,RIBC2
TCGA-C5-A1M5,1,2.248291938,5.274690305,2.825215762,3.121114894,5.35318565
TCGA-EA-A5O9,0,3.346176843,5.404368414,2.604616977,0.629473197,4.429314674
TCGA-C5-A3HL,0,3.363100974,5.363314779,4.124799581,4.127228806,4.916596068

Minimal Workflow

  1. Confirm the input file exists and the target column name is correct.
  2. Run scripts/main.R with --data_file and --target_var.
  3. Check the output directory for table/feature_importance_* and figure/feature_importance_*.

If you omit --data_file or --target_var, the script exits with SKILL_MISSING_INPUT.

Outputs

Expected output structure:

<output_dir>/
├── table/
├── figure/
└── data/

Primary outputs:

  • table/<output_prefix>_feature_importance.csv
  • table/<output_prefix>_model_performance.csv
  • figure/<output_prefix>_feature_importance_<importance_metric>.png

Additional outputs:

  • session_info.txt

Feature importance table fields include:

  • Rank
  • Feature
  • Gain
  • Cover
  • Frequency
  • SelectedMetric
  • SelectedValue

Feature Importance Metrics

  • gain: Average contribution to loss reduction. Recommended for most ranking use cases.
  • cover: Relative sample coverage contributed by a feature.
  • frequency: How often the feature is used in splits.

Read These Files When Needed

NeedFile
XGBoost method details and importance interpretationreferences/algorithm.md
More CLI examplesreferences/cli-guide.md
Error diagnosisreferences/troubleshooting.md
Main execution entry pointscripts/main.R
Bundled test datatests/data/

Quick Examples

Auto-detected binary classification on dt_sample1.csv:

Rscript scripts/main.R \
  --data_file tests/data/dt_sample1.csv \
  --target_var fustat \
  --task_type auto \
  --output_dir tests/output_binary

Binary classification on dt_sample2.csv:

Rscript scripts/main.R \
  --data_file tests/data/dt_sample2.csv \
  --target_var fustat \
  --task_type classification \
  --importance_metric gain \
  --output_dir tests/output_gain

Character-label classification on dt_sample3.txt:

Rscript scripts/main.R \
  --data_file tests/data/dt_sample3.txt \
  --target_var Group \
  --task_type classification \
  --positive_class high \
  --top_n 15 \
  --output_dir tests/output_group

Validation

Rscript scripts/main.R --help
Rscript scripts/main.R \
  --data_file tests/data/dt_sample1.csv \
  --target_var fustat \
  --task_type classification \
  --output_dir tests/validation_output

After running analysis, verify that these files exist:

  • tests/validation_output/table/xgboost_feature_importance.csv
  • tests/validation_output/table/xgboost_model_performance.csv
  • tests/validation_output/figure/xgboost_feature_importance_gain.png

Common Errors

  • SKILL_FILE_NOT_FOUND: Input file path is wrong or inaccessible.
  • SKILL_MISSING_COLUMNS: The target column is missing.
  • SKILL_INVALID_DATA: Data is malformed, the target type is unsuitable, classification has more or fewer than 2 classes, or too few usable rows remain.
  • SKILL_INVALID_PARAMETER: An argument value is invalid.
  • SKILL_DEPENDENCY_MISSING: A required R package such as xgboost is unavailable.

If the issue is not obvious, read references/troubleshooting.md.