Agent Skills

Batch Effect Correction

AIPOCH

Use when correcting batch effects in merged bulk expression matrices with sample-level batch metadata while preserving biological group structure and generating before-and-after QC plots. NOT for: single-cell integration, raw FASTQ processing, differential expression without batch labels, or datasets without biological groups.

24
1
FILES
batch-effect-correction/
skill.md
scripts
functions.R
input_functions.R
main.R
output_utils.R
plotting.R
run_analysis.R
utils.R
references
algorithm.md
cli-guide.md
troubleshooting.md
89100Total Score
View Evaluation Report
Core Capability
90 / 100
Functional Suitability
12 / 12
Reliability
11 / 12
Performance & Context
7 / 8
Agent Usability
15 / 16
Human Usability
8 / 8
Security
11 / 12
Maintainability
11 / 12
Agent-Specific
15 / 20
Medical Task
24 / 25 Passed
92ComBat correction on 2-batch 2-group expression matrix
5/5
90Custom metadata columns with log_transform disabled
5/5
87Batch with only 1 sample (below ComBat minimum)
5/5
87Expression matrix with extra QC sample columns not in metadata
5/5
85Expression matrix with negative values and forced log-transform
4/5

SKILL.md

Batch Effect Correction

Prerequisites

Run the following before the first analysis to install all required R packages:

Rscript -e "if (!require('BiocManager', quietly=TRUE)) install.packages('BiocManager'); BiocManager::install(c('sva', 'limma')); install.packages('ggplot2', repos='https://cloud.r-project.org')"

Note: sva and limma are Bioconductor packages and require BiocManager for installation. ggplot2 is a standard CRAN package.

The skill cannot run until these packages are installed. In new or bare R environments, always run the prerequisite step first.


When to Read External Files

SituationFile to ReadPurpose
Need algorithm detailsreferences/algorithm.mdComBat workflow, assumptions, and QC logic
Need to run analysisscripts/main.RExecute: Rscript scripts/main.R --input_file ... --group_file ...
Encounter errorsreferences/troubleshooting.mdCommon errors and solutions
Need CLI examplesreferences/cli-guide.mdDetailed CLI usage examples and baseline run record
Need test datatests/data/Sample input files for testing

Usage

Rscript scripts/main.R \
  --input_file ./expression_matrix.csv \
  --group_file ./sample_info.csv \
  --output_dir ./output/ \
  --batch_column batch \
  --group_column group \
  --sample_column sample \
  --log_transform auto \
  --timeout_seconds 600 \
  --seed 42

Arguments

ShortLongTypeDefaultDescription
-i--input_filecharacterrequiredExpression matrix file (genes as rows, samples as columns)
-g--group_filecharacterrequiredSample metadata file (sample ID, group, and batch columns)
-o--output_dircharacter./output/Output directory
-b--batch_columncharacterbatchBatch column name in metadata
-c--group_columncharactergroupBiological group column name in metadata
-n--sample_columncharactersampleSample ID column name in metadata
-l--log_transformcharacterautoLog transform mode: auto, yes, no
-t--timeout_secondsinteger600Elapsed time limit in seconds; use 0 to disable
-s--seedinteger42Random seed for reproducibility

Input Format

Expression Matrix (input_file)

Genes as rows, samples as columns, CSV format with gene ID in the first column.

"","Sample01","Sample02","Sample03"
"GeneA",5.12,4.87,6.03
"GeneB",8.44,8.11,7.95

Requirements:

  • Gene IDs must be unique and non-empty
  • Sample column names must be unique and non-empty
  • Expression values must be numeric and finite
  • Extra expression-matrix sample columns not present in metadata are allowed and will be ignored with a warning

Sample Metadata (group_file)

CSV with sample ID, biological group, and batch columns.

"sample","group","batch"
"Sample01","Control","Batch1"
"Sample02","Case","Batch1"
"Sample03","Case","Batch2"

Requirements:

  • Sample IDs must be unique and non-empty
  • At least 2 biological groups are required
  • At least 2 batches are required
  • Each group and each batch must contain at least 2 samples
  • Metadata may describe a subset of expression-matrix samples; the analysis will keep only metadata-matched samples and warn about ignored expression columns

Output Files

FileDescription
corrected_expression_matrix.csvBatch-corrected expression matrix
matched_sample_info.csvStandardized metadata used in the analysis
batch_before_boxplot.pdfSample distribution boxplot before correction
batch_after_boxplot.pdfSample distribution boxplot after correction
batch_before_pca.pdfPCA scatter plot before correction with batch-colored points
batch_after_pca.pdfPCA scatter plot after correction with batch-colored points
batch_before_clustering.pdfHierarchical clustering before correction
batch_after_clustering.pdfHierarchical clustering after correction
session_info.txtR session and package version info

Workflow

Step 1: Validate Input

  • Check file existence and non-empty input files
  • Validate metadata column presence
  • Verify expression values are numeric and finite
  • Confirm at least 2 groups, 2 batches, and at least 2 samples per group/batch

Step 2: Align and Prepare Matrix

  • Reorder expression columns to match metadata sample order
  • Keep only metadata-matched samples; warn if the expression matrix contains extra samples absent from metadata
  • Decide whether log transformation is needed (auto, yes, or no)
  • Apply log2(x + 1) only when required

Step 3: Run Batch Correction

  • Build the design matrix with biological group information
  • Run sva::ComBat() to remove batch-driven variation
  • Preserve modeled biological group structure during correction

Step 4: Normalize and Export Results

  • Apply limma::normalizeBetweenArrays() after ComBat
  • Write the corrected matrix and matched metadata
  • Save before/after QC plots and session information

Methods

ComBat

Empirical Bayes batch-effect correction using sva::ComBat(). Recommended when merged bulk expression datasets contain known batch labels and at least two biological groups.

Log Transformation

Supports auto, yes, and no. The auto mode applies log2(x + 1) only when the matrix appears to be on a raw-like scale.

normalizeBetweenArrays

Post-correction normalization with limma::normalizeBetweenArrays() to reduce remaining cross-sample distribution differences.

QC Visualization

Generates paired boxplots, PCA scatter plots with conditional batch ellipses, and hierarchical clustering plots before and after correction to assess whether batch-driven structure is reduced.


Agent Response Contract

After a successful run, report:

  1. Sample count retained after metadata matching and any subset filtering
  2. Batch count and group count used in the ComBat design matrix
  3. Log transformation applied (auto-detected, forced yes, or skipped)
  4. QC assessment: describe whether before/after PCA plots show reduced batch clustering
  5. Artifact paths: corrected_expression_matrix.csv, batch_after_pca.pdf, batch_after_clustering.pdf

Examples

Basic Usage

Rscript scripts/main.R \
  -i expression_matrix.csv \
  -g sample_info.csv \
  -o ./output

With Custom Metadata Columns

Rscript scripts/main.R \
  -i expression_matrix.csv \
  -g metadata.csv \
  -o ./output \
  -n sample_id \
  -c condition \
  -b platform_batch

Disable Log Transform and Timeout

Rscript scripts/main.R \
  -i expression_matrix.csv \
  -g sample_info.csv \
  -o ./output \
  -l no \
  -t 0 \
  -s 42

Error Handling

Common Errors

ErrorCauseSolution
SKILL_FILE_NOT_FOUNDInput file does not existCheck file path
SKILL_EMPTY_FILEInput file exists but contains no dataRecreate or re-export the file
SKILL_MISSING_COLUMNSMetadata file is missing sample, group, or batch columnsCheck header names or pass custom column names
SKILL_SAMPLE_MISMATCHMetadata sample IDs do not match expression matrix columnsVerify sample names between files
SKILL_INVALID_DATADataset fails minimum design checks (< 2 batches, < 2 groups, < 2 samples per batch/group)Review group counts, batch counts, and ID validity
SKILL_INVALID_TYPEExpression values are non-numeric or non-finiteClean matrix values before running
SKILL_TIMEOUTRun exceeded the configured time limitIncrease --timeout_seconds or set it to 0
SKILL_DEPENDENCY_MISSINGRequired R package is not installedInstall with: Rscript -e "BiocManager::install(c('sva','limma')); install.packages('ggplot2')"
SKILL_RUNTIME_ERRORRuntime I/O or filesystem error occurredCheck read/write permissions and environment

IF error persists, READ: references/troubleshooting.md

Troubleshooting note: In environments where packages are not yet installed, SKILL_DEPENDENCY_MISSING will fire before file-validation or --help. Install dependencies first, then re-run to expose file-related errors or access --help.


Input Validation

This skill accepts:

  1. A bulk RNA-seq or microarray expression matrix (CSV, genes as rows, samples as columns)
  2. A sample metadata file (CSV) with sample ID, biological group, and batch columns; at least 2 batches and 2 biological groups are required

If the user's request does not involve batch effect correction on merged bulk expression matrices — for example, asking to integrate single-cell RNA-seq data, process raw FASTQ files, run differential expression without batch labels, or analyze datasets with only one batch — do not proceed with the workflow. Instead respond:

"Batch Effect Correction is designed to remove batch-driven variation from merged bulk expression matrices using ComBat, while preserving biological group structure. Your request appears to be outside this scope. Please provide a multi-batch expression matrix with sample-level batch metadata, or use a more appropriate tool for single-cell integration, differential expression, or raw sequencing processing."


Testing

Test with Sample Data

# Check help (requires packages installed)
Rscript scripts/main.R --help

# Run with bundled test data
Rscript scripts/main.R \
  -i tests/data/expression_matrix_merged.csv \
  -g tests/data/sample_info.csv \
  -o tests/output/

Validation Commands

# Check corrected matrix exists
ls -la tests/output/corrected_expression_matrix.csv

# Check matched metadata exists
ls -la tests/output/matched_sample_info.csv

# Check PCA output exists
ls -la tests/output/batch_after_pca.pdf

Implementation Checklist

  • CLI parsing with optparse
  • set.seed() for reproducibility
  • requireNamespace() dependency checks
  • Session info recording
  • Time-limit support through setTimeLimit()
  • File reading instructions in SKILL.md
  • Modular script structure in scripts/
  • Test data provided
  • Error handling with SKILL_* codes
  • QC plots generated before and after correction
  • References in references/ directory

Last updated: 2026-04-27 | Version: 1.1.0