Agent Skills

Consensus Clustering Analysis

AIPOCH

Use when identifying stable sample subtypes from bulk expression matrices with ConsensusClusterPlus, including PAC-based model selection and consensus matrix/CDF visualization. NOT for: differential expression analysis, single-cell clustering workflows, or non-expression tables.

21
0
FILES
consensus-clustering-analysis/
skill.md
scripts
functions_analysis.R
io_utils.R
main.R
run_analysis.R
utils.R
visualization.R
references
algorithm.md
cli-guide.md
troubleshooting.md
94100Total Score
View Evaluation Report
Core Capability
94 / 100
Functional Suitability
11 / 12
Reliability
11 / 12
Performance & Context
8 / 8
Agent Usability
15 / 16
Human Usability
7 / 8
Security
12 / 12
Maintainability
12 / 12
Agent-Specific
18 / 20
Medical Task
20 / 20 Passed
96Default case-group run
4/4
96Alias-column group file
4/4
86Single-gene custom list
4/4
93Top-10 features with K=4
4/4
95Custom gene list with K=4
4/4

SKILL.md

Consensus Clustering Analysis

When to Use

Use this skill when you need to identify stable sample subtypes from a bulk expression matrix with ConsensusClusterPlus, compare candidate clustering settings with PAC, and export consensus matrix/CDF visualizations.

Do not use this skill for differential expression analysis, single-cell clustering, or non-expression tabular data.

When to Read External Files

SituationFile to ReadPurpose
Need algorithm detailsreferences/algorithm.mdConsensus clustering, PAC scoring, and preprocessing assumptions
Need to run analysisscripts/main.RExecute: Rscript scripts/main.R --input_file ... --group_file ...
Encounter errorsreferences/troubleshooting.mdCommon errors and solutions
Need CLI examplesreferences/cli-guide.mdDetailed CLI usage examples with verified local runs

Usage

Rscript scripts/main.R \
  --input_file ./expression_matrix.csv \
  --group_file ./groups.csv \
  --disease_group case \
  --max_k 4 \
  --output_dir ./output/ \
  --gene_selection highly_variable \
  --top_n 5000 \
  --reps 1000 \
  --p_item 0.8 \
  --p_feature 1.0 \
  --timeout_seconds 3600 \
  --seed 42

Arguments

ShortLongTypeDefaultDescription
-i--input_filecharacterrequiredExpression matrix file (genes as rows, samples as columns)
-g--group_filecharacterrequiredGroup information file (sample ID + group columns)
-d--disease_groupcharactercaseGroup label retained for clustering
-k--max_kinteger4Maximum cluster count to evaluate
-o--output_dircharacter./output/Output directory
-m--gene_selectioncharacterhighly_variableGene selection mode: highly_variable or custom
-n--top_ninteger5000Number of top variable genes to keep
-l--gene_listcharacterNULLCustom gene list file when gene_selection=custom
-c--center_datalogicalTRUEMedian-center each gene before clustering
-r--repsinteger1000Consensus resampling repetitions
--p_itemdouble0.8Sample resampling proportion
--p_featuredouble1.0Feature resampling proportion
-t--timeout_secondsinteger3600Elapsed timeout in seconds
-s--seedinteger42Random seed for reproducibility

Input Format

Expression Matrix (input_file)

Genes as rows, samples as columns, CSV/TSV/TXT format with gene ID in the first column.

,Sample01,Sample02,Sample03
TSPAN6,1.8479,1.8318,3.8276
TNMD,0.0349,0.0533,1.3889

Group File (group_file)

Delimited text file with sample ID and group columns.

sample,group
Sample01,case
Sample02,control
Sample03,case

Gene List (gene_list)

Optional plain text or single-column CSV file with one gene symbol per line.

TNMD
DPM1
SCYL3

Output Files

FileDescription
Cluster_res.csvPAC summary for each distance/algorithm combination with is_best marking the selected model
genes_for_clustering.csvSelected genes and gene selection mode
samples_for_clustering.csvSamples retained after disease-group filtering
result_<distance>_<algorithm>/Method-specific consensus outputs and PAC_scores.csv
Consensus Matrix Plot.pdfConsensus matrix heatmap for the optimal model
CDF curve Plot.pdfCDF curves for the optimal method
session_info.txtR session and package version info

Workflow

Step 1: Validate Input

  • Check file existence
  • Detect sample and group columns in the group file
  • Validate sample matching between expression matrix and group file

Step 2: Prepare Clustering Matrix

  • Filter samples by the requested disease group
  • Select genes using highly_variable or custom
  • Median-center genes if requested

Step 3: Run Consensus Clustering

  • Evaluate supported distance and clustering algorithm combinations
  • Compute PAC scores across candidate K values
  • Select the optimal model by minimum PAC

Step 4: Generate Outputs

  • Save result tables
  • Generate consensus matrix and CDF plots
  • Record session information for reproducibility

Methods

ConsensusClusterPlus

Repeated subsampling is used to estimate cluster stability across candidate K values and clustering settings.

PAC Score

The proportion of ambiguous clustering is computed as CDF(0.9) - CDF(0.1) from lower-triangle consensus values. Lower PAC indicates more stable clustering.

Gene Selection

  • highly_variable: rank genes by median absolute deviation
  • custom: use the intersection of the provided gene list and matrix row names

Examples

Basic Usage

Rscript scripts/main.R \
  -i expression_matrix.csv \
  -g groups.csv \
  -d case \
  -k 3 \
  -r 20 \
  -o output/example_basic \
  -t 120

With a Custom Gene List

Rscript scripts/main.R \
  -i expression_matrix.csv \
  -g groups.csv \
  -d case \
  -m custom \
  -l genes.csv \
  -k 4 \
  -r 20 \
  -o output/example_custom \
  -t 120

Without Median Centering

Rscript scripts/main.R \
  -i expression_matrix.csv \
  -g groups.csv \
  -d case \
  -c FALSE \
  -k 3 \
  -r 20 \
  -o output/example_rawscale \
  -t 120

Error Handling

Common Errors

ErrorCauseSolution
SKILL_FILE_NOT_FOUNDInput file does not existCheck file path and permissions
SKILL_MISSING_COLUMNSGroup file lacks sample/group columnsVerify column names in the group file
SKILL_SAMPLE_MISMATCHSample names do not matchEnsure group file sample IDs match matrix columns
SKILL_INVALID_PARAMETERCLI value is invalidCheck allowed options and numeric ranges
SKILL_INVALID_DATAToo few samples/genes remain after filteringLower max_k or review the input data
SKILL_TIMEOUTRun exceeded the configured timeoutIncrease timeout_seconds or reduce reps
SKILL_DEPENDENCY_MISSINGRequired R package is not installedInstall missing packages before rerunning

IF error persists, READ: references/troubleshooting.md


Testing

Smoke Check

# Check help
Rscript scripts/main.R --help

# Run analysis
Rscript scripts/main.R \
  -i tests/data/expression_matrix.csv \
  -g tests/data/groups.csv \
  -d case \
  -k 3 \
  -r 20 \
  -o output/example_basic \
  -t 120

Validation Commands

# Inspect selected model
cat output/example_basic/Cluster_res.csv

# Check output plots exist
ls -la output/example_basic

Implementation Checklist

  • CLI parsing with optparse
  • set.seed() for reproducibility
  • requireNamespace() dependency checks
  • Session info recording
  • data.table::fread() input reading
  • File reading instructions in SKILL.md
  • Modular script structure (<150 lines per file)
  • Test data provided
  • Error handling with SKILL_* codes
  • Scripts in scripts/ directory
  • References in references/ directory

Last updated: 2026-04-17 | Version: 1.0.0