Agent Skills

Hierarchical Clustering Plot

AIPOCH

Use when building a sample-level hierarchical clustering dendrogram from a bulk expression matrix and sample annotation table, especially for QC, batch inspection, or sample similarity assessment. Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity. NOT for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.

19
1
FILES
hierarchical-clustering-plot/
skill.md
scripts
clustering_functions.R
input_functions.R
logging_utils.R
main.R
output_utils.R
run_analysis.R
runtime_utils.R
validation_utils.R
references
algorithm.md
cli-guide.md
troubleshooting.md
91100Total Score
View Evaluation Report
Core Capability
95 / 100
Functional Suitability
12 / 12
Reliability
11 / 12
Performance & Context
8 / 8
Agent Usability
16 / 16
Human Usability
8 / 8
Security
11 / 12
Maintainability
12 / 12
Agent-Specific
17 / 20
Medical Task
24 / 25 Passed
92Default euclidean/complete clustering on 40-sample expression matrix
5/5
90Average linkage with sample ID labels
5/5
86Invalid distance method and missing required group file
5/5
88Manhattan distance and ward.D2 linkage options
5/5
83Timeout behavior and temp workspace cleanup
4/5

SKILL.md

Hierarchical Clustering Plot

When to Use

Use this skill when you need a sample-level hierarchical clustering dendrogram from a bulk expression matrix and a sample annotation table.

  • Good fits: sample QC, batch inspection, sample similarity assessment, checking whether annotated sample groups cluster as expected.
  • Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity.
  • Not for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.

When to Read External Files

SituationFile to ReadPurpose
Need algorithm detailsreferences/algorithm.mdDistance calculation, linkage rules, and clustering assumptions
Need to run analysis or inspect CLI entrypoint behaviorscripts/main.RExecute the workflow and inspect argument parsing, defaults, required flags, and sourced modules
Need workflow implementation detailsscripts/run_analysis.RSee orchestration order, temp workspace handling, and output generation
Need logging or warning behaviorscripts/logging_utils.RSee standardized console log formatting and memory usage messages
Need file or parameter validation detailsscripts/validation_utils.RSee path checks, output-directory checks, and scalar validation
Need timeout, temp workspace, or session info behaviorscripts/runtime_utils.RSee timeout control, temp cleanup, output copying, and session-info export
Need expression/group input handlingscripts/input_functions.RSee CSV loading, sample matching, and label extraction
Need clustering logicscripts/clustering_functions.RSee distance calculation and hclust() generation
Need output-writing logicscripts/output_utils.RSee CSV export and PDF rendering
Encounter errors, warnings, or unexpected clustering patternsreferences/troubleshooting.mdCommon failures, warning follow-up, and interpretation guidance
Need CLI examples or common parameter combinationsreferences/cli-guide.mdDetailed command patterns for standard, variant, and test runs
Need example input files or schema-concrete fixturestests/data/Inspect sample CSV layouts for expression and group inputs
Need expected output names or artifact formats## Output Files and references/cli-guide.mdConfirm the files the workflow writes and inspect documented example previews
Need to run regression teststests/run_tests.RExecute the automated test suite
Need exact test assertions or edge casestests/testthat/test-clustering.RInspect validation, reproducibility, and output checks

Usage

Rscript scripts/main.R \
  --input_file ./expression_matrix.csv \
  --group_file ./sample_groups.csv \
  --output_dir ./output/ \
  --distance_method euclidean \
  --linkage_method complete \
  --label_column batch \
  --timeout_seconds 300 \
  --seed 42

Arguments

ShortLongTypeDefaultDescription
-i--input_filecharacterrequiredExpression matrix file (features as rows, samples as columns)
-g--group_filecharacterrequiredSample annotation file (first column sample ID, one metadata column for labels)
-o--output_dircharacter./output/Output directory
-d--distance_methodcharactereuclideanDistance metric for dist(): euclidean, maximum, manhattan, canberra, binary, minkowski
-m--linkage_methodcharactercompleteLinkage method for hclust(): complete, single, average, mcquitty, median, centroid, ward.D, ward.D2
-l--label_columncharactersecond columnColumn used as dendrogram labels
-c--label_cexnumeric0.8Dendrogram label size, must be > 0
-t--timeout_secondsinteger300Elapsed time limit in seconds, must be > 0
-s--seedinteger42Random seed for reproducibility

Input Format

Expression Matrix (input_file)

Features as rows, samples as columns, CSV format with feature IDs in the first column.

,Sample01,Sample02,Sample03
TSPAN6,1.847876677,1.831755661,3.827625975
TNMD,0.034919984,0.053250385,1.388850793

Requirements:

  • The first column contains unique feature IDs.
  • All sample columns must be numeric.
  • Sample column names must be unique and non-empty.
  • At least two matched samples are required.

Sample Annotation (group_file)

CSV with sample IDs in the first column. The second column is used by default for leaf labels unless --label_column is provided.

sample,batch
Sample01,batch1
Sample02,batch2
Sample03,batch1

Requirements:

  • Sample IDs must match expression matrix column names exactly.
  • The selected label column must exist and contain no empty values.
  • The file must contain at least one metadata column in addition to sample IDs.

Output Files

FileDescription
hierarchical_clustering_plot.pdfSample dendrogram plot
sample_distance_matrix.csvPairwise sample distance matrix
clustering_order.csvLeaf order shown in the dendrogram
matched_samples.csvSample-to-label table used for plotting
session_info.txtR session and package version info

Workflow

Step 1: Validate Input

WHEN checking file or parameter validation, READ: scripts/validation_utils.R

WHEN checking expression/group CSV handling, READ: scripts/input_functions.R

  • Check file existence
  • Reject empty files before parsing
  • Read the expression matrix and sample annotation CSV files
  • Validate required columns, unique IDs, and numeric expression values

Step 2: Align Samples

WHEN checking sample matching logic, READ: scripts/input_functions.R

  • Match sample IDs between the annotation file and expression matrix
  • Reorder matrix columns to the annotation file order
  • Select the label column used for plotting

Step 3: Build Hierarchical Clustering

WHEN interpreting distance or linkage behavior, READ: references/algorithm.md

WHEN checking clustering implementation, READ: scripts/clustering_functions.R

  • Transpose the expression matrix to sample-by-feature form
  • Compute pairwise sample distances with dist()
  • Build the dendrogram with hclust()

Step 4: Save Outputs

WHEN checking output staging and cleanup behavior, READ: scripts/run_analysis.R

WHEN checking PDF/CSV export behavior, READ: scripts/output_utils.R

WHEN checking timeout, session info, or final file copy behavior, READ: scripts/runtime_utils.R

  • Stage outputs in a temporary workspace
  • Export the pairwise distance matrix
  • Export the plotted leaf order
  • Render the dendrogram as PDF
  • Copy finalized outputs into the requested output directory

Methods

Distance Matrix

Sample distances are computed from the transposed expression matrix using base R dist().

Hierarchical Clustering

The clustering tree is built with base R hclust(). The default linkage method is complete, matching the source analysis script.


Examples

Basic Usage

Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output/ \
  -t 300

Use Sample IDs as Labels

Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output_sample_labels/ \
  -l sample

Use Average Linkage

Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output_average/ \
  -m average

Error Handling

Common Errors

ErrorCauseSolutionRead More
SKILL_DEPENDENCY_MISSINGRequired R package is not installedInstall the missing package and rerunreferences/troubleshooting.md#skill_dependency_missing
SKILL_FILE_NOT_FOUNDInput file does not exist or output directory could not be createdCheck the path and permissionsreferences/troubleshooting.md#skill_file_not_found
SKILL_EMPTY_FILEInput file is emptyRe-export the CSV and confirm it contains datareferences/troubleshooting.md#skill_empty_file
SKILL_EMPTY_DATACSV parsed successfully but contains no data rowsConfirm the CSV has at least one data rowreferences/troubleshooting.md#skill_empty_data
SKILL_PARSE_ERRORCSV parsing failedCheck encoding, delimiters, and CSV structurereferences/troubleshooting.md#skill_parse_error
SKILL_MISSING_COLUMNSExpected columns or headers are missingCheck CSV headers and metadata columnsreferences/troubleshooting.md#skill_missing_columns
SKILL_INVALID_TYPEExpression values or parameters have the wrong typeEnsure numeric fields are numericreferences/troubleshooting.md#skill_invalid_type
SKILL_SAMPLE_MISMATCHSample IDs do not matchEnsure the first column in group_file matches matrix column namesreferences/troubleshooting.md#skill_sample_mismatch
SKILL_INVALID_DATAExpression or annotation data is malformedCheck duplicate IDs, missing labels, and numeric valuesreferences/troubleshooting.md#skill_invalid_data
SKILL_INVALID_PARAMETERUnsupported distance, linkage, or label parameterUse one of the documented parameter valuesreferences/troubleshooting.md#skill_invalid_parameter
SKILL_TIMEOUTAnalysis exceeded the time limitIncrease --timeout_seconds and rerunreferences/troubleshooting.md#skill_timeout
SKILL_PLOT_ERRORPlot device failed while writing PDFCheck output directory permissions and rerunreferences/troubleshooting.md#skill_plot_error
SKILL_WRITE_ERROROutput or intermediate files could not be writtenCheck output directory permissions and free disk spacereferences/troubleshooting.md#skill_write_error
SKILL_WARNINGNon-fatal warning occurred during executionInspect console warnings and verify output qualityreferences/troubleshooting.md#skill_warning
SKILL_MEMORY_WARNINGMemory usage exceeded the warning thresholdReduce input size or rerun with more memoryreferences/troubleshooting.md#skill_memory_warning

IF error persists, READ: references/troubleshooting.md


Testing

Test with Sample Data

# Check help
Rscript scripts/main.R --help

# Run with sample data
Rscript scripts/main.R \
  -i tests/data/sample_expression_matrix.csv \
  -g tests/data/sample_groups.csv \
  -o ./output/

# Run unit tests (requires testthat and data.table)
Rscript tests/run_tests.R

Validation Commands

# Check main output plot exists
ls -la ./output/hierarchical_clustering_plot.pdf

# Inspect clustering order
wc -l ./output/clustering_order.csv

Implementation Checklist

  • CLI parsing with optparse
  • set.seed() for reproducibility
  • Input validation (file existence, emptiness, types, required columns)
  • Try-catch based fatal error handling
  • Standardized SKILL_* error classification
  • Timeout control with setTimeLimit()
  • Standardized console-only logging
  • Base R clustering implementation
  • Session info recording with sink()
  • Temporary workspace cleanup with on.exit()
  • Memory usage reporting with gc()
  • File reading instructions in SKILL.md
  • Modular script structure across scripts/
  • Test template added under tests/testthat/
  • Test data provided
  • Error handling with SKILL_* codes
  • get_script_dir() defined before use
  • Scripts in scripts/ directory
  • References in references/ directory

Last updated: 2026-04-16 | Version: 1.0.0