Hierarchical Clustering Plot
Use when building a sample-level hierarchical clustering dendrogram from a bulk expression matrix and sample annotation table, especially for QC, batch inspection, or sample similarity assessment. Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity. NOT for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.
SKILL.md
Hierarchical Clustering Plot
When to Use
Use this skill when you need a sample-level hierarchical clustering dendrogram from a bulk expression matrix and a sample annotation table.
- Good fits: sample QC, batch inspection, sample similarity assessment, checking whether annotated sample groups cluster as expected.
- Trigger keywords: hierarchical clustering, dendrogram, sample QC, batch inspection, sample similarity.
- Not for: differential expression testing, gene clustering heatmaps, single-cell clustering workflows.
When to Read External Files
| Situation | File to Read | Purpose |
|---|---|---|
| Need algorithm details | references/algorithm.md | Distance calculation, linkage rules, and clustering assumptions |
| Need to run analysis or inspect CLI entrypoint behavior | scripts/main.R | Execute the workflow and inspect argument parsing, defaults, required flags, and sourced modules |
| Need workflow implementation details | scripts/run_analysis.R | See orchestration order, temp workspace handling, and output generation |
| Need logging or warning behavior | scripts/logging_utils.R | See standardized console log formatting and memory usage messages |
| Need file or parameter validation details | scripts/validation_utils.R | See path checks, output-directory checks, and scalar validation |
| Need timeout, temp workspace, or session info behavior | scripts/runtime_utils.R | See timeout control, temp cleanup, output copying, and session-info export |
| Need expression/group input handling | scripts/input_functions.R | See CSV loading, sample matching, and label extraction |
| Need clustering logic | scripts/clustering_functions.R | See distance calculation and hclust() generation |
| Need output-writing logic | scripts/output_utils.R | See CSV export and PDF rendering |
| Encounter errors, warnings, or unexpected clustering patterns | references/troubleshooting.md | Common failures, warning follow-up, and interpretation guidance |
| Need CLI examples or common parameter combinations | references/cli-guide.md | Detailed command patterns for standard, variant, and test runs |
| Need example input files or schema-concrete fixtures | tests/data/ | Inspect sample CSV layouts for expression and group inputs |
| Need expected output names or artifact formats | ## Output Files and references/cli-guide.md | Confirm the files the workflow writes and inspect documented example previews |
| Need to run regression tests | tests/run_tests.R | Execute the automated test suite |
| Need exact test assertions or edge cases | tests/testthat/test-clustering.R | Inspect validation, reproducibility, and output checks |
Usage
Rscript scripts/main.R \
--input_file ./expression_matrix.csv \
--group_file ./sample_groups.csv \
--output_dir ./output/ \
--distance_method euclidean \
--linkage_method complete \
--label_column batch \
--timeout_seconds 300 \
--seed 42
Arguments
| Short | Long | Type | Default | Description |
|---|---|---|---|---|
-i | --input_file | character | required | Expression matrix file (features as rows, samples as columns) |
-g | --group_file | character | required | Sample annotation file (first column sample ID, one metadata column for labels) |
-o | --output_dir | character | ./output/ | Output directory |
-d | --distance_method | character | euclidean | Distance metric for dist(): euclidean, maximum, manhattan, canberra, binary, minkowski |
-m | --linkage_method | character | complete | Linkage method for hclust(): complete, single, average, mcquitty, median, centroid, ward.D, ward.D2 |
-l | --label_column | character | second column | Column used as dendrogram labels |
-c | --label_cex | numeric | 0.8 | Dendrogram label size, must be > 0 |
-t | --timeout_seconds | integer | 300 | Elapsed time limit in seconds, must be > 0 |
-s | --seed | integer | 42 | Random seed for reproducibility |
Input Format
Expression Matrix (input_file)
Features as rows, samples as columns, CSV format with feature IDs in the first column.
,Sample01,Sample02,Sample03
TSPAN6,1.847876677,1.831755661,3.827625975
TNMD,0.034919984,0.053250385,1.388850793
Requirements:
- The first column contains unique feature IDs.
- All sample columns must be numeric.
- Sample column names must be unique and non-empty.
- At least two matched samples are required.
Sample Annotation (group_file)
CSV with sample IDs in the first column. The second column is used by default for leaf labels unless --label_column is provided.
sample,batch
Sample01,batch1
Sample02,batch2
Sample03,batch1
Requirements:
- Sample IDs must match expression matrix column names exactly.
- The selected label column must exist and contain no empty values.
- The file must contain at least one metadata column in addition to sample IDs.
Output Files
| File | Description |
|---|---|
hierarchical_clustering_plot.pdf | Sample dendrogram plot |
sample_distance_matrix.csv | Pairwise sample distance matrix |
clustering_order.csv | Leaf order shown in the dendrogram |
matched_samples.csv | Sample-to-label table used for plotting |
session_info.txt | R session and package version info |
Workflow
Step 1: Validate Input
WHEN checking file or parameter validation, READ: scripts/validation_utils.R
WHEN checking expression/group CSV handling, READ: scripts/input_functions.R
- Check file existence
- Reject empty files before parsing
- Read the expression matrix and sample annotation CSV files
- Validate required columns, unique IDs, and numeric expression values
Step 2: Align Samples
WHEN checking sample matching logic, READ: scripts/input_functions.R
- Match sample IDs between the annotation file and expression matrix
- Reorder matrix columns to the annotation file order
- Select the label column used for plotting
Step 3: Build Hierarchical Clustering
WHEN interpreting distance or linkage behavior, READ: references/algorithm.md
WHEN checking clustering implementation, READ: scripts/clustering_functions.R
- Transpose the expression matrix to sample-by-feature form
- Compute pairwise sample distances with
dist() - Build the dendrogram with
hclust()
Step 4: Save Outputs
WHEN checking output staging and cleanup behavior, READ: scripts/run_analysis.R
WHEN checking PDF/CSV export behavior, READ: scripts/output_utils.R
WHEN checking timeout, session info, or final file copy behavior, READ: scripts/runtime_utils.R
- Stage outputs in a temporary workspace
- Export the pairwise distance matrix
- Export the plotted leaf order
- Render the dendrogram as PDF
- Copy finalized outputs into the requested output directory
Methods
Distance Matrix
Sample distances are computed from the transposed expression matrix using base R dist().
Hierarchical Clustering
The clustering tree is built with base R hclust(). The default linkage method is complete, matching the source analysis script.
Examples
Basic Usage
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output/ \
-t 300
Use Sample IDs as Labels
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output_sample_labels/ \
-l sample
Use Average Linkage
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output_average/ \
-m average
Error Handling
Common Errors
| Error | Cause | Solution | Read More |
|---|---|---|---|
SKILL_DEPENDENCY_MISSING | Required R package is not installed | Install the missing package and rerun | references/troubleshooting.md#skill_dependency_missing |
SKILL_FILE_NOT_FOUND | Input file does not exist or output directory could not be created | Check the path and permissions | references/troubleshooting.md#skill_file_not_found |
SKILL_EMPTY_FILE | Input file is empty | Re-export the CSV and confirm it contains data | references/troubleshooting.md#skill_empty_file |
SKILL_EMPTY_DATA | CSV parsed successfully but contains no data rows | Confirm the CSV has at least one data row | references/troubleshooting.md#skill_empty_data |
SKILL_PARSE_ERROR | CSV parsing failed | Check encoding, delimiters, and CSV structure | references/troubleshooting.md#skill_parse_error |
SKILL_MISSING_COLUMNS | Expected columns or headers are missing | Check CSV headers and metadata columns | references/troubleshooting.md#skill_missing_columns |
SKILL_INVALID_TYPE | Expression values or parameters have the wrong type | Ensure numeric fields are numeric | references/troubleshooting.md#skill_invalid_type |
SKILL_SAMPLE_MISMATCH | Sample IDs do not match | Ensure the first column in group_file matches matrix column names | references/troubleshooting.md#skill_sample_mismatch |
SKILL_INVALID_DATA | Expression or annotation data is malformed | Check duplicate IDs, missing labels, and numeric values | references/troubleshooting.md#skill_invalid_data |
SKILL_INVALID_PARAMETER | Unsupported distance, linkage, or label parameter | Use one of the documented parameter values | references/troubleshooting.md#skill_invalid_parameter |
SKILL_TIMEOUT | Analysis exceeded the time limit | Increase --timeout_seconds and rerun | references/troubleshooting.md#skill_timeout |
SKILL_PLOT_ERROR | Plot device failed while writing PDF | Check output directory permissions and rerun | references/troubleshooting.md#skill_plot_error |
SKILL_WRITE_ERROR | Output or intermediate files could not be written | Check output directory permissions and free disk space | references/troubleshooting.md#skill_write_error |
SKILL_WARNING | Non-fatal warning occurred during execution | Inspect console warnings and verify output quality | references/troubleshooting.md#skill_warning |
SKILL_MEMORY_WARNING | Memory usage exceeded the warning threshold | Reduce input size or rerun with more memory | references/troubleshooting.md#skill_memory_warning |
IF error persists, READ: references/troubleshooting.md
Testing
Test with Sample Data
# Check help
Rscript scripts/main.R --help
# Run with sample data
Rscript scripts/main.R \
-i tests/data/sample_expression_matrix.csv \
-g tests/data/sample_groups.csv \
-o ./output/
# Run unit tests (requires testthat and data.table)
Rscript tests/run_tests.R
Validation Commands
# Check main output plot exists
ls -la ./output/hierarchical_clustering_plot.pdf
# Inspect clustering order
wc -l ./output/clustering_order.csv
Implementation Checklist
- CLI parsing with
optparse -
set.seed()for reproducibility - Input validation (file existence, emptiness, types, required columns)
- Try-catch based fatal error handling
- Standardized
SKILL_*error classification - Timeout control with
setTimeLimit() - Standardized console-only logging
- Base R clustering implementation
- Session info recording with
sink() - Temporary workspace cleanup with
on.exit() - Memory usage reporting with
gc() - File reading instructions in SKILL.md
- Modular script structure across
scripts/ - Test template added under
tests/testthat/ - Test data provided
- Error handling with
SKILL_*codes -
get_script_dir()defined before use - Scripts in
scripts/directory - References in
references/directory
Last updated: 2026-04-16 | Version: 1.0.0