UMAP and t-SNE Analysis

Prerequisites

Run the following before the first analysis to install all required R packages:

Rscript scripts/install_dependencies.R

Alternative manual installation:

Rscript -e "install.packages(c('optparse','data.table','Rtsne','umap','ggplot2','vegan','R.utils'), repos='https://cloud.r-project.org')"

Note: R.utils is only required when --timeout > 0, but pre-installing it avoids environment drift across runs. testthat is installed by scripts/install_dependencies.R as the development test dependency.

The skill cannot run until these packages are installed. In new or bare R environments, always run the prerequisite step first.

When to Read External Files

Situation	File to Read	Purpose
Need algorithm details	`references/algorithm.md`	Dimensionality reduction methods, assumptions, parameter interpretation
Need to run analysis	`scripts/main.R`	Execute: `Rscript scripts/main.R --input_file ... --group_file ...`
Encounter errors	`references/troubleshooting.md`	Common errors and solutions
Need CLI examples	`references/cli-guide.md`	Detailed CLI usage examples
Need test data	`tests/data/`	Sample input files for testing

Usage

Rscript scripts/main.R \
  --input_file ./otu_table.csv \
  --group_file ./group_info.csv \
  --output_dir ./output/ \
  --method both \
  --seed 42

Arguments

Short	Long	Type	Default	Description
`-i`	`--input_file`	character	required	Abundance / OTU matrix file
`-g`	`--group_file`	character	required	Group information file
`-o`	`--output_dir`	character	`./output/`	Output directory
`-m`	`--method`	character	`both`	Method: `tsne`, `umap`, or `both`
	`--sample_id_col`	character	first column	Sample ID column in group file
	`--group_col`	character	second column	Group column in group file
	`--perplexity`	numeric	`25`	t-SNE perplexity
	`--theta`	numeric	`0.0`	t-SNE theta
	`--pca`	logical	`FALSE`	Whether to use PCA before t-SNE
	`--check_duplicates`	logical	`FALSE`	Whether t-SNE should check duplicated rows
	`--normalize`	logical	`TRUE`	Whether to normalize data before UMAP
	`--norm_method`	character	`hellinger`	Normalization method for `vegan::decostand()`
	`--n_neighbors`	integer	`10`	UMAP neighborhood size
`-s`	`--seed`	integer	`42`	Random seed for reproducibility
`-t`	`--timeout`	integer	`0`	Timeout in seconds; `0` disables timeout

Dependency Baseline

The skill was validated with the exact package baseline recorded in dependencies.lock.tsv.

Package	Tested Version
`optparse`	`1.7.5`
`data.table`	`1.15.4`
`Rtsne`	`0.17`
`umap`	`0.2.10.0`
`ggplot2`	`3.4.0`
`vegan`	`2.7.3`
`R.utils`	`2.13.0`
`testthat`	`3.1.2`

Use this file as the reproducibility baseline when validating a new environment.

Input Format

Abundance / OTU Matrix (`input_file`)

Features as rows, samples as columns, CSV/TSV-like tabular file with feature ID in the first column.

OTU_ID,S1,S2,S3,S4
OTU_1,10,3,0,5
OTU_2,2,8,1,0
OTU_3,0,0,6,9

Group File (`group_file`)

Tabular file with at least two columns: sample ID and group label.

SampleID,Group
S1,Control
S2,Control
S3,Treatment
S4,Treatment

Requirements:

At least 2 groups with at least 2 samples per group are required.
All sample IDs in the group file must exist in the matrix columns.
Single-group inputs will produce a SKILL_INVALID_PARAMETER error because dimensionality reduction without group contrast produces uninterpretable plots.

Output Files

File	Description
`table/tsne_coordinates.csv`	t-SNE coordinates with sample and group annotations
`table/umap_coordinates.csv`	UMAP coordinates with sample and group annotations
`plot/tsne_plot.pdf`	t-SNE scatter plot with group colors and ellipses
`plot/umap_plot.pdf`	UMAP scatter plot with group colors and ellipses
`data/session_info.txt`	R session and package version info
`data/analysis_data.rda`	Saved analysis object with aligned matrix, metadata, colors, and runtime parameters

Workflow

Step 1: Validate Input

Check that matrix file and group file exist
Resolve sample ID and group columns
Validate at least 2 groups and at least 2 samples per group
Ensure all group-file sample IDs exist in the matrix
Remove samples with zero total abundance after alignment if needed

Step 2: Prepare Matrix

Convert input table into numeric matrix
Align matrix columns to sample order from the group file
Transpose matrix so rows become samples and columns become features

Step 3: Run Dimensionality Reduction

Run t-SNE if --method tsne or --method both
Run UMAP if --method umap or --method both
Apply fixed random seed for reproducibility

Step 4: Generate Visualizations

Plot sample embeddings
Color points by group
Draw group ellipses when enabled
Save PDF outputs

Methods

t-SNE

t-SNE is a non-linear dimensionality reduction method that preserves local neighborhood structure. It is useful for identifying local sample clustering patterns.

UMAP

UMAP is a manifold learning method that aims to preserve both local and some global structure. It is often faster than t-SNE and can produce stable low-dimensional embeddings when parameters are chosen appropriately.

Normalization

When --normalize TRUE, the script uses vegan::decostand() with the selected --norm_method before UMAP. This is helpful for abundance-style ecological matrices.

Agent Response Contract

After a successful run, report:

Method(s) run (tsne, umap, or both)
Sample count and group count processed
Key parameters used (perplexity for t-SNE, n_neighbors for UMAP)
Group separation quality (describe visible clustering from coordinate ranges if accessible)
Artifact paths: coordinate CSV(s) and plot PDF(s) produced

Examples

Basic Usage

Rscript scripts/main.R \
  -i otu_table.csv \
  -g group_info.csv \
  -o ./output \
  -m both

Only t-SNE

Rscript scripts/main.R \
  -i otu_table.csv \
  -g group_info.csv \
  -o ./output \
  -m tsne \
  --perplexity 10

Only UMAP with Custom Group Column

Rscript scripts/main.R \
  -i otu_table.csv \
  -g metadata.csv \
  -o ./output \
  -m umap \
  --sample_id_col SampleID \
  --group_col Treatment \
  --n_neighbors 15

Error Handling

Common Errors

Error	Cause	Solution
`SKILL_FILE_NOT_FOUND`	Input file does not exist	Check file path
`SKILL_MISSING_COLUMNS`	Group or matrix file lacks required columns	Verify file format
`SKILL_SAMPLE_MISMATCH`	Sample IDs in group file do not match matrix columns	Check sample naming consistency
`SKILL_EMPTY_DATA`	Matrix becomes empty after preprocessing	Check input values and filtering
`SKILL_INVALID_PARAMETER`	Invalid method, invalid parameter value, or single-group input	Adjust CLI arguments; ensure at least 2 groups are present
`SKILL_PACKAGE_NOT_FOUND`	Required R package is missing	Run `Rscript scripts/install_dependencies.R`; note that file errors will only surface after packages are installed
`SKILL_TIMEOUT`	Analysis exceeded the configured timeout	Increase `--timeout` or set `--timeout 0`

IF error persists, READ: references/troubleshooting.md

Troubleshooting note: In environments where packages are not yet installed, SKILL_PACKAGE_NOT_FOUND will fire before file-validation errors. Install dependencies first, then re-run to expose any file-related errors.

Input Validation

This skill accepts:

An abundance or OTU-style feature matrix (CSV/TSV, features as rows, samples as columns)
A group file with at least two groups (CSV/TSV, sample IDs and group labels)

If the user's request does not involve UMAP or t-SNE dimensionality reduction for group separation visualization — for example, asking to run differential expression testing, process single-cell RNA-seq with specialized pipelines, perform clustering without a group file, or impute missing values — do not proceed with the workflow. Instead respond:

"UMAP and t-SNE Analysis is designed to perform sample-level dimensionality reduction and visualization on abundance or OTU-style matrices. Your request appears to be outside this scope. Please provide a feature matrix and group file for UMAP/t-SNE, or use a more appropriate tool for differential expression testing, single-cell analysis, or clustering."

Testing

Test with Sample Data

Rscript scripts/install_dependencies.R

Rscript scripts/main.R --help

Rscript scripts/main.R \
  -i tests/data/otu_table.csv \
  -g tests/data/group_info.csv \
  -o tests/output/ \
  -m both

Rscript tests/test_skill.R

Rscript tests/run_smoke_test.R

Validation Commands

ls -la tests/output/
ls -la tests/output/table
ls -la tests/output/plot
ls -la tests/output/data
wc -l tests/output/table/tsne_coordinates.csv
wc -l tests/output/table/umap_coordinates.csv

The canonical sample data live in tests/data/. Use those files for examples, smoke tests, and regression checks. The canonical output layout is output_dir/table, output_dir/plot, and output_dir/data.

Implementation Checklist

Last updated: 2026-04-27 | Version: 1.1.0

Umap Tsne Analysis

UMAP and t-SNE Analysis

Prerequisites

When to Read External Files

Usage

Arguments

Dependency Baseline

Input Format

Abundance / OTU Matrix (`input_file`)

Group File (`group_file`)

Output Files

Workflow

Step 1: Validate Input

Step 2: Prepare Matrix

Step 3: Run Dimensionality Reduction

Step 4: Generate Visualizations

Methods

t-SNE

UMAP

Normalization

Agent Response Contract

Examples

Basic Usage

Only t-SNE

Only UMAP with Custom Group Column

Error Handling

Common Errors

Input Validation

Testing

Test with Sample Data

Validation Commands

Implementation Checklist

Details

Umap Tsne Analysis

UMAP and t-SNE Analysis

Prerequisites

When to Read External Files

Usage

Arguments

Dependency Baseline

Input Format

Abundance / OTU Matrix (input_file)

Group File (group_file)

Output Files

Workflow

Step 1: Validate Input

Step 2: Prepare Matrix

Step 3: Run Dimensionality Reduction

Step 4: Generate Visualizations

Methods

t-SNE

UMAP

Normalization

Agent Response Contract

Examples

Basic Usage

Only t-SNE

Only UMAP with Custom Group Column

Error Handling

Common Errors

Input Validation

Testing

Test with Sample Data

Validation Commands

Implementation Checklist

Details

Abundance / OTU Matrix (`input_file`)

Group File (`group_file`)