Agent Skills
Data-analysisGO/KEGGEnrichment analysis

GO/KEGG Enrichment Analysis

AIPOCH-AI

Automatically perform gene enrichment analysis and explain the results

33
1
FILES
go-kegg-enrichment/
skill.md
scripts
main.py
test_genes.txt
references
example_gene_list.txt
GO_KEGG_Reference.md
install_r_packages.R
requirements.txt

SKILL.md

GO/KEGG Enrichment Analysis

Automated pipeline for Gene Ontology and KEGG pathway enrichment analysis with result interpretation and visualization.

Features

  • GO Enrichment: Biological Process (BP), Molecular Function (MF), Cellular Component (CC)
  • KEGG Pathway: Pathway enrichment with organism-specific mapping
  • Multiple ID Support: Gene symbols, Entrez IDs, Ensembl IDs, RefSeq
  • Statistical Methods: Hypergeometric test, Fisher's exact test, GSEA support
  • Visualizations: Bar plots, dot plots, enrichment maps, cnet plots
  • Result Interpretation: Automatic biological significance summary

Supported Organisms

Common NameScientific NameKEGG CodeOrgDB Package
HumanHomo sapienshsaorg.Hs.eg.db
MouseMus musculusmmuorg.Mm.eg.db
RatRattus norvegicusrnoorg.Rn.eg.db
ZebrafishDanio reriodreorg.Dr.eg.db
FlyDrosophila melanogasterdmeorg.Dm.eg.db
YeastSaccharomyces cerevisiaesceorg.Sc.sgd.db

Usage

Basic Usage

# Run enrichment analysis with gene list
python scripts/main.py --genes gene_list.txt --organism human --output results/

Parameters

ParameterDescriptionDefaultRequired
--genesPath to gene list file (one gene per line)-Yes
--organismOrganism code (human/mouse/rat/zebrafish/fly/yeast)humanNo
--id-typeGene ID type (symbol/entrez/ensembl/refseq)symbolNo
--backgroundBackground gene list fileall genesNo
--pvalue-cutoffP-value cutoff for significance0.05No
--qvalue-cutoffAdjusted p-value (q-value) cutoff0.2No
--analysisAnalysis type (go/kegg/all)allNo
--outputOutput directory./enrichment_resultsNo
--formatOutput format (csv/tsv/excel/all)allNo

Advanced Usage

# GO enrichment only with specific ontology
python scripts/main.py \
    --genes deg_upregulated.txt \
    --organism mouse \
    --analysis go \
    --go-ontologies BP,MF \
    --pvalue-cutoff 0.01 \
    --output go_results/

# KEGG enrichment with custom background
python scripts/main.py \
    --genes treatment_genes.txt \
    --background all_expressed_genes.txt \
    --organism human \
    --analysis kegg \
    --qvalue-cutoff 0.05 \
    --output kegg_results/

Input Format

Gene List File

TP53
BRCA1
EGFR
MYC
KRAS
PTEN

With Expression Values (for GSEA)

gene,log2FoldChange
TP53,2.5
BRCA1,-1.8
EGFR,3.2

Output Files

output/
├── go_enrichment/
│   ├── GO_BP_results.csv       # Biological Process results
│   ├── GO_MF_results.csv       # Molecular Function results
│   ├── GO_CC_results.csv       # Cellular Component results
│   ├── GO_BP_barplot.pdf       # Visualization
│   ├── GO_MF_dotplot.pdf
│   └── GO_summary.txt          # Interpretation summary
├── kegg_enrichment/
│   ├── KEGG_results.csv        # Pathway results
│   ├── KEGG_barplot.pdf
│   ├── KEGG_dotplot.pdf
│   └── KEGG_pathview/          # Pathway diagrams
└── combined_report.html        # Interactive report

Result Interpretation

The tool automatically generates biological interpretation including:

  1. Top Enriched Terms: Significant GO terms/pathways ranked by enrichment ratio
  2. Functional Themes: Clustered biological themes from enriched terms
  3. Key Genes: Core genes driving enrichment in significant terms
  4. Network Relationships: Gene-term relationship visualization
  5. Clinical Relevance: Disease associations (for human genes)

Technical Difficulty: HIGH

⚠️ AI自主验收状态: 需人工检查

This skill requires:

  • R/Bioconductor environment with clusterProfiler
  • Multiple annotation databases (org.*.eg.db)
  • KEGG REST API access
  • Complex visualization dependencies

Dependencies

Required R Packages

install.packages(c("BiocManager", "ggplot2", "dplyr", "readr"))
BiocManager::install(c(
    "clusterProfiler", 
    "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db",
    "enrichplot", "pathview", "DOSE"
))

Python Dependencies

pip install pandas numpy matplotlib seaborn rpy2

Example Workflow

  1. Prepare Input: Create gene list from DEG analysis
  2. Run Analysis: Execute main.py with appropriate parameters
  3. Review Results: Check generated CSV files and visualizations
  4. Interpret: Read auto-generated summary for biological insights

References

See references/ for:

  • clusterProfiler documentation
  • KEGG API guide
  • Statistical methods explanation
  • Visualization examples

Limitations

  • Requires internet connection for KEGG database queries
  • Large gene lists (>5000) may require increased memory
  • Some pathways may not be available for all organisms
  • KEGG API has rate limits (max 3 requests/second)

Risk Assessment

Risk IndicatorAssessmentLevel
Code ExecutionPython/R scripts executed locallyMedium
Network AccessNo external API callsLow
File System AccessRead input files, write output filesMedium
Instruction TamperingStandard prompt guidelinesLow
Data ExposureOutput files saved to workspaceLow

Security Checklist

  • No hardcoded credentials or API keys
  • No unauthorized file system access (../)
  • Output does not expose sensitive information
  • Prompt injection protections in place
  • Input file paths validated (no ../ traversal)
  • Output directory restricted to workspace
  • Script execution in sandboxed environment
  • Error messages sanitized (no stack traces exposed)
  • Dependencies audited

Prerequisites

# Python dependencies
pip install -r requirements.txt

Evaluation Criteria

Success Metrics

  • Successfully executes main functionality
  • Output meets quality standards
  • Handles edge cases gracefully
  • Performance is acceptable

Test Cases

  1. Basic Functionality: Standard input → Expected output
  2. Edge Case: Invalid input → Graceful error handling
  3. Performance: Large dataset → Acceptable processing time

Lifecycle Status

  • Current Stage: Draft
  • Next Review Date: 2026-03-06
  • Known Issues: None
  • Planned Improvements:
    • Performance optimization
    • Additional feature support