Agent Skills

Arboreto

AIPOCH

Infer gene regulatory networks (GRNs) from gene expression matrices using GRNBoost2 or GENIE3; use when analyzing bulk or single-cell RNA-seq to identify TF→target regulatory relationships.

70
6
FILES
arboreto/
skill.md
scripts
infer_network.py
references
algorithms.md
distributed_computing.md
85100Total Score
View Evaluation Report
Core Capability
85 / 100
Functional Suitability
11 / 12
Reliability
9 / 12
Performance & Context
8 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
9 / 12
Maintainability
9 / 12
Agent-Specific
17 / 20
Medical Task
15 / 20 Passed
85You have a bulk RNA-seq expression matrix and want to infer transcription factor (TF) → target gene regulatory edges
3/4
85You have single-cell RNA-seq data (after normalization/aggregation as needed) and want to recover putative regulatory interactions
3/4
85GRN inference from gene expression data using GRNBoost2 (gradient boosting) or GENIE3 (random forest)
3/4
85Scalable execution via Dask, from a single machine to multi-node clusters
3/4
85End-to-end case for GRN inference from gene expression data using GRNBoost2 (gradient boosting) or GENIE3 (random forest)
3/4

SKILL.md

When to Use

  • You have a bulk RNA-seq expression matrix and want to infer transcription factor (TF) → target gene regulatory edges.
  • You have single-cell RNA-seq data (after normalization/aggregation as needed) and want to recover putative regulatory interactions.
  • You need GRN inference that can scale to large datasets using parallel/distributed execution.
  • You want to compare gradient-boosting–based GRN inference (GRNBoost2) versus random-forest–based inference (GENIE3).
  • You need a reproducible, scriptable pipeline to generate a ranked network edge list from expression data.

Key Features

  • GRN inference from gene expression data using GRNBoost2 (gradient boosting) or GENIE3 (random forest).
  • Scalable execution via Dask, from a single machine to multi-node clusters.
  • Command-line workflow for generating a GRN edge list from a tabular expression matrix.
  • Algorithm guidance and comparison: see references/algorithms.md.
  • Distributed setup notes: see references/distributed_computing.md.

Dependencies

  • arboreto
  • dask
  • distributed
  • pandas
  • scipy
  • scikit-learn

Example Usage

Run GRN inference from an expression matrix (TSV) and write the inferred network to an output file:

python scripts/infer_network.py \
  --input expression_data.tsv \
  --output network.tsv \
  --algo grnboost2

To use the alternative algorithm:

python scripts/infer_network.py \
  --input expression_data.tsv \
  --output network.tsv \
  --algo genie3

Implementation Details

  • Input/Output

    • Input: a gene expression matrix (e.g., TSV) where rows typically represent samples/cells and columns represent genes (exact expectations depend on scripts/infer_network.py).
    • Output: a ranked edge list representing inferred regulatory relationships (TF → target) with an importance/weight score.
  • Algorithms

    • GRNBoost2: uses gradient boosting to estimate feature importance of candidate regulators for each target gene; generally preferred for larger datasets due to speed and scalability.
    • GENIE3: uses random forests to compute regulator importance per target gene; a classic baseline for GRN inference.
    • For a detailed comparison and practical guidance, refer to references/algorithms.md.
  • Parallel/Distributed Execution

    • Computation is parallelized with Dask, enabling scaling from local multi-core execution to distributed clusters.
    • Cluster configuration and deployment considerations are documented in references/distributed_computing.md.
  • Key Parameters

    • --algo: selects the inference method (grnboost2 or genie3), affecting runtime and model behavior.
    • Additional runtime/cluster parameters (if exposed by the script) typically control Dask scheduling, worker counts, and resource usage.