Agent Skills

Geniml

AIPOCH

Machine learning toolkit for genomic interval (BED) data; use it when you need to tokenize BED collections and train embeddings for regions/cells/labels, build consensus peak universes, or run similarity search and downstream ML on chromatin accessibility datasets.

148
6
FILES
geniml/
skill.md
references
bedspace.md
consensus_peaks.md
region2vec.md
scembed.md
utilities.md
86100Total Score
View Evaluation Report
Core Capability
84 / 100
Functional Suitability
11 / 12
Reliability
9 / 12
Performance & Context
7 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
10 / 12
Maintainability
9 / 12
Agent-Specific
16 / 20
Medical Task
20 / 20 Passed
91You have many BED files and need numeric features for clustering, similarity search, or downstream supervised learning (e.g., ChIP-seq/ATAC-seq region sets)
4/4
87You want unsupervised embeddings of genomic regions to compare region sets across experiments (Region2Vec)
4/4
85Region2Vec: Word2vec-style unsupervised embeddings for genomic regions from tokenized BED data
4/4
85BEDspace: StarSpace-based joint embedding space for region sets and metadata labels; supports similarity search and cross-modal retrieval
4/4
85End-to-end case for Region2Vec: Word2vec-style unsupervised embeddings for genomic regions from tokenized BED data
4/4

SKILL.md

When to Use

  • You have many BED files and need numeric features for clustering, similarity search, or downstream supervised learning (e.g., ChIP-seq/ATAC-seq region sets).
  • You want unsupervised embeddings of genomic regions to compare region sets across experiments (Region2Vec).
  • You need joint embeddings of regions and metadata labels (e.g., tissue/cell type/condition) to enable cross-modal queries like Region → Label or Label → Region (BEDspace).
  • You are analyzing single-cell ATAC-seq and want cell embeddings for clustering/annotation and integration with Scanpy workflows (scEmbed).
  • You need a consensus peak set (“universe”) built from multiple BED files to standardize tokenization and region definitions across datasets (Universe construction).

Key Features

  • Region2Vec: Word2vec-style unsupervised embeddings for genomic regions from tokenized BED data.
  • BEDspace: StarSpace-based joint embedding space for region sets and metadata labels; supports similarity search and cross-modal retrieval.
  • scEmbed: Single-cell ATAC-seq embedding workflow (tokenize cells → train → encode cells) compatible with Scanpy.
  • Universe (Consensus Peaks) Builder: Generates reference peak sets using multiple statistical approaches (CC, CCF, ML, HMM).
  • Utilities:
    • Tokenization: Universe-based tokenization (hard/soft tokenization patterns).
    • Evaluation: Embedding quality metrics (e.g., silhouette, Davies–Bouldin).
    • BEDshift: Region randomization/null-model generation while preserving genomic context.
    • BBClient / caching: Faster repeated access to BED resources.
    • Text2BedNN: Neural search backend for genomic queries.

Additional details are commonly documented in: references/region2vec.md, references/bedspace.md, references/scembed.md, references/consensus_peaks.md, references/utilities.md.

Dependencies

  • Python: 3.9+ (recommended)
  • geniml: latest from PyPI (or GitHub main)
  • Optional ML extras: geniml[ml] (typically pulls PyTorch and related ML dependencies)
  • Scanpy stack (for scEmbed workflows): scanpy (plus anndata, numpy, scipy)
  • StarSpace (for BEDspace training): external binary from https://github.com/facebookresearch/StarSpace
  • Universe coverage generation: uniwig (used to generate coverage tracks in universe workflows)

Example Usage

1) Install

# Base install
uv pip install geniml

# With ML extras (e.g., PyTorch and related dependencies)
uv pip install "geniml[ml]"

# Development version
uv pip install git+https://github.com/databio/geniml.git

2) End-to-end: Build a universe → tokenize BEDs → train Region2Vec → evaluate

# (A) Build coverage tracks (example pattern)
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/

# (B) Build a universe (coverage cutoff method)
geniml universe build cc \
  --coverage-folder coverage/ \
  --output-file universe.bed \
  --cutoff 5 \
  --merge 100 \
  --filter-size 50
# (C) Tokenize BED files, train Region2Vec, and evaluate embeddings
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings

# 1) Tokenize BED files against the universe
hard_tokenization(
    src_folder="bed_files/",
    dst_folder="tokens/",
    universe_file="universe.bed",
    p_value_threshold=1e-9,
)

# 2) Train Region2Vec
region2vec(
    token_folder="tokens/",
    save_dir="model/",
    num_shufflings=1000,
    embedding_dim=100,
)

# 3) Evaluate (requires labels/metadata aligned to embeddings)
metrics = evaluate_embeddings(
    embeddings_file="model/embeddings.npy",
    labels_file="metadata.csv",
)

print(metrics)

3) Single-cell ATAC-seq: tokenize cells → train scEmbed → cluster with Scanpy

import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells

# 1) Load AnnData
adata = sc.read_h5ad("scatac_data.h5ad")

# 2) Tokenize cells using a universe
tokenize_cells(
    adata="scatac_data.h5ad",
    universe_file="universe.bed",
    output="tokens.parquet",
)

# 3) Train scEmbed
model = ScEmbed(embedding_dim=100)
model.train(dataset="tokens.parquet", epochs=100)

# 4) Encode cells and attach embeddings to AnnData
embeddings = model.encode(adata)
adata.obsm["scembed_X"] = embeddings

# 5) Standard Scanpy neighborhood graph + clustering + UMAP
sc.pp.neighbors(adata, use_rep="scembed_X")
sc.tl.leiden(adata)
sc.tl.umap(adata)

Implementation Details

Tokenization (Universe-based)

  • Goal: Convert genomic intervals into discrete “tokens” defined by a reference universe (consensus peak set).
  • Hard tokenization: Assigns intervals to universe bins/peaks deterministically (commonly used for Region2Vec/scEmbed pipelines).
  • Key parameter: p_value_threshold controls stringency of mapping/overlap significance (lower is stricter; overly strict thresholds can reduce coverage).

Region2Vec (Region Embeddings)

  • Core idea: Treat each BED file (or region set) like a “document” and each universe peak like a “word”; learn embeddings using a word2vec-style objective.
  • Important knobs:
    • embedding_dim: dimensionality of learned vectors (e.g., 50–300).
    • num_shufflings: increases training signal by shuffling/co-occurrence augmentation; higher values increase runtime.

BEDspace (Joint Region + Label Embeddings)

  • Core idea: Learn a shared vector space for region sets and metadata labels using StarSpace, enabling:
    • Region → Label retrieval (predict likely labels for a query region set)
    • Label → Region retrieval (find region sets associated with a label)
  • Operational requirement: StarSpace must be installed and its path provided/configured for training.

scEmbed (Single-cell Embeddings)

  • Core idea: Apply Region2Vec-like training on tokenized single-cell accessibility profiles to produce cell embeddings.
  • Best practice: Pre-tokenize cells (e.g., to Parquet) to reduce repeated preprocessing and speed up training.
  • Downstream: Use embeddings as adata.obsm[...] and run standard Scanpy steps (neighbors, Leiden, UMAP).

Universe Construction (Consensus Peaks)

  • Purpose: Create a stable reference peak set for tokenization and cross-dataset comparability.
  • Methods:
    • CC (Coverage Cutoff): threshold-based peak calling from coverage.
    • CCF (Coverage Cutoff Flexible): cutoff with flexible boundaries/confidence intervals.
    • ML (Maximum Likelihood): probabilistic modeling of peak positions.
    • HMM (Hidden Markov Model): state-based segmentation; typically most computationally intensive.
  • Typical parameters:
    • --cutoff: minimum coverage to call peaks (CC/CCF).
    • --merge: merge distance for nearby peaks.
    • --filter-size: minimum peak length to keep.