Agent Skills

Gtars

AIPOCH

A high-performance Rust toolkit (with Python bindings and a CLI) for genomic interval analysis; use it when you need fast overlap queries, coverage track generation, genomic tokenization for ML, reference sequence verification, or fragment processing.

70
6
FILES
gtars/
skill.md
references
cli.md
coverage.md
overlap.md
python-api.md
refget.md
tokenizers.md
85100Total Score
View Evaluation Report
Core Capability
83 / 100
Functional Suitability
11 / 12
Reliability
9 / 12
Performance & Context
7 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
10 / 12
Maintainability
8 / 12
Agent-Specific
16 / 20
Medical Task
20 / 20 Passed
91Overlap and set operations on genomic intervals (e.g., peak/promoter overlap, variant annotation, shared-feature detection)
4/4
87Coverage track generation from interval-like inputs (e.g., ATAC-seq/ChIP-seq/RNA-seq coverage for visualization in genome browsers)
4/4
85Rust performance with low overhead; designed for large genomic datasets
4/4
85Python bindings for integration into analysis notebooks/pipelines
4/4
85End-to-end case for Rust performance with low overhead; designed for large genomic datasets
4/4

SKILL.md

When to Use

  • Overlap and set operations on genomic intervals (e.g., peak/promoter overlap, variant annotation, shared-feature detection).
  • Coverage track generation from interval-like inputs (e.g., ATAC-seq/ChIP-seq/RNA-seq coverage for visualization in genome browsers).
  • Machine-learning preprocessing where genomic regions must be converted into discrete tokens (e.g., Transformer-style models, geniml-style pipelines).
  • Reference sequence management and verification (e.g., subsequence retrieval, digest calculation aligned with GA4GH refget concepts).
  • Single-cell fragment workflows (e.g., splitting fragments by barcode/cluster, scoring fragments against reference region sets).

Key Features

  • Rust performance with low overhead; designed for large genomic datasets.
  • Python bindings for integration into analysis notebooks/pipelines.
  • CLI tooling for batch processing and shell workflows.
  • Fast overlap detection via IGD-style indexing and interval operations.
  • Coverage track generation (WIG/BigWig workflows via the uniwig functionality).
  • Genomic tokenizers for ML-ready representations of genomic regions.
  • Reference sequence utilities (FASTA-backed stores, subsequence retrieval, digesting).
  • Fragment processing and scoring for common single-cell genomics tasks.

Additional module-specific guidance may be available in: references/overlap.md, references/coverage.md, references/tokenizers.md, references/refget.md, references/python-api.md, and references/cli.md.

Dependencies

  • Python package: gtars (version not specified in the source document)
  • Rust toolchain (for CLI install): cargo (version not specified)
  • Rust crate: gtars = "0.1" (as shown in the example)

Example Usage

Python: overlap analysis workflow (runnable)

import gtars

# Load two region sets
peaks = gtars.RegionSet.from_bed("chip_peaks.bed")
promoters = gtars.RegionSet.from_bed("promoters.bed")

# Find overlaps (peaks that overlap promoters)
overlapping_peaks = peaks.filter_overlapping(promoters)

# Export results
overlapping_peaks.to_bed("peaks_in_promoters.bed")

CLI: generate coverage tracks (runnable)

# Generate WIG coverage at a given resolution
gtars uniwig generate --input atac_fragments.bed --output coverage.wig --resolution 10

# Generate BigWig coverage for genome browser visualization
gtars uniwig generate --input atac_fragments.bed --output coverage.bw --format bigwig

Python: ML tokenization (runnable)

import gtars
from gtars.tokenizers import TreeTokenizer

# Load regions and build a tokenizer from BED
regions = gtars.RegionSet.from_bed("training_peaks.bed")
tokenizer = TreeTokenizer.from_bed_file("training_peaks.bed")

# Tokenize each region into a discrete representation
tokens = [tokenizer.tokenize(r.chromosome, r.start, r.end) for r in regions]

print(tokens[:5])

Implementation Details

  • Interval overlap & indexing: Overlap queries are designed around an IGD-like index to accelerate repeated interval queries (build once, query many). Typical parameters are chromosome, start, end; results are overlapping intervals or derived set operations.
  • Coverage generation (uniwig): Produces coverage tracks from interval/fragments input. Common knobs include output format (e.g., WIG vs BigWig) and resolution/binning for track granularity.
  • Tokenization: Tokenizers (e.g., TreeTokenizer) map genomic coordinates to discrete tokens suitable for ML pipelines. Token vocabularies are commonly derived from a BED-defined training region universe.
  • Reference sequence store: FASTA-backed reference access supports subsequence retrieval and digesting/verification workflows aligned with refget-style usage.
  • Fragment workflows: Fragment splitting and scoring operate on fragment-like inputs (often TSV/BED-style) and can be used for barcode/cluster partitioning and enrichment-style scoring against reference region sets.