Agent Skills

Gene Database

AIPOCH

Query the NCBI Gene database via E-utilities and the NCBI Datasets API; use it when you need to search genes by symbol/ID and retrieve annotations (RefSeq, GO, location, phenotype) for single or batch gene lists.

113
8
FILES
gene-database/
skill.md
scripts
batch_gene_lookup.py
fetch_gene_data.py
query_gene.py
references
api_reference.md
common_workflows.md
92100Total Score
View Evaluation Report
Core Capability
87 / 100
Functional Suitability
11 / 12
Reliability
10 / 12
Performance & Context
8 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
9 / 12
Maintainability
10 / 12
Agent-Specific
17 / 20
Medical Task
20 / 20 Passed
100You have a gene symbol (e.g., BRCA1) and need the correct NCBI Gene ID for a specific organism
4/4
97You have an NCBI Gene ID and need consolidated metadata (aliases, RefSeq accessions, genomic location, GO, literature links)
4/4
95Symbol/name search with organism scoping using E-utilities (ESearch)
4/4
94Gene record retrieval by ID using E-utilities (EFetch/ESummary) in JSON/XML/text-oriented outputs
4/4
94End-to-end case for Symbol/name search with organism scoping using E-utilities (ESearch)
4/4

SKILL.md

When to Use

  • You have a gene symbol (e.g., BRCA1) and need the correct NCBI Gene ID for a specific organism.
  • You have an NCBI Gene ID and need consolidated metadata (aliases, RefSeq accessions, genomic location, GO, literature links).
  • You need to annotate a gene panel (dozens to thousands of genes) with consistent identifiers and core annotations.
  • You want to search genes by biological context (GO terms, phenotype/disease keywords, pathway terms) and then retrieve details for the hits.
  • You are building a pipeline that must respect NCBI rate limits and handle retries for transient API failures.

Key Features

  • Symbol/name search with organism scoping using E-utilities (ESearch).
  • Gene record retrieval by ID using E-utilities (EFetch/ESummary) in JSON/XML/text-oriented outputs.
  • Streamlined, gene-focused retrieval using the NCBI Datasets API (metadata + sequences/links in a single workflow).
  • Batch lookup utilities with basic rate-limit awareness and output aggregation.
  • Supports common annotation fields: nomenclature/aliases, RefSeq transcripts/proteins, genomic location, GO annotations, phenotype/disease keywords, and related literature references.

Dependencies

  • Python 3.9+
  • requests >= 2.28
  • NCBI E-utilities (Entrez) HTTP API (public service)
  • NCBI Datasets HTTP API (public service)
  • Optional: NCBI API key (recommended for higher throughput)

Example Usage

The following examples assume the repository provides these scripts:

  • scripts/query_gene.py
  • scripts/fetch_gene_data.py
  • scripts/batch_gene_lookup.py

1) Search by symbol/name (E-utilities / ESearch)

python scripts/query_gene.py --search "BRCA1" --organism "human"

Example advanced query strings:

python scripts/query_gene.py --search "insulin[gene name] AND human[organism]"
python scripts/query_gene.py --search "dystrophin[gene name] AND muscular dystrophy[disease]"
python scripts/query_gene.py --search "human[organism] AND 17q21[chromosome]"

2) Retrieve gene information by Gene ID

Using E-utilities (format-oriented retrieval):

python scripts/query_gene.py --id 672 --format json

Using NCBI Datasets API (consolidated gene payload):

python scripts/fetch_gene_data.py --gene-id 672

Or by symbol + taxon:

python scripts/fetch_gene_data.py --symbol BRCA1 --taxon human
python scripts/fetch_gene_data.py --symbol TP53 --taxon "Homo sapiens" --output json

3) Batch lookup for gene list annotation

From a file of symbols (organism required for symbol disambiguation):

python scripts/batch_gene_lookup.py --file gene_list.txt --organism human

From a comma-separated list of Gene IDs:

python scripts/batch_gene_lookup.py --ids 672,7157,5594 --output results.json

Implementation Details

API selection guidance

  • Use E-utilities when you need:
    • complex Entrez query syntax (fielded queries, boolean logic),
    • cross-database patterns,
    • fine control over search and retrieval steps (ESearch → ESummary/EFetch).
  • Use NCBI Datasets API when you need:
    • a streamlined gene-centric retrieval path,
    • consolidated metadata (and often sequence-related links) with fewer round trips.

Query patterns (E-utilities)

Typical fielded query components include:

  • "<SYMBOL>" plus organism scoping: BRCA1[gene name] AND human[organism]
  • GO term searches (example): GO:0006915[biological process]
  • Phenotype/disease keywords (example): diabetes[phenotype] AND mouse[organism]
  • Pathway keywords (example): insulin signaling pathway[pathway]

Rate limits and API keys

  • Without an API key (typical defaults):
    • E-utilities: ~3 requests/sec
    • Datasets API: ~5 requests/sec
  • With an NCBI API key:
    • both can be used up to ~10 requests/sec (service-dependent)

Obtain an API key from: https://www.ncbi.nlm.nih.gov/account/

Error handling recommendations

  • Handle standard HTTP errors:
    • 400: invalid/malformed query or parameters
    • 404: Gene ID not found
    • 429: rate limit exceeded
  • Use exponential backoff with jitter for retries on 429/5xx.
  • Cache results for repeated lookups (especially in batch annotation workflows).

Output/data formats

Depending on endpoint/script options, gene data may be returned as:

  • JSON (recommended for pipelines)
  • XML (legacy/verbose metadata)
  • Text summaries
  • Sequence-oriented formats such as FASTA or GenBank (when supported by the chosen endpoint/workflow)

Additional references

If present in the repository, consult:

  • references/api_reference.md for endpoint/parameter details and response structures
  • references/common_workflows.md for additional query patterns and end-to-end examples