Agent Skills
Ena Database
AIPOCH
Access the European Nucleotide Archive (ENA) via REST APIs and FTP/Aspera to search and retrieve sequences, raw reads (FASTQ), assemblies, and metadata when you have accession IDs or need metadata-driven discovery for genomics pipelines.
84
6
FILES
86100Total Score
View Evaluation ReportCore Capability
84 / 100
Functional Suitability
11 / 12
Reliability
9 / 12
Performance & Context
7 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
10 / 12
Maintainability
9 / 12
Agent-Specific
16 / 20
Medical Task
20 / 20 Passed
92Access the European Nucleotide Archive (ENA) via REST APIs and FTP/Aspera to search and retrieve sequences, raw reads (FASTQ), assemblies, and metadata when you have accession IDs or need metadata-driven discovery for genomics pipelines
4/4
88Access the European Nucleotide Archive (ENA) via REST APIs and FTP/Aspera to search and retrieve sequences, raw reads (FASTQ), assemblies, and metadata when you have accession IDs or need metadata-driven discovery for genomics pipelines
4/4
86Multi-object ENA coverage: studies/projects, samples, experiments, runs, assemblies, sequences, analyses, taxonomy records
4/4
86Two primary API styles:
4/4
86End-to-end case for Multi-object ENA coverage: studies/projects, samples, experiments, runs, assemblies, sequences, analyses, taxonomy records
4/4
SKILL.md
When to Use
Use this skill when you need to:
- Download raw sequencing reads (FASTQ) for a run/experiment/study using ENA accessions (e.g.,
ERR...,SRR...,PRJ...). - Find samples, runs, experiments, or assemblies by metadata filters (organism, platform, collection date, geography, etc.).
- Retrieve record metadata (XML/JSON/TSV) for reproducible reporting and pipeline inputs.
- Query taxonomic lineage/rank for organisms to drive filtering or grouping in analyses.
- Perform bulk discovery + bulk download workflows (search first, then fetch many files via FTP/Aspera/tools).
Key Features
- Multi-object ENA coverage: studies/projects, samples, experiments, runs, assemblies, sequences, analyses, taxonomy records.
- Two primary API styles:
- Portal API for advanced search and metadata export (JSON/TSV/CSV).
- Browser API for direct record retrieval by accession (XML).
- Multiple data formats: FASTQ, FASTA, BAM/CRAM, EMBL flat file, plus metadata in XML/JSON/TSV.
- Bulk transfer options: FTP/Aspera and command-line tooling patterns for large datasets.
- Cross-references and reference retrieval: ENA xref service and CRAM reference registry endpoints.
- Operational guidance: rate limiting awareness (HTTP 429) and best practices for robust pipelines.
For detailed endpoint and parameter documentation, see
references/api_reference.md.
Dependencies
- Python
>=3.9 requests >=2.31.0
Optional (recommended for XML parsing when using the Browser API):
lxml >=4.9.0
Example Usage
The following script is a complete, runnable example that:
- searches ENA for runs in a study via the Portal API (JSON), then
- fetches one run’s record via the Browser API (XML), and
- retrieves taxonomy lineage via the Taxonomy REST API.
#!/usr/bin/env python3
import sys
import time
import requests
PORTAL_SEARCH = "https://www.ebi.ac.uk/ena/portal/api/search"
BROWSER_XML = "https://www.ebi.ac.uk/ena/browser/api/xml"
TAXONOMY = "https://www.ebi.ac.uk/ena/taxonomy/rest"
SESSION = requests.Session()
SESSION.headers.update({"User-Agent": "ena-database-skill/1.0"})
def get_with_backoff(url, params=None, max_retries=6, timeout=30):
delay = 1.0
for attempt in range(max_retries):
r = SESSION.get(url, params=params, timeout=timeout)
if r.status_code != 429:
r.raise_for_status()
return r
time.sleep(delay)
delay *= 2
r.raise_for_status()
def search_runs_by_study(study_accession, limit=5):
params = {
"result": "read_run",
"query": f"study_accession={study_accession}",
"format": "json",
"limit": limit,
# Ask for a few useful fields; adjust as needed for your pipeline.
"fields": "run_accession,study_accession,sample_accession,experiment_accession,tax_id,scientific_name,fastq_ftp"
}
r = get_with_backoff(PORTAL_SEARCH, params=params)
return r.json()
def fetch_run_xml(run_accession):
url = f"{BROWSER_XML}/{run_accession}"
r = get_with_backoff(url)
return r.text # XML string
def fetch_taxonomy_lineage(tax_id):
url = f"{TAXONOMY}/tax-id/{tax_id}"
r = get_with_backoff(url)
return r.json()
def main():
if len(sys.argv) < 2:
print("Usage: python ena_example.py <STUDY_ACCESSION> (e.g., PRJEB1234)", file=sys.stderr)
sys.exit(2)
study = sys.argv[1]
runs = search_runs_by_study(study_accession=study, limit=5)
if not runs:
print(f"No runs found for study {study}")
return
print(f"Found {len(runs)} runs for study {study}")
first = runs[0]
run_acc = first.get("run_accession")
tax_id = first.get("tax_id")
print("\nFirst run summary (Portal API JSON):")
for k in ["run_accession", "sample_accession", "experiment_accession", "scientific_name", "tax_id", "fastq_ftp"]:
print(f" {k}: {first.get(k)}")
if run_acc:
xml = fetch_run_xml(run_acc)
print("\nBrowser API XML (first 600 chars):")
print(xml[:600])
if tax_id:
tax = fetch_taxonomy_lineage(tax_id)
print("\nTaxonomy lineage (ENA Taxonomy REST API):")
# Response is typically a list with one record
rec = tax[0] if isinstance(tax, list) and tax else tax
print(f" scientificName: {rec.get('scientificName')}")
print(f" rank: {rec.get('rank')}")
print(f" lineage: {rec.get('lineage')}")
if __name__ == "__main__":
main()
Run:
python ena_example.py PRJEB1234
Implementation Details
ENA data model (what you query and retrieve)
ENA organizes records into common object types used in pipelines:
- Study/Project: umbrella entity for a dataset; primary unit for citation.
- Sample: biological material metadata.
- Experiment: library prep + instrument metadata.
- Run: the actual sequencing output files (often FASTQ) for one run.
- Assembly: genome/transcriptome/metagenome assemblies.
- Sequence/Record: annotated sequences (e.g., EMBL records).
- Analysis: computational results derived from sequence data.
- Taxonomy: lineage and rank information.
API selection guidance
- Portal API (
/ena/portal/api/search): use for searching and exporting metadata at scale.- Typical outputs:
json,tsv,csv. - Supports complex query expressions (see
references/api_reference.md).
- Typical outputs:
- Browser API (
/ena/browser/api/xml/{accession}): use for direct retrieval by accession.- Output: XML (parse with an XML parser, not regex).
- Taxonomy REST API (
/ena/taxonomy/rest/...): use for lineage/rank lookups. - Cross-reference service:
https://www.ebi.ac.uk/ena/xref/rest/for related records in external databases. - CRAM reference registry:
https://www.ebi.ac.uk/ena/cram/for reference sequence retrieval by checksum.
Query parameters and outputs (practical notes)
- Portal API core parameters (commonly used):
result: record type (e.g.,sample,read_run,assembly)query: filter expression (e.g.,study_accession=PRJEB1234,tax_tree(Escherichia coli))fields: comma-separated fields to return (improves performance vs returning everything)format:json/tsv/csvlimit(and pagination where applicable)
- File retrieval:
- For raw reads, prefer extracting file locations (e.g.,
fastq_ftp) from Portal results, then download via FTP/Aspera for scale.
- For raw reads, prefer extracting file locations (e.g.,
Rate limiting and robustness
- ENA APIs are rate-limited (commonly documented as 50 requests/second). Exceeding limits returns HTTP 429.
- Implement:
- exponential backoff on 429,
- request consolidation (fetch multiple fields in one query),
- bulk download mechanisms for large datasets instead of per-accession loops.
Recommended pipeline pattern (search → resolve → download)
- Search with Portal API to obtain accessions and file URLs.
- Resolve any needed details (optional) via Browser API XML for specific accessions.
- Download large files via FTP/Aspera or tooling (rather than API streaming).
- Cache taxonomy lookups when processing many records to reduce repeated calls.