Agent Skills
Biopython
AIPOCH
A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.
73
6
FILES
86100Total Score
View Evaluation ReportCore Capability
85 / 100
Functional Suitability
11 / 12
Reliability
9 / 12
Performance & Context
7 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
11 / 12
Maintainability
9 / 12
Agent-Specific
16 / 20
Medical Task
20 / 20 Passed
91Batch-process DNA/RNA/protein sequences (translation, reverse complement, statistics) as part of a custom pipeline
4/4
87Parse, validate, convert, or stream large bioinformatics files (FASTA/FASTQ/GenBank/PDB/mmCIF) without loading everything into memory
4/4
85Sequence objects and utilities: Bio.Seq, Bio.SeqRecord, Bio.SeqUtils (GC fraction, molecular weight, translation, etc.)
4/4
85File I/O and format conversion: Bio.SeqIO, Bio.AlignIO for FASTA/FASTQ/GenBank and alignment formats
4/4
85End-to-end case for Sequence objects and utilities: Bio.Seq, Bio.SeqRecord, Bio.SeqUtils (GC fraction, molecular weight, translation, etc.)
4/4
SKILL.md
When to Use
Use this skill when you need to:
- Batch-process DNA/RNA/protein sequences (translation, reverse complement, statistics) as part of a custom pipeline.
- Parse, validate, convert, or stream large bioinformatics files (FASTA/FASTQ/GenBank/PDB/mmCIF) without loading everything into memory.
- Programmatically query and download records from NCBI (GenBank, PubMed, Gene, Protein) via
Bio.Entrez, respecting rate limits. - Automate BLAST searches (web or local) and parse results to extract top hits and metadata.
- Build or manipulate phylogenetic trees from alignments or distance matrices (e.g., NJ trees) for downstream analysis.
Note: For quick one-off queries, tools like gget may be more convenient; for multi-service API aggregation, bioservices may be a better fit.
Key Features
- Sequence objects and utilities:
Bio.Seq,Bio.SeqRecord,Bio.SeqUtils(GC fraction, molecular weight, translation, etc.). - File I/O and format conversion:
Bio.SeqIO,Bio.AlignIOfor FASTA/FASTQ/GenBank and alignment formats. - NCBI access:
Bio.Entrezforesearch,efetch,elink, and structured parsing viaEntrez.read. - BLAST:
Bio.Blast.NCBIWWWfor remote BLAST andBio.Blast.NCBIXMLfor XML parsing. - Structural bioinformatics:
Bio.PDBfor PDB/mmCIF parsing, hierarchy traversal, and geometry calculations. - Phylogenetics:
Bio.PhyloandBio.Phylo.TreeConstructionfor tree I/O, distances, and construction.
Reference guides (if present in this repository) can be consulted for deeper module-specific patterns:
references/sequence_io.mdreferences/alignment.mdreferences/databases.mdreferences/blast.mdreferences/structure.mdreferences/phylogenetics.mdreferences/advanced.md
Dependencies
- Python >= 3.8 (Biopython 1.85 supports Python 3)
biopython==1.85numpy>=1.20(required by Biopython)
Install:
python -m pip install "biopython==1.85" "numpy>=1.20"
Example Usage
A complete, runnable example that:
- parses a FASTA file,
- computes GC fraction,
- runs a remote BLAST (optional),
- fetches the top hit from NCBI,
- prints basic results.
Create example_biopython_pipeline.py:
from __future__ import annotations
import os
import time
from typing import Optional
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction
# Optional BLAST (remote). Comment out if you do not want network calls.
from Bio.Blast import NCBIWWW, NCBIXML
def configure_entrez() -> None:
"""
NCBI requires an email. An API key increases rate limits.
Set these via environment variables to avoid hardcoding secrets.
"""
email = os.environ.get("NCBI_EMAIL")
if not email:
raise RuntimeError("Set NCBI_EMAIL env var (required by NCBI). Example: export NCBI_EMAIL='[email protected]'")
Entrez.email = email
api_key = os.environ.get("NCBI_API_KEY")
if api_key:
Entrez.api_key = api_key
def read_first_fasta_record(path: str):
with open(path, "r", encoding="utf-8") as handle:
return next(SeqIO.parse(handle, "fasta"))
def blast_top_accession(sequence: str, program: str = "blastn", database: str = "nt") -> Optional[str]:
"""
Remote BLAST can be slow and rate-limited. For large-scale BLAST, prefer local BLAST+.
"""
result_handle = NCBIWWW.qblast(program, database, sequence)
blast_record = NCBIXML.read(result_handle)
if not blast_record.alignments:
return None
# Many BLAST titles include multiple identifiers; accession is usually available directly.
return blast_record.alignments[0].accession
def fetch_fasta_by_accession(accession: str) -> str:
with Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text") as handle:
return handle.read()
def main() -> None:
configure_entrez()
record = read_first_fasta_record("input.fasta")
seq = record.seq
print(f"ID: {record.id}")
print(f"Length: {len(seq)}")
print(f"GC fraction: {gc_fraction(seq):.2%}")
# Be polite to NCBI services in batch workflows.
time.sleep(0.34)
top_acc = blast_top_accession(str(seq))
if not top_acc:
print("No BLAST hits found.")
return
print(f"Top BLAST accession: {top_acc}")
time.sleep(0.34)
fasta_text = fetch_fasta_by_accession(top_acc)
print("Top hit FASTA:")
print(fasta_text)
if __name__ == "__main__":
main()
Run:
export NCBI_EMAIL="[email protected]"
# export NCBI_API_KEY="your_ncbi_api_key" # optional
python example_biopython_pipeline.py
Provide an input.fasta in the same directory, e.g.:
>demo
ATCGATCGATCGATCGATCG
Implementation Details
- Streaming I/O for large datasets: Prefer iterator-based parsing (
SeqIO.parse) to avoid loading entire files into memory. UseSeqIO.readonly when exactly one record is expected. - Entrez configuration and rate limits:
- Always set
Entrez.email(NCBI requirement). - Optionally set
Entrez.api_keyto increase request limits. - In batch jobs, add delays (e.g.,
time.sleep(0.34)as a conservative baseline) and implement retries for transient HTTP failures.
- Always set
- BLAST considerations:
NCBIWWW.qblast(...)is convenient but can be slow and is not ideal for high-throughput workloads.- Parse results with
NCBIXML.read(...)(single record) orNCBIXML.parse(...)(multiple records). - Filter hits by HSP metrics (e-value, identity) by iterating
alignment.hsps.
- Sequence statistics and transformations:
- Use
Bio.SeqUtils.gc_fraction(seq)for GC fraction (returns 0–1). - Use
seq.translate(table=...)with the correct genetic code table for reproducibility.
- Use
- Structure parsing (if used):
- Use
Bio.PDB.PDBParser(QUIET=True)to suppress warnings when appropriate. - Navigate the SMCRA hierarchy (Structure → Model → Chain → Residue → Atom) for robust traversal and geometry calculations.
- Use
- Reproducibility:
- Record key parameters (file formats, translation table, BLAST program/database, e-value thresholds, NCBI query terms).
- Cache downloaded records when iterating to avoid repeated network calls.