Agent Skills

Pdb Database

AIPOCH

Access the RCSB Protein Data Bank (PDB) to search, download, and programmatically retrieve 3D macromolecular structures and metadata; use when you need structure discovery (text/sequence/3D similarity) or automated structural data ingestion for structural biology and drug discovery workflows.

113
9
FILES
pdb-database/
skill.md
references
api_reference.md
86100Total Score
View Evaluation Report
Core Capability
84 / 100
Functional Suitability
11 / 12
Reliability
9 / 12
Performance & Context
7 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
10 / 12
Maintainability
9 / 12
Agent-Specific
16 / 20
Medical Task
20 / 20 Passed
92Find protein/nucleic acid 3D structures by keywords, organism, experimental method, or resolution
4/4
88Identify related structures via sequence similarity (e.g., homolog search for modeling)
4/4
86Text and attribute-based search over RCSB PDB entries
4/4
86Sequence similarity search with configurable thresholds (e-value, identity)
4/4
86End-to-end case for Text and attribute-based search over RCSB PDB entries
4/4

SKILL.md

When to Use

Use this skill when you need to:

  • Find protein/nucleic acid 3D structures by keywords, organism, experimental method, or resolution.
  • Identify related structures via sequence similarity (e.g., homolog search for modeling).
  • Identify related structures via 3D structure similarity (e.g., fold-level comparisons).
  • Download coordinates (PDB/mmCIF) for downstream analysis, visualization, docking, or modeling.
  • Run batch retrieval of metadata/coordinates to feed pipelines in drug discovery, protein engineering, or structural bioinformatics.

Key Features

  • Text and attribute-based search over RCSB PDB entries.
  • Sequence similarity search with configurable thresholds (e-value, identity).
  • Structure similarity search using an existing entry as a query.
  • Programmatic metadata retrieval via the RCSB Data API (schema-based or GraphQL).
  • Direct coordinate downloads in PDB and mmCIF formats.
  • Batch processing patterns for multiple PDB IDs.

Dependencies

  • rcsb-api (latest recommended; provides rcsbapi.search and rcsbapi.data)
  • requests>=2.0 (HTTP downloads)
  • biopython>=1.80 (optional; parsing/analyzing PDB coordinates)

Install (example):

uv pip install rcsb-api requests biopython

Example Usage

The following script is end-to-end runnable: it searches for a target, fetches metadata, downloads coordinates, and parses the structure.

#!/usr/bin/env python3
import pathlib
import requests

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info
from rcsbapi.data import fetch, Schema

from Bio.PDB import PDBParser


def download_text(url: str, out_path: pathlib.Path) -> None:
    r = requests.get(url, timeout=60)
    r.raise_for_status()
    out_path.write_text(r.text, encoding="utf-8")


def main():
    out_dir = pathlib.Path("pdb_out")
    out_dir.mkdir(exist_ok=True)

    # 1) Search: hemoglobin entries with resolution < 2.0 Å
    q_text = TextQuery("hemoglobin")
    q_res = AttributeQuery(
        attribute=rcsb_entry_info.resolution_combined,
        operator="less",
        value=2.0,
    )
    query = q_text & q_res

    pdb_ids = list(query())[:5]
    if not pdb_ids:
        raise SystemExit("No results found.")
    pdb_id = pdb_ids[0]
    print(f"Selected PDB ID: {pdb_id}")

    # 2) Fetch entry metadata
    entry = fetch(pdb_id, schema=Schema.ENTRY)
    title = entry.get("struct", {}).get("title")
    method = (entry.get("exptl") or [{}])[0].get("method")
    resolution = (entry.get("rcsb_entry_info") or {}).get("resolution_combined")
    deposit_date = (entry.get("rcsb_accession_info") or {}).get("deposit_date")

    print("Metadata:")
    print(f"  Title: {title}")
    print(f"  Method: {method}")
    print(f"  Resolution: {resolution}")
    print(f"  Deposit date: {deposit_date}")

    # 3) Download coordinates (PDB and mmCIF)
    pdb_path = out_dir / f"{pdb_id}.pdb"
    cif_path = out_dir / f"{pdb_id}.cif"

    download_text(f"https://files.rcsb.org/download/{pdb_id}.pdb", pdb_path)
    download_text(f"https://files.rcsb.org/download/{pdb_id}.cif", cif_path)
    print(f"Downloaded: {pdb_path} and {cif_path}")

    # 4) Parse PDB coordinates (example: count atoms)
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure(pdb_id, str(pdb_path))

    atom_count = sum(1 for _ in structure.get_atoms())
    chain_ids = sorted({chain.id for chain in structure.get_chains()})
    print("Parsed structure:")
    print(f"  Chains: {chain_ids}")
    print(f"  Atom count: {atom_count}")


if __name__ == "__main__":
    main()

Implementation Details

Search Modes and Query Composition

  • Text search uses free-text matching over entry annotations (titles, keywords, descriptions).
  • Attribute search filters by structured fields (e.g., organism, method, resolution).
  • Sequence similarity search typically supports:
    • evalue_cutoff: lower is more stringent (fewer, more confident hits).
    • identity_cutoff: fraction identity threshold (e.g., 0.9 for near-identical).
  • Structure similarity search uses an existing structure (e.g., an entry_id) as the geometric reference.
  • Queries can be combined with boolean logic:
    • query1 & query2 (AND)
    • query1 | query2 (OR)
    • ~query (NOT), where supported by the client

Data Retrieval (Schema vs GraphQL)

  • Schema-based fetch (e.g., Schema.ENTRY, Schema.POLYMER_ENTITY) is convenient for common objects and stable access patterns.
  • GraphQL fetch is best when you need a custom selection of fields in one request (reduce round-trips and payload).

Example GraphQL pattern:

from rcsbapi.data import fetch

query = """
{
  entry(entry_id: "4HHB") {
    struct { title }
    exptl { method }
    rcsb_entry_info { resolution_combined deposited_atom_count }
  }
}
"""
data = fetch(query_type="graphql", query=query)

Coordinate Downloads and Formats

  • PDB: legacy text format; widely supported but less expressive for large/complex structures.
  • mmCIF (PDBx): modern standard; preferred for completeness and large structures.

Direct download endpoints:

  • https://files.rcsb.org/download/{PDB_ID}.pdb
  • https://files.rcsb.org/download/{PDB_ID}.cif

Batch Processing Pattern

For batch metadata retrieval, iterate over IDs and call fetch(pdb_id, schema=Schema.ENTRY); handle exceptions per-ID to keep pipelines robust. For large batches, consider rate limiting and caching to avoid repeated downloads.

Reference Documentation

If present in this repository, consult:

  • references/api_reference.md for advanced endpoint usage, query patterns, schema notes, rate limits, and troubleshooting.