When to Use

Use this skill when you need to:

Find protein/nucleic acid 3D structures by keywords, organism, experimental method, or resolution.
Identify related structures via sequence similarity (e.g., homolog search for modeling).
Identify related structures via 3D structure similarity (e.g., fold-level comparisons).
Download coordinates (PDB/mmCIF) for downstream analysis, visualization, docking, or modeling.
Run batch retrieval of metadata/coordinates to feed pipelines in drug discovery, protein engineering, or structural bioinformatics.

Key Features

Text and attribute-based search over RCSB PDB entries.
Sequence similarity search with configurable thresholds (e-value, identity).
Structure similarity search using an existing entry as a query.
Programmatic metadata retrieval via the RCSB Data API (schema-based or GraphQL).
Direct coordinate downloads in PDB and mmCIF formats.
Batch processing patterns for multiple PDB IDs.

Dependencies

rcsb-api (latest recommended; provides rcsbapi.search and rcsbapi.data)
requests>=2.0 (HTTP downloads)
biopython>=1.80 (optional; parsing/analyzing PDB coordinates)

Install (example):

uv pip install rcsb-api requests biopython

Example Usage

The following script is end-to-end runnable: it searches for a target, fetches metadata, downloads coordinates, and parses the structure.

#!/usr/bin/env python3
import pathlib
import requests

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info
from rcsbapi.data import fetch, Schema

from Bio.PDB import PDBParser


def download_text(url: str, out_path: pathlib.Path) -> None:
    r = requests.get(url, timeout=60)
    r.raise_for_status()
    out_path.write_text(r.text, encoding="utf-8")


def main():
    out_dir = pathlib.Path("pdb_out")
    out_dir.mkdir(exist_ok=True)

    # 1) Search: hemoglobin entries with resolution < 2.0 Å
    q_text = TextQuery("hemoglobin")
    q_res = AttributeQuery(
        attribute=rcsb_entry_info.resolution_combined,
        operator="less",
        value=2.0,
    )
    query = q_text & q_res

    pdb_ids = list(query())[:5]
    if not pdb_ids:
        raise SystemExit("No results found.")
    pdb_id = pdb_ids[0]
    print(f"Selected PDB ID: {pdb_id}")

    # 2) Fetch entry metadata
    entry = fetch(pdb_id, schema=Schema.ENTRY)
    title = entry.get("struct", {}).get("title")
    method = (entry.get("exptl") or [{}])[0].get("method")
    resolution = (entry.get("rcsb_entry_info") or {}).get("resolution_combined")
    deposit_date = (entry.get("rcsb_accession_info") or {}).get("deposit_date")

    print("Metadata:")
    print(f"  Title: {title}")
    print(f"  Method: {method}")
    print(f"  Resolution: {resolution}")
    print(f"  Deposit date: {deposit_date}")

    # 3) Download coordinates (PDB and mmCIF)
    pdb_path = out_dir / f"{pdb_id}.pdb"
    cif_path = out_dir / f"{pdb_id}.cif"

    download_text(f"https://files.rcsb.org/download/{pdb_id}.pdb", pdb_path)
    download_text(f"https://files.rcsb.org/download/{pdb_id}.cif", cif_path)
    print(f"Downloaded: {pdb_path} and {cif_path}")

    # 4) Parse PDB coordinates (example: count atoms)
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure(pdb_id, str(pdb_path))

    atom_count = sum(1 for _ in structure.get_atoms())
    chain_ids = sorted({chain.id for chain in structure.get_chains()})
    print("Parsed structure:")
    print(f"  Chains: {chain_ids}")
    print(f"  Atom count: {atom_count}")


if __name__ == "__main__":
    main()

Implementation Details

Search Modes and Query Composition

Text search uses free-text matching over entry annotations (titles, keywords, descriptions).
Attribute search filters by structured fields (e.g., organism, method, resolution).
Sequence similarity search typically supports:
- evalue_cutoff: lower is more stringent (fewer, more confident hits).
- identity_cutoff: fraction identity threshold (e.g., 0.9 for near-identical).
Structure similarity search uses an existing structure (e.g., an entry_id) as the geometric reference.
Queries can be combined with boolean logic:
- query1 & query2 (AND)
- query1 | query2 (OR)
- ~query (NOT), where supported by the client

Data Retrieval (Schema vs GraphQL)

Schema-based fetch (e.g., Schema.ENTRY, Schema.POLYMER_ENTITY) is convenient for common objects and stable access patterns.
GraphQL fetch is best when you need a custom selection of fields in one request (reduce round-trips and payload).

Example GraphQL pattern:

from rcsbapi.data import fetch

query = """
{
  entry(entry_id: "4HHB") {
    struct { title }
    exptl { method }
    rcsb_entry_info { resolution_combined deposited_atom_count }
  }
}
"""
data = fetch(query_type="graphql", query=query)

Coordinate Downloads and Formats

PDB: legacy text format; widely supported but less expressive for large/complex structures.
mmCIF (PDBx): modern standard; preferred for completeness and large structures.

Direct download endpoints:

https://files.rcsb.org/download/{PDB_ID}.pdb
https://files.rcsb.org/download/{PDB_ID}.cif

Batch Processing Pattern

For batch metadata retrieval, iterate over IDs and call fetch(pdb_id, schema=Schema.ENTRY); handle exceptions per-ID to keep pipelines robust. For large batches, consider rate limiting and caching to avoid repeated downloads.

Reference Documentation

If present in this repository, consult:

references/api_reference.md for advanced endpoint usage, query patterns, schema notes, rate limits, and troubleshooting.

Pdb Database