Agent Skills
Cellxgene Census
AIPOCH
Programmatically query the CZ CELLxGENE Census (61M+ cells) when you need cross-tissue, disease, or cell-type expression data for population-scale queries and reference atlas comparisons.
86
6
FILES
86100Total Score
View Evaluation ReportCore Capability
84 / 100
Functional Suitability
11 / 12
Reliability
9 / 12
Performance & Context
7 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
10 / 12
Maintainability
9 / 12
Agent-Specific
16 / 20
Medical Task
20 / 20 Passed
92Cross-tissue or cross-disease expression comparisons (e.g., macrophages across lung/liver/brain; COVID-19 vs control)
4/4
88Reference atlas lookups to contextualize findings from your own single-cell dataset (marker validation, expected expression patterns)
4/4
86Programmatic access to versioned CZ CELLxGENE Census data (human and mouse)
4/4
86Query cell (obs) metadata and gene (var) metadata with expressive filter syntax
4/4
86End-to-end case for Programmatic access to versioned CZ CELLxGENE Census data (human and mouse)
4/4
SKILL.md
When to Use
- Cross-tissue or cross-disease expression comparisons (e.g., macrophages across lung/liver/brain; COVID-19 vs control).
- Reference atlas lookups to contextualize findings from your own single-cell dataset (marker validation, expected expression patterns).
- Population-scale metadata exploration (what tissues/cell types/datasets exist; cell counts by cohort attributes).
- Large-scale expression statistics where results exceed RAM and require out-of-core iteration.
- Model training on curated atlas data (e.g., cell-type classifiers) using the experimental PyTorch integration.
Key Features
- Programmatic access to versioned CZ CELLxGENE Census data (human and mouse).
- Query cell (obs) metadata and gene (var) metadata with expressive filter syntax.
- Retrieve expression as AnnData for small/medium queries via
get_anndata(). - Perform out-of-core expression access via SOMA
axis_query()and chunked iteration. - Optional experimental ML utilities (PyTorch dataloaders/datasets).
- Works well with scanpy workflows after loading AnnData.
Dependencies
cellxgene-census(latest)tiledbsoma(latest; required foraxis_query()workflows)pyarrow(latest; used for chunked table batches)anndata(latest; forget_anndata()results)scanpy(latest; optional, for downstream analysis)torch(latest; optional, for experimental ML integration)
Install:
uv pip install cellxgene-census
Optional (experimental ML helpers):
uv pip install cellxgene-census[experimental]
Example Usage
The following script is a complete, runnable example that:
- opens a pinned Census version,
- explores metadata,
- loads a small AnnData slice, and
- runs an out-of-core query to compute a simple statistic.
import numpy as np
import cellxgene_census
import tiledbsoma as soma
def main():
# Pin a version for reproducibility (replace with a valid release if needed)
census_version = "2023-07-25"
with cellxgene_census.open_soma(census_version=census_version) as census:
# 1) Explore summary info
summary = census["census_info"]["summary"].read().concat().to_pandas()
total_cells = int(summary["total_cell_count"].iloc[0])
print(f"Census version: {census_version}")
print(f"Total cells: {total_cells:,}")
# 2) Explore obs metadata (always filter primary data unless you want duplicates)
obs = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type", "tissue_general", "disease", "donor_id"],
)
print(f"Brain (primary) cells returned (metadata only): {len(obs):,}")
print("Top cell types:")
print(obs["cell_type"].value_counts().head(10))
# 3) Small/medium query -> AnnData in memory
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=(
"cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True"
),
var_value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']",
obs_column_names=["cell_type", "tissue_general", "disease", "donor_id", "sex"],
)
print(adata)
print("AnnData X shape:", adata.X.shape)
# 4) Large-scale pattern -> out-of-core iteration with axis_query()
# Example: compute mean of non-zero expression values for a few genes in brain.
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
),
)
n = 0
s = 0.0
for batch in query.X("raw").tables():
# batch is a pyarrow.Table with at least: soma_data, soma_dim_0, soma_dim_1
values = batch["soma_data"].to_numpy(zero_copy_only=False)
n += values.size
s += float(values.sum())
mean_expr = s / n if n else np.nan
print(f"Out-of-core mean expression (over returned entries): {mean_expr:.6g}")
if __name__ == "__main__":
main()
Implementation Details
-
Opening the Census
- Use a context manager to ensure resources are released:
with cellxgene_census.open_soma(...) as census: ...
- For reproducibility, set
census_version="YYYY-MM-DD"; otherwise the latest stable release is used.
- Use a context manager to ensure resources are released:
-
Data model (high level)
- Census data is stored in SOMA collections.
census["census_info"]provides summary tables (e.g., datasets, counts).census["census_data"][organism]provides the experiment for an organism (e.g.,homo_sapiens).
-
Filtering semantics
obs_value_filterfilters cells (obs);var_value_filterfilters genes (var).- Combine predicates with
and/or; usein [...]for multi-value membership. - Best practice: include
is_primary_data == Trueto avoid double-counting cells that appear in multiple source datasets.
-
Choosing an access pattern
- Use
get_anndata()when the result is expected to fit in memory (commonly < ~100k cells, depending on gene count and sparsity). - Use
axis_query()+query.X("raw").tables()for out-of-core iteration and incremental statistics.
- Use
-
Expression layers / matrices
- Examples commonly use
X("raw")to access raw expression. - Chunk iteration yields Arrow tables with:
soma_data: expression valuessoma_dim_0: obs (cell) coordinatessoma_dim_1: var (gene) coordinates
- Examples commonly use
-
Optional ML integration
- The
cellxgene_census.experimental.mlutilities provide PyTorch-friendly datasets/dataloaders for training workflows, typically driven by the same obs/var filtering concepts used elsewhere.
- The