Name: Cellxgene Census
Author: AIPOCH

When to Use

Cross-tissue or cross-disease expression comparisons (e.g., macrophages across lung/liver/brain; COVID-19 vs control).
Reference atlas lookups to contextualize findings from your own single-cell dataset (marker validation, expected expression patterns).
Population-scale metadata exploration (what tissues/cell types/datasets exist; cell counts by cohort attributes).
Large-scale expression statistics where results exceed RAM and require out-of-core iteration.
Model training on curated atlas data (e.g., cell-type classifiers) using the experimental PyTorch integration.

Key Features

Programmatic access to versioned CZ CELLxGENE Census data (human and mouse).
Query cell (obs) metadata and gene (var) metadata with expressive filter syntax.
Retrieve expression as AnnData for small/medium queries via get_anndata().
Perform out-of-core expression access via SOMA axis_query() and chunked iteration.
Optional experimental ML utilities (PyTorch dataloaders/datasets).
Works well with scanpy workflows after loading AnnData.

Dependencies

cellxgene-census (latest)
tiledbsoma (latest; required for axis_query() workflows)
pyarrow (latest; used for chunked table batches)
anndata (latest; for get_anndata() results)
scanpy (latest; optional, for downstream analysis)
torch (latest; optional, for experimental ML integration)

Install:

uv pip install cellxgene-census

Optional (experimental ML helpers):

uv pip install cellxgene-census[experimental]

Example Usage

The following script is a complete, runnable example that:

opens a pinned Census version,
explores metadata,
loads a small AnnData slice, and
runs an out-of-core query to compute a simple statistic.

import numpy as np
import cellxgene_census
import tiledbsoma as soma

def main():
    # Pin a version for reproducibility (replace with a valid release if needed)
    census_version = "2023-07-25"

    with cellxgene_census.open_soma(census_version=census_version) as census:
        # 1) Explore summary info
        summary = census["census_info"]["summary"].read().concat().to_pandas()
        total_cells = int(summary["total_cell_count"].iloc[0])
        print(f"Census version: {census_version}")
        print(f"Total cells: {total_cells:,}")

        # 2) Explore obs metadata (always filter primary data unless you want duplicates)
        obs = cellxgene_census.get_obs(
            census,
            "homo_sapiens",
            value_filter="tissue_general == 'brain' and is_primary_data == True",
            column_names=["cell_type", "tissue_general", "disease", "donor_id"],
        )
        print(f"Brain (primary) cells returned (metadata only): {len(obs):,}")
        print("Top cell types:")
        print(obs["cell_type"].value_counts().head(10))

        # 3) Small/medium query -> AnnData in memory
        adata = cellxgene_census.get_anndata(
            census=census,
            organism="Homo sapiens",
            obs_value_filter=(
                "cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True"
            ),
            var_value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']",
            obs_column_names=["cell_type", "tissue_general", "disease", "donor_id", "sex"],
        )
        print(adata)
        print("AnnData X shape:", adata.X.shape)

        # 4) Large-scale pattern -> out-of-core iteration with axis_query()
        # Example: compute mean of non-zero expression values for a few genes in brain.
        query = census["census_data"]["homo_sapiens"].axis_query(
            measurement_name="RNA",
            obs_query=soma.AxisQuery(
                value_filter="tissue_general == 'brain' and is_primary_data == True"
            ),
            var_query=soma.AxisQuery(
                value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
            ),
        )

        n = 0
        s = 0.0
        for batch in query.X("raw").tables():
            # batch is a pyarrow.Table with at least: soma_data, soma_dim_0, soma_dim_1
            values = batch["soma_data"].to_numpy(zero_copy_only=False)
            n += values.size
            s += float(values.sum())

        mean_expr = s / n if n else np.nan
        print(f"Out-of-core mean expression (over returned entries): {mean_expr:.6g}")

if __name__ == "__main__":
    main()

Implementation Details

Opening the Census
- Use a context manager to ensure resources are released:
  - with cellxgene_census.open_soma(...) as census: ...
- For reproducibility, set census_version="YYYY-MM-DD"; otherwise the latest stable release is used.
Data model (high level)
- Census data is stored in SOMA collections.
- census["census_info"] provides summary tables (e.g., datasets, counts).
- census["census_data"][organism] provides the experiment for an organism (e.g., homo_sapiens).
Filtering semantics
- obs_value_filter filters cells (obs); var_value_filter filters genes (var).
- Combine predicates with and / or; use in [...] for multi-value membership.
- Best practice: include is_primary_data == True to avoid double-counting cells that appear in multiple source datasets.
Choosing an access pattern
- Use get_anndata() when the result is expected to fit in memory (commonly < ~100k cells, depending on gene count and sparsity).
- Use axis_query() + query.X("raw").tables() for out-of-core iteration and incremental statistics.
Expression layers / matrices
- Examples commonly use X("raw") to access raw expression.
- Chunk iteration yields Arrow tables with:
  - soma_data: expression values
  - soma_dim_0: obs (cell) coordinates
  - soma_dim_1: var (gene) coordinates
Optional ML integration
- The cellxgene_census.experimental.ml utilities provide PyTorch-friendly datasets/dataloaders for training workflows, typically driven by the same obs/var filtering concepts used elsewhere.