Agent Skills
Pydeseq2
AIPOCH
Differential gene expression analysis for bulk RNA-seq count matrices using a DESeq2-like workflow in Python; use when you need Wald tests, FDR correction, and optional LFC shrinkage for condition/batch/covariate designs.
105
7
FILES
87100Total Score
View Evaluation ReportCore Capability
87 / 100
Functional Suitability
11 / 12
Reliability
10 / 12
Performance & Context
8 / 8
Agent Usability
14 / 16
Human Usability
8 / 8
Security
9 / 12
Maintainability
10 / 12
Agent-Specific
17 / 20
Medical Task
20 / 20 Passed
91Differential gene expression analysis for bulk RNA-seq count matrices using a DESeq2-like workflow in Python; use when you need Wald tests, FDR correction, and optional LFC shrinkage for condition/batch/covariate designs
4/4
87Differential gene expression analysis for bulk RNA-seq count matrices using a DESeq2-like workflow in Python; use when you need Wald tests, FDR correction, and optional LFC shrinkage for condition/batch/covariate designs
4/4
86End-to-end DESeq2-like workflow: normalization (size factors), dispersion estimation/shrinkage, LFC fitting, outlier handling
4/4
86Wald tests for differential expression with Benjamini–Hochberg FDR (padj)
4/4
86End-to-end case for End-to-end DESeq2-like workflow: normalization (size factors), dispersion estimation/shrinkage, LFC fitting, outlier handling
4/4
SKILL.md
When to Use
Use this skill when you need to run DESeq2-style differential expression in Python, especially in these scenarios:
- Case vs control bulk RNA-seq from a raw integer count matrix (e.g., treated vs control).
- Multi-factor designs to adjust for batch effects or covariates (e.g.,
~ batch + condition,~ age + condition). - DESeq2 migration when converting an R DESeq2 workflow into a Python pipeline.
- Pipeline integration where results must stay in Python objects (pandas/AnnData) for downstream QC, plots, or reporting.
- Requests mentioning “DESeq2”, “differential expression”, “Wald test”, “FDR/padj”, “volcano plot”, “MA plot”, or “PyDESeq2”.
Key Features
- End-to-end DESeq2-like workflow: normalization (size factors), dispersion estimation/shrinkage, LFC fitting, outlier handling.
- Wald tests for differential expression with Benjamini–Hochberg FDR (
padj). - Design formulas in Wilkinson/R-style notation (single-factor and multi-factor).
- Contrast-based comparisons:
[variable, test_group, reference_group]. - Optional Cook’s distance outlier filtering and refitting.
- Optional LFC shrinkage (apeGLM-style) for visualization/ranking.
- Works naturally with pandas and can interoperate with AnnData.
Dependencies
Minimum environment (as documented in the source material):
- Python 3.10–3.11
pydeseq2(install via pip/uv)pandas>= 1.4.3numpy>= 1.23.0scipy>= 1.11.0scikit-learn>= 1.1.1anndata>= 0.8.0 (optional, for AnnData I/O)
Optional plotting:
matplotlib(recommended)seaborn(optional)
Installation:
uv pip install pydeseq2
Example Usage
The following script is a complete, runnable example for a standard treated-vs-control analysis.
import pandas as pd
import numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# -----------------------------
# 1) Load inputs
# -----------------------------
# counts.csv is commonly stored as genes x samples; transpose to samples x genes.
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
# Ensure sample alignment
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]
# -----------------------------
# 2) Basic filtering
# -----------------------------
# Remove genes with very low total counts
min_total_counts = 10
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= min_total_counts]
counts_df = counts_df[genes_to_keep]
# Drop samples with missing condition
metadata = metadata.dropna(subset=["condition"])
counts_df = counts_df.loc[metadata.index]
# -----------------------------
# 3) Fit DESeq2 model
# -----------------------------
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~ condition",
refit_cooks=True,
n_cpus=1,
)
dds.deseq2()
# -----------------------------
# 4) Wald test with contrast
# -----------------------------
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
alpha=0.05,
cooks_filter=True,
independent_filter=True,
)
ds.summary()
# -----------------------------
# 5) Results + optional shrinkage
# -----------------------------
res = ds.results_df.copy()
sig = res[res["padj"] < 0.05].sort_values("padj")
print(f"Significant genes (padj < 0.05): {len(sig)}")
# Optional: shrink LFC for visualization/ranking (p-values do not change)
ds.lfc_shrink()
res_shrunk = ds.results_df.copy()
# Export
res.to_csv("deseq2_results.csv")
res_shrunk.to_csv("deseq2_results_shrunk_lfc.csv")
sig.to_csv("significant_genes.csv")
# -----------------------------
# 6) Minimal volcano plot (optional)
# -----------------------------
try:
import matplotlib.pyplot as plt
plot_df = res.copy()
plot_df["neglog10_padj"] = -np.log10(plot_df["padj"].clip(lower=1e-300))
is_sig = plot_df["padj"] < 0.05
plt.figure(figsize=(9, 5))
plt.scatter(
plot_df.loc[~is_sig, "log2FoldChange"],
plot_df.loc[~is_sig, "neglog10_padj"],
s=10,
alpha=0.3,
c="gray",
label="Not significant",
)
plt.scatter(
plot_df.loc[is_sig, "log2FoldChange"],
plot_df.loc[is_sig, "neglog10_padj"],
s=10,
alpha=0.6,
c="red",
label="padj < 0.05",
)
plt.axhline(-np.log10(0.05), linestyle="--", color="blue", alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(adjusted p-value)")
plt.title("Volcano Plot")
plt.legend()
plt.tight_layout()
plt.savefig("volcano_plot.png", dpi=300)
except ImportError:
pass
Implementation Details
Inputs and orientation
- Counts matrix must be samples × genes with non-negative integer counts.
- Many files are stored as genes × samples; transpose with
.Tafter loading.
Design formula (Wilkinson/R-style)
- Use strings like:
~ condition(single factor)~ batch + condition(batch-adjusted)~ age + condition(continuous covariate)~ group + condition + group:condition(interaction)
- Put adjustment variables first (e.g.,
~ batch + condition) so the primary effect is interpreted cleanly.
What dds.deseq2() does (high level)
The fitting pipeline typically includes:
- Size factor estimation (library-size normalization)
- Gene-wise dispersion estimation
- Dispersion trend fitting and prior estimation
- MAP dispersion shrinkage
- Log2 fold change fitting under the specified design
- Cook’s distance outlier detection
- Optional refitting after outlier handling (
refit_cooks=True)
Statistical testing and multiple testing correction
DeseqStats(...).summary()runs Wald tests for the requested coefficient/contrast.- Output columns commonly include:
baseMean: mean normalized expressionlog2FoldChange,lfcSE,statpvalue: raw p-valuepadj: Benjamini–Hochberg FDR adjusted p-value
- Use
padj < alpha(commonly 0.05) for significance.
Contrast specification
- Format:
contrast=["variable", "test_group", "reference_group"] - Example:
["condition", "treated", "control"]tests treated relative to control.
LFC shrinkage (optional)
ds.lfc_shrink()applies shrinkage to log2FoldChange for more stable ranking/plots.- Shrinkage is intended for visualization and prioritization; statistical significance is still based on the (unshrunken) Wald test p-values.
Notes on bundled references/scripts
If your repository includes them, use:
references/api_reference.mdfor parameter/object details.references/workflow_guide.mdfor extended workflows and troubleshooting.scripts/run_deseq2_analysis.pyfor a CLI-style batch workflow (counts/metadata/design/contrast/output, optional plots).