CRISPR Screen Analyzer
Process CRISPR screening data to identify essential genes and hit candidates.
SKILL.md
status: beta
CRISPR Screen Analyzer
Analyze pooled CRISPR screening data from count matrix to hit identification. Covers QC assessment (Gini index, read depth, dropout, replicate correlation), log fold change calculation, z-score-based sgRNA scoring, and multi-threshold hit calling.
Quick Check
python -m py_compile scripts/main.py
python scripts/main.py --help
When to Use
- Identifying essential genes from viability screens (T0 vs T14)
- Calling drug resistance or sensitivity hits from treatment vs control comparisons
- Assessing screen quality before downstream pathway analysis
Upstream: fastqc-report-interpreter → crispr-screen-analyzer
Downstream: crispr-screen-analyzer → go-kegg-enrichment → hit-validation-planner
Workflow
- Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
- Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
- Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
- Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
- If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
Fallback template: If scripts/main.py fails or required files are absent, report: (a) which file is missing, (b) which QC metrics can still be computed, (c) the manual command equivalent for the failed step.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
--counts, -c | string | Yes | sgRNA count matrix file |
--samples, -s | string | Yes | Sample annotation CSV file |
--control | string | No | Control sample names (comma-separated) |
--treatment, -t | string | No | Treatment sample names (comma-separated) |
--output, -o | string | No | Output directory |
--fdr | float | No | FDR threshold (default: 0.05; must be between 0 and 1) |
--seed | int | No | Random seed for reproducibility (default: 42) |
Usage
# QC assessment only
python scripts/main.py --counts sgrna_counts.txt --samples samples.csv --output qc_results
# Full differential analysis
python scripts/main.py \
--counts sgrna_counts.txt --samples samples.csv \
--control "Ctrl_1,Ctrl_2,Ctrl_3" \
--treatment "Drug_1,Drug_2,Drug_3" \
--output drug_screen --fdr 0.05 --seed 42
Key QC Thresholds
| Metric | Target | Action if Failed |
|---|---|---|
| Gini index | < 0.3 | Check MOI; consider repeating screen |
| Total reads | > 10M/sample | Increase sequencing depth |
| Zero-count sgRNAs | < 5% | Verify library quality at transduction |
| Replicate correlation | > 0.7 | Investigate batch effects; flag in output |
Note: Replicate correlation (Pearson or Spearman) is computed between replicate samples. Samples with correlation < 0.7 are flagged with a warning in the QC output. This metric is documented but not yet computed in the current script — implementation gap noted.
Hit Classification
| Category | Criteria | Interpretation |
|---|---|---|
| Essential | FDR < 0.05, LFC < −1 | Required for cell viability |
| Drug Sensitive | FDR < 0.05, LFC < −1 | Synthetic lethal with treatment |
| Drug Resistant | FDR < 0.05, LFC > 1 | Confers resistance |
Statistical Method Note
The scoring method uses z-score normalization at the sgRNA level (not true Robust Rank Aggregation). LFC uses pseudocount +1: log2((treatment+1)/(control+1)). FDR correction uses Benjamini-Hochberg. For true gene-level RRA, use MAGeCK or BAGEL2 as downstream tools.
Reproducibility
Set --seed (default: 42) to ensure reproducible results across runs. The seed is recorded in the output metadata file. The script must call np.random.seed(args.seed) at the start of main().
Stress-Case Output Structure
For complex multi-constraint requests, always produce all of these sections:
- Assumptions — list every assumption made about sample groupings, thresholds, and data quality
- Constraints — state which parameters are fixed vs configurable
- Risks — flag QC failures, low replicate counts, or potential batch effects
- Unresolved Items — list anything that requires human decision before proceeding
Output Requirements
Every response must make these explicit:
- Objective and deliverable
- Inputs used and assumptions introduced
- Workflow or decision path taken
- Core result: QC metrics, hit list, summary statistics
- Constraints, risks, caveats (e.g., single-sgRNA hits, off-target concerns)
- Unresolved items and next-step checks
Input Validation
This skill accepts: sgRNA count matrices and sample annotation files from pooled CRISPR screens.
If the request does not involve CRISPR screen data analysis — for example, asking to design sgRNA libraries, perform RNA-seq differential expression, or interpret non-CRISPR genomic data — do not proceed. Instead respond:
"
crispr-screen-analyzeris designed to analyze pooled CRISPR screening data. Your request appears to be outside this scope. Please provide an sgRNA count matrix and sample annotation file, or use a more appropriate tool for your task."
Error Handling
- If required inputs are missing, state exactly which fields are missing and request only the minimum additional information.
- If sample names in
--controlor--treatmentdo not match columns in the count matrix, report the mismatch and list available sample names. - If
--fdris not between 0 and 1, reject with:Error: --fdr must be between 0 and 1. - If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If
scripts/main.pyfails, report the failure point, summarize what still can be completed safely, and provide a manual fallback. - Do not fabricate files, citations, data, search results, or execution outcomes.
Response Template
- Objective
- Inputs Received
- Assumptions
- Workflow
- Deliverable
- Risks and Limits
- Next Checks