Agent Skills
MoleculeMedicineMetabolomics

ADME Property Predictor

AIPOCH-AI

Predict the absorption, distribution, metabolism, and excretion properties of candidate molecules, and evaluate their druggability.

52
1
FILES
adme-property-predictor/
skill.md
scripts
main.py

SKILL.md

ADME Property Predictor

Overview

Comprehensive pharmacokinetic prediction tool that assesses drug-likeness and ADME properties of small molecules using validated cheminformatics models, molecular descriptors, and structure-property relationships.

Key Capabilities:

  • Multi-Property Prediction: Absorption, Distribution, Metabolism, Excretion
  • Drug-Likeness Scoring: Lipinski's Rule of 5, Veber rules, QED score
  • Batch Processing: Analyze compound libraries efficiently
  • Structure-Based Insights: Identify liability hotspots and optimization opportunities
  • Comparative Analysis: Rank candidates by predicted PK profile

When to Use

✅ Use this skill when:

  • Screening compound libraries for drug-like properties in early discovery
  • Prioritizing lead compounds for advancement based on predicted PK
  • Identifying ADME liabilities requiring structural optimization
  • Comparing analogs to select candidates with optimal ADME profiles
  • Filtering virtual screening hits before synthesis
  • Generating ADME data for regulatory pre-submission packages
  • Teaching pharmacokinetics and drug design principles

❌ Do NOT use when:

  • Exact PK parameters needed for dosing → Use experimental PK studies
  • Biologics (antibodies, proteins) → Use antibody-pk-predictor
  • Natural products with complex structures → Models trained on synthetic small molecules
  • Prodrugs requiring metabolic activation → Use prodrug-activation-predictor
  • Prediction for clinical dosing decisions → CRITICAL: Experimental validation required
  • Assessing toxicity or safety → Use toxicity-structure-alert or admetox-predictor

Related Skills:

  • 上游: chemical-structure-converter (structure preparation), lipinski-rule-filter (rule-based filtering)
  • 下游: drug-candidate-evaluator (integrated scoring), molecular-dynamics-sim (detailed binding)

Integration with Other Skills

Upstream Skills:

  • chemical-structure-converter: Convert between SMILES, InChI, MOL formats
  • lipinski-rule-filter: Initial rule-based drug-likeness screening
  • chemical-structure-converter: Generate 3D conformers for structure-based predictions
  • smiles-de-salter: Remove salt counterions before analysis

Downstream Skills:

  • drug-candidate-evaluator: Multi-parameter optimization including ADME
  • toxicity-structure-alert: Assess safety alongside ADME
  • target-novelty-scorer: Evaluate target uniqueness for selected candidates
  • biotech-pitch-deck-narrative: Create investor materials with PK data

Complete Workflow:

Chemical Structure Converter (prepare structures) → 
  Lipinski Rule Filter (initial filtering) → 
    ADME Property Predictor (this skill, detailed PK) → 
      Drug Candidate Evaluator (integrated scoring) → 
        Toxicity Structure Alert (safety check)

Core Capabilities

1. Absorption (A) Prediction

Predict intestinal absorption, solubility, and permeability:

from scripts.adme_predictor import ADMEPredictor

predictor = ADMEPredictor()

# Predict absorption properties
absorption = predictor.predict_absorption(
    smiles="CC(=O)Oc1ccccc1C(=O)O",  # Aspirin
    properties=["all"]  # or specific: ["hia", "caco2", "solubility"]
)

print(absorption.summary())

Predicted Properties:

PropertyModelUnitsInterpretation
HIAML + physicochemical%Human intestinal absorption; >80% good
Caco-2QSPR10⁻⁶ cm/sPermeability; >70 high, <25 low
SolubilityQSPRmg/mLAqueous solubility; >0.1 mg/mL acceptable
LogSQSPRunitlessIntrinsic solubility; >-4 acceptable
Lipinski PassRule-basedbooleanPasses all 5 rules
Veber PassRule-basedbooleanPSA <140, rotatable bonds <10

Best Practices:

  • ✅ Consider HIA and solubility together (high HIA but low solubility = dissolution-limited)
  • ✅ Caco-2 good for oral absorption prediction; poor for BBB penetration
  • ✅ Use both rule-based (Lipinski) and ML-based predictions for consensus
  • ✅ Check solubility at physiological pH (not just intrinsic)

Common Issues and Solutions:

Issue: Lipinski pass but poor solubility

  • Symptom: "Passes Rule of 5 but LogS = -5"
  • Solution: Lipinski checks MW and LogP, not solubility directly; use explicit solubility prediction

Issue: Caco-2 predicts high absorption but HIA low

  • Symptom: "Caco-2 = 85 (high) but HIA = 60%"
  • Solution: Models have different training sets; Caco-2 is in vitro, HIA in vivo; HIA generally more reliable

2. Distribution (D) Prediction

Predict tissue distribution, protein binding, and brain penetration:

# Predict distribution properties
distribution = predictor.predict_distribution(
    smiles="CC(=O)Oc1ccccc1C(=O)O",
    properties=["vd", "ppb", "bbb"]
)

# Access specific predictions
vd = distribution.volume_of_distribution
bbb = distribution.blood_brain_barrier
ppb = distribution.plasma_protein_binding

Predicted Properties:

PropertyModelUnitsInterpretation
VdQSPRL/kgVolume of distribution; 0.1-10 typical
PPBML%Plasma protein binding; >90% high, <50% low
BBBLogBBunitlessBrain penetration; >0.3 penetrant
fuCalculatedfractionFree (unbound) fraction; 1 - PPB/100

Best Practices:

  • ✅ High PPB (>90%) may require higher doses but longer half-life
  • ✅ Low Vd (<0.3) = mainly in plasma; high Vd (>3) = extensive tissue distribution
  • ✅ BBB penetration critical for CNS drugs; avoid for peripherally-acting drugs
  • ✅ fu (free fraction) drives pharmacological activity, not total concentration

Common Issues and Solutions:

Issue: BBB predictions unreliable for certain chemotypes

  • Symptom: "BBB model gives conflicting predictions for peptides"
  • Solution: Models trained on small molecules; use specialized BBB predictors for peptides, macrocycles

Issue: PPB overestimated for acidic drugs

  • Symptom: "PPB predicted 95% but experimental is 70%"
  • Solution: Some models biased toward neutral/basic compounds; check model training set overlap

3. Metabolism (M) Prediction

Predict metabolic stability, CYP interactions, and liability sites:

# Predict metabolism properties
metabolism = predictor.predict_metabolism(
    smiles="CC(=O)Oc1ccccc1C(=O)O",
    include_site_prediction=True
)

# Check CYP interactions
cyp_profile = metabolism.cyp_profile
stability = metabolism.metabolic_stability

Predicted Properties:

PropertyModelOutputInterpretation
CYP InhibitionMLIC50 or classPotential DDI; <1 μM high risk
CYP SubstrateClassificationBoolean/ProbabilityMetabolized by specific CYP
StabilityMLT1/2 or classMicrosomal/ hepatocyte stability
Liability SitesReactivity modelsAtom indicesSoft spots for metabolism
MAO SubstrateClassificationBooleanMonoamine oxidase substrate

Best Practices:

  • ✅ Screen for CYP3A4 inhibition early (most common DDI)
  • ✅ Check if compound is CYP substrate (for polymorphism concerns)
  • ✅ Identify metabolic hotspots for structural blocking
  • ✅ Consider species differences (human vs rodent metabolism)

Common Issues and Solutions:

Issue: False negatives for time-dependent inhibition (TDI)

  • Symptom: "No CYP inhibition predicted but TDI observed experimentally"
  • Solution: Standard models predict reversible inhibition; use specialized TDI predictors

Issue: Metabolic site prediction shows multiple hotspots

  • Symptom: "5 different atoms flagged as metabolic liabilities"
  • Solution: Prioritize by reactivity score; consider blocking highest-risk site first

4. Excretion (E) Prediction

Predict clearance routes and elimination kinetics:

# Predict excretion properties
excretion = predictor.predict_excretion(
    smiles="CC(=O)Oc1ccccc1C(=O)O",
    properties=["clearance", "half_life", "route"]
)

# Access predictions
clearance = excretion.clearance_ml_min_kg
t12 = excretion.half_life_hours
route = excretion.primary_route

Predicted Properties:

PropertyModelUnitsInterpretation
CLQSPRmL/min/kgClearance; <5 low, 5-15 moderate, >15 high
T1/2QSPRhoursHalf-life; 2-8h typical for oral drugs
RouteClassificationrenal/biliary/mixedPrimary excretion pathway
LogDQSPRunitlessDistribution coefficient; affects clearance

Best Practices:

  • ✅ Half-life determines dosing frequency (T1/2 × 5 = time to steady state)
  • ✅ Renal clearance predictable for polar compounds; hepatic less predictable
  • ✅ High clearance (>15) may require high doses or prodrug approach
  • ✅ Very long T1/2 (>24h) good for adherence but risk accumulation

Common Issues and Solutions:

Issue: Clearance predictions highly variable

  • Symptom: "Same compound, different models give CL = 5 vs 20 mL/min/kg"
  • Solution: Allometry-based methods unreliable for novel scaffolds; use average of multiple models

Issue: Route prediction contradicts structure

  • Symptom: "Highly polar compound predicted biliary, expected renal"
  • Solution: Check LogP/LogD; polar compounds (<0) usually renal; neutral/lipophilic (>1) usually hepatic

5. Integrated Drug-Likeness Scoring

Overall assessment combining all ADME properties:

# Generate comprehensive drug-likeness score
druglikeness = predictor.calculate_druglikeness(
    smiles="CC(=O)Oc1ccccc1C(=O)O",
    methods=["qed", "muegge", "golden_triangle"]
)

# Multi-parameter optimization
mpo_score = predictor.mpo_score(
    smiles="CC(=O)Oc1ccccc1C(=O)O",
    target_profile={"hia": >80, "bbb": <0.3, "t12": "2-8h"}
)

Scoring Methods:

MethodDescriptionRangeGood Score
QEDQuantitative Estimation of Drug-likeness0-1>0.6
MueggeBioavailability score0-6>4
MPOMulti-Parameter Optimization0-10>6

Best Practices:

  • ✅ Use QED as quick overall metric; MPO for property-weighted scoring
  • ✅ Don't rely solely on drug-likeness; efficacy and safety equally important
  • ✅ Compare to marketed drugs in same class for context
  • ✅ Track drug-likeness trends during optimization (should improve)

Common Issues and Solutions:

Issue: Drug-likeness score conflicts with project needs

  • Symptom: "CNS drug has low QED (0.5) because high LogP needed for BBB"
  • Solution: Drug-likeness rules biased toward oral drugs; use category-specific models (CNS, oncology, etc.)

6. Batch Processing and Library Screening

Analyze compound libraries efficiently:

# Batch process library
results = predictor.batch_predict(
    input_file="library.smi",  # SMILES file
    properties=["all"],
    output_format="csv",
    n_workers=4  # Parallel processing
)

# Filter by criteria
filtered = results.filter(
    lipinski_pass=True,
    hia__gt=80,
    t12__between=(2, 8)
)

# Rank by multi-parameter score
ranked = results.rank(by="mpo_score", ascending=False)

Best Practices:

  • ✅ Process in batches of 1000-10000 for memory efficiency
  • ✅ Save intermediate results (crash recovery)
  • ✅ Apply filters sequentially (Lipinski first, then detailed ADME)
  • ✅ Check property distributions to identify outliers

Common Issues and Solutions:

Issue: Batch processing runs out of memory

  • Symptom: "Killed: Out of memory" with 50K compounds
  • Solution: Process in chunks; use generators instead of loading all into RAM

Issue: Some compounds fail prediction

  • Symptom: "30% of library returns NaN"
  • Solution: Check for invalid SMILES, unusual atoms, or molecules outside training set domain

Complete Workflow Example

From SMILES to prioritized candidates:

# Step 1: Predict ADME for single compound
python scripts/main.py \
  --smiles "CC(=O)Oc1ccccc1C(=O)O" \
  --properties all \
  --output aspirin_adme.json

# Step 2: Batch process compound library
python scripts/main.py \
  --input library.smi \
  --properties absorption,distribution \
  --format csv \
  --output library_adme.csv

# Step 3: Filter and rank
python scripts/main.py \
  --input library_adme.csv \
  --filter "lipinski_pass=True,hia>80" \
  --rank-by qed \
  --top-n 100 \
  --output top_candidates.csv

Python API Usage:

from scripts.adme_predictor import ADMEPredictor
from scripts.batch_processor import BatchProcessor

# Initialize
predictor = ADMEPredictor()
batch = BatchProcessor()

# Single compound analysis
aspirin = predictor.predict_all("CC(=O)Oc1ccccc1C(=O)O")
print(f"HIA: {aspirin.absorption.hia}%")
print(f"Half-life: {aspirin.excretion.t12} hours")

# Batch screening
results = batch.process(
    input_file="library.smi",
    predictor=predictor,
    properties=["absorption", "distribution"],
    n_workers=4
)

# Filter good candidates
good_candidates = results[
    (results.lipinski_pass == True) &
    (results.hia > 80) &
    (results.bbb < 0.3) &
    (results.t12.between(2, 8))
]

Expected Output Files:

output/
├── aspirin_adme.json           # Single compound detailed results
├── library_adme.csv            # Batch screening results
├── top_candidates.csv          # Filtered and ranked candidates

Quality Checklist

Pre-Prediction Checks:

  • SMILES string is valid and canonical
  • Salt forms removed (if analyzing parent compound)
  • Tautomeric state appropriate for physiological pH
  • Stereochemistry specified (if relevant for activity)

During Prediction:

  • Compound within model applicability domain (check similarity to training set)
  • No unusual atoms or functional groups (models trained on typical drug-like space)
  • MW in range 100-800 Da (outside range predictions less reliable)
  • Predictions complete (no missing values for critical properties)

Post-Prediction Verification:

  • Drug-likeness scores in reasonable range (sanity check)
  • Individual properties internally consistent (e.g., high LogP predicts low solubility)
  • CRITICAL: Comparison to experimental data if available (validate model for chemotype)
  • Rankings align with medicinal chemistry intuition

Before Making Decisions:

  • CRITICAL: Predictions are NOT experimental data; use for prioritization only
  • Multiple orthogonal models give consistent results
  • Structural alerts checked (toxicity, reactivity)
  • Top candidates selected for experimental validation
  • Documentation of model versions and confidence intervals

For Regulatory Submissions:

  • Model validation documented (training set, test set performance)
  • Applicability domain clearly defined
  • Prediction uncertainty quantified
  • Experimental confirmation for key predictions

Common Pitfalls

Over-Reliance Issues:

  • Treating predictions as experimental facts → Poor decision making

    • ✅ Use predictions for prioritization; experimental validation required for lead optimization
  • Single model dependency → Miss model-specific biases

    • ✅ Compare multiple models; consensus predictions more reliable
  • Ignoring prediction confidence → False sense of certainty

    • ✅ Check confidence intervals; low confidence predictions need higher scrutiny

Input Issues:

  • Invalid or non-canonical SMILES → Wrong compound analyzed

    • ✅ Validate SMILES before prediction; use canonical forms
  • Analyzing salt forms → Properties skewed by counterion

    • ✅ Remove salts using smiles-de-salter; analyze free base/acid
  • Ignoring stereochemistry → Inaccurate predictions for chiral drugs

    • ✅ Specify stereochemistry explicitly; use 3D descriptors if available

Interpretation Issues:

  • Focusing on single property → Miss overall profile

    • ✅ Consider all ADME properties; use integrated scores like QED or MPO
  • Rigid cutoff application → Discard good candidates

    • ✅ Use cutoffs as guidelines; consider project-specific needs
  • Ignoring property correlations → Unrealistic optimization

    • ✅ Recognize trade-offs (e.g., increasing LogP improves BBB but reduces solubility)

Domain Issues:

  • Applying to biologics → Completely inappropriate

    • ✅ These models for small molecules only; use specialized tools for biologics
  • Extrapolating beyond training set → Unreliable predictions

    • ✅ Check applicability domain; novel scaffolds need experimental validation

Workflow Issues:

  • No experimental validation → Continue with false leads

    • ✅ Always validate top predictions experimentally
  • Not documenting model versions → Irreproducible results

    • ✅ Record software version, model versions, prediction dates

Troubleshooting

Problem: All predictions show "out of domain" warning

  • Symptoms: "Compound outside training set" for entire library
  • Causes: Library contains unusual chemotypes (peptidomimetics, macrocycles, etc.)
  • Solutions:
    • Use specialized models for non-traditional chemotypes
    • Check if input format correct (SMILES vs InChI)
    • Verify no strange atoms (metals, silicon, etc.)

Problem: Extreme predictions (negative solubility, >100% absorption)

  • Symptoms: "LogS = -15" or "HIA = 150%"
  • Causes: Model extrapolation errors; invalid input structures
  • Solutions:
    • Check input structure validity
    • Cap extreme values at physiologically plausible limits
    • Flag for manual review if outside typical ranges

Problem: Batch processing extremely slow

  • Symptoms: "100 compounds taking 30 minutes"
  • Causes: Single-threaded execution; complex models
  • Solutions:
    • Enable parallel processing (--n-workers 4)
    • Use faster models for initial screening (QSAR vs ML)
    • Pre-filter with rule-based methods (Lipinski) before detailed ADME

Problem: Inconsistent predictions across runs

  • Symptoms: "Same compound, different predictions on re-run"
  • Causes: Random seed issues; stochastic models
  • Solutions:
    • Set random seeds for reproducibility
    • Use deterministic models when consistency critical
    • Average multiple predictions if stochastic models necessary

Problem: Properties contradict each other

  • Symptoms: "High LogP (4.5) but predicted very soluble"
  • Causes: Model inconsistencies; prediction errors
  • Solutions:
    • Check input structure (tautomeric form matters for both)
    • Lipophilic compounds (LogP > 3) typically have poor solubility
    • Use thermodynamic cycle checks if available

Problem: Cannot process certain file formats

  • Symptoms: "Error: Unsupported format" for SDF or MOL files
  • Causes: Format limitations; parser issues
  • Solutions:
    • Convert to SMILES using chemical-structure-converter
    • Check file encoding (UTF-8 vs Latin-1)
    • Verify structure validity with external tools

References

Available in references/ directory:

  • lipinski_rules.md - Detailed explanation of Rule of 5 and variants
  • qsar_models.md - Technical documentation of predictive models
  • adme_databases.md - Experimental ADME data sources for validation
  • property_ranges.md - Acceptable ranges for marketed drugs by class
  • model_validation.md - Validation statistics and applicability domains
  • cheminformatics_basics.md - Introduction to molecular descriptors

Scripts

Located in scripts/ directory:

  • main.py - CLI interface for ADME prediction
  • adme_predictor.py - Core prediction engine
  • absorption.py - Absorption property models
  • distribution.py - Distribution property models
  • metabolism.py - Metabolism prediction models
  • excretion.py - Excretion and clearance models
  • druglikeness.py - QED, MPO, and other scoring functions
  • batch_processor.py - Library screening and parallel processing
  • validator.py - Input validation and applicability domain checking

Performance and Resources

Prediction Speed:

TaskTimeHardware
Single compound0.5-2 secCPU
100 compounds30-60 secCPU
1000 compounds5-10 minCPU
1000 compounds2-3 min4-core parallel
10,000 compounds30-60 min4-core parallel

System Requirements:

  • RAM: 4 GB minimum; 8 GB for large libraries (>10K compounds)
  • Storage: 100 MB for models and dependencies
  • CPU: Multi-core recommended for batch processing
  • No GPU required: All models CPU-based

Optimization Tips:

  • Process libraries in batches of 5000-10000
  • Use rule-based filters (Lipinski) before expensive ML predictions
  • Cache results to avoid re-prediction
  • Parallel processing scales nearly linearly up to 8 cores

Limitations

  • Small Molecules Only: Models trained on drugs with MW 100-800 Da; unreliable for larger compounds
  • pH 7.4 Assumption: Most models predict properties at physiological pH
  • Human-Specific: Predictions for human PK; animal models may differ
  • Healthy Subject Assumption: Does not account for disease states, drug interactions
  • Single Compound: Does not predict formulation effects, salt form impact
  • Static Models: Do not account for induction, inhibition, or time-dependent changes
  • Training Set Bias: Underperforms for novel scaffolds not in training data
  • Qualitative Only: For Go/No-Go decisions; not for precise quantitative predictions
  • No Toxicity: ADME only; use separate tools for safety assessment

Model Accuracy (Typical):

  • LogP: R² = 0.85-0.95 (very good)
  • Solubility: R² = 0.65-0.80 (moderate)
  • HIA: Accuracy = 75-85% (good)
  • BBB: Accuracy = 70-80% (moderate)
  • Metabolic stability: R² = 0.60-0.75 (moderate)
  • T1/2: R² = 0.50-0.65 (challenging)

Version History

  • v1.0.0 (Current): Initial release with 20+ ADME endpoints, QED scoring, batch processing
  • Planned: Integration with PK simulation, population variability modeling, formulation effects

⚠️ CRITICAL DISCLAIMER: These predictions are computational estimates for prioritization and guidance only. They do NOT replace experimental ADME studies required for regulatory submissions or clinical decision-making. Always validate predictions with appropriate in vitro and in vivo assays before advancing compounds.

Parameters

ParameterTypeDefaultDescription
--smilesstrRequiredSMILES string of the molecule
--propertiesstr["all"]Specific properties to calculate
--formatstr"json"Output format
--inputstrRequiredInput CSV file with SMILES column
--outputstrRequiredOutput file for results