Bio-Ontology Mapper
Map unstructured data to SNOMED CT or MeSH vocabularies
SKILL.md
Bio-Ontology Mapper
Overview
Biomedical terminology normalization tool that maps free-text clinical and scientific concepts to standardized ontologies for semantic interoperability and data harmonization.
Key Capabilities:
- Multi-Ontology Support: SNOMED CT, MeSH, ICD-10, LOINC, RxNorm
- Entity Extraction: NER for diseases, symptoms, procedures, drugs
- Fuzzy Matching: Handle typos, abbreviations, and synonyms
- Confidence Scoring: Reliability metrics for each mapping
- Batch Processing: Normalize large datasets efficiently
- Cross-Mapping: Translate between ontology systems
When to Use
✅ Use this skill when:
- Normalizing clinical notes for EHR integration
- Standardizing terminology for multi-site studies
- Mapping legacy data to modern ontologies
- Preparing data for clinical data warehouses
- Converting free-text to coded data for analysis
- Building semantic search for biomedical literature
- Teaching biomedical informatics principles
❌ Do NOT use when:
- Clinical diagnosis or decision support → Use clinical decision tools
- Real-time patient care → Latency too high for acute settings
- Replacing expert coding → Use for pre-coding, final review needed
- Processing PHI without de-identification → Ensure HIPAA compliance
Integration:
- Upstream:
clinical-data-cleaner(data preparation),ehr-semantic-compressor(text extraction) - Downstream:
clinical-data-cleaner(SDTM mapping),unstructured-medical-text-miner(NLP pipelines)
Core Capabilities
1. Entity Recognition and Mapping
Extract and map biomedical entities to ontologies:
from scripts.mapper import BioOntologyMapper
mapper = BioOntologyMapper()
# Map clinical text
result = mapper.map_text(
text="Patient has diabetes and hypertension, taking metformin",
ontologies=["snomed", "mesh", "rxnorm"],
confidence_threshold=0.7
)
for entity in result.entities:
print(f"{entity.text} → {entity.concept_id} ({entity.ontology})")
print(f" Preferred: {entity.preferred_term}")
print(f" Confidence: {entity.confidence:.2f}")
Supported Ontologies:
| Ontology | Domain | Use Case |
|---|---|---|
| SNOMED CT | Clinical | EHR interoperability |
| MeSH | Literature | PubMed indexing |
| ICD-10 | Billing | Diagnosis codes |
| LOINC | Labs | Test result standardization |
| RxNorm | Drugs | Medication normalization |
| HGNC | Genes | Gene name standardization |
2. Cross-Ontology Translation
Map concepts between different ontologies:
# Cross-map SNOMED to ICD-10
translation = mapper.cross_map(
source_id="22298006", # SNOMED: Myocardial infarction
source_ontology="snomed",
target_ontology="icd10"
)
print(f"ICD-10: {translation.target_id} - {translation.target_term}")
# Output: I21.9 - Acute myocardial infarction, unspecified
Cross-Mapping Coverage:
- SNOMED CT ↔ ICD-10-CM (clinical modifications)
- MeSH ↔ SNOMED CT (literature to clinical)
- RxNorm ↔ ATC (drug classifications)
- LOINC ↔ SNOMED (lab to clinical)
3. Batch Normalization
Process large datasets:
# Batch process CSV
results = mapper.batch_map(
input_file="clinical_terms.csv",
text_column="diagnosis_description",
ontologies=["snomed", "icd10"],
output_format="csv",
max_workers=4
)
# Results include:
# - Original term
# - Mapped concept ID
# - Confidence score
# - Alternative mappings (if ambiguous)
Performance:
- ~100 terms/second (with caching)
- ~20 terms/second (API lookup)
- Parallel processing for large datasets
4. Confidence Scoring and Validation
Assess mapping reliability:
scoring = mapper.score_mapping(
term="heart attack",
candidate="22298006", # Myocardial infarction
factors=["string_similarity", "context_match", "frequency"]
)
print(f"Overall confidence: {scoring.confidence:.2f}")
print(f"Breakdown: {scoring.factors}")
Scoring Factors:
- String similarity: Levenshtein distance, n-grams
- Context match: Surrounding words alignment
- Frequency: Common usage in corpus
- Semantic similarity: Vector embeddings
Common Patterns
Pattern 1: Clinical Note Normalization
Scenario: Convert free-text diagnoses to SNOMED codes.
# Normalize clinical notes
python scripts/main.py \
--input notes.csv \
--column diagnosis_text \
--ontology snomed \
--threshold 0.8 \
--output coded_diagnoses.csv
# Results: "heart attack" → 22298006 (Myocardial infarction)
Post-Processing:
- Review low-confidence mappings (<0.8)
- Handle ambiguous terms manually
- Validate against clinical context
Pattern 2: Literature Indexing
Scenario: Map research paper keywords to MeSH.
# Map keywords to MeSH
mesh_terms = mapper.map_to_mesh(
keywords=["cancer immunotherapy", "checkpoint inhibitors", "PD-1"],
include_tree_numbers=True,
include_qualifiers=True
)
for term in mesh_terms:
print(f"{term.input} → {term.descriptor}")
print(f" Tree: {term.tree_numbers}")
print(f" Entry terms: {term.synonyms}")
Pattern 3: Drug Name Normalization
Scenario: Standardize medication names across datasets.
# Normalize drug names
drugs = ["Tylenol", "Advil", "Motrin", "acetaminophen"]
for drug in drugs:
result = mapper.map_to_rxnorm(drug)
print(f"{drug} → {result.rxcui}: {result.name}")
# Tylenol → 161: Acetaminophen
# Advil → 5640: Ibuprofen
# Motrin → 5640: Ibuprofen
Pattern 4: EHR Data Harmonization
Scenario: Merge data from multiple hospital systems.
# Harmonize diagnoses from 3 hospitals
python scripts/main.py \
--batch \
--inputs "hospital_a.csv,hospital_b.csv,hospital_c.csv" \
--target-ontology snomed \
--cross-map-to icd10 \
--output harmonized_data.csv
Complete Workflow Example
From free-text to coded database:
from scripts.mapper import BioOntologyMapper
from scripts.validator import MappingValidator
# Initialize
mapper = BioOntologyMapper()
validator = MappingValidator()
# Step 1: Extract entities from text
clinical_note = "Patient has Type 2 diabetes and hypertension..."
entities = mapper.extract_entities(clinical_note)
# Step 2: Map to SNOMED
mappings = []
for entity in entities:
mapping = mapper.map_to_snomed(
entity.text,
context=clinical_note,
top_n=3
)
mappings.append(mapping)
# Step 3: Validate mappings
for mapping in mappings:
validation = validator.validate(
mapping,
check_clinical_plausibility=True
)
if not validation.is_valid:
print(f"Review needed: {mapping}")
# Step 4: Export to database format
db_records = [m.to_database_record() for m in mappings]
Quality Checklist
Pre-Mapping:
- Text preprocessed (lowercase, punctuation handled)
- Abbreviations expanded where possible
- Language identified (multilingual support)
During Mapping:
- Confidence threshold appropriate (>0.7 for clinical)
- Multiple candidates considered for ambiguous terms
- Context used for disambiguation
Post-Mapping:
- Low-confidence mappings flagged for review
- Unmapped terms logged
- CRITICAL: Clinical expert validation for high-stakes use
Before Production:
- Mapping accuracy validated on gold standard
- False positive rate acceptable (<5%)
- Recall acceptable for use case (>90%)
- API rate limits respected
Common Pitfalls
Mapping Errors:
-
❌ Abbreviation ambiguity → "MI" = Myocardial infarction OR Michigan
- ✅ Use context; flag for manual review
-
❌ Outdated terms → Old terminology not in current ontology
- ✅ Use historical mappings; update terminology
-
❌ False confidence → High score for wrong concept
- ✅ Always review top-3 candidates
Technical Issues:
-
❌ API failures → No local fallback
- ✅ Implement caching; use local reference files
-
❌ Version mismatches → Different ontology versions
- ✅ Track ontology version used
-
❌ PHI exposure → Sending patient data to external APIs
- ✅ De-identify before API calls; use local processing when possible
References
Available in references/ directory:
snomed_ct_guide.md- SNOMED CT hierarchy and relationshipsmesh_structure.md- MeSH tree structure and qualifiersontology_mappings.md- Crosswalks between systemsnlp_best_practices.md- Biomedical text processingapi_documentation.md- External service integrationvalidation_datasets.md- Gold standard test sets
Scripts
Located in scripts/ directory:
main.py- CLI interface for mappingmapper.py- Core ontology mapping engineextractor.py- Named entity recognitioncross_mapper.py- Ontology-to-ontology translationscorer.py- Confidence calculationbatch_processor.py- Large dataset handlingvalidator.py- Mapping quality checkscaching.py- Local storage for frequent lookups
Limitations
- Ambiguity: Many-to-many mappings common; context required
- Coverage: Rare diseases and new concepts may not be in ontologies
- Versioning: Ontology updates can change mappings over time
- Language: Best support for English; other languages limited
- Real-time: Not suitable for time-critical clinical applications
- API Dependency: Requires internet for most lookups (caching helps)
⚠️ Critical: Ontology mapping is for research and data integration, not clinical decision-making. Always validate mappings with domain experts before use in patient care contexts. Never process PHI without appropriate de-identification and compliance measures.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--term | str | Required | Single term to map |
--input | str | Required | Input file path |
--output | str | Required | Output file path |
--ontology | str | 'both' | |
--threshold | float | 0.7 | |
--format | str | 'json' | |
--use-api | str | Required | Use UMLS/MeSH APIs |
--api-key | str | Required |