Add mechanism-level embeddings and cell types space #244

cmungall · 2026-02-04T01:33:38Z

Summary

Add new cell types/anatomy embedding space for analyzing diseases by their involved cell types (CL) and tissues (UBERON)
Add mechanism-level embeddings that embed individual pathophysiology entries, enabling comparison of specific disease mechanisms across disorders
Add interactive mechanism comparison browser (app/embeddings/mechanisms.html) for visualizing mechanistic overlap between diseases

New Features

Cell Types Embedding Space

New celltypes space using embed_cells.j2 template
Extracts cell types from pathophysiology entries with CL ontology IDs
Added to disease embedding explorer dropdown

Mechanism Comparison Browser

Embeds each pathophysiology entry separately (1235 mechanisms from 304 diseases)
Search and select multiple diseases to compare
Each disease gets a unique color in the shared embedding space
Overlapping clusters indicate shared pathophysiology mechanisms
Toggle context (unselected diseases as gray) for reference

Bug Fixes

Fix .history.yaml files being incorrectly indexed
Fix datetime serialization for edit_history fields
Fix linkml-store default limit of 100 rows (was truncating results)
Add cache clearing on --recreate to prevent stale embeddings

Commands

# Index mechanisms
just embed-index-mechanisms

# Export browser data  
just embed-mechanisms-data

# Or both:
just embed-mechanisms-all

Test plan

Verify just embed-index-grouped completes without errors
Verify just embed-app-data exports all 313 disorders per space
Verify just embed-mechanisms-all indexes 1235 mechanisms
Test mechanism browser: search "Asthma", add "COPD", verify mechanism overlap visualization
Verify cell types space works in disease explorer

🤖 Generated with Claude Code

New features: - Cell types/anatomy embedding space (embed_cells.j2) - Mechanism-level embeddings for comparing pathophysiology across diseases - Interactive mechanism comparison browser (mechanisms.html) Changes: - Add celltypes dropdown to disease embedding explorer - Add embed_mechanism.j2 template for individual pathophys entries - Add index_mechanisms() and export_mechanisms_data() methods - Add just targets: embed-index-mechanisms, embed-mechanisms-data, embed-mechanisms-all - Consolidate embed-search and embed-similar into single targets with space arg - Fix history file exclusion in load_disorders() - Fix datetime serialization for edit_history fields - Fix linkml-store default limit (100) issue for large collections - Add cache clearing on recreate to prevent stale embeddings - Add comprehensive embeddings documentation (docs/embeddings.md) Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2026-02-04T01:33:54Z

Claude finished @cmungall's task —— View job

Code Review for PR #244: Add mechanism-level embeddings and cell types space

Review Checklist:

Review Summary

This PR adds significant value to the dismech project by introducing mechanism-level embeddings and a cell types embedding space. The implementation is generally well-designed and follows good practices.

✅ Strengths

Architecture & Design:

Clean separation of concerns: New embedding spaces (celltypes, mechanisms) are well-integrated into existing architecture
Consistent patterns: Follows established conventions for indexing, caching, and export functions
Modular design: Jinja2 templates cleanly separate content extraction logic

Code Quality:

Good error handling: Proper validation of cache paths and data existence (e.g., src/dismech/embed.py:458-462)
Defensive programming: Safe dictionary access with .get() and default values throughout
Clear function signatures: Well-documented parameters and return types

User Experience:

Interactive browser: The mechanism comparison tool (app/embeddings/mechanisms.html) provides excellent visualization capabilities
Comprehensive commands: Well-organized justfile commands for different workflows
Good documentation: Clear explanations in docs/embeddings.md

⚠️ Areas for Improvement

Performance Considerations:

Large limit values: Multiple uses of limit=10000 (lines 181, 203, etc.) could cause performance issues as the dataset grows
- Recommendation: Consider implementing pagination or streaming for large datasets
- Impact: Low risk currently but may become problematic with growth
Memory usage: Loading all embeddings into memory for UMAP/t-SNE computation (src/dismech/embed.py:395)
- Recommendation: Consider batch processing for very large datasets
- Impact: Medium risk for scalability

Potential Bugs:

Race condition risk: Cache clearing in _clear_cache() could interfere with concurrent operations
- Recommendation: Add file locking or atomic operations
- Impact: Low risk in single-user scenarios
Missing validation: No validation that selectedDiseases.length >= 10 limit is properly enforced in UI
- Location: app/embeddings/mechanisms.html:413
- Impact: Minor - could allow > 10 selections

Security & Robustness:

XSS potential: Direct string interpolation in HTML generation
- Location: mechanisms.html:441-446 - d.replace(/'/g, "\\'")
- Recommendation: Use proper HTML escaping or DOM manipulation
- Impact: Low risk with controlled data sources
Resource limits: No explicit limits on embedding computation
- Recommendation: Add timeout configuration for OpenAI API calls
- Impact: Low risk but could cause hangs

🔧 Minor Improvements

Code Organization:

Consider extracting color/styling constants to a shared module
The _serialize_datetimes() function could be moved to a utilities module

Documentation:

Add JSDoc comments to the mechanism browser JavaScript functions
Consider adding type hints to more functions in embed.py

🧪 Test Coverage

Missing Tests:

No unit tests for the new extract_mechanisms() function
No tests for the cell types embedding template
Browser interface lacks automated testing

Recommendations:

Add tests for mechanism extraction logic
Test error handling in embedding export functions
Consider end-to-end tests for the browser interface

📊 Performance Analysis

Efficient Patterns:

✅ Good use of caching to avoid redundant API calls
✅ Proper use of DuckDB for efficient embedding storage
✅ Lazy loading of large dependencies (UMAP, t-SNE)

Optimization Opportunities:

Consider pre-computing common dimensionality reductions
Implement progressive loading for large visualization datasets

🎯 Overall Assessment

This is a high-quality PR that adds valuable functionality while maintaining code quality standards. The implementation follows established patterns and provides comprehensive documentation. While there are some minor areas for improvement around performance and testing, these don't block the core functionality.

Recommendation: ✅ APPROVE with consideration for the minor improvements mentioned above in future iterations.

🔗 Specific Code References

Cache management: src/dismech/embed.py:154-159
Mechanism extraction: src/dismech/embed.py:117-140
Interactive visualization: app/embeddings/mechanisms.html:400-500
New templates: src/dismech/templates/embed_cells.j2, embed_mechanism.j2

Copilot

Pull request overview

This pull request adds two major features to the dismech embedding system: a new cell types/anatomy embedding space and a mechanism-level comparison browser. The PR also includes several important bug fixes related to file indexing, datetime serialization, and query limits.

Changes:

Adds celltypes embedding space for analyzing diseases by their involved cell types (CL ontology) and tissues (UBERON ontology)
Adds mechanism-level embeddings that embed individual pathophysiology entries, enabling comparison of specific disease mechanisms across disorders
Adds interactive mechanism comparison browser (app/embeddings/mechanisms.html) for visualizing mechanistic overlap between diseases
Fixes bugs: .history.yaml files being indexed, datetime serialization errors, and linkml-store's default 100-row limit truncating results
Adds cache clearing on --recreate flag to prevent stale embeddings

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/dismech/templates/embed_mechanism.j2	New Jinja2 template for mechanism-level embeddings
src/dismech/templates/embed_cells.j2	New Jinja2 template for cell types/anatomy embeddings
src/dismech/embed.py	Core changes: datetime serialization, extract_mechanisms function, new indexing methods for celltypes and mechanisms, mechanism data export, CLI updates
project.justfile	New commands for mechanism indexing/export, updated existing commands to support new spaces
docs/embeddings.md	Comprehensive documentation for the entire embedding system including new features
app/embeddings/mechanisms.html	New interactive browser for mechanism-level comparison
app/embeddings/index.html	Added celltypes option to space dropdown

Copilot · 2026-02-04T01:40:13Z

app/embeddings/mechanisms.html

+        function addDisease(disease) {
+            if (selectedDiseases.includes(disease) || selectedDiseases.length >= 10) return;
+            selectedDiseases.push(disease);
+            updateChips();
+            updatePlot();
+        }


The UI silently prevents adding more than 10 diseases without providing user feedback. This could be confusing - users might click "Add" and wonder why nothing happens.

Consider adding visual feedback when the limit is reached, such as:

Disabling the search input and add button when at limit

Showing a message like "Maximum 10 diseases selected"

Providing a toast notification when attempting to add beyond the limit

This would improve user experience and make the limitation clear.

Copilot · 2026-02-04T01:40:14Z

src/dismech/embed.py

        coll.attach_indexer(indexer)
        # Index all objects in the collection
-        rows = coll.find().rows
+        rows = coll.find(limit=10000).rows


The hardcoded limit of 10000 could become a scalability issue as the knowledge base grows. Currently, the PR description mentions 313 disorders and 1235 mechanisms, which are well under this limit. However, consider making this configurable or at least documenting this limitation clearly.

As a best practice, consider either:

Making the limit a class constant (e.g., DEFAULT_QUERY_LIMIT = 10000) that can be overridden

Adding a check to warn if the result set is exactly at the limit, which might indicate truncation

Documenting this limitation in the method docstrings

This is particularly important for the find() calls where results are expected to return all items.

Copilot · 2026-02-04T01:40:14Z

src/dismech/templates/embed_cells.j2

+{% if p.tissue_term %}  Tissue: {{ p.tissue_term.preferred_term | default(p.tissue_term.term.label if p.tissue_term.term else '') }}{% endif %}
+{% endfor %}
+{% if histopathology %}
+
+Histopathology:


The template references p.tissue_term on pathophysiology entries, but based on the data schema, tissue_term appears to be a field that exists in the datasets section, not in pathophysiology entries.

While the use of {% if p.tissue_term %} makes this safe (it won't error), this condition will likely never be true, making this line dead code. Consider removing it or documenting if there are indeed disorders with tissue_term in their pathophysiology entries.

If cell types and anatomy is the goal, the histopathology section seems more appropriate as it appears to have both cell_type_term and tissue_term fields.

Suggested change

{% if p.tissue_term %} Tissue: {{ p.tissue_term.preferred_term | default(p.tissue_term.term.label if p.tissue_term.term else '') }}{% endif %}

{% endfor %}

{% if histopathology %}

Histopathology:

{% endfor %}

{% if histopathology %}

Histopathology:

Histopathology:

Copilot · 2026-02-04T01:40:14Z

app/embeddings/mechanisms.html

+            container.innerHTML = selectedDiseases.map((d, i) => `
+                <span class="chip" style="background: ${COLORS[i % COLORS.length]}">
+                    ${d}
+                    <span class="count">${counts[d] || 0}</span>
+                    <span class="remove" onclick="removeDisease('${d.replace(/'/g, "\\'")}')">&times;</span>
+                </span>
+            `).join('');
+
+            // Update legend
+            const legend = document.getElementById('legend');
+            legend.innerHTML = selectedDiseases.map((d, i) => `
+                <div class="legend-item">
+                    <span class="legend-swatch" style="background: ${COLORS[i % COLORS.length]}"></span>
+                    <span>${d}</span>


Potential XSS vulnerability: Disease names are inserted into HTML without escaping. If a disease name contains HTML special characters (e.g., <script>, <, >, &), it could lead to XSS attacks.

While disease names in the knowledge base are likely sanitized, it's a security best practice to always escape user-controlled or data-driven content before inserting it into HTML.

Consider using textContent property or a helper function to escape HTML entities. For example:

function escapeHtml(text) { const div = document.createElement('div'); div.textContent = text; return div.innerHTML; } // Then use: ${escapeHtml(d)}

Or create the elements programmatically rather than using innerHTML with template strings.

Copilot · 2026-02-04T01:40:15Z

app/embeddings/mechanisms.html

+                suggestions.innerHTML = matches.map((d, i) =>
+                    `<div class="suggestion-item" data-disease="${d}">${d}</div>`
+                ).join('');


Potential XSS vulnerability: Disease names are inserted into HTML attributes and content without escaping. While the data-disease attribute has some protection through the dataset API, the disease name in the text content should be escaped.

Use a helper function to escape HTML entities or create elements programmatically to avoid potential XSS issues.

Suggested change

suggestions.innerHTML = matches.map((d, i) =>

`<div class="suggestion-item" data-disease="${d}">${d}</div>`

).join('');

suggestions.innerHTML = '';

matches.forEach((d) => {

const item = document.createElement('div');

item.className = 'suggestion-item';

item.dataset.disease = d;

item.textContent = d;

suggestions.appendChild(item);

});

Copilot · 2026-02-04T01:40:15Z

src/dismech/embed.py

+            if idx is None:
+                continue
+
+            disease_name = mech.get("disease_name", "Unknown")
+            mechanism_name = mech.get("_mechanism_name", "Unknown")


Consider adding a warning or debug log when mechanisms are skipped because the text couldn't be matched to an embedding. This would help diagnose issues where the number of exported mechanisms is less than expected.

For example:

if idx is None: print(f" Warning: Could not find embedding for mechanism '{mechanism_name}' in disease '{disease_name}'") continue

This would make debugging easier if there are template rendering inconsistencies or data issues.

Suggested change

if idx is None:

continue

disease_name = mech.get("disease_name", "Unknown")

mechanism_name = mech.get("_mechanism_name", "Unknown")

disease_name = mech.get("disease_name", "Unknown")

mechanism_name = mech.get("_mechanism_name", "Unknown")

if idx is None:

print(f" Warning: Could not find embedding for mechanism '{mechanism_name}' in disease '{disease_name}'")

continue

Copilot AI review requested due to automatic review settings February 4, 2026 01:33

Copilot started reviewing on behalf of cmungall February 4, 2026 01:34 View session

cmungall merged commit b9df67f into main Feb 4, 2026
11 checks passed

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mechanism-level embeddings and cell types space #244

Add mechanism-level embeddings and cell types space #244

Uh oh!

cmungall commented Feb 4, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                suggestions.innerHTML = matches.map((d, i) =>
-                    `<div class="suggestion-item" data-disease="${d}">${d}</div>`
-                ).join('');
+                suggestions.innerHTML = '';
+                matches.forEach((d) => {
+                    const item = document.createElement('div');
+                    item.className = 'suggestion-item';
+                    item.dataset.disease = d;
+                    item.textContent = d;
+                    suggestions.appendChild(item);
+                });

Add mechanism-level embeddings and cell types space #244

Add mechanism-level embeddings and cell types space #244

Uh oh!

Conversation

cmungall commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Features

Cell Types Embedding Space

Mechanism Comparison Browser

Bug Fixes

Commands

Test plan

Uh oh!

claude bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PR #244: Add mechanism-level embeddings and cell types space

Review Summary

✅ Strengths

⚠️ Areas for Improvement

🔧 Minor Improvements

🧪 Test Coverage

📊 Performance Analysis

🎯 Overall Assessment

🔗 Specific Code References

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmungall commented Feb 4, 2026 •

edited

Loading

claude bot commented Feb 4, 2026 •

edited

Loading