Skip to content

Conversation

@cmungall
Copy link
Member

@cmungall cmungall commented Feb 4, 2026

Summary

  • Add new cell types/anatomy embedding space for analyzing diseases by their involved cell types (CL) and tissues (UBERON)
  • Add mechanism-level embeddings that embed individual pathophysiology entries, enabling comparison of specific disease mechanisms across disorders
  • Add interactive mechanism comparison browser (app/embeddings/mechanisms.html) for visualizing mechanistic overlap between diseases

New Features

Cell Types Embedding Space

  • New celltypes space using embed_cells.j2 template
  • Extracts cell types from pathophysiology entries with CL ontology IDs
  • Added to disease embedding explorer dropdown
image

Mechanism Comparison Browser

  • Embeds each pathophysiology entry separately (1235 mechanisms from 304 diseases)
  • Search and select multiple diseases to compare
  • Each disease gets a unique color in the shared embedding space
  • Overlapping clusters indicate shared pathophysiology mechanisms
  • Toggle context (unselected diseases as gray) for reference
image

Bug Fixes

  • Fix .history.yaml files being incorrectly indexed
  • Fix datetime serialization for edit_history fields
  • Fix linkml-store default limit of 100 rows (was truncating results)
  • Add cache clearing on --recreate to prevent stale embeddings

Commands

# Index mechanisms
just embed-index-mechanisms

# Export browser data  
just embed-mechanisms-data

# Or both:
just embed-mechanisms-all

Test plan

  • Verify just embed-index-grouped completes without errors
  • Verify just embed-app-data exports all 313 disorders per space
  • Verify just embed-mechanisms-all indexes 1235 mechanisms
  • Test mechanism browser: search "Asthma", add "COPD", verify mechanism overlap visualization
  • Verify cell types space works in disease explorer

🤖 Generated with Claude Code

New features:
- Cell types/anatomy embedding space (embed_cells.j2)
- Mechanism-level embeddings for comparing pathophysiology across diseases
- Interactive mechanism comparison browser (mechanisms.html)

Changes:
- Add celltypes dropdown to disease embedding explorer
- Add embed_mechanism.j2 template for individual pathophys entries
- Add index_mechanisms() and export_mechanisms_data() methods
- Add just targets: embed-index-mechanisms, embed-mechanisms-data, embed-mechanisms-all
- Consolidate embed-search and embed-similar into single targets with space arg
- Fix history file exclusion in load_disorders()
- Fix datetime serialization for edit_history fields
- Fix linkml-store default limit (100) issue for large collections
- Add cache clearing on recreate to prevent stale embeddings
- Add comprehensive embeddings documentation (docs/embeddings.md)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copilot AI review requested due to automatic review settings February 4, 2026 01:33
@claude
Copy link
Contributor

claude bot commented Feb 4, 2026

Claude finished @cmungall's task —— View job


Code Review for PR #244: Add mechanism-level embeddings and cell types space

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Review main embedding implementation (src/dismech/embed.py)
  • Examine new HTML/JS browser interface files
  • Review Jinja2 templates for embeddings
  • Check justfile command additions
  • Analyze documentation additions
  • Assess code quality and best practices
  • Check for potential bugs and security issues
  • Evaluate performance considerations
  • Review test coverage
  • Provide final feedback summary

Review Summary

This PR adds significant value to the dismech project by introducing mechanism-level embeddings and a cell types embedding space. The implementation is generally well-designed and follows good practices.

Strengths

Architecture & Design:

  • Clean separation of concerns: New embedding spaces (celltypes, mechanisms) are well-integrated into existing architecture
  • Consistent patterns: Follows established conventions for indexing, caching, and export functions
  • Modular design: Jinja2 templates cleanly separate content extraction logic

Code Quality:

  • Good error handling: Proper validation of cache paths and data existence (e.g., src/dismech/embed.py:458-462)
  • Defensive programming: Safe dictionary access with .get() and default values throughout
  • Clear function signatures: Well-documented parameters and return types

User Experience:

  • Interactive browser: The mechanism comparison tool (app/embeddings/mechanisms.html) provides excellent visualization capabilities
  • Comprehensive commands: Well-organized justfile commands for different workflows
  • Good documentation: Clear explanations in docs/embeddings.md

⚠️ Areas for Improvement

Performance Considerations:

  1. Large limit values: Multiple uses of limit=10000 (lines 181, 203, etc.) could cause performance issues as the dataset grows

    • Recommendation: Consider implementing pagination or streaming for large datasets
    • Impact: Low risk currently but may become problematic with growth
  2. Memory usage: Loading all embeddings into memory for UMAP/t-SNE computation (src/dismech/embed.py:395)

    • Recommendation: Consider batch processing for very large datasets
    • Impact: Medium risk for scalability

Potential Bugs:

  1. Race condition risk: Cache clearing in _clear_cache() could interfere with concurrent operations

    • Recommendation: Add file locking or atomic operations
    • Impact: Low risk in single-user scenarios
  2. Missing validation: No validation that selectedDiseases.length >= 10 limit is properly enforced in UI

    • Location: app/embeddings/mechanisms.html:413
    • Impact: Minor - could allow > 10 selections

Security & Robustness:

  1. XSS potential: Direct string interpolation in HTML generation

    • Location: mechanisms.html:441-446 - d.replace(/'/g, "\\'")
    • Recommendation: Use proper HTML escaping or DOM manipulation
    • Impact: Low risk with controlled data sources
  2. Resource limits: No explicit limits on embedding computation

    • Recommendation: Add timeout configuration for OpenAI API calls
    • Impact: Low risk but could cause hangs

🔧 Minor Improvements

Code Organization:

  • Consider extracting color/styling constants to a shared module
  • The _serialize_datetimes() function could be moved to a utilities module

Documentation:

  • Add JSDoc comments to the mechanism browser JavaScript functions
  • Consider adding type hints to more functions in embed.py

🧪 Test Coverage

Missing Tests:

  • No unit tests for the new extract_mechanisms() function
  • No tests for the cell types embedding template
  • Browser interface lacks automated testing

Recommendations:

  • Add tests for mechanism extraction logic
  • Test error handling in embedding export functions
  • Consider end-to-end tests for the browser interface

📊 Performance Analysis

Efficient Patterns:

  • ✅ Good use of caching to avoid redundant API calls
  • ✅ Proper use of DuckDB for efficient embedding storage
  • ✅ Lazy loading of large dependencies (UMAP, t-SNE)

Optimization Opportunities:

  • Consider pre-computing common dimensionality reductions
  • Implement progressive loading for large visualization datasets

🎯 Overall Assessment

This is a high-quality PR that adds valuable functionality while maintaining code quality standards. The implementation follows established patterns and provides comprehensive documentation. While there are some minor areas for improvement around performance and testing, these don't block the core functionality.

Recommendation: ✅ APPROVE with consideration for the minor improvements mentioned above in future iterations.

🔗 Specific Code References

  • Cache management: src/dismech/embed.py:154-159
  • Mechanism extraction: src/dismech/embed.py:117-140
  • Interactive visualization: app/embeddings/mechanisms.html:400-500
  • New templates: src/dismech/templates/embed_cells.j2, embed_mechanism.j2

@cmungall cmungall merged commit b9df67f into main Feb 4, 2026
11 checks passed
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds two major features to the dismech embedding system: a new cell types/anatomy embedding space and a mechanism-level comparison browser. The PR also includes several important bug fixes related to file indexing, datetime serialization, and query limits.

Changes:

  • Adds celltypes embedding space for analyzing diseases by their involved cell types (CL ontology) and tissues (UBERON ontology)
  • Adds mechanism-level embeddings that embed individual pathophysiology entries, enabling comparison of specific disease mechanisms across disorders
  • Adds interactive mechanism comparison browser (app/embeddings/mechanisms.html) for visualizing mechanistic overlap between diseases
  • Fixes bugs: .history.yaml files being indexed, datetime serialization errors, and linkml-store's default 100-row limit truncating results
  • Adds cache clearing on --recreate flag to prevent stale embeddings

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/dismech/templates/embed_mechanism.j2 New Jinja2 template for mechanism-level embeddings
src/dismech/templates/embed_cells.j2 New Jinja2 template for cell types/anatomy embeddings
src/dismech/embed.py Core changes: datetime serialization, extract_mechanisms function, new indexing methods for celltypes and mechanisms, mechanism data export, CLI updates
project.justfile New commands for mechanism indexing/export, updated existing commands to support new spaces
docs/embeddings.md Comprehensive documentation for the entire embedding system including new features
app/embeddings/mechanisms.html New interactive browser for mechanism-level comparison
app/embeddings/index.html Added celltypes option to space dropdown

Comment on lines +412 to +417
function addDisease(disease) {
if (selectedDiseases.includes(disease) || selectedDiseases.length >= 10) return;
selectedDiseases.push(disease);
updateChips();
updatePlot();
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI silently prevents adding more than 10 diseases without providing user feedback. This could be confusing - users might click "Add" and wonder why nothing happens.

Consider adding visual feedback when the limit is reached, such as:

  1. Disabling the search input and add button when at limit
  2. Showing a message like "Maximum 10 diseases selected"
  3. Providing a toast notification when attempting to add beyond the limit

This would improve user experience and make the limitation clear.

Copilot uses AI. Check for mistakes.
coll.attach_indexer(indexer)
# Index all objects in the collection
rows = coll.find().rows
rows = coll.find(limit=10000).rows
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded limit of 10000 could become a scalability issue as the knowledge base grows. Currently, the PR description mentions 313 disorders and 1235 mechanisms, which are well under this limit. However, consider making this configurable or at least documenting this limitation clearly.

As a best practice, consider either:

  1. Making the limit a class constant (e.g., DEFAULT_QUERY_LIMIT = 10000) that can be overridden
  2. Adding a check to warn if the result set is exactly at the limit, which might indicate truncation
  3. Documenting this limitation in the method docstrings

This is particularly important for the find() calls where results are expected to return all items.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +15
{% if p.tissue_term %} Tissue: {{ p.tissue_term.preferred_term | default(p.tissue_term.term.label if p.tissue_term.term else '') }}{% endif %}
{% endfor %}
{% if histopathology %}

Histopathology:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template references p.tissue_term on pathophysiology entries, but based on the data schema, tissue_term appears to be a field that exists in the datasets section, not in pathophysiology entries.

While the use of {% if p.tissue_term %} makes this safe (it won't error), this condition will likely never be true, making this line dead code. Consider removing it or documenting if there are indeed disorders with tissue_term in their pathophysiology entries.

If cell types and anatomy is the goal, the histopathology section seems more appropriate as it appears to have both cell_type_term and tissue_term fields.

Suggested change
{% if p.tissue_term %} Tissue: {{ p.tissue_term.preferred_term | default(p.tissue_term.term.label if p.tissue_term.term else '') }}{% endif %}
{% endfor %}
{% if histopathology %}
Histopathology:
{% endfor %}
{% if histopathology %}
Histopathology:
Histopathology:

Copilot uses AI. Check for mistakes.
Comment on lines +441 to +454
container.innerHTML = selectedDiseases.map((d, i) => `
<span class="chip" style="background: ${COLORS[i % COLORS.length]}">
${d}
<span class="count">${counts[d] || 0}</span>
<span class="remove" onclick="removeDisease('${d.replace(/'/g, "\\'")}')">&times;</span>
</span>
`).join('');

// Update legend
const legend = document.getElementById('legend');
legend.innerHTML = selectedDiseases.map((d, i) => `
<div class="legend-item">
<span class="legend-swatch" style="background: ${COLORS[i % COLORS.length]}"></span>
<span>${d}</span>
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential XSS vulnerability: Disease names are inserted into HTML without escaping. If a disease name contains HTML special characters (e.g., <script>, <, >, &), it could lead to XSS attacks.

While disease names in the knowledge base are likely sanitized, it's a security best practice to always escape user-controlled or data-driven content before inserting it into HTML.

Consider using textContent property or a helper function to escape HTML entities. For example:

function escapeHtml(text) {
    const div = document.createElement('div');
    div.textContent = text;
    return div.innerHTML;
}

// Then use:
${escapeHtml(d)}

Or create the elements programmatically rather than using innerHTML with template strings.

Copilot uses AI. Check for mistakes.
Comment on lines +332 to +334
suggestions.innerHTML = matches.map((d, i) =>
`<div class="suggestion-item" data-disease="${d}">${d}</div>`
).join('');
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential XSS vulnerability: Disease names are inserted into HTML attributes and content without escaping. While the data-disease attribute has some protection through the dataset API, the disease name in the text content should be escaped.

Use a helper function to escape HTML entities or create elements programmatically to avoid potential XSS issues.

Suggested change
suggestions.innerHTML = matches.map((d, i) =>
`<div class="suggestion-item" data-disease="${d}">${d}</div>`
).join('');
suggestions.innerHTML = '';
matches.forEach((d) => {
const item = document.createElement('div');
item.className = 'suggestion-item';
item.dataset.disease = d;
item.textContent = d;
suggestions.appendChild(item);
});

Copilot uses AI. Check for mistakes.
Comment on lines +507 to +511
if idx is None:
continue

disease_name = mech.get("disease_name", "Unknown")
mechanism_name = mech.get("_mechanism_name", "Unknown")
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a warning or debug log when mechanisms are skipped because the text couldn't be matched to an embedding. This would help diagnose issues where the number of exported mechanisms is less than expected.

For example:

if idx is None:
    print(f"  Warning: Could not find embedding for mechanism '{mechanism_name}' in disease '{disease_name}'")
    continue

This would make debugging easier if there are template rendering inconsistencies or data issues.

Suggested change
if idx is None:
continue
disease_name = mech.get("disease_name", "Unknown")
mechanism_name = mech.get("_mechanism_name", "Unknown")
disease_name = mech.get("disease_name", "Unknown")
mechanism_name = mech.get("_mechanism_name", "Unknown")
if idx is None:
print(f" Warning: Could not find embedding for mechanism '{mechanism_name}' in disease '{disease_name}'")
continue

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants