-
Notifications
You must be signed in to change notification settings - Fork 1
Add mechanism-level embeddings and cell types space #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
New features: - Cell types/anatomy embedding space (embed_cells.j2) - Mechanism-level embeddings for comparing pathophysiology across diseases - Interactive mechanism comparison browser (mechanisms.html) Changes: - Add celltypes dropdown to disease embedding explorer - Add embed_mechanism.j2 template for individual pathophys entries - Add index_mechanisms() and export_mechanisms_data() methods - Add just targets: embed-index-mechanisms, embed-mechanisms-data, embed-mechanisms-all - Consolidate embed-search and embed-similar into single targets with space arg - Fix history file exclusion in load_disorders() - Fix datetime serialization for edit_history fields - Fix linkml-store default limit (100) issue for large collections - Add cache clearing on recreate to prevent stale embeddings - Add comprehensive embeddings documentation (docs/embeddings.md) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Claude finished @cmungall's task —— View job Code Review for PR #244: Add mechanism-level embeddings and cell types spaceReview Checklist:
Review SummaryThis PR adds significant value to the dismech project by introducing mechanism-level embeddings and a cell types embedding space. The implementation is generally well-designed and follows good practices. ✅ StrengthsArchitecture & Design:
Code Quality:
User Experience:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds two major features to the dismech embedding system: a new cell types/anatomy embedding space and a mechanism-level comparison browser. The PR also includes several important bug fixes related to file indexing, datetime serialization, and query limits.
Changes:
- Adds
celltypesembedding space for analyzing diseases by their involved cell types (CL ontology) and tissues (UBERON ontology) - Adds mechanism-level embeddings that embed individual pathophysiology entries, enabling comparison of specific disease mechanisms across disorders
- Adds interactive mechanism comparison browser (
app/embeddings/mechanisms.html) for visualizing mechanistic overlap between diseases - Fixes bugs:
.history.yamlfiles being indexed, datetime serialization errors, and linkml-store's default 100-row limit truncating results - Adds cache clearing on
--recreateflag to prevent stale embeddings
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/dismech/templates/embed_mechanism.j2 | New Jinja2 template for mechanism-level embeddings |
| src/dismech/templates/embed_cells.j2 | New Jinja2 template for cell types/anatomy embeddings |
| src/dismech/embed.py | Core changes: datetime serialization, extract_mechanisms function, new indexing methods for celltypes and mechanisms, mechanism data export, CLI updates |
| project.justfile | New commands for mechanism indexing/export, updated existing commands to support new spaces |
| docs/embeddings.md | Comprehensive documentation for the entire embedding system including new features |
| app/embeddings/mechanisms.html | New interactive browser for mechanism-level comparison |
| app/embeddings/index.html | Added celltypes option to space dropdown |
| function addDisease(disease) { | ||
| if (selectedDiseases.includes(disease) || selectedDiseases.length >= 10) return; | ||
| selectedDiseases.push(disease); | ||
| updateChips(); | ||
| updatePlot(); | ||
| } |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UI silently prevents adding more than 10 diseases without providing user feedback. This could be confusing - users might click "Add" and wonder why nothing happens.
Consider adding visual feedback when the limit is reached, such as:
- Disabling the search input and add button when at limit
- Showing a message like "Maximum 10 diseases selected"
- Providing a toast notification when attempting to add beyond the limit
This would improve user experience and make the limitation clear.
| coll.attach_indexer(indexer) | ||
| # Index all objects in the collection | ||
| rows = coll.find().rows | ||
| rows = coll.find(limit=10000).rows |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded limit of 10000 could become a scalability issue as the knowledge base grows. Currently, the PR description mentions 313 disorders and 1235 mechanisms, which are well under this limit. However, consider making this configurable or at least documenting this limitation clearly.
As a best practice, consider either:
- Making the limit a class constant (e.g.,
DEFAULT_QUERY_LIMIT = 10000) that can be overridden - Adding a check to warn if the result set is exactly at the limit, which might indicate truncation
- Documenting this limitation in the method docstrings
This is particularly important for the find() calls where results are expected to return all items.
| {% if p.tissue_term %} Tissue: {{ p.tissue_term.preferred_term | default(p.tissue_term.term.label if p.tissue_term.term else '') }}{% endif %} | ||
| {% endfor %} | ||
| {% if histopathology %} | ||
|
|
||
| Histopathology: |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The template references p.tissue_term on pathophysiology entries, but based on the data schema, tissue_term appears to be a field that exists in the datasets section, not in pathophysiology entries.
While the use of {% if p.tissue_term %} makes this safe (it won't error), this condition will likely never be true, making this line dead code. Consider removing it or documenting if there are indeed disorders with tissue_term in their pathophysiology entries.
If cell types and anatomy is the goal, the histopathology section seems more appropriate as it appears to have both cell_type_term and tissue_term fields.
| {% if p.tissue_term %} Tissue: {{ p.tissue_term.preferred_term | default(p.tissue_term.term.label if p.tissue_term.term else '') }}{% endif %} | |
| {% endfor %} | |
| {% if histopathology %} | |
| Histopathology: | |
| {% endfor %} | |
| {% if histopathology %} | |
| Histopathology: | |
| Histopathology: |
| container.innerHTML = selectedDiseases.map((d, i) => ` | ||
| <span class="chip" style="background: ${COLORS[i % COLORS.length]}"> | ||
| ${d} | ||
| <span class="count">${counts[d] || 0}</span> | ||
| <span class="remove" onclick="removeDisease('${d.replace(/'/g, "\\'")}')">×</span> | ||
| </span> | ||
| `).join(''); | ||
|
|
||
| // Update legend | ||
| const legend = document.getElementById('legend'); | ||
| legend.innerHTML = selectedDiseases.map((d, i) => ` | ||
| <div class="legend-item"> | ||
| <span class="legend-swatch" style="background: ${COLORS[i % COLORS.length]}"></span> | ||
| <span>${d}</span> |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential XSS vulnerability: Disease names are inserted into HTML without escaping. If a disease name contains HTML special characters (e.g., <script>, <, >, &), it could lead to XSS attacks.
While disease names in the knowledge base are likely sanitized, it's a security best practice to always escape user-controlled or data-driven content before inserting it into HTML.
Consider using textContent property or a helper function to escape HTML entities. For example:
function escapeHtml(text) {
const div = document.createElement('div');
div.textContent = text;
return div.innerHTML;
}
// Then use:
${escapeHtml(d)}Or create the elements programmatically rather than using innerHTML with template strings.
| suggestions.innerHTML = matches.map((d, i) => | ||
| `<div class="suggestion-item" data-disease="${d}">${d}</div>` | ||
| ).join(''); |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential XSS vulnerability: Disease names are inserted into HTML attributes and content without escaping. While the data-disease attribute has some protection through the dataset API, the disease name in the text content should be escaped.
Use a helper function to escape HTML entities or create elements programmatically to avoid potential XSS issues.
| suggestions.innerHTML = matches.map((d, i) => | |
| `<div class="suggestion-item" data-disease="${d}">${d}</div>` | |
| ).join(''); | |
| suggestions.innerHTML = ''; | |
| matches.forEach((d) => { | |
| const item = document.createElement('div'); | |
| item.className = 'suggestion-item'; | |
| item.dataset.disease = d; | |
| item.textContent = d; | |
| suggestions.appendChild(item); | |
| }); |
| if idx is None: | ||
| continue | ||
|
|
||
| disease_name = mech.get("disease_name", "Unknown") | ||
| mechanism_name = mech.get("_mechanism_name", "Unknown") |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a warning or debug log when mechanisms are skipped because the text couldn't be matched to an embedding. This would help diagnose issues where the number of exported mechanisms is less than expected.
For example:
if idx is None:
print(f" Warning: Could not find embedding for mechanism '{mechanism_name}' in disease '{disease_name}'")
continueThis would make debugging easier if there are template rendering inconsistencies or data issues.
| if idx is None: | |
| continue | |
| disease_name = mech.get("disease_name", "Unknown") | |
| mechanism_name = mech.get("_mechanism_name", "Unknown") | |
| disease_name = mech.get("disease_name", "Unknown") | |
| mechanism_name = mech.get("_mechanism_name", "Unknown") | |
| if idx is None: | |
| print(f" Warning: Could not find embedding for mechanism '{mechanism_name}' in disease '{disease_name}'") | |
| continue |
Summary
app/embeddings/mechanisms.html) for visualizing mechanistic overlap between diseasesNew Features
Cell Types Embedding Space
celltypesspace usingembed_cells.j2templateMechanism Comparison Browser
Bug Fixes
.history.yamlfiles being incorrectly indexededit_historyfields--recreateto prevent stale embeddingsCommands
Test plan
just embed-index-groupedcompletes without errorsjust embed-app-dataexports all 313 disorders per spacejust embed-mechanisms-allindexes 1235 mechanisms🤖 Generated with Claude Code