-
-
Notifications
You must be signed in to change notification settings - Fork 71
Description
Create a comprehensive evaluation framework to assess all aspects of Semantica: GraphRAG systems, agentic systems, knowledge graph quality, information extraction, and ontology evaluation.
Problem Statement
Semantica powers GraphRAG and Agentic GraphRAG systems, but there's no standardized way to evaluate retrieval accuracy, reasoning quality, agent performance, knowledge graph quality, extraction accuracy, and ontology completeness.
Current Status: The semantica.evals module exists but is currently empty (marked as "Coming Soon"). This is a greenfield opportunity for contributors. Contributions are welcome!
Evaluation Categories
1. GraphRAG & RAG Evaluations
RAGAS Framework Metrics (Reference-Free): Context Precision, Context Recall, Context Entities Recall, Noise Sensitivity, Response Relevancy, Faithfulness, Multimodal Faithfulness, Multimodal Relevance
GraphRAG-Specific Metrics: Multi-hop Reasoning Accuracy, Hierarchical Knowledge Retrieval, Graph Traversal Quality, Hybrid Retrieval Balance, Context Expansion Quality
Benchmarks: GraphRAG-Bench, MTRAG (110 conversations, 842 tasks), Microsoft GraphRAG Benchmarking Datasets
2. Knowledge Graph Evaluations
Link Prediction & Triple Classification: MRR, Hits@K, Mean Rank, Accuracy, F1
Quality Dimensions (20 QDs Framework): Completeness, Consistency, Accuracy, Accessibility, Appropriate Amount, Believability, Consistent Representation, Ease of Understanding, Connectivity, Coverage
ABECTO-Style Evaluation: Accuracy Assessment (compare overlapping RDF graphs without gold standard), Completeness Assessment, Quality Monitoring
Benchmarks: CODEX (Wikidata/Wikipedia-based, replaces FB15K/FB15K-237), KGrEaT (downstream tasks), AYNEC (KG completion workflow)
3. Information Extraction Evaluations
NER Metrics: Precision, Recall, F1-Score, Entity Type Accuracy, Boundary Detection, Cross-Domain Performance
Relationship Extraction Metrics: Precision, Recall, F1-Score, End-to-End RE Evaluation, Entity Linking Accuracy, Triplet Extraction Quality, Relation Type Classification
Benchmarks: ACE05, TACRED/Re-TACRED (F1: 74.6% / 91.1%), CoNLL-2003, Custom Domain Datasets
4. Agentic System Evaluations
Core Metrics: Success Rate, Process/Progress Rate, Tool Utilization (selection accuracy, parameter accuracy, action advancement), Fine-grained Process Metrics
Benchmarks: AgentQuest, AgentEval Benchmark Suite, SWE-bench, CVE-Bench
5. Ontology Evaluations
OQuaRE Framework: Completeness, Consistency, Accuracy, Utility, Adaptability
SHACL Validation: 69 SHACL-based quality metrics, Shape Validation, Reasoning Validation (HermiT/Pellet, F1 up to 0.99)
6. Evaluation Datasets
GraphRAG: Question-answer pairs with ground truth reasoning paths, multi-hop reasoning scenarios, hierarchical knowledge retrieval tasks
Agentic: Agent task scenarios with expected behaviors, multi-turn conversation datasets, tool usage scenarios
Knowledge Graph: Annotated entities/relationships, gold standard KGs, domain-specific KG datasets
Benchmark: Standardized datasets, version-controlled benchmark data, cross-domain evaluation sets
7. Reporting & Visualization
Automated reports (HTML, PDF, JSON), metrics visualization (confusion matrices, ROC curves, precision-recall curves, performance trends), benchmark comparisons, CI/CD integration, statistical significance testing
Files
Note: The semantica/evals/ module already exists but is currently empty. Contributors should implement:
graphrag_evaluator.py- GraphRAG evaluations (RAGAS, GraphRAG-Bench)rag_evaluator.py- General RAG metricsagentic_evaluator.py- Agentic evaluations (AgentQuest, AgentEval)kg_evaluator.py- KG evaluations (link prediction, quality dimensions)extraction_evaluator.py- NER and relationship extractionontology_evaluator.py- Ontology quality (OQuaRE, SHACL)benchmarks/- Benchmark runners (GraphRAG-Bench, CODEX, KGrEaT, AYNEC, ACE05, TACRED, CoNLL)datasets/- Evaluation datasets with ground truthmetrics/- Metric calculation utilitiesreporting/- Report generation and visualization
Getting Started
Current State: semantica/evals/ exists but contains only a placeholder __init__.py. Greenfield implementation opportunity!
Reference Patterns: semantica/semantic_extract/extraction_validator.py, semantica/ontology/ontology_evaluator.py, semantica/context/agent_context.py, cookbook/use_cases/advanced_rag/
References
RAG & GraphRAG
- RAGAS: https://github.com/vibrantlabsai/ragas | https://docs.ragas.io/ | https://www.ragas.io/
- GraphRAG-Bench: https://github.com/GraphRAG-Bench/GraphRAG-Benchmark | https://huggingface.co/datasets/GraphRAG-Bench/GraphRAG-Bench | https://arxiv.org/abs/2506.05690
- MTRAG: https://github.com/IBM/mt-rag-benchmark | https://arxiv.org/abs/2501.03468
- Microsoft GraphRAG: https://github.com/microsoft/graphrag-benchmarking-datasets
Knowledge Graph
- CODEX: https://github.com/tsafavi/codex | https://aclanthology.org/2020.emnlp-main.669/ | https://arxiv.org/abs/2009.07810
- KGrEaT: https://github.com/dwslab/kgreat | https://arxiv.org/abs/2308.10537
- AYNEC: https://github.com/DEAL-US/AYNEXT | https://link.springer.com/chapter/10.1007/978-3-030-21348-0_26
- ABECTO: https://www.semantic-web-journal.net/content/abecto-assessing-accuracy-and-completeness-rdf-knowledge-graphs-0 | https://arxiv.org/abs/2208.07779
Ontology
- OQuaRE: https://www.sciencedirect.com/science/article/abs/pii/S0957417412012146 | https://github.com/tecnomod-um/oquare-metrics | https://semantics.inf.um.es/ontology-metrics/doc-ws.html
Agentic Systems
- AgentQuest: https://github.com/nec-research/agentquest | https://aclanthology.org/2024.naacl-demo.19/
- AgentEval: https://github.com/Narabzad/AgentEval | https://arxiv.org/abs/2404.06411 | https://www.emergentmind.com/topics/agenteval-benchmark-suite
Information Extraction
- ACE05: https://catalog.ldc.upenn.edu/LDC2006T06
- TACRED/Re-TACRED: https://aclanthology.org/2020.emnlp-main.301/
- CoNLL-2003: https://www.clips.uantwerpen.be/conll2003/ner/
Labels: feature, evaluation, quality, graphrag