Pipeline for generating the knowledge graph integrating enriched metabolite data originally used for ENPKG, traits data from TRY, and interaction data from GloBI.
Notes
- If you want to build the METRIN-KG triples, skip to installation
- If you just want build your own instance METRIN-KG SPARQL endpoint, skip to querying METRIN-KG
-
Wikidata Data Acquisition Fetches lineage and taxonomic data for up to 15 taxonomies from Wikidata using SPARQL.
-
Taxonomy Matching against fetched Wikidata records. Matches taxa from:
- GloBI (Global Biotic Interactions)
- TRY (Plant Trait Database)
- Knowledge Graph Generation Generates RDF triples representing taxonomic alignments and traits for:
- GloBI
- TRY
- EMI-KG (extension of ENPKG)
- Clone the repository
git clone https://github.com/earth-metabolome-initiative/metrin-kg.git- Make sure you have pipenv installed. If not, install it via:
pip install pipenvThe code has been run only with python-3.12, but it may work with other versions of python-3.
- Once
pipenvis installed, install the dependencies:
pipenv install
pipenv shell- Download associated accessory data from METRIN-KG zenodo repository and verbatim-interactions.tsv.gz (only) from GloBI zenodo repository.
cd metrin-kg
# download METRIN-KG data
wget https://zenodo.org/records/15689186/files/metrin-kg.tar.gz?download=1
tar -xvf metrin-kg.tar.gz
mv metrin-kg-data data
# download GloBI data
wget https://zenodo.org/records/14640564/files/verbatim-interactions.tsv.gz?download=1
mv verbatim-interactions.tsv.gz data/raw/
# download TRY data
wget https://zenodo.org/records/17079465/files/TRYdb_40340.txt.gz?download=1
mv TRYdb_40340.txt.gz data/raw/- For supported arguments, run:
python main.py --help- Run the pipeline via command-line
python main.py [OPTIONS]Command-Line Options
| Option | Description |
|---|---|
--config |
Path to config file (default: config.txt) |
--run-wd-fetcher |
Fetch taxonomy data from Wikidata |
--run-ontology-match |
Match ontologies to GloBI or TRY terms |
--run-globi-match |
Match GloBI dataset with Wikidata taxonomies |
--run-trydb-match |
Match TRY dataset with Wikidata taxonomies |
--run-globi-kg |
Generate RDF Knowledge Graph for GloBI |
--run-trydb-kg |
Generate RDF Knowledge Graph for TRY |
- Run the full pipeline:
python main.py --run-wd-fetcher --run-globi-match --run-trydb-match --run-globi-kg --run-trydb-kg --config config.txtNote: This might take a while. If you only want to reproduce the KG, skip to point-8 directly. Note that if you have copied the data from the METRIN-KG zenodo repository, all accessory files are already available.
- Run only Wikidata fetcher:
python main.py --run-wd-fetcher --config config.txtNote: If you just want to reproduce the KG, you don't need to perform this step because the data directory already has the relevant files (if the METRIN-KG zenodo contents are copied correctly).
- Run only GloBI/TRY taxonomy matching:
python main.py --run-globi-match --config config.txtpython main.py --run-trydb-match --config config.txtNote: If you just want to reproduce the KG, you don't need to perform this step because the data directory already has the relevant files (if the METRIN-KG zenodo contents are copied correctly).
- Run only ontology matching
This can be done for any of the datasets from GloBI (body part, life stages, and biological sex) and TRY (unit names). Specify the input and output files under [ontology] header in config.txt
python main.py --run-ontology-match --config config.txtNote: If you just want to reproduce the KG, you don't need to perform this step because the data directory already has the relevant files (if the METRIN-KG zenodo contents are copied correctly).
- Generate knowledge graph - GloBI/TRY:
python main.py --run-globi-kg --config config.txtpython main.py --run-trydb-kg --config config.txtNotes:
- For generating the sub knowledge graph of metabolites, follow the instructions here
- If you skip
--run-wd-fetcher, make sure that the wd_* paths in config.txt point to valid, existing files. Each part of the pipeline can be run independently.- Outputs a) Fetched taxonomy files from Wikidata (*.json) b) Matched taxa files for GloBI and TRY (*.tsv) c) RDF files representing the final knowledge graphs (*.ttl, *.rdf, etc.)
For querying METRIN-KG, you can use two methods:
a) the Qlever powered end-point hosted on earth-metabolome-initiative.org.
Want to generate your own instance of METRIN-KG SPARQL endpoint?
Follow the instructions on qlever-control and our fork of qlever-ui to install Qlever. You can find the qlever config file used to index METRIN-KG. Follow the commands below to generate your own instance of METRIN-KG on localhost.
qlever --qleverfile Qlever.metrin_kg get-data # download full METRIN-KG graph
qlever --qleverfile Qlever.metrin_kg index --overwrite-existing --parallel-parsing false # index KG
qlever --qleverfile Qlever.metrin_kg start # start the server on local hostOnce Qlever index is generated and the server started, you can query the endpoint using qlever-ui on your localhost. Once you are done querying METRIN-KG, don't forget to stop the server
qlever --qleverfile Qlever.metrin_kg stopNotes:
- Note that you will need Docker for running
qlever. On Linux Docker runs natively and takes up only a small amount of RAM, whereas, on macOS, Docker runs in a virtual machine and thus, takes significant RAM. Therefore, on macOS,qlever indexmay fail sometimes, thus requiring more moemory./home/drishti/.local/bin- For indexing the METRIN-KG data (
qlever index), atleast 31 GB RAM will be required - works on Linux, may require more on macOS.- The shell commands for
qlever get-datainside the config file have been adapted for Ubuntu's terminal and macOS's iTerm2 default settings.qlever get-datacommand will only download the triple (ttl.gzorttl) and not the raw data used to generate the triples. For downloading the full METRIN-KG dataset including the raw data and the triples, please refer to Usage point-1.
This endpoint also provides direct access to class-overview (find the icon at the top-left corner). It also provides a way to suggest example queries to be accepted in the METRIN-KG examples set (find the icon 💾 at the top-left corner).
Note that for some queries, this endpoint might give a The quota has exceeded error. We are trying to resolve it. Updates soon...
For visualization of class overview and data schema, visit the sparql-editor powered endpoint and click on the class overview icon at the top-left corner of the page.
You can also open sparql_editor_metrin-kg.html in a browser and visualize the class-overview. For instructions on how to generate this file, refer to following github repos: sparql-editor, sparql-examples, and our fork of void-generator.
Have a look at METRIN-KG wiki for how-to-use and how-to-contribute-to METRIN-KG.
For bugs, questions, or contributions, please open an issue or submit a pull request.
If you use METRIN-KG in your work, please cite
METRIN-KG: A knowledge graph integrating plant metabolites, traits and biotic interactions Disha Tandon, Tarcisio Mendes De Farias, Pierre-Marie Allard, Emmanuel Defossez bioRxiv 2025.08.20.671289; doi: https://doi.org/10.1101/2025.08.20.671289