rMAP-2.0 is a modular, containerized bioinformatics workflow for analyzing microbial genomic data and profiling AMR, mobilome, virulome, and phylogenomics, with support for MLST typing, variant calling, and BLASTn-based sequence similarity search. It bundles the required tools and dependencies to enable reproducible, scalable analysis of NGS data in research and public health settings.
rMAP-2.0 is optimized for profiling the resistome and other genomic features of ESKAPEE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter species, and Escherichia coli) using whole-genome sequencing (WGS) paired-end reads.
- Overview
- Repository layout (current)
- Features
- Quick start / Test dataset (E. coli, Illumina PE)
- Prerequisites
- Install / download
- How to run
- Minimum sample requirements
- Sample input JSON
- Tools used (with Docker images)
- Outputs
- Databases (local BLAST + updates)
- Benchmarking
- Execution (Cromwell)
- Offline use & data sovereignty
- Releases & reproducibility
- Intended use & limitations
- Docker Desktop configuration for rMAP-2.0
- Troubleshooting
- Support / Issues
- Citation
- Authors & contributors
- License
- Acknowledgements
- Appendix
Version: 1.0 (see Releases for tagged versions)
Pipeline Type: WDL-based, Docker-enabled
Workflow Engine: Cromwell
rMAP-2.0 is a containerized, modular workflow for microbial genomics that integrates trimming, quality control, de novo assembly, annotation, variant calling, MLST typing, AMR profiling, mobile genetic element analysis, pangenome analysis, phylogeny, and tree visualization.
The workflow is written in Workflow Description Language (WDL), uses Docker containers for tool standardization, and runs on the Cromwell execution engine. The primary deliverable is a single consolidated, navigable HTML report (with per-module outputs preserved in the Cromwell execution directories).
This README reflects the current repository layout (as in the GitHub tree):
rMAP-2.0/
rMAP.wdl
README.md
docs/
test_data/
config/
databases/
workflow.png
- config/: example input JSONs (e.g.,
inputs_example.json) plus small reference FASTA artifacts used for testing/examples (e.g., species reference FASTAs andadapters.fa). - databases/: small FASTA databases shipped for convenience (e.g.,
resfinder.fa,plasmidfinder.fa,vfdb.fa). Large reference bundles are distributed via Zenodo/releases. - test_data/: a minimal Illumina paired-end FASTQ subset plus
inputs_test.jsonfor quick end-to-end validation.
- Adapter trimming with Trimmomatic
- Quality control using FastQC & MultiQC
- Genome assembly using MEGAHIT
- Genome annotation with Prokka
- Variant calling using Snippy
- MLST profiling for sequence typing
- Roary for pangenome construction
- Phylogenetic inference using FastTree
- AMR, virulence, & MGE detection with Abricate
- Sequence similarity search using BLAST
- Phylogenetic tree visualization with ETE3
- Generation of a consolidated interactive HTML report summarizing all key outputs
To support reproducibility and quick validation, the repository includes a small Illumina paired-end Escherichia coli test dataset (5 isolates) under test_data/, together with a matching input JSON: test_data/inputs_test.json.
The test_data cohort comprises five E. coli WGS datasets retrieved from NCBI/SRA (typical E. coli genome size ≈ 5.0 Mb, with expected strain-to-strain variation).
A hosted end-to-end test HTML report generated from this dataset is available here:
From the repository root:
java -jar cromwell.jar run rMAP.wdl --inputs test_data/inputs_test.jsonAfter a successful run, Cromwell will write outputs under cromwell-executions/ (plus workflow logs). Key expected outputs include:
- QC outputs: FastQC per-sample + MultiQC summary
- Assembly outputs: assembled contigs (FASTA)
- Annotation outputs: Prokka annotations (e.g., GFF/GBK)
- Typing/AMR outputs: MLST & AMR profiling results
- Pangenome/phylogeny outputs (multi-isolate): Roary outputs & phylogenetic trees
- Final HTML report: merged interactive report generated at the end of the workflow
Note: pangenome & phylogeny are most meaningful with multiple isolates; this test dataset is provided to exercise the full end-to-end workflow quickly.
- Java 17 or newer (Oracle JDK)
- Cromwell (v84 or newer)
- Docker (installed & running)
Optional (only required if you build local databases yourself):
- BLAST+ (for indexing local databases)
Install via Conda:conda install -c bioconda blast
Input data:
- Paired-end FASTQ files (Illumina PE recommended)
- Reference genome (FASTA or GenBank)
- Adapter sequence file (FASTA or TXT)
git clone https://github.com/gmboowa/rMAP-2.0.git
cd rMAP-2.0Download cromwell.jar from the Cromwell releases page (or use your site-provided Cromwell).
Place it in your working directory or provide its full path in commands below.
docker info >/dev/null && echo "Docker is running"Docker Desktop → Settings → Resources → Advanced
- CPU limit → increase as needed (e.g., 8–15)
- Memory limit → increase (e.g., 12–24 GB if available)
- Swap → optional (2–4 GB is usually sufficient)
- Disk usage limit → increase if pulling many images / large databases
Apply changes (Docker may restart), then confirm:
docker info | egrep "CPUs|Total Memory"Edit your input JSON file (e.g., inputs.json) with paths to your:
- Paired-end reads
- Reference genome (FASTA or GenBank)
- Illumina adapter file
- Flags for toggling steps (true/false)
- Optional database configuration (local BLAST, custom AMR/VF DBs)
java -jar cromwell.jar run rMAP.wdl --inputs inputs.jsonFor the pipeline to execute successfully, the following tasks must be enabled at a minimum:
- Trimming
- Assembly
- Reporting
If you disable optional modules, ensure downstream modules do not depend on them.
rMAP uses Trimmomatic for adapter/quality trimming. By default, Trimmomatic is run with -phred33, which is the standard quality encoding for modern Illumina FASTQ files.
If you need flexibility (e.g., legacy data encoded as Phred+64), you can override the default via the inputs JSON parameter below:
{
"rMAP.trimmomatic_quality_encoding": "phred33"
}Allowed values:
"phred33"(default; recommended for Illumina FASTQ)"phred64"(legacy encoding; use only if your FASTQ is Phred+64)
If rMAP.trimmomatic_quality_encoding is not provided, rMAP defaults to phred33.
Certain analysis modules require minimum sample numbers to function properly:
| Analysis module | Minimum samples | Required for | JSON parameter to disable |
|---|---|---|---|
| Pangenome analysis (Roary) | 2 | Core/accessory genome separation | "rMAP.do_pangenome": false |
| Phylogenetic analysis (core/accessory trees) | 4 | Meaningful tree topology | "rMAP.do_phylogeny": false |
Tip: rMAP will still run on smaller cohorts if you disable modules that require multi-sample context.
Validate JSON locally with jq or any JSON validator.
jq . inputs.json >/dev/null && echo "JSON OK"Example JSON (update paths to your environment):
{
"rMAP.input_reads": [
"~/sample1_R1.fastq.gz",
"~/sample1_R2.fastq.gz",
"~/sample2_R1.fastq.gz",
"~/sample2_R2.fastq.gz"
],
"rMAP.adapters": "~/adapters.fa",
"rMAP.reference_genome": "~/reference.gbk",
"rMAP.reference_type": "genbank",
"rMAP.trimmomatic_quality_encoding": "phred33",
"rMAP.do_trimming": true,
"rMAP.do_quality_control": true,
"rMAP.do_assembly": true,
"rMAP.do_variant_calling": true,
"rMAP.do_annotation": true,
"rMAP.do_amr_profiling": true,
"rMAP.do_mlst": true,
"rMAP.do_pangenome": true,
"rMAP.do_phylogeny": true,
"rMAP.do_mge_analysis": true,
"rMAP.do_virulence": true,
"rMAP.do_reporting": true,
"rMAP.do_blast": true,
"rMAP.use_local_blast": true,
"rMAP.local_blast_db": "~/eskapee_db/eskapee_db",
"rMAP.local_amr_db": "~/resfinder.fa",
"rMAP.local_mge_db": "~/plasmidfinder.fa",
"rMAP.local_virulence_db": "~/vfdb.fa",
"rMAP.blast_max_target_seqs": 250,
"rMAP.blast_evalue": 0.000001,
"rMAP.blast_min_contig_length": 300,
"rMAP.virulence_min_cov": 60,
"rMAP.virulence_min_id": 80.0,
"rMAP.phylogeny_model": "-nt -gtr",
"rMAP.max_cpus": 8,
"rMAP.max_memory_gb": 16
}Important: when using local BLAST,
rMAP.local_blast_dbmust point to the BLAST database prefix (e.g.,~/eskapee_db/eskapee_db), not the FASTA file.
| Step | Tool | Docker image |
|---|---|---|
| Trimming | Trimmomatic | staphb/trimmomatic:0.39 |
| QC | FastQC | staphb/fastqc:0.11.9 |
| Assembly | Megahit | quay.io/biocontainers/megahit:1.2.9--h5ca1c30_6 |
| Annotation | Prokka | staphb/prokka:1.14.6 |
| Variant Calling | Snippy | staphb/snippy:4.6.0 |
| MLST | MLST | staphb/mlst:2.19.0 |
| Pangenome | Roary | gmboowa/roary-pillow:0.4 |
| Phylogeny | FastTree | staphb/fasttree:2.1.11 |
| Tree Visualization | ETE3 | gmboowa/ete3-render:1.18 |
| AMR/MGE/Virulence | Abricate | staphb/abricate:1.0.0 |
| BLAST | BLAST+ | gmboowa/blast-analysis:1.9.4 |
Cromwell typically writes outputs under:
cromwell-executions/
rMAP/
<workflow-id>/
call-TRIMMING/
execution/
stdout
stderr
rc
call-QUALITY_CONTROL/
call-ASSEMBLY/
...Each call-* directory contains:
execution/– shell scripts & logs for the taskstdout/stderr– standard output & error logsrc– return code for the task- output files generated by the task (e.g.,
.fasta,.vcf,.tsv,.json,.html, etc.)
| Module | Key output files |
|---|---|
TRIMMING |
Trimmed FASTQ files (*.fastq.gz) |
QUALITY_CONTROL |
MultiQC report + FastQC outputs (*.zip, *.html) |
ASSEMBLY |
Assembled contigs (*.fasta) |
VARIANT_CALLING |
Variant calls (*.vcf) |
ANNOTATION |
Prokka annotations (*.gff, *.gbk) |
AMR_PROFILING |
AMR profiles (*.txt, *.tsv) |
MLST |
MLST profiles (*.txt, *.tsv) |
MGE_ANALYSIS |
Plasmid/MGE predictions (*.txt, *.tsv) |
VIRULENCE_ANALYSIS |
Virulence gene predictions (*.txt, *.tsv) |
BLAST_ANALYSIS |
Top BLAST hits (*.tsv, *.xml) |
PANGENOME |
Roary outputs (gene_presence_absence.csv, core_gene_alignment.aln) |
CORE_PHYLOGENY |
Core genome tree + alignment (*.nwk, alignments) |
ACCESSORY_PHYLOGENY |
Accessory tree (*.nwk) |
TREE_VISUALIZATION |
Rendered trees (*.png, *.pdf) |
MERGE_REPORTS |
Consolidated HTML report + assets (final_report.html, assets/*, summaries) |
Interactive HTML reports for several ESKAPEE example cohorts are hosted here:
rMAP-2.0 supports fully offline operation by allowing users to run against local, versioned reference databases. For convenience and reproducibility, we provide a prebuilt ESKAPEE reference BLAST database snapshot and also document how to rebuild the database from public genomes (e.g., RefSeq) when users need a customized or refreshed reference set.
We distribute a ready-to-use ESKAPEE reference database snapshot via Zenodo:
Zenodo record: https://zenodo.org/records/18001238
# 1) Download the archive from Zenodo (or via your browser)
# Example filename (may vary): eskapee_db.tar.gz
# 2) Verify checksum (recommended; compare to the published .sha256 if provided)
sha256sum eskapee_db.tar.gz
# 3) Unpack
tar -xzvf eskapee_db.tar.gzAfter extraction, you should see the BLAST database prefix files (e.g., .nsq/.nin/.nhr, etc.). Configure rMAP to use the DB prefix (not the FASTA), for example:
{
"rMAP.use_local_blast": true,
"rMAP.local_blast_db": "~/eskapee_db/eskapee_db"
}This option is useful if you:
- require local policies/curation,
- want a different assembly level filter,
- need to refresh the database on your own schedule.
mkdir -p ~/refseq/bacteria/eskapee
cd ~/refseq/bacteria/eskapeeInstall the tool if not already installed:
pip install ncbi-genome-downloadDownload RefSeq genomes for the 7 ESKAPEE genera (example filter: complete genomes):
ncbi-genome-download bacteria --genera "Escherichia,Klebsiella,Enterobacter,Acinetobacter,Pseudomonas,Staphylococcus,Enterococcus" --formats fasta --assembly-level complete --section refseq --output-folder eskapee_genomesfind eskapee_genomes -name "*.fna.gz" -print0 | xargs -0 cat > eskapee_db.fasta.gz
gunzip -f eskapee_db.fasta.gzmakeblastdb -in eskapee_db.fasta -dbtype nucl -parse_seqids -title "ESKAPEE_DB" -out eskapee_dbYou should now have eskapee_db.nsq, eskapee_db.nin, eskapee_db.nhr, etc. Use the prefix in JSON:
{
"rMAP.use_local_blast": true,
"rMAP.local_blast_db": "~/refseq/bacteria/eskapee/eskapee_db"
}If your DB is split into multiple volumes (e.g.,
eskapee_db.00.nsq), still use the common prefix path.
If you maintain a curated FASTA (eskapee_db.fasta) from a known list of assemblies:
mkdir -p databases/blast/eskapee
cp ~/eskapee_db.fasta databases/blast/eskapee/
cd databases/blast/eskapee
makeblastdb -in eskapee_db.fasta -dbtype nucl -parse_seqids -max_file_sz 3000000000 -out eskapee_dbtar -czvf eskapee_db.tar.gz eskapee_db.*
sha256sum eskapee_db.tar.gz > eskapee_db.tar.gz.sha256Before running rMAP-2.0 with custom FASTA databases for AMR/plasmid/virulence detection, index each FASTA file with makeblastdb:
makeblastdb -in resfinder.fa -dbtype nucl -parse_seqids
makeblastdb -in plasmidfinder.fa -dbtype nucl -parse_seqids
makeblastdb -in vfdb.fa -dbtype nucl -parse_seqidsThen point rMAP-2.0 to these FASTAs in your inputs JSON:
{
"rMAP.local_amr_db": "~/resfinder.fa",
"rMAP.local_mge_db": "~/plasmidfinder.fa",
"rMAP.local_virulence_db": "~/vfdb.fa"
}To support reproducible analyses, we plan to refresh and publish reference snapshots on a defined cadence:
- Hotfix updates: on-demand when major upstream reference updates or critical issues are identified
- For large batches, using a local ESKAPEE BLAST database may require substantial disk space (tens of GB depending on scope & assembly level).
- NCBI imposes usage limits on BLAST queries from a single IP address; local databases improve throughput, reproducibility, and compliance with query limits.
We benchmarked rMAP-2.0 using three bacterial isolate WGS cohorts spanning increasing cohort sizes:
- Small / test_data: five Escherichia coli Illumina paired-end isolates (typical genome ≈ 5.0 Mb)
- Medium: 11 Pseudomonas aeruginosa genomes (typical genome ≈ 6.3 Mb)
- Large: 20 Klebsiella pneumoniae genomes (typical genome ≈ 5.5 Mb)
The E. coli cohort served as the standardized, end-to-end runtime benchmark for direct comparison with Bactopia, whereas the medium & large cohorts were used to assess scaling behavior & reporting for multi-isolate analyses, including pangenome reconstruction & core-gene phylogeny.
Interactive test reports generated by rMAP-2.0 are hosted on GitHub Pages:
- Test dataset (5 E. coli): https://gmboowa.github.io/rMAP-2.0/eskapee/test_data/
- Medium dataset (11 Pseudomonas aeruginosa cohort): https://gmboowa.github.io/rMAP-2.0/eskapee/pseudomonas/report.html
- Large dataset (20 Klebsiella pneumoniae cohort): https://gmboowa.github.io/rMAP-2.0/eskapee/klebsiella/report.html
rMAP-2.0 is executed with Cromwell using the default configuration for local runs. This repository does not ship backend configuration files (e.g., cromwell.*.conf) and does not require custom backend configuration for standard local execution.
Run the workflow using Cromwell defaults:
java -jar cromwell.jar run rMAP.wdl --inputs inputs.jsonIf you plan to run on HPC schedulers or cloud backends, those environments typically require site-specific Cromwell configuration (and/or institutional wrappers for containers). Because these settings vary by institution, they are intentionally not included in this repository.
rMAP-2.0 is designed to support data sovereignty by allowing analyses to run fully on-premises (workstation or HPC) with local inputs & local outputs—no data upload is required by the workflow. All results, intermediate files, and the final consolidated HTML report are written to your local/project storage under the Cromwell execution directories.
rMAP-2.0 uses Docker containers for tool standardization. After the first successful container pull, images are cached locally, so subsequent runs can proceed offline (provided the required images are already present on the machine/cluster).
For sequence similarity screening, rMAP-2.0 supports offline BLAST by allowing users to point the workflow to local BLAST databases (e.g., the ESKAPEE reference DB snapshot or user-built databases). This enables high-throughput analyses without reliance on remote BLAST services & avoids network rate limits while preserving reproducibility through versioned database snapshots.
rMAP-2.0 is versioned and released to support reproducible, comparable analyses across machines (laptop/HPC/cloud) and over time.
Each release (e.g., vX.Y.Z) is an immutable snapshot of:
- Workflow source:
rMAP.wdland all referenced tasks/modules used for that version - Executable example inputs: curated JSON templates, including the Quick start test dataset configuration (
test_data/inputs_test.json) - Prebuilt reference artifacts (optional):
- a versioned ESKAPEE BLAST database tarball (or pointers to Zenodo snapshots)
- corresponding checksums (sha256)
- basic build metadata (date, scope, number of sequences)
- Documentation snapshot: README updates aligned to that release, including expected outputs & example report links
rMAP-2.0 relies on Docker images to standardize tool versions and ensure consistent outputs. For best reproducibility:
- Prefer pinned tags (avoid
latestwhen possible) - Keep the “Tools used (with Docker images)” table aligned to the current release
- Record for each run:
- GitHub Release tag (e.g.,
vX.Y.Z) - container tags and ideally digests
- database snapshot version (Zenodo record/version or local rebuild date)
- GitHub Release tag (e.g.,
Capture image digests used in a run:
docker image inspect --format='{{index .RepoDigests 0}}' <image:tag>rMAP-2.0 is designed for end-to-end analysis of bacterial isolate whole-genome sequencing (WGS), with an emphasis on Illumina short-read paired-end data and standardized reporting for research and public health use cases (e.g., AMR profiling, MLST, assembly/annotation, pangenome & phylogeny). The workflow is most appropriate when samples represent single-organism isolates (or near-isolates) and when users want a reproducible, containerized pipeline with a consolidated HTML report.
- Metagenomics & mixed communities: rMAP-2.0 is not intended for complex metagenomic samples (e.g., stool, wastewater) where multiple organisms & uneven abundance require dedicated taxonomic profiling, binning & contamination-aware assembly workflows.
- Long-read–only datasets: rMAP-2.0 is optimized and validated for Illumina short-read PE inputs; long-read (ONT/PacBio) or hybrid assemblies may require additional tuning and are not the primary target in this release.
- Species/cohort composition: Some multi-isolate analyses (pangenome/phylogeny) assume broadly comparable genomes; mixed-species cohorts may yield reduced interpretability unless intentionally included (e.g., as outgroups).
- Container runtime constraints: rMAP-2.0 uses Docker for tool standardization. On some HPC systems where Docker is restricted, execution may require Apptainer/Singularity (or a site-approved container runtime).
Docker Desktop → Settings → Resources → Advanced
- Memory: set to 12–24 GB (more if you can)
- CPUs: set to 8 (or ~50–60% of your cores)
- Swap: 2–4 GB (small swap helps; large swap can slow jobs)
- Disk image size: 120–200 GB (store on your fastest disk)
- File sharing: enable VirtioFS (or gRPC-FUSE) if available for faster I/O
- Click Apply & Restart
General (recommended)
- Start Docker Desktop when you sign in (ensures the engine is up before runs)
- Kubernetes: off (unless you need it)
Verify resources inside a container
docker run --rm alpine sh -c 'echo "mem.max=$(cat /sys/fs/cgroup/memory.max 2>/dev/null || echo max)"; grep MemTotal /proc/meminfo'
docker info | grep -E "Total Memory|CPUs"docker infoIf this fails, start Docker Desktop (macOS/Windows) or your Docker service (Linux).
Cromwell and containers can produce large intermediate files. Confirm free space:
df -h
docker system dfYou may need to increase Docker disk image size or clean unused images:
docker system prune -ajava -versionEnsure Java 17+.
Confirm your cromwell.jar is accessible and not corrupted:
ls -lh cromwell.jar
java -jar cromwell.jar --versionEnsure rMAP.local_blast_db points to the DB prefix and files exist:
ls -lh /path/to/eskapee_db*On macOS, sed -i '' is required. Example:
find docs -type f -print0 | xargs -0 sed -i '' 's/example_data/test_data/g'- Bug reports, feature requests, and questions:
https://github.com/gmboowa/rMAP-2.0/issues
When filing an issue, please include:
- OS + CPU architecture (e.g., macOS Intel, Linux x86_64)
- Java version (
java -version) - Cromwell version
- Docker version
- The command you ran
- The failing task name (
call-...) andstderrlog (if available)
If you use rMAP-2.0 in your work, please cite:
- rMAP: the Rapid Microbial Analysis Pipeline for ESKAPEE bacterial group whole-genome sequence data
Microbial Genomics (see journal page): https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000583
Recommended repository citation (GitHub + release tag):
- rMAP-2.0 GitHub repository: https://github.com/gmboowa/rMAP-2.0
If using the prebuilt ESKAPEE reference DB snapshot, cite the Zenodo record:
This project is licensed under the MIT License.
- rMAP-2.0 builds on many excellent open-source bioinformatics tools. We acknowledge & thank the authors & maintainers of these tools and their communities.
- The workflow design emphasizes reproducibility, portability, and practical reporting for bacterial genomics in research & public health settings.
If you are performing MLST typing across many samples, we recommend downloading and setting up PubMLST schemes locally when operating at scale. A local installation can improve throughput, avoids dependency on internet connectivity, and supports reproducible analysis across species.
For each analysis (especially publications), record:
- rMAP-2.0 release tag (or commit SHA if no release)
- Inputs JSON used
- Database snapshot version (Zenodo or local rebuild date)
- Docker image tags and (ideally) digests
- Cromwell version and the exact command used
- Hardware summary (CPU/RAM)
