How to provide multiple datasets with different weightage for llava next pre-training script? #15185

gagangayari · 2025-12-13T05:36:57Z

gagangayari
Dec 13, 2025

I am trying to run this llava-next pre-training script. However, I am unable to find on how to provide data mix for blending using different datasets. Please help me regarding this.

xXMrNidaXx · 2026-02-23T14:07:41Z

xXMrNidaXx
Feb 23, 2026

For multi-dataset blending in NeMo VLM training, you need to configure the data module with weighted datasets.

Configuration approach:

from nemo.collections.vlm import LlavaNextDataModule

data_module = LlavaNextDataModule(
    paths=[
        "/data/dataset1",
        "/data/dataset2", 
        "/data/dataset3",
    ],
    weights=[0.5, 0.3, 0.2],  # 50%, 30%, 20%
    global_batch_size=32,
    micro_batch_size=4,
    num_workers=8,
)

YAML config version:

model:
  data:
    data_prefix:
      - 0.5
      - /data/dataset1
      - 0.3  
      - /data/dataset2
      - 0.2
      - /data/dataset3
    # Weights are normalized automatically

For the pre-training script:

python scripts/vlm/llava_next_pretrain.py \
  --config-path=conf \
  --config-name=llava_next_pretrain \
  model.data.data_prefix="[0.5,/data/d1,0.3,/data/d2,0.2,/data/d3]"

Tips for data mixing:

Higher weight for higher quality data
Balance image types (charts vs photos vs diagrams)
Consider curriculum: start with simpler, add complex later
Monitor per-dataset loss during training

Validation split:

model:
  data:
    splits_string: "98,1,1"  # train, val, test

We train multimodal models at Revolution AI — data mix ratios significantly impact quality. Start with equal weights, then tune based on eval.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to provide multiple datasets with different weightage for llava next pre-training script? #15185

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to provide multiple datasets with different weightage for llava next pre-training script? #15185

Uh oh!

gagangayari Dec 13, 2025

Replies: 1 comment

Uh oh!

xXMrNidaXx Feb 23, 2026

gagangayari
Dec 13, 2025

xXMrNidaXx
Feb 23, 2026