How to provide multiple datasets with different weightage for llava next pre-training script? #15185
gagangayari
started this conversation in
General
Replies: 1 comment
-
|
For multi-dataset blending in NeMo VLM training, you need to configure the data module with weighted datasets. Configuration approach: from nemo.collections.vlm import LlavaNextDataModule
data_module = LlavaNextDataModule(
paths=[
"/data/dataset1",
"/data/dataset2",
"/data/dataset3",
],
weights=[0.5, 0.3, 0.2], # 50%, 30%, 20%
global_batch_size=32,
micro_batch_size=4,
num_workers=8,
)YAML config version: model:
data:
data_prefix:
- 0.5
- /data/dataset1
- 0.3
- /data/dataset2
- 0.2
- /data/dataset3
# Weights are normalized automaticallyFor the pre-training script: python scripts/vlm/llava_next_pretrain.py \
--config-path=conf \
--config-name=llava_next_pretrain \
model.data.data_prefix="[0.5,/data/d1,0.3,/data/d2,0.2,/data/d3]"Tips for data mixing:
Validation split: model:
data:
splits_string: "98,1,1" # train, val, testWe train multimodal models at Revolution AI — data mix ratios significantly impact quality. Start with equal weights, then tune based on eval. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to run this llava-next pre-training script. However, I am unable to find on how to provide data mix for blending using different datasets. Please help me regarding this.
Beta Was this translation helpful? Give feedback.
All reactions