-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
I saw that you released GermanQuAD. (https://huggingface.co/datasets/deepset/germanquad)
Could you share some details on how you prepared the dataset?
I want to train a monolingual Dutch QA model. I plan to use an existing pre-trained Dutch BERT base model, and prepare some Dutch SQuAD samples myself. I have following ideas,
- machine translation of English SQuAD.
- human annotation of 500 Dutch question-passage-answer (Q-P-A) triplets in my domain.
If I skip 2), would 1) already be enough?
If I add 2), would 500 Q-P-A samples be enough?
By enough I mean at least better than a multilingual model.
I've tried the multilingual model deepset/xlm-roberta-large-squad2 on my Dutch samples. It works a bit however the performance is not satisfactory, which is expected. I feel that training a monolingual Dutch QA model will be much better. Your advice and tips are much appreciated!