Prepare non-English SQuAD 

I saw that you released GermanQuAD. (https://huggingface.co/datasets/deepset/germanquad)
Could you share some details on how you prepared the dataset?
I want to train a monolingual Dutch QA model. I plan to use an existing pre-trained Dutch BERT base model, and prepare some Dutch SQuAD samples myself. I have following ideas,
1) machine translation of English SQuAD.
2) human annotation of 500 Dutch question-passage-answer (Q-P-A) triplets in my domain.

If I skip 2), would 1) already be enough?
If I add 2), would 500 Q-P-A samples be enough?
By enough I mean at least better than a multilingual model.

I've tried the multilingual model `deepset/xlm-roberta-large-squad2` on my Dutch samples. It works a bit however the performance is not satisfactory, which is expected. I feel that training a monolingual Dutch QA model will be much better. Your advice and tips are much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare non-English SQuAD #1530

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prepare non-English SQuAD #1530

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions