Skip to content

Prepare non-English SQuAD  #1530

@yingzwang

Description

@yingzwang

I saw that you released GermanQuAD. (https://huggingface.co/datasets/deepset/germanquad)
Could you share some details on how you prepared the dataset?
I want to train a monolingual Dutch QA model. I plan to use an existing pre-trained Dutch BERT base model, and prepare some Dutch SQuAD samples myself. I have following ideas,

  1. machine translation of English SQuAD.
  2. human annotation of 500 Dutch question-passage-answer (Q-P-A) triplets in my domain.

If I skip 2), would 1) already be enough?
If I add 2), would 500 Q-P-A samples be enough?
By enough I mean at least better than a multilingual model.

I've tried the multilingual model deepset/xlm-roberta-large-squad2 on my Dutch samples. It works a bit however the performance is not satisfactory, which is expected. I feel that training a monolingual Dutch QA model will be much better. Your advice and tips are much appreciated!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions