Question Answering

September 3, 2021 Daniel Hladek project 2 minutes, 54 seconds

annotation, question-answer, nlp

Edit this Page

Question Answering

Project Description

Create a clone of SQuaD 2.0 in the Slovak language
Setup annotation infrastructure with Prodigy
Perform and evaluate annotations of Wikipedia data.

Auxiliary tasks:

Consider using machine translation
Train and evaluate Question Answering model

People

Daniel Hládek (responsible researcher).
Tomáš Kuchárik (student, help with web app).
Ján Staš (BERT model).
Ondrej Megela, Oleh Bilykh, Matej Čarňanský (auxiliary tasks).
other students and annotators (annotations).

Finished Tasks

Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

Obtaining and parsing of wikipedia dump
Selecting feasible paragraphs

Done:

Wiki parsing script (Daniel Hládek)
PageRank script (Daniel Hládek)
selection of paragraphs: select all good paragraphs and shuffle
fix minor errors

To be done:

Select the largest articles (to be compatible with squad).

Notes:

PageRank Causes bias to geography, random selection might be the best
75 best articles
167 good articles
Wiki Facts

Annotation Manual

Output: Recommendations for annotators

Done:

Web Page for annotators (Daniel Hládek)
Modivation video (Daniel Hládek)
Video with instructions (Daniel Hládek) bn application?

Question Annotation

An annotation recipe for Prodigy

Input: A set of paragraphs

Output: 5 questions for each paragraph

Done:

a data preparation script (Daniel Hládek)
annotation recipe for Prodigy (Daniel Hládek)
deployment at question.tukekemt.xyz (only from tuke) (Daniel Hládek)
answer annotation together with question (Daniel Hládek)
prepare final input paragraphs (dataset)

Annotation Web Application

Annotation work summary, web applicatiobn

Input: Database of annotations

Output: Summary of work performed by each annotator

Done:

application template (Tomáš Kuchárik)
Dockerfile (Daniel Hládek)
web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
application deployment (Daniel Hládek)
extract annotations from question annotation in squad format (Daniel Hladek)

Annotation Validation

Input: annnotated questions and paragraph

Output: good annotated questions

Done:

Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
Deployment

Tasks in progress

Unanswerable question annotation

Input: validated questions and answers

Output: Unanswerable questions and answers

Done:

Annotation manual
Annotation interface
Database schema modifications
Modification of the database application
Export of validations

In progress:

Annotaion process optimization

Final Data Export

Input: Validations and unanswerable questions

Output: Final database in SQUAD format

Done:

Preliminary export script

To be done:

Final export script
Database web visualization
Prepare development set

Resources

Bibligraphy

Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
WDaqua publications

Existing Datasets

Squad The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
WebQuestions
Freebase

Intern tasks

Week 1: Intro

Get acquainted with the project and Squad Database
Download the database and study the bibliography
Study Prodigy annnotation tool
Read SQuAD: 100,000+ Questions for Machine Comprehension of Text
Read Know What You Don't Know: Unanswerable Questions for SQuAD

Output:

Short report

Week 2-4 The System

Select and train a working question answering system

Output:

a deployment script with comments for a selected question answering system

Week 5-7 The Model

Take a working training recipe (can use English data), a script with comments or Jupyter Notebook

Output:

a trained model
evaluation of the model (if possible)