Corpus Stratifying

[Task of Domain Adoption and Customization]

Purpose

Prepare the annotated corpus to train the model and validate it’s performance.

Description

Given the amount and complexity of annotated data, a test set to evaluate the performance of the system has to be splitted from the training data. Purpose is to create a certain level of confidence. Classic 70-20-10 splits may not be satisfiing.

Steps

  • Measure corpus metrics (inter annotator agreement (IAA), …)

  • Define amount of necessary test data (depends on goal)

  • Split the remaining training data in training and development / validation splits.

Responsible Role