Corpus Stratifying¶

[Task of Domain Adoption and Customization]

Purpose¶

Prepare the annotated corpus to train the model and validate it’s performance.

Description¶

Given the amount and complexity of annotated data, a test set to evaluate the performance of the system has to be splitted from the training data. Purpose is to create a certain level of confidence. Classic 70-20-10 splits may not be satisfiing.

Steps¶

Measure corpus metrics (inter annotator agreement (IAA), …)
Define amount of necessary test data (depends on goal)
Split the remaining training data in training and development / validation splits.

Artifacts¶

Requires¶

Creates¶

Responsible Role¶

Data Scientist