Tests of the network extracting procedure

The corpus to perform tests

To test the original procedure, we employed the three stylistically and thematically different corpora, i.e. the PAP corpus which consists of 51,574 press releases of the Polish Press Agency, and contains over 2,900,000 words, the sub-corpus of the National Corpus of Polish with a size of 3,363 separate documents spanning over 860,000 words, and a literary text corpus which consists of 10 short stories and the novel Lalka (The Doll) written by the influential novelist, Boleslaw Prus. All three corpora were lemmatized using a dictionary-based approach [KOR 12]. The procedure performed equally well on all three corpora. Then, we decided to perform the test described below on the largest corpus, which is that of the PAP.

Evaluation of the extracted sub-graph

To evaluate the quality of the extracted sub-graph, we shall use two separate evaluation criteria: first, to test the semantic consistency of the subgraph and second, to test how the sub-graph matches the text collection.

