Home Language & Literature COGNITIVE APPROACH TO NATURAL LANGUAGE PROCESSING
In this section, we present the experimental setup of our approach. We first describe the dataset used in the experiment, and then present the classification scheme and algorithm employed for this experiment. The results and discussion are presented in the next section.
To test the effectiveness of sequential rules over Part-Of-Speech tags and function words for authorship attribution, we used texts written by Balzac, Dumas, France, Gautier, Hugo, Maupassant, Proust, Sand, Sue and Zola. This choice was motivated by our special interest in studying the classic French literature of the 19th Century, and the availability of electronic texts from these authors on the Gutenberg project website and in the Gallica electronic library. Our choice of authors was also affected by the fact that we want to cover the most important writing styles and trends from this period. For each of the 10 authors mentioned above, we collected four novels, so that the total number of novels is 40. The next step was to divide these novels into smaller pieces of texts in order to have enough data instances to train the attribution algorithm. Researchers working on authorship attribution on literature data have been using different dividing strategies. For example, Hoover [HOO 03] decided to take just the first 10,000 words of each novel as a single text, while Argamon and Levitan [ARG 05] treated each chapter of each book as a separate text. In our experiment, we chose to slice novels by the size of the smallest one in the collection in terms of the number of sentences. This choice respects the condition proposed by Eder [EDE 13] that specifies the smallest reasonable text size to achieve good attribution; more information about the dataset used in the experiment is presented in Table 8.2.
Table 8.2. Statistics for the dataset used in our experiment
|< Prev||CONTENTS||Next >|