Desktop version

Home arrow Language & Literature arrow COGNITIVE APPROACH TO NATURAL LANGUAGE PROCESSING

Results and discussion

The results of measuring the attribution performance for the different feature sets presented in our experiment setup are summarized in Table 8.3 for features derived from function words, and in Table 8.4 for those derived from Part-Of-Speech tags. These results show, in general, a better performance when using function words and Part-Of-Speech tag 3-gram frequencies, which achieved a nearly perfect attribution, over features based on sequential rules for our corpus.

Our study here shows that the SVMs classifier combined with features extracted using sequential data mining techniques can achieve a high attribution performance (e.g. F1 = 0.939 for Top 300 FW-SR). Until a certain limit, adding more rules increases the attribution performance (e.g. F1 = 0.733 for Top 100 POS-SR compared with F1 = 0.880 for Top 800 POS-SR).

Contrary to our hypothesis, function word frequency features, which fall under the bag-of-word assumption, known to be blind to sequential information, outperform features extracted using the sequential rule mining technique. The same thing can be said for the Part-Of-Speech tag 3-grams.

Feature set

P

R

Fi

Top 100 FW-SR

0.901

0.886

0.893

Top 200 FW-SR

0.942

0.933

0.937

Top 300 FW-SR

0.940

0.939

0.939

FW frequencies

0.990

0.988

0.988

Table 8.3. Five-fold cross-validation for our dataset. SR refers to sequential rules and FW refers to function words

By taking a closer look at the sequential rules extracted from the Part-Of- Speech tag sequences, we found that these rules, especially the most frequent ones, are more likely to be language-grammar dependent (e.g. ADJ NC,PONCT with sup = 63,569 and DET,NC,P ^ ADJ with sup = 63,370).

To reduce this effect, we added a TF - IDF -like heuristic that measures the overall discriminative power of each sequential rule. The TF - IDF -like weight of a sequential rule R present in a text t is calculated as follows:

where suppt (R) is the support of the rule R in the text t, N is the total support of all rules in the corpus and Nt is the total support of all rules in the text t.

Results given by this TF - IDF weighting in Table 8.5 are better than the original ones, but they still cannot reach the performance given by the state- of-the-art style markers. This suggests that in future studies, we should add an adequate feature selection method that will filter the rules to capture the most relevant ones.

By analyzing the individual attribution performance for each author separately, we notice a significant variance between the attribution performance of one author and that of another (e.g. F1 = 1 for Proust compared with F1 = 0.673 for Dumas); some individual results are presented in Table 8.5. This particularity is due to the fact that some authors have more characterizing style than others in the works used for the experiment. This property can be clearly visualized by carrying out the principal components analysis (see Figure 8.2) on the 40 books used in the dataset.

Feature set

P

R

Fi

Top 200 POS-SR

0.72

0.70

0.71

Top 300 POS-SR

0.83

0.81

0.82

Top 400 POS-SR

0.84

0.83

0.83

Top 500 POS-SR

0.85

0.84

0.84

Top 600 POS-SR

0.87

0.85

0.86

Top 700 POS-SR

0.88

0.86

0.87

Top 800 POS-SR

0.88

0.87

0.88

POS 3-gram frequencies

0.99

0.99

0.99

Table 8.4. Five-fold cross-validation results for our dataset. SR refers to sequential rules and POS refers to Part-Of-Speech

Feature set

P*

R*

F*1

Top 200 POS-SR

0.82

0.79

0.81

Top 300 POS-SR

0.86

0.84

0.85

Top 400 POS-SR

0.87

0.86

0.87

Top 500 POS-SR

0.89

0.88

0.88

Top 600 POS-SR

0.89

0.88

0.88

Top 700 POS-SR

0.91

0.90

0.90

Top 800 POS-SR

0.92

0.91

0.91

Table 8.5. Five-fold cross-validation results given by considering the TF-IDF-like weighting for our dataset. SR refers to sequential rules and POS refers to Part-Of-Speech

Even if these results are in line with previous works that claimed that bag-of-words-based features are more relevant than sequence-based features for stylistic attribution [ARG 05], they show that style markers extracted using sequential rule mining techniques can be valuable for authorship attribution. We believe that our results open the door to a promising line of research by integrating and using sequential data mining techniques to extract more linguistically motivated style markers for computational, stylistic and authorship attribution.

Author Name

P

R

Fi

Balzac

0.88

0.75

0.80

Dumas

0.65

0.69

0.67

France

0.92

0.96

0.93

Gautier

0.95

0.85

0.89

Hugo

0.88

0.95

0.91

Maupassant

1.00

0.85

0.91

Proust

1.00

1.00

1.00

Sand

0.92

0.90

0.91

Sue

0.86

0.86

0.86

Zola

0.98

1.00

0.99

Table 8.6. Individual 5-fold cross-validation results for each author evaluated for the Top 700 Part-Of-Speech tag sequential rules

Actually, despite the fact that function words are not very relevant features to describe the stylistic characterization, they are a reliable indicator of authorship. Owing to their high frequency in a written text, function words are very difficult to consciously and voluntarily control, which makes them a more inherent trait and consequently minimizes the risk of false attribution. Moreover, unlike content words, they are more independent of the topic or the genre of the text, and therefore we should not expect to find great differences of frequencies across different texts written by the same authors on different topics [CHU 07]. Yet, they basically rely on the bag-of-words assumption, which stipulates that text is a set of independent words.

As we have seen, it turns out that the hypothesis, stated as a basis for the experiment, is not true, at least for the corpus that we have considered in this experiment. This can be considered as a clear argument, suggesting that complex features such as sequential rules are not suitable for authorship attribution studies. In fact, there is a difference between the characterizing ability of a stylistic feature, on the one hand, and its discriminant power, on the other. The most relevant and suitable stylistic features to perform a discriminant task such as stylistic classification are the ones that operate on the low linguistic levels as function words do. These are subsequently more difficult to linguistically interpret and understand and do not necessarily enhance the knowledge concerning the style of the text from which they were extracted.

Principal components analysis of the 40 books (four books per author) in the dataset, Top 200 SR analyzed. For a color version of this figure, see www.iste.co.uk/sharp/cognitive.zip

Figure 8.2. Principal components analysis of the 40 books (four books per author) in the dataset, Top 200 SR analyzed. For a color version of this figure, see www.iste.co.uk/sharp/cognitive.zip

 
Source
< Prev   CONTENTS   Source   Next >

Related topics