Home Language & Literature Theory and Data in Cognitive Linguistics

## Perspective 2: CA and its mathematics/computationLet us now turn to some of the more technical arguments regarding CA's input data and choice of measure. ## The issue of the corpus sizeLet us begin with the issue of Bybee's "fourth factor", the corpus size in constructions. Yes, an exact number of constructions for a corpus cannot easily be generated because i. "a given clause may instantiate multiple constructions" (Bybee 2010: 98); ii. researchers will disagree on the number of constructions a given clause instantiates; iii. in a framework that does away with a separation of syntax and lexis, researchers will even disagree on the number of constructions a given word instantiates. However, this is much less of a problem than it seems. First, this is a problem nearly all AMs have faced and addressed successfully. The obvious remedy is to choose a level of granularity close to the one of the studied phenomenon. For the last 30 years collocational statistics used the number of lexical items in the corpus as Second, CA rankings are remarkably robust. Bybee herself pointed out that different corpus sizes yield similar results, and a more systematic test supports that. I took Stefanowitsch & Gries's (2003) original results for the ditransitive construction and increased the corpus size from the number used in the paper by a factor of ten (138,664 to 1,386,640), and I decreased the observed frequencies used in the paper by a factor of 0.5 (with n's = 1 being set to 0 / omitted). Then I computed four CAs: - one with the original data; - one with the original verb frequencies but the larger corpus size; - one with the halved verb frequencies and the original corpus size; - one in which both frequencies were changed. In Figure 1, the pairwise correlations of the collostruction strengths of the verbs are computed (Spearman's rho) and plotted. The question of which verb frequencies and corpus size to use turns out to be fairly immaterial: Even when the corpus size is de-/increased by one order of magnitude and/or the observed frequencies of the words in the constructional slots are halved/doubled, the overall rankings of the words are robustly intercorrelated (all rho > 0.87). Thus, this 'issue' is unproblematic when the corpus size is approximated at some appropriate level of granularity and, trivially, consistently, in one analysis.
## The distribution of pFYEAnother aspect of how CA is computed concerns its 'response' to observed frequencies of word - the power law of learning (cf. Anderson 1982, cited by Bybee herself); - word frequency effects are logarithmic (cf. Tryk 1986); - forgetting curves are logarithmic (as in priming effects; cf. Gries 2005, Szmrecsanyi 2006), ... Given such and other cases As for the former, it is easy to show that the AM used in most CAs, pFYE, is not a straightforward linear function of the observed frequencies of words in constructions but rather varies as a function of w's frequency in
frequency of I am not claiming that logged pFYE-values are the best way to model cognitive processes for example, a square root transformation makes the values level off more like a learning curve but clearly a type of visual curvature we know from many other cognitive processes is obtained. Also, pFYE values are highly correlated with statistics we know Let us now also at least briefly look at authentic data, some here and some further below (in Section 4.3.2). The first result is based on an admittedly small comparison of three different measures of collostruction strengths: For the ditransitive construction, I computed three different CAs, one based on -log10 pFYE, one on an effect size (logged odds ratio), and one on Mutual Information Comparing these results to each other and to Goldberg's (1995) analysis of the ditransitive suggests that, of these measures, The pFYE-values arguably fare best: Finally, there is Wiechmann's (2008) comprehensive study of how well more than 20 AMs predict experimental results regarding lexico-constructional co-occurrence. Raw co-occurrence frequency scores rather well but this was in part because several outliers were removed. Crucially, pFYE ended up in second place and the first-ranked measure, Minimum Sensitivity (MS), is theoretically problematic. Using the notation of Table 1, it is computed as shown in (5), i.e. as the minimum of two conditional probabilities: One problem here is that some collexemes' positions in the ranking order will be due to p(word|construction) while others' will be due to p(construction|word). Also, the value for
case it is p(word|construction) and in the other it is p(construction|word). This is clearly undesirable, which is why pFYE, while 'only' second, is more appealing. As an alternative, a unidirectional measure such as AP is more useful (cf. Gries to appear). |

< Prev | CONTENTS | Next > |
---|

Related topics |

Academic library - free online college e textbooks - info{at}ebrary.net - © 2014 - 2019