Primary Data: Background and Informants
Data in Historical Sociolinguistics
The 'Bad-Data' Problem
It is true that researchers of the earlier varieties of a language cannot gather their data in the same way as a person studying present-day languages. The standard sociolinguistic methods, such as interview and elicitation, are automatically excluded. Recordings of spoken language are available only from the last century onwards. What serves as primary data necessarily represents written language and, as Labov says (1982: 20), 'the fragments of the literary record that remain are the result of historical accidents beyond the control of the investigator'.
In our experience, however, there is no need in historical linguistics to overstress what Labov calls 'bad data'. True, historical data can be characterized as 'bad' in many ways, but we would rather place the emphasis on making the best use of the data available (Nevalainen 1999b). This requires systematicity in data collection, extensive background reading and good philological work, in other words, tasks that are demanding and timeconsuming but by no means unrealizable.
Labov's additional comment (1994: 11) that '[w]e usually know very little about the social position of the writers, and not much more about the social structure of the community' seems to us rather inaccurate if not misleading, at least as far as the history of English is concerned. Extensive studies of how people lived in the past have been carried out by historians, from general investigations to research on particular areas and communities as well as families and individuals. Integrating information gathered by historians into linguistic research has been one of the challenges of this study.1 We would not argue, however, that it is always possible to assess an individual writer's social position or the conditions of his or her community.
Rather than complaining about the quality of the information we have, we need to regret the shortage of material concerning particular sections of society. Owing to widespread illiteracy in the past, it is not possible to gain access to the language of the lowest social strata and most women, not at least in an autograph form. The level of full literacy, comprising both reading and writing skills, varies a great deal between different historical periods but, on the whole, most of the material that has come down to us has been produced by upper- and middle-ranking male informants.
The whole field of historical linguistics has been revolutionized by the emergence of computer-assisted data processing techniques. Besides huge computerized corpora on present-day languages, there are also corpora on historical varieties. The first such corpus on the history of English, the Helsinki Corpus of English Texts, carefully compiled by a project team at the University of Helsinki in the 1980s, has paved the way for a number of new, second-generation corpora (e.g. Hickey et al. 1997; Meurman-Solin 2001; Beal et al. 2007). The trend has been from textually balanced multi-purpose corpora towards larger single-genre corpora, such as the one on correspondence used in this study. Computer technology has proved to be a great help in making the best use of the data available in historical sociolinguistics.