Corpus-Assisted Analysis of Language Ideological Debates on the Delfi News Portal, August 2013 - February 2015
Building the Corpora
We would like to emphasize that a corpus-assisted study can only tell us about the language in the corpus one is employing and, therefore, the composition of the corpus will necessarily affect the conclusions reached (Partington et al. 2013). However, in contrast to qualitative discourse analysis, where the starting point depends on the researcher”s standpoint and understanding of the meaning of the discourse, CADS, even in cases like ours where the researchers have to build corpora themselves, does not have to deal with the issue of biased research to the same extent, since the selection of texts in the early stages of the analysis is computerized and therefore free of the researcher’s prejudices (Baker and McEnery 2015, 6). Since there is no common and up-to-date corpus of Estonian media texts, we had to create a sample corpus (also known as DIY - do it yourself - corpus, Fitzsimmons-Doolan 2015) for the purpose of our analysis. We chose to create corpora based on texts published on the largest online news portal, Delfi. Delfi has a broad platform and it ranks as the most popular Estonian news website among Estonian internet users, based on an analysis of internet traffic (GemiusAudience) and surveys of self-reported media trust use (Vihalemm 2011). There is no universal measure of the representativeness of a corpus, although a corpus is considered balanced if it consists of a variety of texts, which our corpus does given the variety of texts published on Delfi1 (our corpora contain news reports, readers’ letters, interviews, features and editorials) (Fitzsimmons-Doolan 2015). Two specialized corpora, based on predetermined content-focus words (Fitzsimmons-Doolan 2015), were located through the archival research of online publications on delfi.ee for articles in Estonian, and delfi.ru.ee for articles in Russian. The search parameters focused specifically on the period from August 2013 to March 2014, giving us seven months before the annexation of Crimea, in March 2014, and from March 2014 to the parliamentary elections in Estonia on March 1, 2015. A higher number of language ideological debate articles in the months before the elections for local governments in October 2013 and for the European Parliament in May 2014 was noticed early in our research, and thus we wanted to test whether this also held true for the parliamentary elections on March 1, 2015. The chosen texts related primarily to the debate about the integration of the Russian-speaking minority in Estonia and were identified through keyword list searches of the repositories. All of the archival searches and cross-checks were based on the following phrases: “Russian/Estonian language” “Estonian Russians”, “integration”, “citizenship” and “Russian school”. All of the selected articles had to mention “Russian language” and/or “Estonian language”.
The search resulted in 210 articles in Russian and 289 articles in Estonian, which were saved separately as .txt files organized in sub-directories by the corresponding months and then included in two parallel corpora of approximately 200,000 and 300,000 tokens. The collected corpus can thus be characterized as a small, specialized corpus (Barker et al. 2008) that can be processed by computer corpus software in a preliminary way, and the evidence of which can be examined manually and individually, while important features of the context of the production of the texts may become lost in a large corpus (Clark 2007). Since we wanted to detect changes in language ideological discourse patterns before and after the annexation of Crimea in March 2014, our criteria for corpus compilation were broader context, such as events in Ukraine and Estonia, and the narrower context of individual articles, such language policy-related topics as integration, language status, acquisition and citizenship issues. The two corpora, in our evaluation, were both sufficiently representative for our study and large enough to justify a corpus-assisted approach.
In spite of the fact that our analysis of changing language ideological debates in media was cross-linguistic, we did not have to face the challenges common to crosslinguistic CADS, such as finding comparable search terms for collocation analysis (Taylor 2014). Our comparative interest was largely cultural, social and historical, and not linguistic, since we analyzed Estonian and Russian corpora separately. 
Fig. 1 Language policy-related articles in delfi.ee and delfi.ru.ee
However, since all of the corpus analysis was carried out in the original language, we faced the challenge of literal vs. functional translation and thus could not present the key-word-in-context (KWIC) concordance lines, which are central for illustrating language patterns, as the word order and even the node can get lost in translation (Taylor 2014). Furthermore, since both Estonian and Russian are languages with synthetic-inflectional structures, where nouns and adjectives are declined in cases, automated functions such as “keyness” could only be used tentatively, and the results had to be verified by concordance and collocation analysis.
-  Delfi is a commercially run internet portal owned by the Estonian media company EkspressGroup. It operates in all three Baltic states and in Poland. Aside from the Estonian, Latvian andLithuanian versions, the company offers English- and Russian-language versions of its portal in allthree Baltic countries. Besides news and articles produced by Delfi, the portal also publishes summaries of the most important news and articles published elsewhere. Articles published on Delfiare freely accessible and cover a wide range of topics, from politics to fashion.