Identification of Partial Falsifications
Another method for identifying partial falsification is by identifying similar response patterns across ordinal-scaled item batteries (Blasius and Thiessen 2012; Blasius and Thiessen 2013; Blasius and Thiessen 2015). Principal Component Analysis (PCA) as well as Categorical Principal Component Analysis (CatPCA) can be applied to identify similar response patterns based on the factor scores of all interviews. Suspicious interviewers are thought to produce identical response patterns particularly frequently (Blasius and Thiessen 2013). Again, one inference is that similar response patterns result from a lower variance in falsified data (Blasius and Thiessen 2013; Blasius and Thiessen 2015). An assumption that is required here but seldom discussed is the identical and independent distribution of answers across items. Often item batteries cover related topics, so that answers depend on each other and specific patterns of answers occur more frequently in the population absent any falsification. Nevertheless, if an interviewer is responsible for an unusually large number of similar response patterns, this may indicate a potential risk of falsification.
Identification of Duplicate Records
Duplicate records can heavily influence survey data (Koczela et al. 2015; Kuriakose and Robbins 2016), for example, by biasing point estimates (e.g. regression coefficients) upward (or downward) and affecting variance estimates leading to erroneous tests of significance. Published accounts of high incidence of duplicates in survey data sets (Kuriakose and Robbins 2016; Slomczynski, Powalko, and Krauze 2017) have resulted in recommendations for the use of duplicate analyses. A simple form of duplicate analysis that checks a data set for completely identical data rows (including missing values) is implemented in standard statistical software (e.g. Stata®, SAS®, SPSS®, among others) (Koczela et al. 2015; Kuriakose and Robbins 2016; Slomczynski, Powalko, and Krauze 2017). So-called "near duplicates" can be particularly problematic since even a change in a single value is sufficient to produce an undetectable duplicate (Kuriakose and Robbins 2016). For this reason, "high-matching" methods were developed to identify data with an unusually high correspondence (between 85% and 99%) of response values (Koczela et al. 2015; Kuriakose and Robbins 2016). However, there is a strong risk that real data will be identified as near duplicates by coincidence, producing falsely suspected cases. High-matching methods are highly sensitive to various characteristics of a survey (e.g. number of questions, number of respondents, homogeneous subgroups), and thus are not generally applicable to every data set (Simmons et al. 2016). Whether "high-matching" methods should be used requires careful consideration of these factors.