Interacting with Interviewers in Text and Voice Interviews on Smartphones
In the parts of the world with access to recent advances in mobile technologies, the daily communication habits of potential survey respondents have changed massively. An increasing majority of sample members are augmenting their face-to-face (FTF) and landline telephone interactions (if they even have a landline any more) by communicating in multiple modes - talking, emailing, text messaging, video chatting, posting to social media - on a single mobile device, and switching between those modes on the same device or across devices (smartphone, tablet, desktop computer, etc.). People are growing accustomed to choosing and switching between communication modes appropriate to their current setting (a noisy environment? an unstable network connection?), communication goals (a professional communication with co-workers? a private conversation with a family member?), chronic or temporary needs (wanting a lasting record of what was communicated, or no record? not wanting to be seen right now? needing to care for a child while communicating?), and interlocutor (a partner who never responds to voice calls or emails?). People with access to advanced technologies also are more frequently engaging in human-machine interactions, whether with bank automated teller machines, ticket kiosks, and self-check-out at grocery stores, or in conversations with automated phone agents for travel reservations and tech support, or in online help "chat."
Many potential survey respondents have thus gotten used to modes of interacting that have quite different dynamics than the FTF "doorstep" interviews and landline telephone interviews that have formed the backbone of survey measurement - and even than the online "self-administered" surveys that are a growing part of the landscape. How will these transformations affect survey responding and the effects of interviewers? How should we think about the role of interviewers when potential respondents may want to - or even assume they can - participate in surveys in the modes of interaction they now use daily? Will interviewers in new modes enhance participation and respondent motivation, as they often do in current modes (e.g., Villar and Fitzgerald 2017)? What signs of their "humanness" do interviewers produce in modes with less social presence than FTF, and how do these affect respondents' participation and the quality of the data they provide?
The study described here begins to address these questions by exploring the dynamics of interviewer-respondent interaction in a corpus of 634 US-based interviews on smartphones from Schober, et al. (2015). The interviews were carried out in four survey modes that work through two native iPhone apps with which all participants were likely to have experience (Phone and Messages), as opposed to through a study-specific survey app or as a web survey in the phone's browser; the study purposely limited itself to using a uniform interface for all respondents (iOS) rather than mixing mobile platforms to include Android, Windows, etc. The study varied the medium of interaction (voice vs. text messaging) and the interviewing agent (human interviewer vs. automated system), leading to four survey modes: Human Voice, Human Text, Automated Voice, and Automated Text. In each mode, respondents (who had screened into participating on their iPhone for a $20 (gift code) incentive and were randomly assigned to one of the modes) answered 32 questions selected from ongoing US social scientific surveys. (See Antoun, et al. 2016 for details about sample recruitment from online sources.)
The Human Voice and Human Text interviews were carried out by the same eight professional interviewers from the University of Michigan Survey Research Center, using a custom CATI interface for this study that supported voice and text interviews (see Schober, et al. 2015, supplementary materials, for more details). In the Human Voice interviews, interviewers read the questions and entered answers as they usually do. In the Human Text interviews, interviewers used the interface to select and send stored questions and prompts as well as to edit pre-existing prompts or type individualized messages. Respondents could text their answers with single characters rather than typing out full answers: "Y" or "N" (for "yes" or "no"), single letters ("a," "b," "c") corresponding to response options, or numbers for numerical questions. They could also text requests for clarification or help if they desired.
The Automated Voice mode was a custom-built Speech-IVR (Interactive Voice Response) dialogue system, implemented using AT&T's Watson speech recognizer and the Asterisk telephony gateway. Questions and prompts were presented by an audio-recorded human interviewer; respondents' spoken responses that were recognized were either accepted ("Got it," "Thanks") or presented for verification if the system's recognition confidence was lower than a minimum threshold ("I think you said 'nine' - is that right? Yes or no"). Responses that were not recognized led the system to re-present the question or response options (see Johnston, et al. 2013 for more details about the system and interface).
The Automated Text interviews were carried out via a custom text dialogue system, and the interface for respondents was the same as that for Human Text interviews. The system texted questions, accepted recognizable answers, prompted for an acceptable answer if what was provided was uninterpretable, and presented standard definitions of key survey terms if respondents requested help. The automated text interview introduction stated: "This is an automated interview from the University of Michigan," rather than the human text interview introduction: "My name is < first name, last name > from the University of Michigan."
As reported in greater detail in Schober, et al. (2015; see also Conrad, et al. 2017a, 2017b), text interviews differed from voice interviews in a number of ways, as did automated and interviewer-administered interviews, with a range of independent and additive effects. Text interviews (compared to voice interviews) led to higher interview completion rates whether automated or interviewer-administered, but also (in interviewer-administered interviews) to a higher breakoff rate. Respondents in text interviews - both automated and interviewer-administered - produced higher-quality data on several fronts, giving more precise (fewer rounded) numerical answers and more differentiated answers to battery questions (less straight-lining). They also disclosed more sensitive behaviors, consistent with West and Axinn's (2015) finding of greater disclosure in text than voice interviews in a quite different population and survey in Nepal. Respondents in automated (compared to interviewer-administered) interviewers completed interviews at a lower rate, and they had a higher rate of breakoff in both text and voice interviews. Respondents in automated interviews also reported more sensitive behaviors than those in interviewer-administered interviews (replicating the oft-reported finding of greater disclosure in self-administered interviews, e.g., Tourangeau and Smith 1996) - as a main effect independent of the increased disclosure due to texting. So the greatest level of disclosure was in Automated Text interviews, and the least in Human Voice.
What accounts for this pattern of differences in participation and data quality in these modes? Any or all of the many differences in timing and behavior between text and voice, and automated and interviewer-administered, interviews - alone or in combination - are plausible contributing factors. One could hypothesize, for example, that respondents report with greater precision (less rounding) in text because texting (vs. voice) reduces the immediate time pressure to respond, so the respondent has more time to think or to look up answers. Or one could hypothesize that respondents disclose more in text because text messaging reduces the interviewer's "social presence" - there is no immediate evidence of any interviewer reaction, and thus reduced salience of the interviewer's ability to evaluate or be judgmental.
The experimental design of the study helps rule in or rule out at least some such alternative plausible accounts, because we can explore whether differences in interview timing or in particular interviewer or respondent behaviors across the modes correlate with particular data quality or participation outcomes. We focus on whether interviewer behaviors that provide concrete evidence of interviewers' humanness - speech disfluencies (evidence that speech is spontaneous and that the interviewer might be fallible), laughter, chat that goes beyond the survey script - and that vary in the different modes in the study correlate with or predict data quality. As potential correlates of interviewer-respondent rapport (e.g., Garbarski, Schaeffer, and Dykema 2016), these interviewer behaviors may affect respondents' engagement and motivation, and thus the quality of respondents' answers. Theorizing about how interviewers might affect data quality (see, e.g., Kreuter, Presser, and Tourangeau 2008; Lind, et al. 2013; Tourangeau and Smith 1996, among many others) has proposed a range of possible effects, both positive and negative: interviewers may indeed increase respondents' motivation, but they may also cause respondents to feel embarrassed and reduce their candor, or distract their attentional focus from the content of survey questions. We see empirically examining effects of concrete interviewer behaviors associated with the "human touch" on data quality as a useful step.
We used two main analytic strategies. First, if interviewer behavior differs between two modes but our measures of data quality don't differ, interviewer behavior is unlikely to be causal. For example, one might hypothesize that respondents provided more precise numerical answers (rounded less) in text than voice because text interviewers (human or automated) may have been less likely to laugh (text "LOL" or "haha"), and laughter might suggest that less precise answers are sufficient. But this can't be the simple explanation if respondents round just as much in Human and Automated Voice interviews, but the Automated Voice system never laughs. Our second analytic strategy is to test whether interviewer behavior is associated with data quality. In combination, these two strategies help open a window onto what might account for observed mode differences.