Evaluating the performances of artificial systems
This chapter introduces the main proposals that have been developed in order to evaluate the performance of artificial systems (cognitively inspired or not) and to justify the ascription of faculties from the “cognitive” vocabulary (like “intelligence”) to such systems. After introducing the Turing Test, its problematic aspects, and some of the main modifications proposed (e.g., the Super Turing Test and other variations), we will analyze other frameworks like the Newell Test for a theory of cognition and other tasks and challenges that have been used - with different purposes - as a testbed for the evaluation of artificial systems. These tasks range from the RoboCup World Soccer to the DARPA Challenges for autonomous vehicles to the recently proposed Winograd Schema Challenge and the RoboCup@Home. We will analyze these proposals both in light of their eventual explanatory role in the context of a computationally driven science of the mind and with respect to their actual capacity for evaluating the “intelligence” of artificial systems.
“Thinking” machines and Turing Test(s)
Determining to what extent an artificial system can be defined as “intelligent” as humans (or other animals) has been a problematic aspect since the beginning of the early research on intelligent machines. Usually the arguments used in support of the idea that machines can be intelligent follow this schema - premise 1: an entity is intelligent if shows a given behaviour X; premise 2: it is possible to build artificial systems (both embodied and not) that are able to manifest the behaviour X; conclusion: the machines able to exhibit that manifest behaviour can be claimed to be “intelligent”. This argument has been subjected to different objections. The first one, called the “behaviouristic objection”, concerns the fact that the first premise is questionable - an artificial system able to display a certain behaviour does not necessarily imply any understanding or intelligence about the actions or tasks that it is able to perform. A second objection, which we could call “technological pessimism’’, frames as questionable the second premise: in this case there are doubts about the possibility of an artificial system actually being programmed to exhibit a target behaviour that is externally described as an “intelligent” one. In the following sections, these objections, along with others, will be overviewed by exploring some of the main proposals concerning the evaluation of artificial systems able to exhibit intelligent behaviour.
The first, in this line, was the proposal from Alan Turing in his famous paper “Computing machinery and intelligence” (Turing, 1950). The British mathematician and inventor of the abstract computing machine bearing his name suggested that, in order to determine what answer to provide to the question, “Can machines think?” it was possible to use a sort of “indirect” test, called the “Turing Test” (TT) or “Imitation game” (an overview of this topic is provided in Epstein, Roberts, and Beber, 2009). In this game three “players” are involved: two human beings (one working as an “interrogator” and the other being asked to provide the answer) and a computing machine, which also has the role of answering the questions posed by the human interrogator. Within this “game” the interrogator is assumed to be in a sort of “blind” situation: i.e., s/he does not see who/what (the computing machine or the other human being) is responding to the questions s/he is asking. Indeed, s/he is supposed to communicate with them only indirectly (e.g., through a video display and keyboard) by asking them questions and reading their answers.
The goal of the game, for the interrogator, is to discover as quickly as possible which is the human and which is the machine. To achieve this goal, the interrogator can ask any question. In this way, the human player is assumed to behave in a way that would help the interrogator, while the machine is programmed to deceive the interrogator for as long as possible. According to Turing, indeed, the more the machine is able to resist and deceive the human interrogator, the more this can been seen as an indirect hint of its “intellectual” ability (a pictorial representation of the situation hypothesized in the TT is available in the Figure 5.1). Here is an interrogator/machine conversation imagined by Turing in his paper:
INTERROGATOR: In the first line of your sonnet which reads, “Shall I compare three to a summer’s day”, would not “a spring day” do as well or better? COMPUTER: I wouldn’t scan.
INTERROGATOR: How about “a winter’s day?” That would scan all right. COMPUTER: Yes, but nobody wants to be compared to a winter’s day. INTERROGATOR: Would you say Mr. Pickwick reminded you of Christmas? COMPUTER: In a way.
INTERROGATOR: Yet Christmas is a winter’s day, and I do not think Mr. Pickwick would mind the comparison.
COMPUTER: I don’t think you are serious. By a winter’s day one means a typical winter’s day, rather than a special one like Christmas
Evaluating the performances of Al systems 79
FIGURE 5.1 A pictorial representation of the “Imitation game”.
As pointed out by Levesque (2017) in his recent book, we are today still far from building systems capable of this level of conversation. Turing’s point, however, was to investigate if—assuming that we could build such a system - we could ascribe, from the vocabulary of folk psychology, terms like “understanding”, “thinking”, and “intelligence” to such machines.
As mentioned, Turing assumed that the machine, in order to obtain a more realistic effect in playing the Imitation game, is allowed to “cheat”, by occasionally making mistakes. For example, he explicitly wrote:
It is claimed that the interrogator could distinguish the machine from the human simply by setting them a number of problems in arithmetic. The machine would be unmasked because of its deadly accuracy. The reply to this is simple. The machine (programmed to play the game) would not attempt to give the right answers to the arithmetic problems. It would deliberately introduce mistakes in a manner calculated to confuse the interrogator.
(Turing, 1950: 448)
Similar cheating strategies were used also by the ELIZA chat-bot program, mentioned in the Chapter 1 (footnote 14), developed byjoseph Weizenbaum in 1966, which attempted to mimic the dialogue capabilities of a psychotherapist by employing a number of simple strategies like (1) the use of keywords and pre-canned responses (for instance, answering, “Can you tell me more about your family?” when the human wrote, “Perhaps I could learn to get along with my mother...”); (2) by parroting the human interrogator (e.g., if the human wrote, “My girlfriend made me come here”, the system would have rebutted, “Your boyfriend made you come here?”); or (3) by asking very general questions (e.g., “In what way?” or “Can you give a specific example?”). Despite these simple strategies, it is interesting that humans were quick to attribute human-level intelligence to such a simple program. In this respect, experiments with ELIZA can be viewed as the first attempts to deal with the TT. As a consequence they also pointed out some of its limitations.
The TT, in fact, has been interpreted in a number of different ways: as a way to provide a general definition of thought, or intelligence; as an operational criterion for ascribing intelligence to artificial systems; or as a test for determining the adequacy of simulative models of cognition. As mentioned, for such diverse interpretations there are corresponding different critiques. The most important one concerns the behaviouristic objection mentioned before. In particular, this test has been criticized because it only refers to the manifest behaviour of a given system and no claim can be made about the internal mechanisms that have led to that behaviour. This makes the test an insufficient criterion for the empirical validation of a simulative model (see e.g., Cordeschi, 2002). Another well-known criticism, developed in the copious literature on this theme, concerns its excessive anthropocentrism. The TT, indeed, explicitly targets human and human-like “thinking” and, therefore, cannot be used to provide a universal criterion for attributing intelligence (this is also called the “chauvinistic objection”). Concerning this aspect, Turing himself clarified that he did not intend to propose the test as a way to define “intelligence” in a general sense. In his paper, Turing readily acknowledges that one could have a situation where intelligent beings are able (or not able) to pass the test simply by not having human-like intellect:
May not machines carry out something which ought to be described as thinking but which is very different from what a man does? This objection is a very strong one, but at least we can say that if, nevertheless, a machine can be constructed to play the imitation game satisfactorily, we need not be troubled by this objection.
Another well-known and strong objection raised towards the TT concerns the fact that it is only limited to the linguistic behaviour (i.e., it is only a “language-based” experiment, while all the other cognitive faculties are not tested). This drawback has also downsized the role of the proposed test as a “general test for human intelligence” since, as the psychologist Howard Gardner pointed out in his “multiple intelligence theory”, there are different kinds of modality-specific “intelligent abilities” concerning human beings. And verbal-linguistic abilities are only one of those (Gardner, 2011). Another problem, finally, concerns the subjective evaluation of the interrogator. Different human interrogators, indeed, could judge in a different way the same machine behaviour. Such criticisms, however, may be considered as not all having the same weight. As noticed in Frixione (2015), for example, while the chauvinistic objection can be considered less problematic when the TT is used within an empirical study of the (human) mind, the other ones - concerning the “linguistic”, the “behaviouristic”, and the “subjectivistic” bias - are much more serious. Given this state of affairs, in fact, different, modified versions of the TT have been proposed. For example, Stevan Harnad (Harnad, 2001) proposed the so-called Total Turing Test (TTT), a version of the TT extended to take into account any kind of input and output and that, consequently, assumes to have a robotic system, with perceptors and actuators, as a “machine”. Such a proposal, however, while allows one to deal with the “linguistic objection”, does not make any progress on the subjectivistic and overall behaviouristic objection. The latter, in particular, still holds, since nothing can be said about the compliance (if any) of the computational mechanisms used by such an embodied system to determine its - eventually intelligent - behaviour (described as such by an external observer). This fact, as a consequence, affects the use of this variant of the TT as a “general” test for intelligence. In addition, passing the TTT per se cannot be considered a sufficient condition for validating a simulative model of some cognitive phenomena, since - as with the TT - the TTT does not allow the explanation of any mental processing activity. This point has been stressed, for example, by Pylyshyn (1984), Newell and Simon, and Philipp Johnson-Laird; on this, see Roberto Cordeschi (2002). In the context of a computational cognitive account of the sciences of the mind, alternative suggestions have been made to use the TT in such a way that the interrogator could unmask the machine and its eventual non-compliance with “human-like” thinking and intelligence. It has been proposed, for example, to “test” the artificial system by proposing it solve behavioural tests for which there are already established results in psychological literature. For example: the interrogator, in order to unmask the machine, could take advantage of her/his empirical knowledge about certain behaviour regularities and could find a way to use, in both the TT and TTT version, her/his knowledge about, for example, the “semantic priming” effect (as proposed by French 1990) or the conjunction fallacy (see footnote 3 in Chapter 3) or other well-known heuristics. This would allow her/him to see whether the artificial system replies in way that is similar to a human’s response and if it respects other behavioural parameters (e.g., response times). As reported in Frixione (2015), this possibility was considered by Robert French (1990) as a flaw of the test (at least with respect to its interpretation as a “general” test for the attribution of intelligence to an artificial system) since, in this way, it could have been easier for the interrogator to discover if his/her interlocutor is a man or a machine. This critique has been rejected by scholars like Copeland (2000), who specifies how, according to the original dictate of Turing, this kind of test would be illegitimate since “the specifications of the TT are clear: the interrogator is allowed only to put questions (535)”. As a response to Copeland, however, one could easily argue that the testing of such “behavioural regularities” could be easily put in forms of questions during a completely open conversation. Of course, in the case of the TTT, the constraint according to which the interrogator is only allowed to put forward questions no longer holds (since the communication is assumed to be also possible via other channels).
However, in such a case, it is not clear what instrument the interrogator could use to check the provided answers. Despite such limitations, the proposal made by French is prima facie interesting from a cognitive perspective since it presupposes that, if a system is able to match human performance in dealing with these kinds of problems, it could deceive the interrogator by showing human-like compliance with respect to her/his expectations. Such “human-likeness” could be measured also by resorting to additional psychometric tests and evaluations. Upon a deeper analysis, however, despite the fact that a system able to pass the test in the described conditions should be one of those explicitly addressing stronger constraints imposed on its models, it is worth noting that this hypothetical human-like compliance in terms of “performance” would not necessarily imply that any underlying “structural” simulative model of cognition is actually running on the “machine” exhibiting that behaviour. As we saw in Chapter 3, indeed, the “performance match’’ is only one of the three criteria identified for ascribing an explanatory role to an artificial model. In addition, as Frixione points out (Frixione, 2015), even in this case, passing the TT or the TTT
should not be sufficient to validate a simulative model of cognition because it ignores: (i) any form of non-behavioral empirical evidence (such as, for example, evidence coming from the neurosciences); (ii) other relevant “cognitive virtues”, such as simplicity, consistency with other accepted theories, refusal of ad hoc solutions, and so on.
From this point of view, then, the behaviouristic problem of the TT, in all of its forms, remains. Also in the case of a “Super-simplified” TT, having only an interrogator judging if a machine is intelligent in a non-blind configuration (i.e., the interrogator knows that s/he is interacting with a machine), would not be unproblematic. Here, indeed, the problem is that the mere knowledge that we are dealing with a machine will bias our judgment as to whether that machine can think or not, as we may bring certain preconceptions to the table (see De Melo and Terada, 2019). For example, some people/interrogators could have higher expectations of the machine (with respect to the human beings), and therefore their judgments could be influenced by the fact that they “raise the bar for intelligence”, while others could be less demanding with the machines. The “blind” situation eliminates, in principle, this risk.
Overall, I have indicated a selection, relevant for our purposes, of some problematic accounts raised towards the TT and its variations. In particular,
I have pointed out how the TT and the TTT cannot be considered suitable tests for a simulative model or cognition nor as “general” test for intelligence due to their intrinsic anthropocentrism. Such tests can only provide an account of human-level performances in specific or integrated tasks, respectively (with the only exceptions of the TT in its variation with behavioural experiments that, in principle, can offer other also some hints to evaluate the human-like compliance of the obtained performances). As mentioned, one of the major weaknesses of such tests is the fact that they are biased by a subjective evaluation procedure.
In the following part, we will introduce another influential critique of the TT (extendible to all its variants), proposed by the philosopher John Searle in his famous Chinese Room experiment. The experiment shows how such tests, being behavioural in nature, cannot be used for attributing an intrinsic “intelligence” (in the “human sense”) to the systems that eventually pass them. Nonetheless, they can be used for a superficial evaluation of their “performances” with respect to human performances: i.e., they could be useful to test their human-level capacities.