Desktop version

Home arrow Communication

  • Increase font
  • Decrease font


<<   CONTENTS   >>

Comparison

The following table provides a synthetic comparison of the main evaluation approaches that have been reviewed so far. For the reasons explicated above about the different intrinsic statuses of the challenges and competitions, such evaluation campaigns are not included (Table 5.1).

The tests considered in the table are: the Turing Test (TT); the Turing Test version with the administration of behavioural experiments (TTwBE) like the priming effect or others; the Total Turing Test (TTT); the Newell Test (NT) the Winograd Schema Challenge (WSC); and the Minimal Cognitive Grid (MCG) proposed in the previous chapters. The MCD is included because - despite not explicitly proposing any kind of test - it represents, nonetheless, as does the “Newell Test”, a concrete proposal for the evaluation of the structural accuracy of artificial systems that, however, could in principle additionally be used also to indirectly assess the degree of human-level artificial performances (in particular, thanks to the additional constraints considered for the “performance match” dimension ).

The criteria considered for comparing the proposed approaches are those discussed above, namely: the GT4I column describes if the approach can be considered a general test for “intelligence”; T4IHLI describes if the test allow for a comparison between the results of integrated artificial systems (i.e., systems doing more than one “intelligent” task, not narrow) with human performances (measured in terms of percentage of overlapping positive results) in a way that it is possible to use this behavioural comparison to assess the distance (if any) from human-level performances; and the T4IHLII describes if the test can be useful to compare the performances of integrated artificial systems with those exhibited by humans, through evaluating not only the “match” in terms of overlapping positive performances but also, as proposed in the MCG, other psychometric measures like the execution times and type of errors. As a consequence this criterion should allow us to assess the distance (if any) from human-like performances. The columns T4SHLI (Test for Specific Human-Level Intelligence) and T4SHLII (Test for Specific Human-Like Intelligence) specify, in a narrow setting and for non-multitasking systems, respectively, the considerations about the human level and the human-like performances. The CA column is intended to assess whether the analyzed test can be used to evaluate the cognitive adequacy of artificial models (please note that tests able to detect the “human-likeness” of their performances - considering different measures other than just the percentage of success - may not be sufficient for evaluating the cognitive adequacy of

Tests

GT4I general test for intelligence

T4IHLI test for integrated human-level intellig.

T4IHLII Test for integrated human-like intellig.

T4SHLI test for specific Ini man-level intelligence

T4SHLII

test for specific

human-like

intelligence

CA

cogn.

adeq.

SE suhj evaluat.

QUALE

qualit.

Evaluat.

QUANE

quantit.

evaluat.

GE

graded

evaluat.

Turing test

NO

NO

NO

YES

NO

NO

YES

YES

NO

NO

Turing test with BE

NO

NO

NO

YES

YES

NO

YES

YES

NO

NO

Total Turing test

NO

YES

NO

NO

NO

NO

YES

YES

NO

NO

Newell test

NO

MAYBE

YES

NO

NO

YES

YES

YES

NO

YES

Winog. schema chall.

NO

NO

NO

YES

NO

NO

NO

YES

YES

NO

Minimal cogn. grid

NO

MAYBE

YES

MAYBE

YES

YES

NO

YES

YES

YES

a system); the SE column evaluates if such tests resort to subjective judgements; the QUALE and QUANE columns express the type of evaluation allowed by the tests (i.e., qualitative and quantitative); and, finally, the GE column indicates whether the system provides the possibility of using only Boolean notions (YES/ NO) or if it allows for the expression of graded rankings and evaluation (in both qualitative or quantitative terms). In bold, the “positive” traits for each of the considered features are represented. In particular, for the GT4I, it is considered a positive element of the test, the eventual possibility that it can be used for a “general” test for the intelligence (including non-human intelligences), but none of the proposed approaches is able to tackle this issue. For the T4IHLI feature, “YES” is considered the positive option, which describes the possibility of using the performances of the test to evaluate the human-level ability of integrated multitasking systems. A similar discourse holds for the evaluation of the T4I- HLII, T4SHLI, and T4SHLII criteria. As for the CA criterion (obviously not guaranteed by tests interested only in human-level comparisons of the performances of the artificial systems) and the QUALE, QUANE, and GE criteria, the “YES” option has been considered the positive one. In particular, for the last criterion (GE), the choice of considering “YES” as a positive option depends on the fact that allowing to rank and grade the matching of the performances between humans and systems represents a way to make explicit more subtle differences and similarities both between the humans and machines and within the different classes of machines built to pass a given test. This criterion is not allowed in any of the variations of the TT (where, at the end, the interrogator must provide a binary YES/NO decision in her/his assessment about the interlocutor) or the Winograd Schema Challenge, though it is built-in to the Newell Test and the MCG. Finally, another considered comparative feature is the SE. In this case, of course, the “NO” is considered the positive option since it indicates evaluation methods not based on subjective judgments. As emerges from analyzing the table, the MCG seems to have at its disposal a wider range of positive features compared to the other “evaluation toolkits”. In particular, while the YES/NO answers are self-explanatory (since MCG is a graded, non-subjective evaluation tool allowing both quantitative and qualitative analysis about the cognitive adequacy and the human-like performances of artificial systems in both single and multitasking settings) the MAYBE ones suggest that, as anticipated above, the MCG, and in particular its “performance match” dimension, could be also used as an indicator of the eventual human-level performances obtained by an integrated multitasking system or by narrow ones (and the same holds for the Newell test for the T4IHLI feature). Since, however, despite being plausible, an eventual evaluation in this direction has been not yet tested, the columns corresponding to the use of this methodological tool as a means for testing human-level performances of artificial systems have been filled with the MAYBE option and not with the positive “YES” one.

6

 
<<   CONTENTS   >>

Related topics