Measuring AI System Effectiveness

29 Nov

Note: This post is adapted from a version originally written for sela.co.il

What are we really trying to measure when we think about evaluating intelligence? To couch it in human terms, it would be to assess the overall ability to think, reason and make rational decisions. In conventional psychology, the aim is to try to measure several different cognitive characteristics and form an overall view of an individual. Usually something like the Wechsler Adult Intelligence Scale (WAIS) is used for this, which requires subjecting a person to a variety of different tasks. How well the subject performs in each task contributes to a final comprehensive assessment (Lichtenberger et al, 2012).

Artificial Intelligence (AI) has been defined as the ability of a computer system – using maths and logic and leveraging new information – to simulate human cognitive functions such as learning or problem solving to make decisions (Microsoft, 2021). In fact the term Cognitive Intelligence has also been coined to emphasize the human-like nature of these computer-derived capabilities. It seems, however, that the comprehensive aggregative nature of human intelligence testing does not apply to the assessment of computer-based systems. AI intelligence is assessed on a more functional level, whereas human intelligence testing is more wholistic.

The WAIS test is obviously not suitable to perform on computer systems, which, as we have mentioned, are in fact largely uni-modal, focusing on performing single human-like tasks or functions at a time. To assess Artificial Intelligence (AI) we actually mean that we want to know how efficient and accurate a computer system is at performing a single – albeit perhaps complex – task. This could be, for example navigating a self-driving car through a series of obstacles or summarizing a legal file into topics. These are rather involved multi-step tasks that humans can do, and that have been regarded as artificial intelligence.

Although the exact underlying biological processes that contribute to intelligence in humans are not entirely understood, research has narrowed down four potential factors – brain size, sensory ability, speed and efficiency of neural transmission and working memory capacity (Stangor & Walinga in Cummings, J. A. & Sanders, L., 2019).

These factors have clear parallels in the world of computer systems – with:

· brain size analogous to storage capacity

· sensory ability analogous to the number and type of incoming data sources

· speed and efficiency of neural transmission analogous to the complexity of the code, the number of CPU cores and their clock speed and the extent of their distributed processing

· working memory analogous to the available RAM.

The utilization of common statistical measures – such as R2, MSE, F1-score, Precision and Recall – used to assess machine learning model performance is only the first step. These measures are really applied at too low a level to be considered assessments of AI. They are model-level assessments. What is required is a method to aggregate these low-level measures in some way and standardize a series of tests that could be applied to gauge the accuracy, speed and effectiveness of AI systems in toto. In doing so, we would have a way to compare systems both against the equivalent human capability as well as against other similar cognitive models. For example, how well a system translates a benchmarked document or how safely a car performs when being guided by a self-driving AI.

These are measurable, but not usually at the sub-task level. When measuring intelligence, we don’t for example usually measure the ability of a person to discern colors, or to sort similar meaning words into lists. Though these would be tested if we were trying to get to the bottom of an evident deficit of some sort.

While there are increasing gains in this area, there does not seem to be anything like an all-embracing AI testing regimen at present. This is likely due to the difficulty of lumping different types of AI systems into a single performance assessment tool. Moreover, it is uncommon for the same data science team who are innovating in one area – say computer vision, to also deal with other types of models as well. Were they to do so, a more cohesive and overarching viewpoint could be arrived at more easily.

As AI solutions proliferate in the market, it is becoming increasingly important to be able to differentiate their reliability and accuracy. For example, how can we differentiate a Google self-driving solution from a Tesla one? Which platform offers the best off-the-shelf AI plugin for translating speech into text? What computer vision solution provides the best accuracy and reliability for detecting potentially aggressive behavior in a crowd of protesters? The list of potential AI solutions is literally endless and growing every day.

When discussing this subject with others, the topic of the Artificial Intelligence Maturity Model (AIMM) often comes up, however it does not really relate to this discussion. Whereas an AIMM attempts to assess the level of AI integration within a particular company, it does not really relate to the assessment of individual AI capabilities. It is true that the more integrated AI capabilities are within a company the better the AIMM score, and it is likely that companies scoring higher on an AIMM assessment are also those that would have innovated some sort of AI task assessment solution. However, these are likely to be focused on individual tasks and not a wholistic assessment as such.

So what should we, as an AI community be doing – if anything – about generating wholistic testing assessments? Do we necessarily need to aim for a catchall testing regimen that encompasses multiple facets of artificial intelligence in the same way that cognitive testing in humans assesses their level of intelligence?

I don’t believe we do. Though artificial Intelligence as it stands today probably needs a consistent approach to testing, unless we are considering testing a humanoid-like robot designed to act like a human, there would be nothing to gain from combining the actual testing of different categories of AI into one.

What do you think?

References

Cummings, J. A. & Sanders, L. (2019). Introduction to Psychology. Saskatoon, SK: University of Saskatchewan Open Press. https://openpress.usask.ca/introductiontopsychology/

Daniel Karp

Measuring AI System Effectiveness

Expecting the Unexpected!

AI Strategy: Focussing AI where your business needs it most