Self-driving vehicles, security and surveillance, and robot vacuums — artificial intelligent (AI) systems are increasingly integrating themselves into our lives. Many of these modern innovations rely on AIs trained in object recognition, identifying objects like vehicles, people, or obstacles. Safety requires that a system know its limitations and realize when it doesn’t recognize something.
Just how well-calibrated are the accuracy and confidence of the object recognition AIs that power these technologies? Our team set out to assess the calibration of AIs and compare them with human judgments.
Artificial Intelligence Identifications & Confidence
Our study required a set of novel visual stimuli that we knew were not already posted online and so could not be familiar to any of the systems or individuals we wanted to test. And so we asked 100 workers on Amazon Mechanical Turk to take 15 pictures in and around their homes, each one featuring an object. After removing submissions that failed to follow these instructions, we were left with 1208 images. We uploaded these photos to four AI systems (Microsoft Azure, Facebook Detectron2, Amazon Rekognition, and Google Vision), which labeled objects identified in each image and reported confidence for each label. In order to compare the accuracy of these AI systems, we showed these same images to people and asked them to identify objects in the images and report their confidence.
To measure the accuracy of the labels, we asked a different set of human judges to estimate the percentage of other humans who would report that the identified label is present in the image, and paid them based on these estimates. These human judges assessed the accuracy of the generated labels from both the previous human participants and the AIs.
AI vs. Humans: Confidence and Accuracy Calibration
Below is a calibration curve that outlines the confidence and accuracy of AIs and humans for object recognition. Both humans and AIs are, on average, overconfident. Humans reported an average confidence of 75% but were only 66% accurate. AIs displayed an average confidence of 46% and accuracy of 44%.
Overconfidence is most prominent at high levels of confidence as the figure below shows.
Figure 1: The calibration curve outlining AI, humans, and perfect calibration. The blue line represents perfect calibration in which confidence matches accuracy. Values under the blue line represent overconfidence, where confidence exceeds accuracy. Values over the blue line represent underconfidence, where accuracy exceeds confidence.
Identifications at a glimpse
Before we conclude from the above analysis that humans are more overconfident than AIs, we must note an important difference between them. The AIs each generated a list of objects identified with varying levels of confidence. However, human participants responded differently when asked to identify objects present in images: they identified the objects most likely to be present in the image. As a result, high-confidence labels were overrepresented in the set of human-generated labels compared to the set of AI-generated labels. Since the risk of being overconfident increases with confidence, comparing all labels might be misleading.
Figure 2: A bar graph of confidence levels from humans and AIs.
In order to make a more equivalent comparison, we repeated our analysis using labels identified with confidence of 80% or greater. This measure also illustrated that humans and AIs are both overconfident, but this time human judgments were not more overconfident than AIs. In this subset of the data, humans and AIs were 94% and 90% confident, but only 70% and 63% accurate respectively.
Table 2. The average confidence and accuracy levels of each object identifier organized in a table for confidence levels over 80%.
One notable finding is how humans and AIs generated different types of labels. Below is an image that we used in our study. For this image, humans generated labels such as “remote” and “buttons” with 85% and 52% confidence respectively; meanwhile, AI-generated labels with similar confidence were “indoor” with 87% and “font” with 75% confidence.
Figure 3: An example of an image humans and AI were prompted to identify — specifically, this set of identifications and confidence was produced by Google Vision.
Conclusions
The results support our prediction that artificially intelligent agents are vulnerable to being too sure of themselves, just like people. This is relevant for tools that are guided by artificial intelligence: autonomous vehicles, security and surveillance, and robot assistants. Because AIs are prone to overconfidence, users and operators should be conscious of this while utilizing these tools. One response would be, as the consequences of making an error go up, to make the system more risk averse and less likely to act on their imperfect beliefs. Another response would be to require AIs to have checks and verification systems that might be able to catch errors. Provably safe AI systems must know their limitations and demonstrate well-calibrated confidence.
by Angelica Wang, Kimberly Thai, and Don A. Moore