Human-Centered and Contextual Assessment of Human-AI Decision-Making Interventions
Human-AI collaboration is increasingly promoted to improve high-stakes decision-making, like medical and mental health diagnosis, yet its potential remains unrealized, with human-AI collaboration resulting on average in worse performance than either humans or AI alone. Explainable AI (XAI) is often proposed as a solution to improve human-AI decisionmaking, but has failed to do so. In fact, explanations may reduce human-AI performance, due to a lack of appropriate trust caused by over-trusting an AI’s abilities or from an unfounded lack of trust. Several interventions have been proposed to address the challenge and enable appropriate trust. One approach, cognitive forcing, aims to engage decision-makers in analytic thinking regarding the explanations and has been shown to reduce overreliance by. Taking this a step further, evaluative AI aims to support decision-making by allowing users to explore their own hypotheses using model outputs and explanations to review evidence for and against their hypothesis, and has also been shown to reduce overreliance. However, these studies are not conclusive and require further supporting evidence to justify these approaches. Furthermore, contextual factors, such as user expertise, personality traits, and decision-stakes may influence the effectiveness of such interventions.
To assess the influence of such contextual factors towards appropriate trust, robust and reproducible human-centered empirical studies are needed. However, human-centered evaluations often rely on proxy tasks, such as maze tasks or simple toy datasets, that do not accurately predict real-world outcomes. Application-grounded evaluations, on the other hand, provide valuable insights into human factors but are often domain-specific and require domain experts, limiting generalizability while also making them costly. For these reasons, evaluation approaches that simulate real decision-making scenarios are needed to mitigate the practical constraints of fully application-grounded evaluation, thus making studies more scalable, generalizable, and accessible.
To address this challenge, we developed an application-grounded evaluation framework that supports large-scale online studies using authentic model predictions and explanations, rather than relying on proxies or Wizard-of-Oz methodologies. The framework is developed around BLOCKIES, a parametric approach for generating datasets of simulated diagnostic tasks, offering fine-grain control over the traits and biases in the data used to train real-world models. These tasks are designed to be easy to learn but difficult to master, enabling participation from both experts and non-experts. This affords an evaluation paradigm in which researchers can systematically manipulate and observe contextual factors, making the study of human-AI decision-making more accessible.
We validated the paradigm in a large-scale online study by comparing the influence of decision-making stakes towards appropriate trust. In this case, participants performed two rounds of a diagnostic task generated via BLOCKIES. In the first round, participants made diagnoses without the support of AI, and then, in the second round, with the support of AI. Participants were assigned to either a high-stakes or low-stakes condition. In the high-stakes condition, participants were offered a substantial monetary bonus reliant on strong performance. In contrast, participants in the low-stakes condition were offered a much smaller bonus, tied to a lower performance threshold. The results of the study indicated that, when provided with AI support, participants in the high-stakes condition took significantly more time to make decisions compared to those in the lowstakes condition, but this actually led to worse decision-making. High-stakes participants tended to overrely on incorrect model predictions. While this initial study shows the usefulness of the framework for evaluating decision-making using online studies, it does not yet evaluate decision-making interventions aimed at improving human-AI collaboration.
In future studies, we plan to evaluate the influence of stakes on decisionmaking using state-of-the-art interventions, including cognitive-forcing and Evaluative AI. However, interfaces implementing these approaches rely on foundational XAI methods for generating model explanations. Therefore, our first study will compares the effects of different categories of XAI approaches, including feature-, concept-, and example-based explanations, towards fostering appropriate trust in high-stakes contexts. Building on these findings, we will integrate the results into the development of the aforementioned approaches. Our goal is to develop an interactive decision-support system that incorporates these insights to support real-world tasks, such as mental health assessment.
Beyond our work, the evaluation framework makes large-scale studies more accessible to researchers by eliminating the need for costly domain experts or hard to obtain specialized datasets. Furthermore, the flexibility of the framework enables researchers to explore additional contextual factors besides decision stakes, including user expertise, task complexity, and personality traits, among others. However, the framework is not intended to replace domain-specific evaluations but rather to complement them by offering a cost-effective tool for exploring contextual factors in Human-AI collaboration within realistic settings involving the general population. By making the evaluation of human-AI collaboration more accessible and standardized, the framework offers researchers opportunities to rigorously assess decision-making interventions in large-scale studies and to enhance the reproducibility of their findings.
We will present an introduction to the framework and BLOCKIES dataset generator, along with results from our studies examining explanation types in high-stakes decision-making contexts.
Presentation Human-Centered and Contextual Assessment of Human-AI Decision-Making Interventions held at the 3rd TRR 318 Conference: Contextualizing Explanations on 18th of June 2025 in Bielefeld, Germany