Human-Centered and Contextual Assessment of Human-AI Decision-Making Interventions

Seham Nasr und David Johnson

Human-AI collaboration is increasingly promoted to improve high-stakes decision-making, like medical and mental health diagnosis, yet its potential remains unrealized, with human-AI collaboration resulting on average in worse performance than either humans or AI alone. Explainable AI (XAI) is often proposed as a solution to improve human-AI decisionmaking, but has failed to do so. In fact, explanations may reduce human-AI performance, due to a lack of appropriate trust caused by over-trusting an AI’s abilities or from an unfounded lack of trust. Several interventions have been proposed to address the challenge and enable appropriate trust. One approach, cognitive forcing, aims to engage decision-makers in analytic thinking regarding the explanations and has been shown to reduce overreliance by. Taking this a step further, evaluative AI aims to support decision-making by allowing users to explore their own hypotheses using model outputs and explanations to review evidence for and against their hypothesis, and has also been shown to reduce overreliance. However, these studies are not conclusive and require further supporting evidence to justify these approaches. Furthermore, contextual factors, such as user expertise, personality traits, and decision-stakes may influence the effectiveness of such interventions.

To assess the influence of such contextual factors towards appropriate trust, robust and reproducible human-centered empirical studies are needed. However, human-centered evaluations often rely on proxy tasks, such as maze tasks or simple toy datasets, that do not accurately predict real-world outcomes. Application-grounded evaluations, on the other hand, provide valuable insights into human factors but are often domain-specific and require domain experts, limiting generalizability while also making them costly. For these reasons, evaluation approaches that simulate real decision-making scenarios are needed to mitigate the practical constraints of fully application-grounded evaluation, thus making studies more scalable, generalizable, and accessible.

To address this challenge, we developed an application-grounded evaluation framework that supports large-scale online studies using authentic model predictions and explanations, rather than relying on proxies or Wizard-of-Oz methodologies. The framework is developed around BLOCKIES, a parametric approach for generating datasets of simulated diagnostic tasks, offering fine-grain control over the traits and biases in the data used to train real-world models. These tasks are designed to be easy to learn but difficult to master, enabling participation from both experts and non-experts. This affords an evaluation paradigm in which researchers can systematically manipulate and observe contextual factors, making the study of human-AI decision-making more accessible.

We validated the paradigm in a large-scale online study by comparing the influence of decision-making stakes towards appropriate trust. In this case, participants performed two rounds of a diagnostic task generated via BLOCKIES. In the first round, participants made diagnoses without the support of AI, and then, in the second round, with the support of AI. Participants were assigned to either a high-stakes or low-stakes condition. In the high-stakes condition, participants were offered a substantial monetary bonus reliant on strong performance. In contrast, participants in the low-stakes condition were offered a much smaller bonus, tied to a lower performance threshold. The results of the study indicated that, when provided with AI support, participants in the high-stakes condition took significantly more time to make decisions compared to those in the lowstakes condition, but this actually led to worse decision-making. High-stakes participants tended to overrely on incorrect model predictions. While this initial study shows the usefulness of the framework for evaluating decision-making using online studies, it does not yet evaluate decision-making interventions aimed at improving human-AI collaboration.

In future studies, we plan to evaluate the influence of stakes on decisionmaking using state-of-the-art interventions, including cognitive-forcing and Evaluative AI. However, interfaces implementing these approaches rely on foundational XAI methods for generating model explanations. Therefore, our first study will compares the effects of different categories of XAI approaches, including feature-, concept-, and example-based explanations, towards fostering appropriate trust in high-stakes contexts. Building on these findings, we will integrate the results into the development of the aforementioned approaches. Our goal is to develop an interactive decision-support system that incorporates these insights to support real-world tasks, such as mental health assessment.

Beyond our work, the evaluation framework makes large-scale studies more accessible to researchers by eliminating the need for costly domain experts or hard to obtain specialized datasets. Furthermore, the flexibility of the framework enables researchers to explore additional contextual factors besides decision stakes, including user expertise, task complexity, and personality traits, among others. However, the framework is not intended to replace domain-specific evaluations but rather to complement them by offering a cost-effective tool for exploring contextual factors in Human-AI collaboration within realistic settings involving the general population. By making the evaluation of human-AI collaboration more accessible and standardized, the framework offers researchers opportunities to rigorously assess decision-making interventions in large-scale studies and to enhance the reproducibility of their findings.

We will present an introduction to the framework and BLOCKIES dataset generator, along with results from our studies examining explanation types in high-stakes decision-making contexts.

Presentation Human-Centered and Contextual Assessment of Human-AI Decision-Making Interventions held at the 3rd TRR 318 Conference: Contextualizing Explanations on 18th of June 2025 in Bielefeld, Germany

Nächstes Kapitel

22 Using SHAP for Feature Importance in Predicting Axelrod Tournament Winners

IntroductionPrisoner’s dilemma is the most known and investigated simple 2x2 game in Game Theory. Rationality demands players to defect, while general common sense implicates that wasting resources is bad for everyone. This gap between an ideal rationality-based theoretical framework and real behavior inspired many ideas on how "to solve" a prisoner’s dilemma. One approach is an iterated setup. Although things do not change for any finite-iterated game, surprisingly,…

Schriftgröße

Klein

Mittel

Groß

Hintergrund

% Lesefortschritt

Inhaltsverzeichnis
Contextualizing Explanations
Fußnoten
1. Lai, V. et al.: Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human- Subject Studies. In: Proc. of the 2023 ACM Conference on Fairness, Accountability, and Transparency. pp. 1369–1385. FAccT ’23, Association for Computing Machinery, New York, NY, USA (Jun 2023). https://doi.org/10.1145/3593013.3594087
2. Chanda, T. et al.: Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma. Nature Communications 15(1), 524 (Jan 2024). https://doi.org/10.1038/s41467-023-43095-4 ;
  Sutton, R.T. et al.: An overview of clinical decision support systems: benefits, risks,and strategies for success. npj Digital Medicine 3(1), 1–10 (Feb 2020). https://doi.org/10.1038/s41746-020-0221-y ;
  Lekadir, K. et al.: FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388, e081554 (Feb 2025). https://doi.org/10.1136/bmj-2024-081554
3. Lutz, W. et al.: Prospective evaluation of a clinical decision support system in psychological therapy. Journal of Consulting and Clinical Psychology 90(1), 90–106 (Jan 2022). https://doi.org/10.1037/ccp0000642 ;
  Saakyan, W. et al.: On Scalable and Interpretable Autism Detection from Social Interaction Behavior. In: 2023 11th Int. Conf. on Affective Computing and Intelligent Interaction (ACII). pp. 1–8 (Sep 2023). https://doi.org/10.1109/ACII59096.2023.10388157
4. Vaccaro, M., Almaatouq, A., Malone, T.: When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour pp. 1–11 (Oct 2024). https://doi.org/10.1038/s41562-024-02024-1
5. Antoniadi, A.M. et al.: Current Challenges and Future Opportunities for XAI in Machine Learning-Based Clinical Decision Support Systems: A Systematic Review. Applied Sciences 11(11), 5088 (Jan 2021). https://doi.org/10.3390/app11115088
6. Vaccaro, M., Almaatouq, A., Malone, T.: When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour pp. 1–11 (Oct 2024). https://doi.org/10.1038/s41562-024-02024-1
7. Amarasinghe, K. et al.: On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods. Proc. of the AAAI Conference on Artificial Intelligence 38(19), 20921–20929 (Mar 2024). https://doi.org/10.1609/aaai.v38i19.30082 ;
  Chen, V. et al.: Understanding the Role of Human Intuition on Reliance in Human-AI Decision-Making with Explanations (Jun 2023). https://doi.org/10.48550/arXiv.2301.07255 , arXiv:2301.07255 [cs]
8. Chen, V..: Understanding the Role of Human Intuition on Reliance in Human-AI Decision-Making with Explanations (Jun 2023). https://doi.org/10.48550/arXiv.2301.07255 , arXiv:2301.07255 [cs];
  Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proc. ACM Hum.-Comput. Interact. 5(CSCW1), 188:1–188:21 (Apr 2021). https://doi.org/10.1145/3449287
9. Vasconcelos, H. et al.: Explanations Can Reduce Overreliance on AI Systems During Decision-Making. Proc. of the ACM on Human-Computer Interaction 7(CSCW1), 1–38 (Apr 2023). https://doi.org/10.1145/3579605
10. Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proc. ACM Hum.-Comput. Interact. 5(CSCW1), 188:1–188:21 (Apr 2021). https://doi.org/10.1145/3449287
11. Miller, T.: Explainable AI is Dead, Long Live Explainable AI! Hypothesisdriven decision support (Mar 2023). https://doi.org/10.48550/arXiv.2302.12389, arXiv:2302.12389 [cs]
12. Le, T. et al.: Towards the New XAI: A Hypothesis-Driven Approach to Decision Support Using Evidence. In: ECAI 2024, pp. 850–857. IOS Press (2024). https://doi.org/10.3233/FAIA240571
13. Vaccaro, M., Almaatouq, A., Malone, T.: When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour pp. 1–11 (Oct 2024). https://doi.org/10.1038/s41562-024-02024-1
14. Conati, C. et al.: Toward personalized XAI: A case study in intelligent tutoring systems. Artificial Intelligence 298, 103503 (Sep 2021). https://doi.org/10.1016/j.artint.2021.103503
15. Johnson, D.S.: Higher Stakes, Healthier Trust? An Application-Grounded Approach to Assessing Healthy Trust in High-Stakes Human-AI Collaboration (Mar 2025). https://doi.org/10.48550/arXiv.2503.03529 , arXiv:2503.03529 [cs];
  Vasconcelos, H. et al.: Explanations Can Reduce Overreliance on AI Systems During Decision-Making. Proc. of the ACM on Human-Computer Interaction 7(CSCW1), 1–38 (Apr 2023). https://doi.org/10.1145/3579605
16. Vasconcelos, H. et al.: Explanations Can Reduce Overreliance on AI Systems During Decision-Making. Proc. of the ACM on Human-Computer Interaction 7(CSCW1), 1–38 (Apr 2023). https://doi.org/10.1145/3579605
17. Chen, V. et al.: Understanding the Role of Human Intuition on Reliance in Human-AI Decision-Making with Explanations (Jun 2023). https://doi.org/10.48550/arXiv.2301.07255, arXiv:2301.07255 [cs]
18. Buçinca, Z. et al.: Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In: Proc. of the 25th Int. Conference on Intelligent User Interfaces. pp. 454–464. IUI ’20, Association for Computing Machinery, New York, NY, USA (Mar 2020). https://doi.org/10.1145/3377325.3377498
19. Doshi-Velez, F., Kim, B.: Towards A Rigorous Science of Interpretable Machine Learning (Mar 2017). https://doi.org/10.48550/arXiv.1702.08608, number: arXiv:1702.08608 [cs, stat] short: iml
20. Amarasinghe, K. et al.: On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods. Proc. of the AAAI Conference on Artificial Intelligence 38(19), 20921–20929 (Mar 2024). https://doi.org/10.1609/aaai.v38i19.30082 ;
  Chanda, T. et al.: Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma. Nature Communications 15(1), 524 (Jan 2024). https://doi.org/10.1038/s41467-023-43095-4 ;
  Zytek, A. et al.: Sibyl: Understanding and Addressing the Usability Challenges of Machine Learning In High-Stakes Decision Making. IEEE Transactions on Visualization and Computer Graphics 28(1), 1161–1171 (Jan 2022). https://doi.org/10.1109/TVCG.2021.3114864
21. Johnson, D.S.: Higher Stakes, Healthier Trust? An Application-Grounded Approach to Assessing Healthy Trust in High-Stakes Human-AI Collaboration (Mar 2025). https://doi.org/10.48550/arXiv.2503.03529, arXiv:2503.03529 [cs]
22. Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proc. ACM Hum.-Comput. Interact. 5(CSCW1), 188:1–188:21 (Apr 2021). https://doi.org/10.1145/3449287
23. Miller, T.: Explainable AI is Dead, Long Live Explainable AI! Hypothesisdriven decision support (Mar 2023). https://doi.org/10.48550/arXiv.2302.12389, arXiv:2302.12389 [cs]
24. Johnson, D.S. et al.: Explainable AI for Audio and Visual Affective Computing: A Scoping Review. IEEE Transactions on Affective Computing pp. 1–20 (2024). https://doi.org/10.1109/TAFFC.2024.3505269
Literaturverzeichnis
1. Amarasinghe, K., Rodolfa, K.T., Jesus, S., Chen, V., Balayan, V., Saleiro, P., Bizarro, P., Talwalkar, A., Ghani, R.: On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods. Proc. of the AAAI Conference on Artificial Intelligence 38(19), 20921–20929 (Mar 2024). https://doi.org/10.1609/aaai.v38i19.30082
2. Antoniadi, A.M., Du, Y., Guendouz, Y., Wei, L., Mazo, C., Becker, B.A., Mooney, C.: Current Challenges and Future Opportunities for XAI in Machine Learning-Based Clinical Decision Support Systems: A Systematic Review. Applied Sciences 11(11), 5088 (Jan 2021). https://doi.org/10.3390/app11115088
3. Buçinca, Z., Lin, P., Gajos, K.Z., Glassman, E.L.: Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In: Proc. of the 25th Int. Conference on Intelligent User Interfaces. pp. 454–464. IUI ’20, Association for Computing Machinery, New York, NY, USA (Mar 2020). https://doi.org/10.1145/3377325.3377498
4. Buçinca, Z., Malaya, M.B., Gajos, K.Z.: To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proc. ACM Hum.-Comput. Interact. 5(CSCW1), 188:1–188:21 (Apr 2021). https://doi.org/10.1145/3449287
5. Chanda, T., Hauser, K., Hobelsberger, S., Bucher, T.C., Garcia, C.N., Chousakos, E., Crnaric, I., Majstorovic, J., Foreman, T., Peternel, S., Özdemir, I., Barnhill, R.L., Llamas-Velasco, M., Poch, G., Korsing, S., Sondermann, W., Heppt, M.V., Erdmann, M., Haferkamp, S., Goebeler, M.: Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma. Nature Communications 15(1), 524 (Jan 2024). https://doi.org/10.1038/s41467-023-43095-4
6. Chen, V., Liao, Q.V., Vaughan, J.W., Bansal, G.: Understanding the Role of Human Intuition on Reliance in Human-AI Decision-Making with Explanations (Jun 2023). https://doi.org/10.48550/arXiv.2301.07255 , arXiv:2301.07255 [cs]
7. Conati, C., Barral, O., Putnam, V., Rieger, L.: Toward personalized XAI: A case study in intelligent tutoring systems. Artificial Intelligence 298, 103503 (Sep 2021). https://doi.org/10.1016/j.artint.2021.103503
8. Doshi-Velez, F., Kim, B.: Towards A Rigorous Science of Interpretable Machine Learning (Mar 2017). https://doi.org/10.48550/arXiv.1702.08608 , number: arXiv:1702.08608 arXiv:1702.08608 [cs, stat] short: iml
9. Johnson, D.S.: Higher Stakes, Healthier Trust? An Application-Grounded Approach to Assessing Healthy Trust in High-Stakes Human-AI Collaboration (Mar 2025). https://doi.org/10.48550/arXiv.2503.03529 , arXiv:2503.03529 [cs]
10. Johnson, D.S., Hakobyan, O., Paletschek, J., Drimalla, H.: Explainable AI for Audio and Visual Affective Computing: A Scoping Review. IEEE Transactions on Affective Computing pp. 1–20 (2024). https://doi.org/10.1109/TAFFC.2024.3505269
11. Lai, V., Chen, C., Smith-Renner, A., Liao, Q.V., Tan, C.: Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human- Subject Studies. In: Proc. of the 2023 ACM Conference on Fairness, Accountability, and Transparency. pp. 1369–1385. FAccT ’23, Association for Computing Machinery, New York, NY, USA (Jun 2023). https://doi.org/10.1145/3593013.3594087
12. Le, T., Miller, T., Sonenberg, L., Singh, R.: Towards the New XAI: A Hypothesis-Driven Approach to Decision Support Using Evidence. In: ECAI 2024, pp. 850–857. IOS Press (2024). https://doi.org/10.3233/FAIA240571
13. Lekadir, K., Frangi, A.F., Porras, A.R., Glocker, B., Cintas, C., Curtis P. Langlotz: FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388, e081554 (Feb 2025). https://doi.org/10.1136/bmj-2024-081554
14. Lutz, W., Deisenhofer, A.K., Rubel, J., Bennemann, B., Giesemann, J., Poster, K., Schwartz, B.: Prospective evaluation of a clinical decision support system in psychological therapy. Journal of Consulting and Clinical Psychology 90(1), 90–106 (Jan 2022). https://doi.org/10.1037/ccp0000642
15. Miller, T.: Explainable AI is Dead, Long Live Explainable AI! Hypothesisdriven decision support (Mar 2023). https://doi.org/10.48550/arXiv.2302.12389 , arXiv:2302.12389 [cs]
16. Saakyan, W., Norden, M., Herrmann, L., Kirsch, S., Lin, M., Guendelman, S., Dziobek, I., Drimalla, H.: On Scalable and Interpretable Autism Detection from Social Interaction Behavior. In: 2023 11th Int. Conf. on Affective Computing and Intelligent Interaction (ACII). pp. 1–8 (Sep 2023). https://doi.org/10.1109/ACII59096.2023.10388157
17. Sutton, R.T., Pincock, D., Baumgart, D.C., Sadowski, D.C., Fedorak, R.N., Kroeker, K.I.: An overview of clinical decision support systems: benefits, risks,and strategies for success. npj Digital Medicine 3(1), 1–10 (Feb 2020). https://doi.org/10.1038/s41746-020-0221-y
18. Vaccaro, M., Almaatouq, A., Malone, T.: When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour pp. 1–11 (Oct 2024). https://doi.org/10.1038/s41562-024-02024-1
19. Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M.S., Krishna, R.: Explanations Can Reduce Overreliance on AI Systems During Decision-Making. Proc. of the ACM on Human-Computer Interaction 7(CSCW1), 1–38 (Apr 2023). https://doi.org/10.1145/3579605
20. Zytek, A., Liu, D., Vaithianathan, R., Veeramachaneni, K.: Sibyl: Understanding and Addressing the Usability Challenges of Machine Learning In High-Stakes Decision Making. IEEE Transactions on Visualization and Computer Graphics 28(1), 1161–1171 (Jan 2022). https://doi.org/10.1109/TVCG.2021.3114864

Bibliografische Daten

Erscheinungsdatum	5. März 2026
DOI	10.64136/mznt1005
Creative Commons Lizenz

Human-Centered and Contextual Assessment of Human-AI Decision-Making Interventions

Nächstes Kapitel

22 Using SHAP for Feature Importance in Predicting Axelrod Tournament Winners