Fußnote
Referenz
Kevin Baum, Richard Uth, Holger Hermanns, Sophie Kerstan, Markus Langer, Anne Lauber-Rönsberg, Philip Meinel, Laura Stenzel, Sarah Sterz, Hanwei Zhang
Contextualizing Explanations

The Principal’s Principles: Actionable (Personalized) AI Alignment as Underexplored XAI Application Context

Explainable Artificial Intelligence (XAI) has been proposed as a key element—or even a prerequisite—for addressing various challenges and fulfilling numerous societal desiderata.  Yet, there is one topic that is frequently debated  but rarely recognized as a relevant application context for XAI methods: the alignment of artificial intelligence agents (AIAs) (with a few exceptions). 

Background and Motivation 

In the foreseeable future, AIAs—ranging from software agents (such as OpenAI’s Operator or Google’s Project Mariner ) to cyber-physical systems (like Tesla’s Optimus or 1X’s Neo)—will co-inhabit both our digital and physical environments. These agents will execute tasks delegated to them by humans (human principals) either directly or indirectly, often involving considerable technical autonomy. This scenario immediately raises the critical challenge of ensuring these agents act as they ought to, i.e., constrained by human intents and preferences or guided by norms from diverse domains—a bundle of challenges commonly known as the AI alignment problem. 

While the exact formulation of the AI alignment challenge and the criteria for solving it remain debated, we argue that methods from XAI should—and inevitably will—play a central role. Concretely, understanding task delegation to AIAs as intent-driven interaction that establishes extended human agency raises questions closely linked to indirect human oversight and responsibility gaps  — questions that are inherently associated with XAI research. In the following, we briefly outline three key aspects supporting our argument:

Personal Normative Alignment and Delegation as Extension of Agency 

We propose understanding the delegation of tasks to artificial agents (AIAs) by human principals as a form of extension of agency via personal normative alignment and focus on the three factors of warranted trust, appropriate responsibility, and anticipatory control.

To this end, we propose breaking down solving personal normative alignment into a series of sub-tasks. Rather than embedding general normative principles or values directly into AIAs, the focus should be on enabling human principals to:

  • co-create  the formulation and explication of their normative expectations relative to foreseeable contexts, including comprehending the implications of their judgements; 

  • communicate these expectations to AIAs in an unambiguous, interactive manner; 

  • verify that the AIAs have correctly ‘understood’ these normative expectations and that these AIAs act reliably and robustly in accordance with these expectations.

In combination, the fulfilment of these requirements allows to establish a conceptual link between what the AIAs do and the moral responsibility of the human principals for the AIAs’ behaviours in form of indirect control and responsibility for a wide range of cases by fulfilling the traditional conditions of control and epistemic access.  For this, however, we suggest that it needs justifiability as a special kind of explainability.

The Role of XAI and Justifiability 

We argue that XAI technologies are a necessary foundation for meeting two of the three requirements outlined above. In particular, iterative XAI processes—likely processes based on contrastive and counterfactual methods—are crucial for the co-creation of a human principal’s normative expectations, especially in light of the potential consequences such expectations may entail once articulated.

Moreover, we believe that justifications as a currently underexplored class of explanation techniques will be central to the verification of whether an AI system has correctly grasped the intent behind those expectations.

While explanations, broadly speaking, provide answers to why-questions, justifications explain (typically in terms of reasons) why something is right, appropriate, or acceptable according to a given normative standard. Justifications (or, at least, explanations from which the human principal can reasonably infer such justifications) are essential for enabling a human principal to assess whether an AIA has correctly ‘understood’ the principals’ normative expectations. Importantly, such justifications are critical for assessing the agent’s trustworthiness and, thus, also for fostering appropriately calibrated, justified, and potentially even warranted trust. 

Indirect Responsibility and Forward-Looking Human Control and Oversight

We claim that if the above conditions are met and all relevant application contexts have been taken into account, the result is a successfully personally normatively aligned AIA. We claim further that, as a result, at least some traditional conditions for moral responsibility are met:  The epistemic condition is satisfied once the human principal has sufficient anticipatory understanding (through explanations and/or justifications) about how and why the AIA will or would act in specific contexts. The control condition may often be indirectly met through clearly communicated normative expectations, their assessment, and contextually sensitive anticipatory authorization. Therefore, personal normative alignment allows appropriate responsibility attributions and offers a plausible account of indirect, anticipatory human control and oversight.

In other words: Given the conditions above, all of an AIA A’s actions that take place in forseen application contexts will ceteris paribus (especially in absence of malfunction) be permissible according to all of the normative expectations of a human principal H that have been correctly ‘transferred’ to A. Thus, the AIA’s actions are (more or less) explicitly yet anticipatorily authorized and sanctified by H in these foreseen contexts. Insofar, H bears (at least some kind of indirect) backward-looking responsibility for A’s actions, while bearing forward-looking responsibility via the direct responsibility to give the AIA normative guardrails via personal normative alignment.  In this respect, the human principal becomes the locus of responsibility, the appropriate object of blame (and, of course, praise),  because AIAs may be seen as their (metaphorical or actual) extension of action and, thus, this allows for an indirect version of meaningful/effective forward-looking human control.  (In sufficiently rich application domains, however, it cannot be hoped that all relevant contexts can be considered in advance; in this respect, approximate measures of personal normative alignment and safe exploration of new contexts are another key issue of the overall approach.)

Presentation The Principal’s Principles: Actionable (Personalized) AI Alignment as Underexplored XAI held at the 3rd TRR 318 Con­fe­rence: Con­tex­tu­a­li­zing Ex­pla­na­ti­ons on 17th of June 2025 in Bielefeld, Germany

Nächstes Kapitel