Tackling the Generic Masculine: Evaluating Gender Neutrality of German AI-Generated Texts

Authors

  • Jasmin Schmidt IU International University of Applied Sciences
  • Claudia Hess IU International University of Applied Sciences https://orcid.org/0000-0001-9373-4019

DOI:

https://doi.org/10.34190/icgr.9.1.4667

Keywords:

Artificial Intelligence, Gender neutrality, large language models, fairness, Ethics, Responsible AI

Abstract

When using large language models (LLMs)—artificial intelligence (AI) systems trained to generate and interpret human language—gender-specific biases in AI-generated text represent a key challenge. Particularly in grammatically gendered languages such as German, this often results in texts outputs using the so-called generic masculine as an allegedly gender neutral default form. Consequently, generated text is neither neutral nor inclusive, and gender stereotypes are perpetuated. Organizations seeking to offer LLM-based systems that generate inclusive language by default typically rely on system prompts or specific model configurations to steer the model’s responses. However, for German, methods for automatically, systematically, and objectively assessing whether such approaches enhance gender neutrality and inclusivity remain limited and underexplored. To address this gap, a framework was developed that applies the concept of LLM-as-a-judge. This approach involves using an LLM to systematically evaluate the outputs of another, thereby enabling automated and replicable assessments of features such as gender neutrality and inclusivity. The paper presents the development and evaluation of a prototype of this framework, designed specifically for German, following a Design Science Research approach. Using the framework, the effectiveness of configurations or system prompts can be evaluated. To enable this in a systematic and replicable manner, a catalogue of 150 prompts in German was developed, adapting and extending approaches from other languages. The outputs generated by an LLM in response to these prompts are then assessed by an evaluation module: Linguistic analysis identifies gendered forms and grammatical structures, while scoring metrics quantify the degree of gender neutrality and inclusivity. To demonstrate the framework, it was applied in several test runs using an iteratively developed system prompt designed to elicit gender neutral responses. The resulting metrics allowed assessment of whether a given prompt effectively enhances the neutrality of generated outputs and reduces gender-specific bias. Potential applications of the framework in organisational settings, as well as its relevance for the development of responsible AI systems, are outlined.

Author Biographies

Jasmin Schmidt, IU International University of Applied Sciences

Jasmin Schmidt is working on AI governance, with a focus on compliance with the EU AI Act and bias measurement. She is developing methodological approaches for the quantitative assessment of fairness in AI systems (e.g. GenScore-DE) and for the risk-based classification of applications in public administration.

Claudia Hess, IU International University of Applied Sciences

Claudia Hess is a professor of Digital Transformation at IU International University of Applied Sciences. She teaches on the application of artificial intelligence, including ethical implications, and on digital transformation projects. She also conducts research on young women in STEM and works in industry as a consultant and coach.

Downloads

Published

2026-04-25