Loading data...
Loading data...
Generative AI is being rapidly adopted into systems people rely on to make decisions about their health, finances, relationships, and sense of self. If developed and deployed thoughtfully, these systems have the potential to expand human access and agency. But the psychological risks are relational, context-dependent, and unfold over time. A model can pass conventional safety benchmarks and still erode a user's autonomy, scaffold emotional dependency, or quietly degrade the skills it was meant to support.
Existing evaluations rarely capture these dynamics: most are single-turn, static, and validated against narrow definitions of harm that no single discipline would have written alone.
Today, we're introducing ImpactBench: a benchmark suite designed to measure how AI systems affect human flourishing across extended, realistic interactions. Built through an open submission process with researchers, clinicians, legal scholars, and community advocates, the suite spans 18 expert-submitted benchmarks covering emotional dependence, cognitive autonomy, health, legal and financial advice, child safety, and more. Each benchmark is evaluated through multi-turn adversarial simulation with demographically stratified user personas, so that risks are surfaced the way they appear in real conversations rather than in isolated prompts.
ImpactBench is a first-of-its-kind collaboration between the MIT Media Lab, the Psychology of Technology Institute, the USC Neely Center, and UC Berkeley, launched at the AHA Flourishing Workshop at MIT in October 2025 with support from the Omidyar Network.
ImpactBench is grounded in our belief that evaluations for AI systems should be:
Alongside the ImpactBench benchmark suite, we're sharing how 14 leading AI systems perform across 18 expert-submitted constructs, setting a baseline for the field to improve upon.
ImpactBench organizes 18 expert-submitted benchmarks into three domains of human flourishing: Physical (health, finances, legal and civic rights, education and career), Psychological (mental wellbeing, autonomy preservation, creativity and cognition, self-determination, learning), and Societal (social relationships, fairness and bias, safety and protection). Each domain is grounded in eudaimonic psychology and human capability theory, capturing not only whether AI systems avoid harm but whether they actively support the conditions under which people thrive.
Within each domain, performance is measured at four levels of resolution:
Main areas. Aggregate scores across the three domains, ranging from -1 (AI consistently harms this dimension) to +1 (AI consistently benefits this dimension), with 0 indicating no net effect on wellbeing.
Subareas. Each domain is decomposed into sub-constructs. Physical, for example, covers physical health, legal and civic rights, and education, career, and finance, each evaluated against expert-derived criteria.
Scenarios. Each subarea contains scenarios that probe specific situations, scored on a breakdown of good behaviors (compliance is desirable) and harmful behaviors (compliance is a failure). Each behavior is graded on a 0 to 1 scale, with the final scenario score weighted by metric importance.
Chat logs. Every scenario score is traceable to the underlying multi-turn conversation between a user-simulator model and the target model, with the judge's verdict on each behavior visible alongside the transcript.
This structure is designed so that aggregate rankings remain interpretable. A model's overall score can always be traced down to specific conversations, specific behaviors, and the expert-derived criteria that defined them.
ImpactBench tests how AI systems shape human flourishing across realistic, multi-turn interactions, grounded in what experts across clinical, educational, legal, and policy domains say matters most.
The benchmark suite covers 18 expert-submitted benchmarks comprising 375 individual metrics that span emotional dependence, cognitive autonomy, health, legal and financial advice, child safety, and more. Each metric is operationalized as a set of six-turn scenarios in which a user-simulator model probes a target model while pursuing a latent adversarial objective, with personas stratified by age and gender to surface demographic sensitivity. Conversations are designed to mirror real-world use: they capture layperson and expert personas, include surface-form perturbations that mimic typo and autocorrect artifacts, and accumulate pressure across turns rather than relying on isolated prompts.
ImpactBench is a binary-verdict evaluation, where each conversation is graded by a model-based judge (GPT-5.4-mini) against expert-derived criteria. Each metric is classified as positive (where yes indicates good behavior) or negative (where yes indicates bad behavior, inverted for scoring), and verdicts are aggregated into a single model score on a 0 to 1 scale. The pipeline is audited at every stage through psychometric tools including test-retest reliability (Fleiss' κ = 0.64 to 0.78), between-judge agreement (Spearman ρ = 0.61), generator-swap audits (Wilcoxon p = 0.003), and user-simulator swaps (Spearman ρ up to 0.977), so that operationalization and inference choices remain empirically contestable.
We use ImpactBench to evaluate how 14 frontier AI systems perform across the full suite, setting a baseline for the field to improve upon.
The Claude 4.x models cluster tightly at the top (0.714–0.719), followed by GPT-5.x near 0.67–0.68, with Gemini, Gemma, Llama, DeepSeek, and GPT-4o between 0.54 and 0.59, and Grok, Mistral, and Qwen between 0.43 and 0.50. The full ranking spans approximately 29 percentage points.
Three findings stand out beyond the aggregate ranking.
Harm avoidance does not imply flourishing. Every model scored higher on negative metrics (harm avoidance) than on positive metrics (actively beneficial behavior), with gaps ranging from +3.9 pp (Claude Opus 4.6) to +21.6 pp (GPT-4o), suggesting that alignment investment has concentrated on suppressing harmful outputs rather than scaffolding flourishing.
Construct matters more than model. Benchmark difficulty was determined more by what was being measured than by which model was tested. Humane Bench (mean 0.373), Cognitive Bias (0.467), and Human Agency (0.469) were uniformly hard across all 14 systems, while VERA-MH (0.777) and User Bias (0.765) were uniformly easy.
Models behave differently toward minors. 12 of 14 models showed more emotional-dependence behaviors toward child and teen personas than toward adults, holding scenario content constant. Largest effects: Qwen3 80B (+0.049), Mistral Small 3.2 (+0.044), DeepSeek V3.2 (+0.042).
Rankings were stable across generator, simulator, and judge swaps. Run-to-run Fleiss' κ ranged from 0.64 to 0.78, 78.1% of conversation triples were unanimous, and a single sample matched the three-sample majority vote at ρ = 0.982.
This ambitious project could not have been done without collaboration across many disciplines and areas of expertise. A core group based out of MIT, USC, and the Psychology of Technology Institute have initiated this collaboration with the support of many others.
The project began at the AHA Flourishing Workshop at MIT in October 2025, supported by the Omidyar Network, which convened 80 experts from over 40 institutions. Prior AHA research on AI companion chatbots was cited as a key inspiration for California Senate Bill 243.
Led by researchers at
Participants of the MIT Workshop for Designing Benchmarks for Human Flourishing with AI supported by Omidyar Network.
Help us improve the benchmark. Your feedback shapes how we evaluate AI's impact on human flourishing.
We read every submission and use it to improve the benchmark.
The full benchmark dataset and evaluation API are available to vetted researchers and institutions. Request access below.
We'll review your application and get back to you within 5 business days.
Building an open, independent benchmark for AI's impact on human flourishing takes a community. Here's how you can help.
Spread the word, cite the benchmark, or champion human-centered AI evaluation in your community.
Co-develop benchmarks, contribute datasets, or partner on peer-reviewed publications.
Philanthropic funding enables us to expand coverage, run evaluations, and keep the benchmark open.
Media coverage, policy connections, technical infrastructure, community building, we welcome all forms of support.
We're excited to connect. Someone from our team will be in touch shortly.