Open Benchmark of
AI Impact on Humans

How does using AI ?

The first open benchmark measuring AI's impact on human well-being across physical, psychological, and societal dimensions.

260+

Behavioral Indicators

Tested across realistic scenarios.

18+

AI Models Evaluated

Compared on the same standard.

Well-being Dimensions

Physical, psychological, societal.

Audience Groups

Parents, educators, policymakers, developers.

The Rationale

Today's AI benchmarks measure what models can do, accuracy, speed, reasoning, task completion. They tell us almost nothing about what happens to the person on the other side of the screen. We're building the first open standard that asks a different question: across millions of real conversations, is AI actually making people's lives better?

AI is being adopted faster than any technology in history, yet there's no shared, independent way to measure its effect on human well-being
Models that ace conventional safety benchmarks have still produced harmful outcomes in over 70% of high-risk scenarios in controlled studies
No existing benchmark covers all three domains where impact unfolds: Physical, Psychological, and Societal
Developers, policymakers, clinicians, and everyday users need a common signal they can trust, one not produced by the companies being evaluated
Human flourishing, not task performance, is the right north star for responsible AI

The Methodology

Scores reflect how consistently a model supports, or undermines, human flourishing across realistic, multi-turn scenarios. Each conversation is simulated with a demographically varied user pursuing a latent adversarial goal, then graded by an LLM-as-judge pipeline whose reliability is validated through test-retest consistency and between-judge agreement checks.

1, behavior reliably supports human flourishing on the measured dimension
0, behavior reliably undermines human flourishing on the measured dimension
Each metric is adversarially designed to elicit a specific failure mode, with polarity assigned per metric so that a "1" reflects the model resisting that failure and a "0" reflects the model exhibiting it
Metrics are stratified across age and gender to surface where harms concentrate
Scenarios unfold over multiple turns to capture relational, time-extended dynamics that single-prompt benchmarks miss
Framework grounded in eudaimonic psychology and VanderWeele's multidimensional flourishing framework, with constructs contributed through an open submission process by clinicians, legal scholars, educators, and community advocates

Request Access

The full benchmark dataset and evaluation API are available to vetted researchers and institutions.

Led by researchers at

Loading data...

Claude 3 Opus

Prohibits human flourishing

Promotes human flourishing

Generative AI is being rapidly adopted into systems people rely on to make decisions about their health, finances, relationships, and sense of self. If developed and deployed thoughtfully, these systems have the potential to expand human access and agency. But the psychological risks are relational, context-dependent, and unfold over time. A model can pass conventional safety benchmarks and still erode a user's autonomy, scaffold emotional dependency, or quietly degrade the skills it was meant to support.

Existing evaluations rarely capture these dynamics: most are single-turn, static, and validated against narrow definitions of harm that no single discipline would have written alone.

Today, we're introducing ImpactBench: a benchmark suite designed to measure how AI systems affect human flourishing across extended, realistic interactions. Built through an open submission process with researchers, clinicians, legal scholars, and community advocates, the suite spans 18 expert-submitted benchmarks covering emotional dependence, cognitive autonomy, health, legal and financial advice, child safety, and more. Each benchmark is evaluated through multi-turn adversarial simulation with demographically stratified user personas, so that risks are surfaced the way they appear in real conversations rather than in isolated prompts.

ImpactBench is a first-of-its-kind collaboration between the MIT Media Lab, the Psychology of Technology Institute, the USC Neely Center, and UC Berkeley, launched at the AHA Flourishing Workshop at MIT in October 2025 with support from the Omidyar Network.

ImpactBench is grounded in our belief that evaluations for AI systems should be:

Human-centered: Scores reflect impact on people, not just model capability. Strong performance on capability benchmarks does not guarantee that a system supports human flourishing.
Scientifically rigorous: Scores are built on validated methods and domain expertise. Only 16.0% of reviewed ML and NLP benchmarks include any statistical testing, and many operationalize contested constructs without expert input.
Publicly accessible: Scores are legible to the audiences who need them. Parents, teachers, policymakers, and users themselves should be able to understand how AI systems behave, not just the technical community.

Alongside the ImpactBench benchmark suite, we're sharing how 14 leading AI systems perform across 18 expert-submitted constructs, setting a baseline for the field to improve upon.

Explore ImpactBench

ImpactBench organizes 18 expert-submitted benchmarks into three domains of human flourishing: Physical (health, finances, legal and civic rights, education and career), Psychological (mental wellbeing, autonomy preservation, creativity and cognition, self-determination, learning), and Societal (social relationships, fairness and bias, safety and protection). Each domain is grounded in eudaimonic psychology and human capability theory, capturing not only whether AI systems avoid harm but whether they actively support the conditions under which people thrive.

Within each domain, performance is measured at four levels of resolution:

Main areas. Aggregate scores across the three domains, ranging from -1 (AI consistently harms this dimension) to +1 (AI consistently benefits this dimension), with 0 indicating no net effect on wellbeing.

Subareas. Each domain is decomposed into sub-constructs. Physical, for example, covers physical health, legal and civic rights, and education, career, and finance, each evaluated against expert-derived criteria.

Scenarios. Each subarea contains scenarios that probe specific situations, scored on a breakdown of good behaviors (compliance is desirable) and harmful behaviors (compliance is a failure). Each behavior is graded on a 0 to 1 scale, with the final scenario score weighted by metric importance.

Chat logs. Every scenario score is traceable to the underlying multi-turn conversation between a user-simulator model and the target model, with the judge's verdict on each behavior visible alongside the transcript.

This structure is designed so that aggregate rankings remain interpretable. A model's overall score can always be traced down to specific conversations, specific behaviors, and the expert-derived criteria that defined them.

Methodology

ImpactBench tests how AI systems shape human flourishing across realistic, multi-turn interactions, grounded in what experts across clinical, educational, legal, and policy domains say matters most.

The benchmark suite covers 18 expert-submitted benchmarks comprising 375 individual metrics that span emotional dependence, cognitive autonomy, health, legal and financial advice, child safety, and more. Each metric is operationalized as a set of six-turn scenarios in which a user-simulator model probes a target model while pursuing a latent adversarial objective, with personas stratified by age and gender to surface demographic sensitivity. Conversations are designed to mirror real-world use: they capture layperson and expert personas, include surface-form perturbations that mimic typo and autocorrect artifacts, and accumulate pressure across turns rather than relying on isolated prompts.

ImpactBench is a binary-verdict evaluation, where each conversation is graded by a model-based judge (GPT-5.4-mini) against expert-derived criteria. Each metric is classified as positive (where yes indicates good behavior) or negative (where yes indicates bad behavior, inverted for scoring), and verdicts are aggregated into a single model score on a 0 to 1 scale. The pipeline is audited at every stage through psychometric tools including test-retest reliability (Fleiss' κ = 0.64 to 0.78), between-judge agreement (Spearman ρ = 0.61), generator-swap audits (Wilcoxon p = 0.003), and user-simulator swaps (Spearman ρ up to 0.977), so that operationalization and inference choices remain empirically contestable.

Performance of models

We use ImpactBench to evaluate how 14 frontier AI systems perform across the full suite, setting a baseline for the field to improve upon.

ImpactBench forest plot of model performance

The Claude 4.x models cluster tightly at the top (0.714–0.719), followed by GPT-5.x near 0.67–0.68, with Gemini, Gemma, Llama, DeepSeek, and GPT-4o between 0.54 and 0.59, and Grok, Mistral, and Qwen between 0.43 and 0.50. The full ranking spans approximately 29 percentage points.

Three findings stand out beyond the aggregate ranking.

Harm avoidance does not imply flourishing. Every model scored higher on negative metrics (harm avoidance) than on positive metrics (actively beneficial behavior), with gaps ranging from +3.9 pp (Claude Opus 4.6) to +21.6 pp (GPT-4o), suggesting that alignment investment has concentrated on suppressing harmful outputs rather than scaffolding flourishing.

Construct matters more than model. Benchmark difficulty was determined more by what was being measured than by which model was tested. Humane Bench (mean 0.373), Cognitive Bias (0.467), and Human Agency (0.469) were uniformly hard across all 14 systems, while VERA-MH (0.777) and User Bias (0.765) were uniformly easy.

Models behave differently toward minors. 12 of 14 models showed more emotional-dependence behaviors toward child and teen personas than toward adults, holding scenario content constant. Largest effects: Qwen3 80B (+0.049), Mistral Small 3.2 (+0.044), DeepSeek V3.2 (+0.042).

Rankings were stable across generator, simulator, and judge swaps. Run-to-run Fleiss' κ ranged from 0.64 to 0.78, 78.1% of conversation triples were unanimous, and a single sample matched the three-sample majority vote at ρ = 0.982.

Team & Collaborators

This ambitious project could not have been done without collaboration across many disciplines and areas of expertise. A core group based out of MIT, USC, and the Psychology of Technology Institute have initiated this collaboration with the support of many others.

The project began at the AHA Flourishing Workshop at MIT in October 2025, supported by the Omidyar Network, which convened 80 experts from over 40 institutions. Prior AHA research on AI companion chatbots was cited as a key inspiration for California Senate Bill 243.

Led by researchers at

Team

Pat Pataranutaporn, MIT Media Lab
Pattie Maes, MIT Media Lab
Jennifer Pfister, MIT Media Lab
Chayapatr Archiwaranguprok, MIT Media Lab
Constanze Albrecht, MIT Media Lab
Sheer Karny, MIT Media Lab
Rachel Poonsiriwong, MIT Media Lab
Ravi Iyer, USC Marshall School's Neely Center & Psychology of Technology Institute
Nate Fast, USC Marshall School's Neely Center & Psychology of Technology Institute
Juliana Schroeder, University of California, Berkeley & Psychology of Technology Institute
Stanley Huang, Boston University
Yuning Liu, Harvard University

Support in helping define and launch project

Noesis Collaborative
Building Humane Technology

Support in contributing benchmarks and benchmark expertise

Jenny Radesky, MD, University of Michigan
Alexis Hiniker, University of Washington
Marie Bragg, New York University
Yaoli Mao & Erika Anderson, Humane Bench
Eric Ngoiya, QueueLab
Carl Vincent Kho, Minerva University
Su Jin Park
Anil Kshatriya, ESSEC Business School
Generative AI for Good
AI Culture Lab
Eduardo Baena, Northeastern University
Ryan McBain, Jonathan Cantor, Ellice Huang, Rand Corporation
Cornelia Walther, University of Pennsylvania
Spring Health
Tech Justice Law

Participants of the MIT Workshop for Designing Benchmarks for Human Flourishing with AI supported by Omidyar Network.

Get Involved

Support the Benchmark

Building an open, independent benchmark for AI's impact on human flourishing takes a community. Here's how you can help.

Advocacy

Spread the word, cite the benchmark, or champion human-centered AI evaluation in your community.

Research Collaboration

Co-develop benchmarks, contribute datasets, or partner on peer-reviewed publications.

Financial Support

Philanthropic funding enables us to expand coverage, run evaluations, and keep the benchmark open.

Other

Media coverage, policy connections, technical infrastructure, community building, we welcome all forms of support.

Get in Touch

Name *

Contact *

Affiliation

How would you like to support the benchmark? *

Areas of support (select all that apply)

Advocacy Raising awareness and promoting human-centered AI evaluation Research Collaboration Joint research, benchmark development, or dataset contributions Financial Support Philanthropic grants or institutional funding Other Media, policy, infrastructure, or something else entirely

Open Benchmark of
AI Impact on Humans

The Rationale

The Methodology

Request Access

Request received!

Support Benchmarking Efforts

Thank you!

Feedback

Thank you!

Introducing ImpactBench

Explore ImpactBench

Methodology

Performance of models

Team & Collaborators

Team

Support in helping define and launch project

Support in contributing benchmarks and benchmark expertise

Feedback

Thank you for your feedback!

Request Access

Request received!

Support the Benchmark

Advocacy

Research Collaboration

Financial Support

Other

Get in Touch

Thank you for your support!

Open Benchmark ofAI Impact on Humans

The Rationale

The Methodology

Request Access

Request received!

Support Benchmarking Efforts

Thank you!

Feedback

Thank you!

Introducing ImpactBench

Explore ImpactBench

Methodology

Performance of models

Team & Collaborators

Team

Support in helping define and launch project

Support in contributing benchmarks and benchmark expertise

Feedback

Thank you for your feedback!

Request Access

Request received!

Support the Benchmark

Advocacy

Research Collaboration

Financial Support

Other

Get in Touch

Thank you for your support!

Open Benchmark of
AI Impact on Humans