Loading data...
Loading data...
Today's AI systems are evaluated on technical performance alone. No widely adopted framework measures whether AI actually benefits the people who use it. We're building the first open benchmark that does.
As AI is increasingly present in our daily lives, in how we choose to work and relate to one another, there has been no better time to evaluate AI's impact on our lives as a whole. Even top-performing AI models produce harmful outcomes in over 70% of high-risk scenarios. Real harms build up across long, multi-turn conversations, not single prompts. And while researchers have started consolidating AI benchmarks, none of them measure human flourishing.
We are researchers from MIT Media Lab, USC, and UC Berkeley collaborating to evaluate the impact of AI across the physical, societal, and psychological domains of our lives. Each metric is tested through multi-turn simulations with realistic, demographically varied users, and validated by domain experts on our team.
This work began at an AHA Flourishing Workshop held at MIT in October 2025 with support from the Omidyar Network, where we convened 80 interdisciplinary experts from over 40 institutions to address this gap. Through intensive collaboration, participants developed novel evaluation frameworks for assessing AI's impact on human flourishing across three core dimensions.
The workshop produced a report on benchmarks for human flourishing with AI and a community-driven initiative to design human-centered AI benchmarks, including PsychoRisk Bench (our MIT effort), Humane Bench, and more.
A comprehensive, publicly available set of evaluation scenarios spanning all six flourishing domains. Each benchmark follows a rigorous structure: a realistic multi-turn interaction, clear definitions of positive and negative model behaviors, and grounding in peer-reviewed literature from learning science, psychology, and human-computer interaction.
The team is developing original benchmarks and aggregating existing ones, including SycophancyBench, CoraBench (focused on children), and OpenAI's HealthBench, to build a unified, independent evaluation controlled by no single company.
Technical scores alone do not help parents, teachers, or policymakers make informed decisions. This deliverable translates benchmark results into accessible, human-readable profiles analogous to nutrition labels on food packaging or Common Sense Media ratings.
A parent asking "Is this AI safe for my child?" or a teacher asking "Does this tool support learning?" receives a clear, contextualized answer. The label adapts to its audience, offering different views for end users, AI developers, and policymakers.
Any benchmark relating to human psychological well-being or flourishing is eligible for inclusion, whether it addresses positive outcomes such as autonomy and competence, or negative outcomes such as codependence and psychological distress.
Submissions follow a standardized structure covering benchmark justification, scenario details, and scoring criteria. The project is designed as a living resource, welcoming contributions from interdisciplinary experts worldwide.
The initiative is seeking philanthropic support to execute a public launch. AHA's research has already shaped policy, the team's findings on AI companion chatbots were cited as key inspiration in California Senate Bill 243, protecting children from harmful AI relationships.
Building benchmarks that measure AI's impact on human flourishing requires a rigorous, multi-step evaluation process grounded in both psychological science and AI engineering.
Drawing on psychology, education research, clinical science, and human-computer interaction, the AHA team has developed a framework that evaluates AI across six core domains of human experience.
Does the AI contribute to users' subjective well-being, positive affect, and overall sense of life satisfaction?
Does the AI support mental health, encourage healthy behaviors, and avoid fostering anxiety, dependency, or harmful patterns?
Does the AI help users find meaning, pursue goals, and develop a sense of direction and personal significance?
Does the AI support the development of moral character, honesty, integrity, and ethical reasoning?
Does the AI strengthen human connections, support healthy relationships, and avoid replacing genuine social bonds?
Does the AI help users achieve economic security and practical stability rather than fostering dependence or vulnerability?
Framework informed by the Harvard Human Flourishing Program.
Not all benchmarks are equal. We evaluate each benchmark candidate against a rigorous quality framework before including it in the suite.
Multi-turn simulations that reflect real-world user interactions, judged representative by human experts.
Simulated inputs are consistently representative of the construct of interest, verified by human experts and LLM judges.
Inputs capture a broad, complete set of scenarios, measured through semantic diversity analysis.
Benchmarks can be run automatically with LLM-as-judge systems that approximate human judgment.
Benchmark differentiates between better and worse models, later models should score higher than earlier ones.
LLM-as-judge appropriately and consistently detects harm/benefit, validated against human rater judgments.
Benchmark submitters provide the construct definition, example interaction patterns, and expected model behaviors. The community identifies what patterns within user input messages are expected to elicit behaviors of interest.
LLMs generate additional related simulations from the seed examples, expanding coverage. For example: given 3 example messages expressing distress, the model generates 20 semantically related variations to ensure broad scenario coverage.
We ask the LLM whether generated examples match the intended input pattern, comparing against submitter intent and human expert judgments. This reveals whether examples capture the construct or drift beyond it.
The LLM evaluates whether model responses are harmful or helpful (depending on whether the construct is negatively or positively framed). Human raters independently judge the same outputs to validate LLM judge accuracy.
We compute test-retest reliability and sensitivity statistics across models. Scored metrics include: semantic diversity of conversations, variation in LLM-judge scores, and differentiation across model generations and providers.
We use a three-dimensional framework to ensure our benchmark suite provides balanced, comprehensive coverage. Benchmarks are mapped across: level of need (Physical → Psychological → Moral/Self-Actualization), valence (positive outcomes vs. prevention of harm), and scope (individual focus vs. societal focus).
| Level | Valence | Individual Focus | Society Focus |
|---|---|---|---|
| Moral / Self-Actualization | Positive | Supporting Individual Creativity | Design for Equity and Inclusion (HumaneBench) |
| Negative | Cognitive Offloading | Copyright & Intellectual Property | |
| Psychological | Positive | Foster Healthy Relationships · Enable Meaningful Choices | Prioritize Long-term Wellbeing · Social Navigation (Sotopia) |
| Negative | Attachment/Dependency (AAP) · Protect Dignity (HumaneBench) | Human-Like Simulation · Suicidality Assessment | |
| Physical | Positive | Enhance Human Capabilities (HumaneBench) | Improve Peace & Conflict Resolution |
| Negative | Respect User Attention (HumaneBench) | Energy Efficiency (QueueLab) |
This ambitious project could not have been done without collaboration across many disciplines and areas of expertise. A core group based out of MIT, USC, and the Psychology of Technology Institute have initiated this collaboration with the support of many others.
Led by researchers at
Participants of the MIT Workshop for Designing Benchmarks for Human Flourishing with AI supported by Omidyar Network.
Rather than defining yet another taxonomy, this benchmark allows anyone to define a construct that relates to human flourishing, and places those constructs within a unified, evidence-based framework.
Many dimensions of flourishing map naturally onto Maslow's hierarchy of human needs. A flourishing society supports users' physical needs (health, safety from harm), which enables people to pursue social and relational needs (belonging, meaningful relationships), which in turn creates space for psychological needs (autonomy, competence, relatedness), ultimately leading to self-actualization (purpose, meaning, moral values).
LLM benchmarks exist at every level of this hierarchy, and this framework lets us ensure comprehensive coverage across all levels.
Within each level, every construct can be considered along both positive and negative dimensions. The positive dimension refers to the extent to which AI systems actively enhance valued outcomes (e.g. fostering a sense of belonging). The negative dimension refers to the extent to which they prevent or mitigate harmful outcomes (e.g. avoiding shame and social exclusion).
This dual framing is consistent with established frameworks like Schwartz's taxonomy of values, which recognizes dimensions of conservation vs. openness to change and self-enhancement vs. self-transcendence.
AI actively enhances valued aspects of human flourishing, supporting creativity, strengthening relationships, fostering autonomy, building competence, enriching meaning.
AI prevents or mitigates harmful outcomes, avoiding dependency, protecting dignity, respecting attention, reducing anxiety, preventing exploitation or manipulation.
The sunburst visualization on the Explore tab maps AI model behavior scores onto this human flourishing framework. Each ring represents a level of the hierarchy:
Arc color indicates impact direction and magnitude:
All scores in this visualization are currently synthetic/simulated for demonstration purposes. The benchmark framework is open and intended to be populated with real evaluation data through rigorous multi-turn simulation studies. We invite researchers to contribute benchmarks through our open submission process.
Help us improve the benchmark. Your feedback shapes how we evaluate AI's impact on human flourishing.
We read every submission and use it to improve the benchmark.
The full benchmark dataset and evaluation API are available to vetted researchers and institutions. Request access below.
We'll review your application and get back to you within 5 business days.
Building an open, independent benchmark for AI's impact on human flourishing takes a community. Here's how you can help.
Spread the word, cite the benchmark, or champion human-centered AI evaluation in your community.
Co-develop benchmarks, contribute datasets, or partner on peer-reviewed publications.
Philanthropic funding enables us to expand coverage, run evaluations, and keep the benchmark open.
Media coverage, policy connections, technical infrastructure, community building, we welcome all forms of support.
We're excited to connect. Someone from our team will be in touch shortly.