Building Benchmarks from the Ground Up - Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

Hamna, Hamna, Gayatri Bhat, Sourabrata Mukherjee, et al. 2026. “Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings.” Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems, April 13, 1–19. https://doi.org/10.1145/3772318.3791172.

Notes

In-text annotations

"We choose not to create gold-standard reference answers for the query set for both methodological and practical reasons. On one hand, developing high-quality reference answers in the healthcare domain requires substantial expert time and careful validation, which limits scalability even for a relatively small set of questions. This work cannot be reliably delegated to non-expert data workers. On the other hand, the questions generated through our communitydriven process are inherently open ended and context dependent. They reflect lived experiences, locally relevant concerns, and diverse informational needs, and therefore do not lend themselves to a single correct answer in the traditional benchmarking sense. Introducing gold-standard responses would risk flattening this variability and reintroducing the institutional biases our approach seeks to avoid." (Page -)

"Recognizing that CSOs working closely with communities in high-stakes domains such as healthcare are often resource-constrained and balancing multiple urgent priorities, we designed the first phase of the evaluation pipeline to elicit their expertise through short and focused interviews." (Page 2)