How to do human evaluation - A brief introduction to user studies in NLP
Schuff, Hendrik, Lindsey Vanderlyn, Heike Adel, and Ngoc Thang Vu. 2023. βHow to Do Human Evaluation: A Brief Introduction to User Studies in NLP.β Natural Language Engineering 29 (5): 1199β1222. https://doi.org/10.1017/S1351324922000535.
Notes
Considerations for human-centered NLP evaluations
- Ethical and legal considerations
- privacy
- informed consent
- respect for participants
- Research questions and hypotheses
- exploratory research questions
- confirmatory research questions
- Variables
- operationalize the measurements
- types
- independent
- dependent
- confounding
- Metrics
- Likert scales
- Visual analog scale
- Direct comparisons
- Ranked order comparisons
- Error classification
- Completion time
- Bio signals
- Qualitative analysis
- Level of measurement
- nominal
- ordinal
- interval
- ratio
- Experimental designs
- within-subject
- between-subject
- Crowdsourcing for NLP
- fair compensation
- platform rules
- task description
- incentives and response quality
- pilot study
- Data collection
- Statistical evaluation of NLP
- estimating the required sample size
- choosing the correct statistical test
- post hoc tests
- multiple comparisons problem
- worked example
In-text annotations
"On the other hand, there are task-specific NLP resources. For example, van der Lee et al. (2019, 2021), Belz, Mille, and Howcroft (2020) provide guidelines on human evaluation with a focus on natural language generation (NLG), Sedoc et al. (2019) present an evaluation methodology specifically for chatbots, and Iskender, Polzehl, and MΓΆller (2021) provide guidelines for human evaluation for summarization tasks." (Page 1201)
"this paper aims to provide an overview that focuses on commonalities of human evaluation across NLP without restriction to a single task and seeks a good balance between generality and relevance to foster an overall understanding of important aspects in human evaluation, how they are connected, and where to find more information." (Page 1201)