Evaluating Evaluation Metrics - A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
Xiao, Ziang, Susu Zhang, Vivian Lai, and Q. Vera Liao. 2023. “Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics Using Measurement Theory.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 10967–82. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.676.
Notes
- evaluating the evaluation metrics for NLG systems
- uses measurement theory
- test the individual capabilities
- reliability
- how much a metric is subject to random measurement error and consistent across repeated measures
- Example: “For human evaluations, the variability across raters, resulting from their subjectivity, inconsistency, errors, and so on.” (Xiao et al., 2023, p. 10969)
- how much a metric is subject to random measurement error and consistent across repeated measures
- validity
- “benchmarking is valid only when the metric scores can inform their intended interpretations (e.g., model capability) and uses (e.g., predicting models’ real-world behavior)” (Xiao et al., 2023, p. 10971)
- reliability
- operationalizes four key prerequisites
- Reliability
- Metric Stability
- Metric Consistency
- Validity
- Metric Construct Validity
- Metric Concurrent Validity
- Reliability
In-text annotations
"assess evaluation metrics by drawing from measurement theory in educational and psychological testing" (Page 10968)