Position - Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Wallach, Hanna, Meera Desai, A. Feder Cooper, et al. 2025. “Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge.” arXiv:2502.00561. Preprint, arXiv, June 6. https://doi.org/10.48550/arXiv.2502.00561.

Notes

Measurement framework
- four levels
  - background concept
  - systematized concept or specific concept
  - measurement instruments
  - measurements
- four processes
  - Systematization
    - narrowing background concept to systematized concept
  - Operationzation
    - developing measurement instruments from systematized concept
  - Application
    - using the instruments to get Measurements
  - Interrogation
    - Interrogating the validity of all the levels
This is important because it helps in pinpointing the broader social framework or the assumptions under which we are designing and evaluating the system.
Validity in one setting might not be valid in another
We recommend using the following set of lenses to interrogate validity, adapted from Messick (1987) by Jacobs& Wallach (2021): face validity, content validity, convergent validity, discriminant validity, predictive validity,hypothesis validity, and consequential validity.

In-text annotations

"1. Evaluating GenAI Systems" (Page 1)

"The process of evaluation2 necessarily requires information about the system’s capabilities (like its mathematical reasoning skills), behaviors (like regurgitating pieces of its training data), and impacts (like causing its users to feel harmed)." (Page 1)

"measurement reflects the amount of some concept of interest exhibited by that system (related to its capabilities, behaviors, or impacts) in some context of interest." (Page 1)

"measurement instruments are typically developed and used in the ML community, mean it is difficult to know precisely what these instruments are measuring and why, let alone whether they and their resulting measurements are accurate or useful—i.e., valid." (Page 2)

"it is difficult to know precisely what these instruments are measuring and why, let alone whether they and their resulting measurements are accurate or useful—i.e., valid." (Page 2)

"3. A Measurement Framework for GenAI" (Page 3)

"One formulation of measurement theory is the framework of Adcock & Collier (2001), a variant of which is shown in Figure 1. This variant distinguishes between four levels: the background concept or “broad constellation of meanings and understandings associated with [the] concept [of interest];” the systematized concept or “specific formulation of the concept[, which] commonly involves an" (Page 3)

"explicit definition;” the measurement instruments4 used to obtain measurements of the concept; and the measurements themselves (Adcock & Collier, 2001)" (Page 3)

"These levels are linked by four processes: systematization, operationalization, application, and interrogation." (Page 3)

"Without a systematized concept, many of these decisions are accessible only indirectly via the measurement instruments themselves, which may be hard for stakeholders other than ML researchers and practitioners to engage with. We therefore argue that adopting this framework can broaden the expertise involved in evaluating GenAI systems." (Page 4)

"Messick also argued that the consequences of measurement instruments and their resulting measurements are fundamental to their validity. A crucial implication of this perspective is that it is not possible to interrogate validity without considering the measurement context, including the reasons for measuring the concept and how the measurements will be used." (Page 4)

"Measurement instruments and measurements that have been demonstrated to be sufficiently valid6 in one context may not be valid in another, so validity must therefore be re-interrogated whenever measurement instruments are to be used in new contexts." (Page 4)

"We recommend using the following set of lenses to interrogate validity, adapted from Messick (1987) by Jacobs & Wallach (2021): face validity, content validity, convergent validity, discriminant validity, predictive validity, hypothesis validity, and consequential validity." (Page 4)

"4. Using the Measurement Framework" (Page 4)

"although the systematization process specifies how the concept of interest is connected to observable phenomena in the real world, it takes place at a theoretical leveli.e., it stops short of specifying measurement instruments." (Page 5)

"As we explained in Section 3, the separation of systematization and operationalization can enable stakeholders with different perspectives to participate in conceptual debates. One way to do this is to directly involve them in the systematization process, giving them an opportunity to advocate for the inclusion of particular meanings and understandings." (Page 5)

How do we make it possible within a power framework?

"4.1.1. DEFINITIONS" (Page 5)

"both benchmarks’ measurement instruments involve crowdworkers, who, without a systematized concept, must rely on their own understandings of these high-level definitions, which may be contradictory (e.g., whether factually true generalizations about social groups are stereotypes or not). Had Nadeem et al. and Nangia et al. further systematized their high-level definitions, they may have proactively avoided many of the limitations identified by Blodgett et al. (2021)" (Page 5)

"4.1.2. INTERROGATION: SYSTEMATIZATION" (Page 5)

"content validity, which, in the case of conceptual debates, refers to the extent to which the systematized concept reflects the most salient aspects of the background concept." (Page 6)

"substantive validity of our systematized concept—i.e., whether the systematized concept fully specifies the observable phenomena that are connected to the concept of interest—perhaps by noting that the linguistic patterns do not account for differences in the acceptability of positive," (Page 6)

"consequential validity, which is concerned with the consequences of measurement, including the consequences of the systematization process and the systematized concept." (Page 6)

"4.2.1. MEASUREMENT INSTRUMENTS" (Page 6)

", the first step in the operationalization process is to specify how the observable phenomena will be represented by defining a set of variables—often called indicators8—that reflect the observable phenomena." (Page 6)

"Having defined both the indicators and how their values should be aggregated, the next step is to develop the measurement instruments—i.e., the operational procedures and artifacts used to obtain the measurements." (Page 6)

"4.2.2. INTERROGATION: OPERATIONALIZATION" (Page 7)

"structural validity—i.e., the extent to which the measurement instruments align with the relationships specified as part of the systematized concept—" (Page 7)

"convergent validity—i.e., the extent to which the measurement instruments yield measurements that are similar to measurements of the concept of interest, or other similar concepts, obtained using other, already validated measurement instruments—" (Page 7)

Notes

In-text annotations

"1. Evaluating GenAI Systems" (Page 1)

"3. A Measurement Framework for GenAI" (Page 3)

"4. Using the Measurement Framework" (Page 4)

"4.1.1. DEFINITIONS" (Page 5)

"4.1.2. INTERROGATION: SYSTEMATIZATION" (Page 5)

"4.2.1. MEASUREMENT INSTRUMENTS" (Page 6)

"4.2.2. INTERROGATION: OPERATIONALIZATION" (Page 7)

"5. Adopting the Measurement Framework" (Page 8)