"We compared the satisfaction survey we ran three months ago with the latest results, and the score moved significantly — but we cannot explain what actually changed." "An executive asked, 'Is that metric really measuring satisfaction?' and we had no answer." If you operate surveys continuously, you will inevitably hit the question: how do you guarantee measurement quality? The concepts that answer this question are reliability and validity, a domain that psychometrics and survey research have built up over more than 70 years.
In this article, we organize the four categories of reliability (internal consistency, test-retest, parallel-forms, and inter-rater), the calculation and thresholds of Cronbach's alpha, the three categories of validity (content, construct, and criterion), methods for verifying construct validity, the entry points into exploratory factor analysis (EFA) and confirmatory factor analysis (CFA), and the format of practitioner reports, grounded in the classics: Nunnally & Bernstein (1994), Cronbach (1951), Messick (1989), and Campbell & Fiske (1959). Position this piece as the upper-level hub article that provides the "evidence for measurement quality" presupposed by Likert scale design guide, matrix question pitfalls, survey pilot testing, and survey aggregation and significance testing.
1. Why "measurement quality" matters
In the field of business surveys, the workflow of "write questions → distribute → look at the aggregated results → make decisions" has become standard practice. Yet this flow tends to drop the prerequisite: does the data you collected actually measure the concept you wanted to measure?
The three patterns of "the measurement trap"
When you operate without questioning measurement quality, these failures occur:
- The metric moves with the time period: "We measured with the same questions, but the score swings wildly each quarter" — low test-retest reliability
- Metrics contradict each other: "Satisfaction is up, but NPS is down" — construct validity is ambiguous
- No correlation with action: "We ran a training program, but training satisfaction shows zero correlation with business KPIs" — low criterion validity
These are design problems, not problems with respondents or operations. The role of reliability and validity verification is to verify what the questions are actually measuring, from both theoretical and statistical perspectives.
Reliability and validity are distinct concepts
Reliability and validity are often confused, but they are distinct concepts, and both must hold.
- Reliability: When measured repeatedly under the same conditions, are the results stable?
- Validity: Does the measured value actually represent the construct you wanted to measure?
Nunnally & Bernstein (1994) Psychometric Theory organizes this as: "Reliability is a necessary but not sufficient condition for validity." In other words, if reliability is low, validity cannot be guaranteed, but even if reliability is high, validity is not guaranteed (you may be consistently wrong with the same bias).
2. The four categories of reliability
Reliability is a concept about the "stability" and "consistency" of measured values. There are four representative types.
The four categories of reliability
In business surveys, the two most commonly used are (1) internal consistency (alpha) and (2) test-retest reliability.
3. Cronbach's alpha
Alpha is the representative indicator of internal consistency proposed by Cronbach (1951) Coefficient alpha and the internal structure of tests. It takes values from 0 to 1 and shows the degree to which multiple items measure the same concept.
The calculation idea
Alpha is expressed mathematically as follows (k = number of items, sigma squared sub i = variance of item i, sigma squared sub t = variance of the total score).
alpha = (k / (k - 1)) * (1 - sum(sigma_i^2) / sigma_t^2)
Intuitively, "the larger the covariance between items, the higher the alpha" and "the more items, the higher the alpha tends to be" — this level of understanding is sufficient for practitioner work. Hand calculation is not practical; compute it with R's psych::alpha(), Python's pingouin.cronbach_alpha(), SPSS Reliability Analysis, the JASP Reliability module, and similar tools.
Interpreting the thresholds
The thresholds presented by Nunnally (1978), still standard references today, are:
- alpha ≥ 0.9: Excellent (but may contain redundant items)
- alpha ≥ 0.8: Good
- alpha ≥ 0.7: Acceptable (the floor for exploratory research)
- alpha < 0.7: Needs improvement
- alpha < 0.5: Items likely do not measure the same concept
However, Cortina (1993) What is coefficient alpha? emphasizes that "high alpha does not equal guaranteed unidimensionality." Because alpha mechanically rises with the number of items, you should not judge by alpha alone — combining it with factor analysis is the correct operational practice.
Factors that raise / lower alpha
- Increasing the number of items: mechanically raises alpha (but raises redundancy concerns)
- Raising inter-item correlation: carefully select items that target the same concept
- Including reverse-coded items: fine if reverse-scored correctly, but alpha plummets if scoring is forgotten
- High respondent homogeneity: variance shrinks and alpha can drop
4. The three categories of validity
Validity is a concept about whether the measured value represents the concept you wanted to measure, and is traditionally divided into three categories. Messick (1989) later proposed a monistic view that integrates these into "Construct Validity," but the three-category framing is easier to handle in practice, so we organize this article around the three.
The three categories of validity
Why construct validity is the core
Of the three categories, the one most emphasized in modern psychometrics is construct validity. Cronbach & Meehl (1955) Construct validity in psychological tests showed that as long as we deal with unobservable latent variables (satisfaction, engagement, stress, etc.), "whether we can actually measure the theoretically defined concept" becomes the central question.
5. Methods for verifying construct validity
The main methods for verifying construct validity are the following four.
(1) Convergent Validity
Confirm that the metric has a high correlation with another indicator that is thought to measure the same construct. Example: confirm that the correlation between NPS and overall satisfaction is r ≥ 0.5.
(2) Discriminant Validity
Confirm that the metric has a low correlation with indicators measuring different constructs. Example: confirm that the correlation between job satisfaction and last night's sleep duration is low. Verified as a pair with convergent validity.
(3) MTMM Matrix (Multitrait-Multimethod Matrix)
A classical method proposed by Campbell & Fiske (1959) Convergent and discriminant validation by the multitrait-multimethod matrix. Measure multiple concepts (traits) with multiple methods (methods), and evaluate convergence and discrimination in a single table. Oriented toward academic surveys.
(4) Factor Analysis
The most practical method. Use Exploratory Factor Analysis (EFA) to discover how many factors a set of items collapses into, and use Confirmatory Factor Analysis (CFA) to verify whether the factor structure matches your hypothesis.
- EFA: Without assuming the number of factors, let the data drive the discovery of factor structure. Used when developing new scales.
- CFA: Hypothesize a factor structure and verify whether the data fits it. Used for validity verification of existing scales.
EFA can be performed in R's psych::fa(), Python's factor_analyzer, SPSS, or JASP. CFA requires structural equation modeling (SEM) tools such as R's lavaan, Python's semopy, or Mplus.
Thresholds for fit indices
Representative fit indices used in CFA and their conventional thresholds:
- CFI (Comparative Fit Index): ≥ 0.95 (good)
- TLI (Tucker-Lewis Index): ≥ 0.95 (good)
- RMSEA (Root Mean Square Error of Approximation): ≤ 0.06 (good), ≤ 0.08 (acceptable)
- SRMR (Standardized Root Mean Square Residual): ≤ 0.08 (good)
These are the thresholds presented by Hu & Bentler (1999), still standard references today.
6. Verifying criterion validity
Criterion validity looks at whether the measured value is related to externally important business criteria, so it has the greatest practical significance among the three categories of validity.
Concurrent Validity
Look at the correlation with an external criterion measured at the same time. Examples:
- The correlation between employee engagement score and the intention-to-leave rate at the same point
- The correlation between customer satisfaction and the churn rate at the same point
Predictive Validity
Look at whether you can predict a future external criterion. Examples:
- Whether this quarter's NPS correlates with the next quarter's revenue growth rate
- Whether this quarter's employee engagement predicts the attrition rate six months from now
When explaining the significance of measurement metrics to executives in a business survey context, whether you hold data on predictive validity is the decisive factor for persuasion.
7. The format of practitioner reports
Once you have verified reliability and validity, how to report the results is the next challenge. The granularity required differs between academic papers and business reports.
Reporting format for academic papers
In academic papers (especially APA style), at minimum record the following in the Methods section:
- The number of items and alpha for each subscale (e.g., "Satisfaction scale, 5 items, alpha = .87")
- As needed, the correlation coefficient and interval for test-retest reliability (e.g., "Two-week test-retest reliability r = .82")
- If CFA was performed, a full set of fit indices (CFI / TLI / RMSEA / SRMR) and estimates (e.g., "CFI = .96, RMSEA = .05")
- Convergent and discriminant validity verification reported via a correlation matrix or Average Variance Extracted (AVE)
Reporting format for business reports
For reports to executives and business divisions, minimize technical jargon and write the conclusions needed for decision-making in three lines.
- "Is this metric stable over time?" (test-retest reliability) -> "Correlation with three months ago r = .85, stable"
- "What does this metric actually measure?" (construct validity) -> "Correlation with NPS r = .62, functions as a proxy for satisfaction"
- "Is this metric related to business?" (criterion validity) -> "Correlation with churn rate r = -.45, effective as a churn-prediction metric"
In business reports, rather than writing detailed alpha or CFA figures, take the interpretation that directly connects to "what action can we take next" as the main act.
8. Implementation in Kicue
Kicue covers question distribution, response collection, and raw-data export, while reliability and validity verification statistical processing is realistically performed in external tools.
What Kicue covers
- Multi-item scale question distribution: Multi-item measurement of constructs using Likert scales and matrix questions
- Operation of test-retest surveys: Re-distribute to the same respondents after an interval and export linked by ID
- Demographic / external-criterion data capture: Simultaneous capture of attribute information and behavioral metrics needed for reliability and validity verification
- Raw-data CSV export: Respondent-level data for ingestion into statistical analysis tools
What external tools cover
- Alpha calculation: R
psych::alpha(), Pythonpingouin, SPSS, JASP - Exploratory Factor Analysis (EFA): R
psych::fa(), Pythonfactor_analyzer, SPSS, JASP - Confirmatory Factor Analysis (CFA) / SEM: R
lavaan, Pythonsemopy, Mplus - Correlation analysis (convergent / discriminant / criterion-related): R / Python / Excel
- MTMM matrix construction: R / Python scripts
Verification recommended at the pilot stage
Reliability and validity verification is ideally performed at the pilot testing stage, before the main survey. If problems are discovered in the main survey, fixes are difficult, and comparison with historical data becomes impossible. The safer operation is to secure n = 100 to 200 in the pilot, confirm the structure with alpha and exploratory factor analysis, and then proceed to the main survey.
Reliability and validity verification is the most academic area of survey design, and the most likely to be deferred. But a metric that cannot answer "what is this measuring?" or "how does this relate to business?" cannot meet accountability obligations to executives and will not survive long-term operation.
The concepts of alpha, factor analysis, construct validity, and criterion validity organized in this article all originated in academic contexts, but they are also practitioner tools that ensure the operational continuity of business surveys. Rather than aiming for perfection from the start, begin by computing alpha once for your main scale, and measuring test-retest reliability once.
References (9)
Reliability
- Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. https://doi.org/10.1007/BF02310555
- Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98-104. https://doi.org/10.1037/0021-9010.78.1.98
- Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). McGraw-Hill. https://www.mheducation.com/highered/product/psychometric-theory-nunnally-bernstein/M9780070478497.html
Validity
- Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281-302. https://doi.org/10.1037/h0040957
- Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105. https://doi.org/10.1037/h0046016
- Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). American Council on Education and Macmillan.
Fit indices
- Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1-55. https://doi.org/10.1080/10705519909540118
Industry and standardization bodies
If you want to start running surveys with measurement quality you can rely on, try the free survey tool Kicue. From multi-item composition with Likert scales and matrix questions, to respondent ID management for test-retest studies, and raw data CSV export for ingestion into R / Python / SPSS / JASP — you can build the foundation for reliability and validity verification in a single account.
