PEP 6305 Measurement in Health & Physical Education


Topic 12: Validity

Section 12.1


n   This Topic has 2 Sections.



n   “Validity” reading posted in Blackboard.



n   To discuss the concepts of measurement validity.

n   To introduce the principles of measurement validation.



n   Measurement validity is different from experimental validity, which was discussed in the “Research Design” section of Topic 1.

¨  Measurement validity investigates the qualities and merits of a measure.

¨  Experimental validity investigates the qualities and merits of an experiment or study.




n   A measure that is valid means that:

1) the measure is an indicator of the characteristic being measured, and

2) the interpretation of the measure is appropriate: your uses of the information (e.g., making a decision) are proper (accurate and unbiased) and correct actions.


n   Validity is specific to a population and a context.

¨  The population comprises the people about whom we want to gather accurate information for some purpose.

¨  The context comprises the purpose for making a measurement, the conditions under which the measurement is conducted, and the application of the resulting measure (i.e., the use of the measure to fulfill the specific purpose).

¨  A measure that is valid in one population and context may not be valid in other populations or contexts.

¨  Thus, we do not validate tests; we validate the interpretations and uses of a test.

¨  For example, if a man’s body fat is measured to be 40%, there are several potential interpretations and uses:

n   Establishing test validity involves obtaining evidence of what the test measures and how it can be interpreted.

¨  This evidence is typically gathered in a series of studies.

¨  Some studies evaluate whether the measure is a valid indicator of the correct variable.

¨  Other studies evaluate whether the interpretation of the measure is valid.

¨  Every new use or interpretation of a measure requires gather more evidence in a new series of validation studies.

n   The reading discusses three types of validity evidence:

¨  Content validity evidence.

¨  Criterion validity evidence.

¨  Construct validity evidence.


Content Validity Evidence: Logical Support


n   Content validity evidence (also called “logical validity evidence” or “face validity evidence”) involves using logic to determine if the test is valid.  Evidence of content validity can be illustrated by examples.

¨  Written Classroom Test - A content valid written test is one that measures what was taught in the classroom.

¨  40-Yard Sprint - A test used by many football coaches to measure speed is 40-yard sprint time.

¨  Firefighter and Police Tests - Individuals who want to become a firefighter or a police officer must pass demanding physical tests.

n   Making a logical argument to support including certain components in a test can be simple or complex.

¨  In general, the more complex the characteristic or behavior being measured, the more complex the test and the logical argument.

¨  An example of a simple characteristic to measure is arm flexion strength; subjects who can lift more weight by flexing their arm have more arm strength.

¨  An example of a complex characteristic to measure is the ability to conduct a scientific study, which faculty are often interested in evaluating in their graduate students. One way to evaluate this is to have a student identify a scientific problem, design a study to answer the problem, collect and analyze data, interpret the results, and then write a thesis or dissertation explaining how the study has addressed the problem. Obviously, this process is more complicated than having subjects lift a weight.


Criterion Validity Evidence: Reference to a Standard


n   Criterion Validity Evidence

¨  Criterion-related validity evidence involves examining how a test is related to a well-established criterion.

¨  Typically, product-moment correlation or regression is used to establish criterion-related validity.

¨  The degree to which the test and criterion are correlated is the degree to which the test is a valid indicator of the trait measured by the criterion.

¨  For example, the correlation between height and weight is about 0.60. Thus, to some extent height is a valid indicator of weight - taller individuals tend to be heavier than shorter individuals. If you know someone's height you have a rough idea of his or her weight relative to other people.

n   A high correlation between a test and a criterion is evidence of criterion validity.

n   In addition, high correlation indicates that the test can predict the criterion with good precision (the standard error of the estimate is low).

n   Some examples of criterion variables.

¨  Expert judgment. The critical points in using expert judgments are (1) to use the judgment of several (>2) experts rather than a single expert and (2) to ensure that the experts’ judgments have high objectivity and high reliability (stability).

¨  Rank order of the subjects on some performance test or indicator. The key is to ensure that the rank order is itself a valid indicator of the characteristic.

¨  An established, accurate, and valid test of the same characteristic. This method is probably the most common in exercise science because we have accurate measures of many of the characteristics and properties in which we are interested.

n   Several important exercise science examples of tests that have been validated using criterion validity evidence are:

¨  Aerobic fitness.

¨  Body Composition


Relation of Reliability and Validity

n   A test cannot be valid if it is not reliable.

¨  If a test is not reliable, then it is not measuring anything. All of the variation in test scores is due to measurement error.

¨  If a test is not measuring anything, if it represents only measurement error, then it cannot be a valid indicator of anything.

¨  For instance, suppose I assigned grades by rolling a die, so that 1 = A, 2 = B, 3 = C, and so on. Would this be reliable--would it vary with your "true" ability? No, because the grades are assigned at random. Consequently, that grade cannot be a valid indicator of your performance in this course.  

n   The maximum possible validity coefficient is defined by the following equation:

       where rxy is the validity (correlation) coefficient, rxx is the reliability of the x variable (test), and ryy is the reliability of the y variable (criterion).

n   The criterion validity coefficient cannot be higher than the square root of the product of the reliabilities. It can, however, be lower than the product, because even if the x and y variables each had perfect reliability, the two variables may be completely unrelated to one another. To illustrate, we can measure your height with high reliability and measure your percent body fat with high reliability, both with reliability coefficients >0.90. But height is not correlated with percent body fat, so the validity coefficient (rxy) would be close to 0.

n   Thus, the magnitude of the validity coefficient is directly related to the reliability of both tests.

n   Here are some numerical examples:


       rxx = 0.90 and ryy = 0.90, so rxy ≤ 0.90


       rxx = 0.90 and ryy = 0.60, so rxy ≤ 0.73


       rxx = 0.70 and ryy = 0.60, so rxy ≤ 0.65


       rxx = 0.90 and ryy = 0.10, so rxy ≤ 0.30


n   Suppose your new test has an estimated reliability = 0. What is the highest that correlation with a criterion variable could be?

n   The reliability of test may not be known. If the validity coefficient is high, however, then both of the tests must be reliable.

n   For instance, suppose the correlation (validity coefficient) between skinfolds and underwater measured percent fat is about 0.90.

¨  Underwater weighing is known to be highly reliable (rxx > 0.95), but the reliability of skinfolds is not as well established.

¨  The validity coefficient of 0.90, however, provides strong data indicating that skinfolds can be measured reliably.


Construct Validity Evidence


n   Construct validation is used primarily, although not exclusively, with measures of abstract characteristics rather than concrete characteristics.

¨  An abstract characteristic is one that cannot be directly observed. A concrete characteristic can be directly observed.

¨  Self-efficacy is a psychosocial characteristic. Compared to body weight, which is directly observable, self-efficacy cannot be seen or directly measured.

¨  However, if we develop new measures or indicators of body weight, we could still use the process of construct validation to confirm that those measures are valid.

¨  The variable being measured is also known as a construct, because it is constructed of the measures that indicate the relative or absolute quantity of the relevant characteristics.

¨  Thus, construct validity investigates whether a measure is a valid indicator of the variable (construct) of interest.

n   Construct validity evidence involves the scientific method:

¨  Identify and define the construct (variable) of interest and the population of interest.

¨  Develop or identify a theory that explains the construct, including how it is associated with observable traits or behaviors, associated with other constructs, and associated with current measures of similar and different characteristics.

¨  Sample the population and collect and analyze data, including the measure you are testing, to produce statistical evidence of these theoretical associations.

¨  Determine the extent to which this statistical evidence supports the theory. When the evidence supports the theory, you have a measure that has construct validity.

n   While this can be a complex procedure, the central issue is the extent to which test scores agree with what the theory predicts.

n   If you administer a test to a group of individuals and if the results do not match what you would expect, this is evidence that the test lacks validity. The definition, the theory, or the test are incorrect, but regardless of the reason the test is not valid for the intended use. (Although it might be valid for other uses.)

¨  Note that in this instance you are comparing groups which you know are different, you are not trying to determine whether they differ.

n   If the results support your expectations, this would support the conclusion that the test is valid.

n   This can be illustrated with an example.

¨  Swimming Test. Assume that you develop a test of swimming skills. Your construct is swimming ability. Your population is college students. You administer it to the following two groups of college students.

¨  Theory and logic would dictate that the order of the mean performance of the groups would be: highest, college swimmers; next, advanced swimmers; and lowest, intermediate swimmers.

¨  Construct validity evidence supporting the test would be that the group mean test scores are consistent with the performance grouping. Your test accurately discriminates between swimmers with very high, high, and moderate abilities, respetively.

¨  If the means were not consistent with the performance-defined groups, this provide evidence that the test lacked construct validity. It would not be a valid measure of swimming ability in college students.


Click to go to the next section (Section 12.2)