PEP 6305 Measurement in Health & Physical Education

Topic 12: Validity

Section 12.1

n This Topic has 2 Sections.

Reading

n “Validity” reading posted in Blackboard.

Purpose

n To discuss the concepts of measurement validity.

n To introduce the principles of measurement validation.

Note

n Measurement validity is different from experimental validity, which was discussed in the “Research Design” section of Topic 1.

¨ Measurement validity investigates the qualities and merits of a measure.

¨ Experimental validity investigates the qualities and merits of an experiment or study.

Validity

n A measure that is valid means that:

1) the measure is an indicator of the characteristic being measured, and

2) the interpretation of the measure is appropriate: your uses of the information (e.g., making a decision) are proper (accurate and unbiased) and correct actions.

n **Validity is specific to a population and a context.**

¨ The population comprises the people about whom we want to gather accurate information for some purpose.

¨ The context comprises the purpose for making a measurement, the conditions under which the measurement is conducted, and the application of the resulting measure (i.e., the use of the measure to fulfill the specific purpose).

¨ A measure that is valid in one population and context may not be valid in other populations or contexts.

¨ Thus, we do not validate tests; **we validate the interpretations and uses of a test**.

¨ For example, if a man’s body fat is measured to be 40%, there are several potential interpretations and uses:

40% of the man’s body is composed of fat.
The man is obese and should consider modification of diet and exercise behaviors.
The man is at risk for certain health problems and medical intervention is needed.
Validation studies investigate questions such as: Which, if any, of these interpretations and uses are appropriate? Why? What evidence would you need to support each of these interpretations and uses?

n Establishing test validity involves obtaining evidence of what the test measures and how it can be interpreted.

¨ This evidence is typically gathered in a series of studies.

¨ Some studies evaluate whether the measure is a valid indicator of the correct variable.

¨ Other studies evaluate whether the interpretation of the measure is valid.

¨ Every new use or interpretation of a measure requires gather more evidence in a new series of validation studies.

n The reading discusses three types of validity evidence:

¨ Content validity evidence.

¨ Criterion validity evidence.

¨ Construct validity evidence.

Content Validity Evidence: Logical Support

n Content validity evidence (also called “logical validity evidence” or “face validity evidence”) involves using logic to determine if the test is valid. Evidence of content validity can be illustrated by examples.

¨ Written Classroom Test - A content valid written test is one that measures what was taught in the classroom.

A test with questions that sample the “population” of course information and skills has higher content validity. Tests that have questions not related to course information have low content validity.
The test questions measure the material defined by the course objectives.

¨ 40-Yard Sprint - A test used by many football coaches to measure speed is 40-yard sprint time.

Lower sprint times indicate higher speed.
Comparing sprint times is equivalent to comparing the speeds of the athletes, which is the goal. The test is clearly a measure of what is intended.

¨ Firefighter and Police Tests - Individuals who want to become a firefighter or a police officer must pass demanding physical tests.

These tests involve physical tasks that must be done as part of the job.
Analyzing the work and tasks that are actually done on the job identified these work-specific tasks.
When the test includes the key tasks done at work, it has higher content validity.

n Making a logical argument to support including certain components in a test can be simple or complex.

¨ In general, the more complex the characteristic or behavior being measured, the more complex the test and the logical argument.

¨ An example of a simple characteristic to measure is arm flexion strength; subjects who can lift more weight by flexing their arm have more arm strength.

¨ An example of a complex characteristic to measure is the ability to conduct a scientific study, which faculty are often interested in evaluating in their graduate students. One way to evaluate this is to have a student identify a scientific problem, design a study to answer the problem, collect and analyze data, interpret the results, and then write a thesis or dissertation explaining how the study has addressed the problem. Obviously, this process is more complicated than having subjects lift a weight.

Criterion Validity Evidence: Reference to a Standard

n Criterion Validity Evidence

¨ Criterion-related validity evidence involves examining how a test is related to a well-established criterion.

¨ Typically, product-moment correlation or regression is used to establish criterion-related validity.

¨ The degree to which the test and criterion are correlated is the degree to which the test is a valid indicator of the trait measured by the criterion.

The correlation coefficient between the test and criterion is sometimes called the validity coefficient.

¨ For example, the correlation between height and weight is about 0.60. Thus, to some extent height is a valid indicator of weight - taller individuals tend to be heavier than shorter individuals. If you know someone's height you have a rough idea of his or her weight relative to other people.

n A high correlation between a test and a criterion is evidence of criterion validity.

n In addition, high correlation indicates that the test can predict the criterion with good precision (the standard error of the estimate is low).

n Some examples of criterion variables.

¨ Expert judgment. The critical points in using expert judgments are (1) to use the judgment of several (>2) experts rather than a single expert and (2) to ensure that the experts’ judgments have high objectivity and high reliability (stability).

¨ Rank order of the subjects on some performance test or indicator. The key is to ensure that the rank order is itself a valid indicator of the characteristic.

¨ An established, accurate, and valid test of the same characteristic. This method is probably the most common in exercise science because we have accurate measures of many of the characteristics and properties in which we are interested.

n Several important exercise science examples of tests that have been validated using criterion validity evidence are:

¨ Aerobic fitness.

The standard, valid measure of aerobic fitness is laboratory-measured oxygen use, in which subjects run on a treadmill while a machine collects and analyzes the air they breath in and out. The machine reports a measure of VO2max (the maximum amount of oxygen the person can utilize) per minute per kilogram of body weight. While highly reliable, objective, and accurate, the lab test is not practical for testing a large number of subjects.
Several types of tests that are more feasible for testing a large number of subjects have been developed.
The validity of these more practical field tests has been established by testing subjects with both tests and examining the correlation between tests.
Maximum treadmill tests. The correlations between maximum treadmill time and laboratory determined VO2max range from 0.88 to 0.97.
Maximum distance run tests. Distance run tests (e.g., 1-mile run for time, 1.5-mile run for time and 12-minute run for distance) are correlated with laboratory determined VO2max. These correlations (i.e., validity coefficients) tend to range from 0.70 to 0.90.
One submaximal mile walk or jog. These submaximal tests use submaximal walking or jogging in combination with heart rate response to exercise and other variables to estimate laboratory measured VO2max with a regression equation. The validity coefficients tend to be ≥ 0.87.
Non-exercise tests. Age, body composition, self-report level of physical activity and gender provided an excellent estimate of laboratory determined VO2max. The validity coefficients of these models are 0.79 and 0.85. These tests were validated at the University of Houston by Andrew Jackson, an emeritus professor in HHP, and Matthew Mahar, one of his doctoral students.
Think about these issues: How are each of the tests more feasible for testing a large number of subjects? Which is easiest? Which has the highest validity coefficient (the best prediction of the lab test)? Is there a tradeoff between simplicity of testing and accuracy?

¨ Body Composition

The most accurate methods of measuring body composition (i.e., percent body fat) are: underwater weighing (fat floats, so the difference in weight between land weight and body weight is fat weight) and dual x-ray absorptiometry (DEXA; x-rays are used to identify different types of body tissue, including fat). DEXA is generally accepted as the most valid criterion of body composition. These methods can only be used in a laboratory, which is again impractical for evaluating large groups of people.
The two methods that are used in the field are skinfolds and body mass index (BMI). The evidence supporting skinfolds and BMI is criterion-related validity.
Skinfolds. The most common field tests of body composition are skinfolds. The thickness of skinfolds at various points on the body are entered into equations to predict body density, which is converted to percent body fat. The validity coefficients for skinfold fat are 0.85 for women and 0.90 for men. Skinfolds are preferred over BMI to predict body fat because the validity coefficients higher, i.e., the skinfold test has a higher criterion validity than does BMI.
BMI. The validity coefficients for equations that relate BMI to percent body fat are between 0.75 and 0.85.

Relation of Reliability and Validity

n A test cannot be valid if it is not reliable.

¨ If a test is not reliable, then it is not measuring anything. All of the variation in test scores is due to measurement error.

¨ If a test is not measuring anything, if it represents only measurement error, then it cannot be a valid indicator of anything.

¨ For instance, suppose I assigned grades by rolling a die, so that 1 = A, 2 = B, 3 = C, and so on. Would this be reliable--would it vary with your "true" ability? No, because the grades are assigned at random. Consequently, that grade cannot be a valid indicator of your performance in this course.

n The maximum possible validity coefficient is defined by the following equation:

where r_xy is the validity (correlation) coefficient, r_xx is the reliability of the x variable (test), and r_yy is the reliability of the y variable (criterion).

n The criterion validity coefficient cannot be higher than the square root of the product of the reliabilities. It can, however, be lower than the product, because even if the x and y variables each had perfect reliability, the two variables may be completely unrelated to one another. To illustrate, we can measure your height with high reliability and measure your percent body fat with high reliability, both with reliability coefficients >0.90. But height is not correlated with percent body fat, so the validity coefficient (r_xy) would be close to 0.

n Thus, the magnitude of the validity coefficient is directly related to the reliability of both tests.

n Here are some numerical examples:

r_xx = 0.90 and r_yy = 0.90, so r_xy ≤ 0.90

r_xx = 0.90 and r_yy = 0.60, so r_xy ≤ 0.73

r_xx = 0.70 and r_yy = 0.60, so r_xy ≤ 0.65

r_xx = 0.90 and r_yy = 0.10, so r_xy ≤ 0.30

n Suppose your new test has an estimated reliability = 0. What is the highest that correlation with a criterion variable could be?

n The reliability of test may not be known. If the validity coefficient is high, however, then both of the tests must be reliable.

n For instance, suppose the correlation (validity coefficient) between skinfolds and underwater measured percent fat is about 0.90.

¨ Underwater weighing is known to be highly reliable (r_xx > 0.95), but the reliability of skinfolds is not as well established.

¨ The validity coefficient of 0.90, however, provides strong data indicating that skinfolds can be measured reliably.

If not, the validity coefficient could not be as high.
Can you show this using the equation above? What is the lowest that the reliability of skinfolds could be?

Construct Validity Evidence

n Construct validation is used primarily, although not exclusively, with measures of abstract characteristics rather than concrete characteristics.

¨ An abstract characteristic is one that cannot be directly observed. A concrete characteristic can be directly observed.

¨ Self-efficacy is a psychosocial characteristic. Compared to body weight, which is directly observable, self-efficacy cannot be seen or directly measured.

¨ However, if we develop new measures or indicators of body weight, we could still use the process of construct validation to confirm that those measures are valid.

¨ The variable being measured is also known as a construct, because it is constructed of the measures that indicate the relative or absolute quantity of the relevant characteristics.

¨ Thus, construct validity investigates whether a measure is a valid indicator of the variable (construct) of interest.

n Construct validity evidence involves the scientific method:

¨ Identify and define the construct (variable) of interest and the population of interest.

¨ Develop or identify a theory that explains the construct, including how it is associated with observable traits or behaviors, associated with other constructs, and associated with current measures of similar and different characteristics.

¨ Sample the population and collect and analyze data, including the measure you are testing, to produce statistical evidence of these theoretical associations.

¨ Determine the extent to which this statistical evidence supports the theory. When the evidence supports the theory, you have a measure that has construct validity.

n While this can be a complex procedure, the central issue is the extent to which test scores agree with what the theory predicts.

n If you administer a test to a group of individuals and if the results do not match what you would expect, this is evidence that the test lacks validity. The definition, the theory, or the test are incorrect, but regardless of the reason the test is not valid for the intended use. (Although it might be valid for other uses.)

¨ Note that in this instance you are comparing groups which you know are different, you are not trying to determine whether they differ.

n If the results support your expectations, this would support the conclusion that the test is valid.

n This can be illustrated with an example.

¨ Swimming Test. Assume that you develop a test of swimming skills. Your construct is swimming ability. Your population is college students. You administer it to the following two groups of college students.

Individuals who just completed an intermediate swimming class.
Individuals who just completed an advanced swimming class that led to lifeguard certification.
Members of a nationally ranked college swimming team.

¨ Theory and logic would dictate that the order of the mean performance of the groups would be: highest, college swimmers; next, advanced swimmers; and lowest, intermediate swimmers.

¨ Construct validity evidence supporting the test would be that the group mean test scores are consistent with the performance grouping. Your test accurately discriminates between swimmers with very high, high, and moderate abilities, respetively.

¨ If the means were not consistent with the performance-defined groups, this provide evidence that the test lacked construct validity. It would not be a valid measure of swimming ability in college students.

Click to go to the next section (Section 12.2)