PEP 6305 Measurement in
Health & Physical Education
Topic 12:
Validity
Section 12.1
n
This Topic has 2 Sections.
Reading
n
“Validity” reading posted in Blackboard.
Purpose
n
To discuss the concepts of measurement validity.
n
To introduce the principles of measurement validation.
Note
n
Measurement validity is different from
experimental validity, which was discussed in the “Research
Design” section of Topic 1.
¨
Measurement validity investigates the qualities and merits of a
measure.
¨
Experimental validity investigates the qualities and merits of an
experiment or study.
Validity
n
A measure that is valid means that:
1) the
measure is an indicator of the characteristic being measured, and
2) the
interpretation of the measure is appropriate: your uses of the information
(e.g., making a decision) are proper (accurate and unbiased) and correct
actions.
n
Validity is specific to a population and a context.
¨
The population comprises the people about whom we
want to gather accurate information for some purpose.
¨
The context comprises the purpose for making a
measurement, the conditions under which the measurement is conducted, and
the application of the resulting measure (i.e., the use of the measure to
fulfill the specific purpose).
¨
A measure that is valid in one population and context may not
be valid in other populations or contexts.
¨
Thus, we do not validate tests; we validate the
interpretations and uses of a test.
¨
For example, if a man’s body fat is measured to be 40%, there are
several potential interpretations and uses:
-
40% of the man’s body is composed of fat.
-
The man is obese and should consider
modification of diet and exercise behaviors.
-
The man is at risk for certain health
problems and medical intervention is needed.
-
Validation studies investigate questions
such as: Which, if any, of these interpretations and
uses are appropriate? Why? What evidence would you need to support
each of these interpretations and uses?
n
Establishing test validity involves obtaining evidence
of what the test measures and how it can be interpreted.
¨
This evidence is typically gathered in a series of studies.
¨
Some studies evaluate whether the measure is a valid indicator of
the correct variable.
¨
Other studies evaluate whether the interpretation of the measure
is valid.
¨
Every new use or interpretation of a measure requires
gather more evidence in a new series of validation studies.
n
The reading discusses three types of validity evidence:
¨
Content validity evidence.
¨
Criterion validity evidence.
¨
Construct validity evidence.
Content Validity
Evidence: Logical Support
n
Content validity evidence (also called “logical validity
evidence” or “face validity evidence”) involves using logic to determine if
the test is valid. Evidence of content
validity can be illustrated by examples.
¨
Written Classroom Test - A content valid written test is
one that measures what was taught in the classroom.
-
A test with questions that sample the
“population” of course information and skills has higher content validity.
Tests that have questions not related to course information have low content
validity.
-
The test questions measure the material
defined by the course objectives.
¨
40-Yard Sprint - A test used by many football coaches to
measure speed is 40-yard sprint time.
-
Lower sprint times indicate higher speed.
-
Comparing sprint times is equivalent to
comparing the speeds of the
athletes, which is the goal. The test is clearly a measure of what is
intended.
¨
Firefighter and Police Tests - Individuals who want to
become a firefighter or a police officer must pass demanding physical tests.
-
These tests involve physical tasks that
must be done as part of the job.
-
Analyzing the work and tasks that are
actually done on the job
identified these work-specific tasks.
-
When the test includes the key tasks done
at work, it has higher content validity.
n
Making a logical argument to support including certain components
in a test can be simple or complex.
¨
In general, the more complex the characteristic or behavior being
measured, the more complex the test and the logical argument.
¨
An example of a simple characteristic to measure is arm flexion
strength; subjects who can lift more weight by flexing their arm have more arm
strength.
¨
An example of a complex characteristic to measure is the ability
to conduct a scientific study, which faculty are often interested in evaluating
in their graduate students. One way to evaluate this is to have a student
identify a scientific problem, design a study to answer the problem, collect and
analyze data, interpret the results, and then write a thesis or dissertation
explaining how the study has addressed the problem. Obviously, this process is
more complicated than having subjects lift a weight.
Criterion
Validity Evidence: Reference to a Standard
n
Criterion Validity Evidence
¨
Criterion-related validity evidence involves examining how a test
is related to a well-established criterion.
¨
Typically, product-moment correlation or regression is used to
establish criterion-related validity.
¨
The degree to which the test and criterion are correlated is the
degree to which the test is a valid indicator of the trait measured by the
criterion.
-
The correlation coefficient between the
test and criterion is sometimes called the validity coefficient.
¨
For example, the correlation between height and weight is about
0.60. Thus, to some extent height is a valid indicator of weight - taller
individuals tend to be heavier than shorter individuals. If you know
someone's height you have a rough idea of his or her weight relative to other
people.
n
A high correlation between a test and a criterion is evidence of
criterion validity.
n
In addition, high correlation indicates that the test can predict
the criterion with good precision (the
standard error of the estimate is low).
n
Some examples of criterion variables.
¨
Expert judgment. The critical points in using expert
judgments are (1) to use the judgment of several (>2) experts rather than a
single expert and (2) to ensure that the experts’ judgments have high
objectivity and high reliability (stability).
¨
Rank order of the subjects on some performance test or
indicator. The key is to ensure that the rank order is itself a valid indicator
of the characteristic.
¨
An established, accurate, and valid test of the same
characteristic. This method is probably the most common in exercise science
because we have accurate measures of many of the characteristics and properties
in which we are interested.
n
Several important exercise science examples of tests that have
been validated using criterion validity evidence are:
¨
Aerobic fitness.
-
The standard, valid measure of aerobic fitness is
laboratory-measured oxygen use, in which subjects run on a treadmill while a
machine collects and analyzes the air they breath in and out. The machine
reports a measure of VO2max (the maximum amount of oxygen the person can utilize) per
minute per kilogram of body weight. While highly reliable, objective, and
accurate, the lab test is not practical for testing a large number of
subjects.
-
Several types of tests that are more
feasible for testing a large number of subjects have been developed.
-
The validity of these more practical field
tests has been established by testing subjects with both tests and examining
the correlation between tests.
-
Maximum treadmill tests.
The correlations between maximum treadmill time and laboratory determined
VO2max range from 0.88 to 0.97.
-
Maximum distance run tests.
Distance run tests (e.g., 1-mile run for time, 1.5-mile run for time and
12-minute run for distance) are correlated with laboratory determined
VO2max. These correlations (i.e., validity coefficients) tend to range from
0.70 to 0.90.
-
One submaximal mile walk or jog.
These submaximal tests use submaximal walking or jogging in combination with
heart rate response to exercise and other variables to estimate laboratory
measured VO2max with a regression equation. The validity coefficients tend
to be ≥ 0.87.
-
Non-exercise tests.
Age, body composition, self-report level of physical activity and gender
provided an excellent estimate of laboratory determined VO2max. The validity
coefficients of these models are 0.79 and 0.85. These tests were validated
at the University of Houston by Andrew
Jackson, an emeritus professor in HHP, and Matthew Mahar, one of his
doctoral students.
-
Think about these issues: How are each of the tests more
feasible for testing a large number of
subjects? Which is easiest? Which has the highest validity coefficient (the
best prediction of the lab test)? Is there a tradeoff between simplicity of
testing and accuracy?
¨
Body Composition
-
The most accurate methods of measuring body
composition (i.e., percent body fat) are: underwater weighing (fat floats,
so the difference in weight between land weight and body weight is fat
weight) and dual x-ray absorptiometry (DEXA; x-rays are used to identify
different types of body tissue, including fat). DEXA is generally accepted
as the most valid criterion of body composition.
These methods can only be used in a
laboratory, which is again impractical for evaluating large groups of people.
-
The two methods that are used in the field
are skinfolds and body mass index (BMI). The evidence supporting skinfolds
and BMI is criterion-related validity.
-
Skinfolds.
The most common field tests of body composition are skinfolds. The thickness
of skinfolds at various points on the body are entered into equations to
predict body density, which is converted to percent body fat. The validity
coefficients for skinfold fat are 0.85 for women and 0.90 for men. Skinfolds
are preferred over BMI to predict body fat because the validity coefficients
higher, i.e., the skinfold test has a higher criterion validity than does
BMI.
-
BMI.
The validity coefficients for equations that relate BMI to percent body fat
are between 0.75 and 0.85.
Relation of
Reliability and Validity
n
A test cannot be valid if it is not reliable.
¨
If a test is not reliable, then it is not measuring anything. All
of the variation in test scores is due to measurement error.
¨
If a test is not measuring anything, if it represents only
measurement error, then it cannot be a valid indicator of anything.
¨
For instance, suppose I assigned grades by rolling a die, so that 1 = A, 2 = B,
3 = C, and so on. Would this be reliable--would it vary with your "true"
ability? No, because the grades are assigned at random. Consequently, that grade
cannot be a valid indicator of your performance in this course.
n
The maximum possible validity coefficient is defined by the
following equation:
where rxy is the validity (correlation) coefficient,
rxx
is the reliability of the x variable (test), and ryy is
the reliability of the y variable (criterion).
n
The criterion validity coefficient cannot be higher than the
square root of the product of the reliabilities. It can, however, be lower than
the product, because even if the x and y variables each had
perfect reliability, the two variables may be completely unrelated to one
another. To illustrate, we can measure your height with high reliability
and measure your percent body fat with high reliability, both with
reliability coefficients >0.90. But height is not correlated with percent body
fat, so the validity coefficient (rxy)
would be close to 0.
n
Thus, the magnitude of the validity coefficient is directly
related to the reliability of both tests.
n
Here are some numerical examples:
rxx
= 0.90 and ryy = 0.90, so rxy ≤ 0.90
rxx
= 0.90 and ryy = 0.60, so rxy ≤ 0.73
rxx
= 0.70 and ryy = 0.60, so rxy ≤ 0.65
rxx
= 0.90 and ryy = 0.10, so rxy ≤ 0.30
n
Suppose your new test has an estimated reliability = 0. What is
the highest that correlation with a criterion variable
could be?
n
The reliability of test may not be known. If the validity
coefficient is high, however, then both of the tests must be reliable.
n
For instance, suppose the correlation (validity coefficient)
between skinfolds and underwater measured percent fat is about 0.90.
¨
Underwater weighing is known to be highly reliable (rxx
> 0.95), but the reliability of skinfolds is not as well established.
¨
The validity coefficient of 0.90, however, provides strong data
indicating that skinfolds can be measured reliably.
-
If not, the validity coefficient could not
be as high.
-
Can you show this using the equation above?
What is the lowest that the reliability of skinfolds
could be?
Construct
Validity Evidence
n
Construct validation is used primarily, although not exclusively,
with measures of abstract characteristics rather than concrete characteristics.
¨
An abstract characteristic is one that cannot be directly
observed. A concrete characteristic can be directly observed.
¨
Self-efficacy is a psychosocial characteristic. Compared to body
weight, which is directly observable, self-efficacy cannot be seen or directly
measured.
¨
However, if we develop new measures or indicators of body weight,
we could still use the process of construct validation to confirm that those
measures are valid.
¨
The variable being measured is also known as a construct, because
it is constructed of the measures that indicate the relative or absolute
quantity of the relevant characteristics.
¨
Thus, construct validity investigates whether a measure is a valid
indicator of the variable (construct) of interest.
n
Construct validity evidence involves the scientific method:
¨
Identify and define the construct (variable) of interest and the
population of interest.
¨
Develop or identify a theory that explains the construct,
including how it is associated with observable traits or behaviors,
associated with other constructs, and associated with
current measures of similar and different characteristics.
¨
Sample the population and collect and analyze data, including the
measure you are testing, to produce statistical evidence of these theoretical
associations.
¨
Determine the extent to which this statistical evidence supports
the theory. When the evidence supports the theory, you have a measure that has
construct validity.
n
While this can be a complex procedure, the central issue is the
extent to which test scores agree with what the theory predicts.
n
If you administer a test to a group of individuals and if the
results do not match what you would expect, this is evidence that the test lacks
validity. The definition, the theory, or the test are incorrect, but regardless
of the reason the test is not valid for the intended use. (Although it
might be valid for other uses.)
¨
Note that in this instance you are comparing groups which you
know
are different, you are not trying to determine
whether
they differ.
n
If the results support your expectations, this would support the
conclusion that the test is valid.
n
This can be illustrated with an example.
¨
Swimming Test. Assume that you develop a test of swimming skills.
Your construct is swimming ability. Your population is college students. You
administer it to the following two groups of college students.
-
Individuals who just completed an
intermediate swimming class.
-
Individuals who just completed an advanced
swimming class that led to lifeguard certification.
-
Members of a nationally ranked college
swimming team.
¨
Theory and logic would dictate that the order of the mean performance of the
groups would be: highest, college swimmers; next, advanced swimmers; and lowest,
intermediate swimmers.
¨
Construct validity evidence supporting the test would be that the
group mean test scores are consistent with the performance grouping. Your test accurately
discriminates between swimmers with very high, high, and moderate abilities,
respetively.
¨
If the means were not consistent with the performance-defined
groups, this provide evidence that the test lacked construct validity. It would
not be a valid measure of swimming ability in college students.
Click
to go to the next section (Section 12.2)