PEP 6305 Measurement in
Health & Physical Education
Topic 11:
Reliability
Section 11.1
n
This Topic has 3 Sections.
Reading
n
Vincent & Weir, Statistics in Kinesiology, 4th ed. Chapter
13 “Quantifying Reliability”.
n
Also, the
"Reliability" PDF reading posted in Blackboard.
Purpose
n
To discuss the principles of reliability and measurement error.
n
To demonstrate the estimation of reliability and the standard
error of measurement.
Objectivity
n
Objectivity concerns how a test is scored. It depends on
two factors:
¨
A defined scoring system.
¨
Individuals (called judges or raters) who have been trained to score the test.
n
An objective test is one that two or more competent
judges assign the same value when scoring a test. In other words,
whether the judges agree on the rating.
¨
The training of the judges and the scoring system are important
to achieving high reliability.
¨
For example, if multiple judges score an event, such as
gymnastics or diving, you would need a scoring system that not only includes
what would be important, but also how to assign points (a scale), and scorers would have
to be trained on what to observe and how to assign points.
n
Objectivity is sometimes referred to as interrater reliability.
n
An objective rating is always better than a
subjective rating because less measurement
error is introduced.
¨
Differences between multiple test scorers introduces measurement error.
¨
Increasing measurement error decreases objectivity (interrater reliability); decreasing
objectivity, in turn, decreases validity (Validity is the final course Topic).
Reliability
n
Reliability concerns how accurately a test represents
variation between subjects.
n
Measurement theory: An observed score (X) consists of two
components, a true score (T) and error score (E):
¨
X = T + E
¨
T (true score) is the measure of the ability or characteristic that we are interested in.
¨
E (error score) is measurement error, which is anything that is NOT the thing we
want to measure.
n
There are several possible sources of
measurement error:
¨
Measurement unit or scale – a test’s unit of measurement
may be too large to measure the characteristic precisely. For example, if
subjects are rated on a three point scale (poor, average, excellent), then
distinguishing between subjects within any of the three ratings would be
impossible, although it is unlikely that all subjects who received the same
rating have exactly the same ability: not everyone who is rated "excellent" is
performing exactly the same, so giving them all the same score introduces some
error (deviation from their "true" score or ability).
¨
Subject inconsistency – a person could have a good or bad
day when being tested, which means their performance differs from day to day.
¨
Poor test conditions – noise or other distractions when
taking a written test or a slippery condition when taking a running test. Poor
conditions rather than the subject's ability affect the subject's score.
¨
Poorly constructed test – writing bad test questions that
no one can understand so that the test takers end up guessing. Guessing is not a
measure of academic ability.
¨
Poor test equipment – The equipment (e.g., the gas analyzer
when measuring VO2) is inconsistent because it is malfunctioning or improperly
calibrated.
n
Theoretically, you could determine a person’s true score (T) by
calculating the mean of an infinite number of tests.
¨
While it is not possible to administer a test an infinite number
of times, it is important to understand that the mean of several administrations
(or trials) is the most accurate representation of the subject's true
score.
Why? In general (but not always), the more trials you have, the more
accurate (reliable) is the average score with respect to the subject's true
score -- it is more reliable.
¨
Measurement error is assumed to be random and
normally
distributed. Thus, the mean error over several trials will be 0, with
predictable
variability on either
side of 0.
Interpreting Test
Reliability
n
A reliability coefficient represents the proportion of total variance that
is measuring true score differences among the subjects.
¨
A reliability coefficient can range from a value of 0.0 (all the
variance is measurement error) to a value of 1.00 (no measurement error).
In reality, all
tests have some error, so reliability is never 1.00.
¨
A test with high reliability (≥0.70) is desired, because lower
reliability indicates that a large proportion of test variance is measurement
error.
¨
If test reliability is 0, and test scores are used to assign grades, a
student's grade would be assigned purely by chance, similar to flipping a coin
or rolling dice!
n
High reliability indicates that the test is measuring something;
validity studies (Topic 12) determines what the test is measuring.
¨
High test reliability is required for test validity. A test cannot
be valid if it is not reliable.
¨
Low reliability means most of the observed test variance is measurement error –
due to chance.
¨
If test variance is largely due to chance, it is not measuring
anything.
¨
If a test is not measuring anything, it cannot be a
valid measure of anything.
Determining Reliability
n
Test reliability is always established for a defined population;
reliability of a test in one population may not be the same as in other populations.
¨
Test
variance
is central to reliability.
¨
Since a test score (X) consists of true and error components, the
total variance (σx2) of a test administered to a group
consists of true score variance (σt2) and error variance
(σe2):
Test
Variance = True Score Variance + Error Variance
n
To illustrate, suppose the SD of a test administered to students
was 2.0 (thus, total test variance = 2.02 = 4.0). All of the students
guessed on every question, which means that getting the questions correct was
due to luck or chance
(σe2 = 4.0). Since guessing is completely random
and has nothing to do with ability (true score), there would be no
true score variance (σt2 = 0) and the
components would be:
¨
Total Variance = True Score Variance + Error Variance: 4.0 = 0.0 +
4.0
n
Test reliability (Rxx) is calculated from these
variances:
¨
For this example test reliability would be: Rxx = [(4.0
– 4.0)/4.0] = 0/4.0 = 0.0
¨
The reliability of this test is 0, which means that all test
variance was due to measurement error. The test did not measure anything.
n
As another example of test reliability, let us assume we have a
test with a total variance of 40 and error variance is 4. If this were the case
we would have the following:
¨
The reliability of the test would be: Rxx = [(40 –
4)/40] = 36/40 = 0.90.
¨
The test reliability would be 0.90. 90% of the test variance is
attributable to true score differences; only 10% of the total test variance was
due to measurement error. This test has good reliability for detecting
differences among subjects for the ability or trait being measured.
Types of
Reliability
Stability or
“Test-Retest” Reliability
n
Involves administering the same test on two or more different
occasions.
n
Typically the tests are administered within a 7-day period to
ensure that true score does not change in the testing period.
n
This method can be used with any test, but is often used with
tests that cannot be administered twice within the same day. An example would be
an endurance test like the 1.5-mile run.
n
The stability reliability of a scorer (i.e., comparing multiple
scores assigned by a single judge) is called intrarater reliability
(how is this different from interrater
reliability, or objectivity?
Internal Consistency
Reliability
n
This type of test involves getting multiple measures within a
day, usually at a single testing session. Examples are:
n
Written test. The items are the multiple measures. The
person’s score is the sum of all items answered correctly.
n
Psychological instrument. These survey or interview instruments consist of
several items that are often scored with 1 to 5 points. The person responds on a
5-point scale describing how characteristic the described behavior is of the
person responding to the scale. The person’s score is usually the sum of all items.
n
Judge's Ratings. A judge rates the performances of several
individuals. Some examples are: figure skaters, divers, gymnasts; allied health
students performing clinical procedures; or high school students who try for the
drill team or cheerleader squad. The judge rates several aspects of the
performance, such as the components of a figure skating routine or competitive
dive; each aspect is rated independently of the other aspects. The final score
is typically the sum or average of the judge's ratings. (Comparing the final scores
between multiple judges is objectivity.)
n
Multiple-trial test. Many motor performance tests can be
administered several times within a day. Examples of this are isometric strength
tests. The test requires the subject to exert maximum effort for six seconds.
The recommended test procedure is to administer a warm-up trial and 50% effort
and then two trials at maximum effort. The average of the two maximum trials is
the individual's score.
Click
to go to the next section (Section 11.2)