PEP 6305 Measurement in Health & Physical Education

Topic 11: Reliability

Section 11.1

n This Topic has 3 Sections.

Reading

n Vincent & Weir, Statistics in Kinesiology, 4th ed. Chapter 13 “Quantifying Reliability”.

n Also, the "Reliability" PDF reading posted in Blackboard.

Purpose

n To discuss the principles of reliability and measurement error.

n To demonstrate the estimation of reliability and the standard error of measurement.

Objectivity

n Objectivity concerns how a test is scored. It depends on two factors:

¨ A defined scoring system.

¨ Individuals (called judges or raters) who have been trained to score the test.

n An objective test is one that two or more competent judges assign the same value when scoring a test. In other words, whether the judges agree on the rating.

¨ The training of the judges and the scoring system are important to achieving high reliability.

¨ For example, if multiple judges score an event, such as gymnastics or diving, you would need a scoring system that not only includes what would be important, but also how to assign points (a scale), and scorers would have to be trained on what to observe and how to assign points.

n Objectivity is sometimes referred to as interrater reliability.

n An objective rating is always better than a subjective rating because less measurement error is introduced.

¨ Differences between multiple test scorers introduces measurement error.

¨ Increasing measurement error decreases objectivity (interrater reliability); decreasing objectivity, in turn, decreases validity (Validity is the final course Topic).

Reliability

n Reliability concerns how accurately a test represents variation between subjects.

n Measurement theory: An observed score (X) consists of two components, a true score (T) and error score (E):

¨ X = T + E

¨ T (true score) is the measure of the ability or characteristic that we are interested in.

¨ E (error score) is measurement error, which is anything that is NOT the thing we want to measure.

n There are several possible sources of measurement error:

¨ Measurement unit or scale – a test’s unit of measurement may be too large to measure the characteristic precisely. For example, if subjects are rated on a three point scale (poor, average, excellent), then distinguishing between subjects within any of the three ratings would be impossible, although it is unlikely that all subjects who received the same rating have exactly the same ability: not everyone who is rated "excellent" is performing exactly the same, so giving them all the same score introduces some error (deviation from their "true" score or ability).

¨ Subject inconsistency – a person could have a good or bad day when being tested, which means their performance differs from day to day.

¨ Poor test conditions – noise or other distractions when taking a written test or a slippery condition when taking a running test. Poor conditions rather than the subject's ability affect the subject's score.

¨ Poorly constructed test – writing bad test questions that no one can understand so that the test takers end up guessing. Guessing is not a measure of academic ability.

¨ Poor test equipment – The equipment (e.g., the gas analyzer when measuring VO2) is inconsistent because it is malfunctioning or improperly calibrated.

n Theoretically, you could determine a person’s true score (T) by calculating the mean of an infinite number of tests.

¨ While it is not possible to administer a test an infinite number of times, it is important to understand that the mean of several administrations (or trials) is the most accurate representation of the subject's true score. Why? In general (but not always), the more trials you have, the more accurate (reliable) is the average score with respect to the subject's true score -- it is more reliable.

¨ Measurement error is assumed to be random and normally distributed. Thus, the mean error over several trials will be 0, with predictable variability on either side of 0.

Interpreting Test Reliability

n A reliability coefficient represents the proportion of total variance that is measuring true score differences among the subjects.

¨ A reliability coefficient can range from a value of 0.0 (all the variance is measurement error) to a value of 1.00 (no measurement error). In reality, all tests have some error, so reliability is never 1.00.

¨ A test with high reliability (≥0.70) is desired, because lower reliability indicates that a large proportion of test variance is measurement error.

¨ If test reliability is 0, and test scores are used to assign grades, a student's grade would be assigned purely by chance, similar to flipping a coin or rolling dice!

n High reliability indicates that the test is measuring something; validity studies (Topic 12) determines what the test is measuring.

¨ High test reliability is required for test validity. A test cannot be valid if it is not reliable.

¨ Low reliability means most of the observed test variance is measurement error – due to chance.

¨ If test variance is largely due to chance, it is not measuring anything.

¨ If a test is not measuring anything, it cannot be a valid measure of anything.

Determining Reliability

n Test reliability is always established for a defined population; reliability of a test in one population may not be the same as in other populations.

¨ Test variance is central to reliability.

¨ Since a test score (X) consists of true and error components, the total variance (σ_x²) of a test administered to a group consists of true score variance (σ_t²) and error variance (σ_e²):

Test Variance = True Score Variance + Error Variance

n To illustrate, suppose the SD of a test administered to students was 2.0 (thus, total test variance = 2.0² = 4.0). All of the students guessed on every question, which means that getting the questions correct was due to luck or chance (σ_e² = 4.0). Since guessing is completely random and has nothing to do with ability (true score), there would be no true score variance (σ_t² = 0) and the components would be:

¨ Total Variance = True Score Variance + Error Variance: 4.0 = 0.0 + 4.0

n Test reliability (R_xx) is calculated from these variances:

¨ For this example test reliability would be: R_xx = [(4.0 – 4.0)/4.0] = 0/4.0 = 0.0

¨ The reliability of this test is 0, which means that all test variance was due to measurement error. The test did not measure anything.

n As another example of test reliability, let us assume we have a test with a total variance of 40 and error variance is 4. If this were the case we would have the following:

¨ The reliability of the test would be: R_xx = [(40 – 4)/40] = 36/40 = 0.90.

¨ The test reliability would be 0.90. 90% of the test variance is attributable to true score differences; only 10% of the total test variance was due to measurement error. This test has good reliability for detecting differences among subjects for the ability or trait being measured.

Types of Reliability

Stability or “Test-Retest” Reliability

n Involves administering the same test on two or more different occasions.

n Typically the tests are administered within a 7-day period to ensure that true score does not change in the testing period.

n This method can be used with any test, but is often used with tests that cannot be administered twice within the same day. An example would be an endurance test like the 1.5-mile run.

n The stability reliability of a scorer (i.e., comparing multiple scores assigned by a single judge) is called *intra**rater reliability* (how is this different from *inter**rater* reliability, or objectivity?

Internal Consistency Reliability

n This type of test involves getting multiple measures within a day, usually at a single testing session. Examples are:

n Written test. The items are the multiple measures. The person’s score is the sum of all items answered correctly.

n Psychological instrument. These survey or interview instruments consist of several items that are often scored with 1 to 5 points. The person responds on a 5-point scale describing how characteristic the described behavior is of the person responding to the scale. The person’s score is usually the sum of all items.

n Judge's Ratings. A judge rates the performances of several individuals. Some examples are: figure skaters, divers, gymnasts; allied health students performing clinical procedures; or high school students who try for the drill team or cheerleader squad. The judge rates several aspects of the performance, such as the components of a figure skating routine or competitive dive; each aspect is rated independently of the other aspects. The final score is typically the sum or average of the judge's ratings. (Comparing the final scores between multiple judges is objectivity.)

n Multiple-trial test. Many motor performance tests can be administered several times within a day. Examples of this are isometric strength tests. The test requires the subject to exert maximum effort for six seconds. The recommended test procedure is to administer a warm-up trial and 50% effort and then two trials at maximum effort. The average of the two maximum trials is the individual's score.

Click to go to the next section (Section 11.2)

PEP 6305 Measurement in Health & Physical Education

Topic 11: Reliability

Section 11.1

n This Topic has 3 Sections.

Reading

n Vincent & Weir, Statistics in Kinesiology, 4th ed. Chapter 13 “Quantifying Reliability”.

n Also, the "Reliability" PDF reading posted in Blackboard.

Purpose

n To discuss the principles of reliability and measurement error.

n To demonstrate the estimation of reliability and the standard error of measurement.

Objectivity

n Objectivity concerns how a test is scored. It depends on two factors:

¨ A defined scoring system.

¨ Individuals (called judges or raters) who have been trained to score the test.

n An objective test is one that two or more competent judges assign the same value when scoring a test. In other words, whether the judges agree on the rating.

¨ The training of the judges and the scoring system are important to achieving high reliability.

¨ For example, if multiple judges score an event, such as gymnastics or diving, you would need a scoring system that not only includes what would be important, but also how to assign points (a scale), and scorers would have to be trained on what to observe and how to assign points.

n Objectivity is sometimes referred to as interrater reliability.

n An objective rating is always better than a subjective rating because less measurement error is introduced.

¨ Differences between multiple test scorers introduces measurement error.

¨ Increasing measurement error decreases objectivity (interrater reliability); decreasing objectivity, in turn, decreases validity (Validity is the final course Topic).

Reliability

n Reliability concerns how accurately a test represents variation between subjects.

n Measurement theory: An observed score (X) consists of two components, a true score (T) and error score (E):

¨ X = T + E

¨ T (true score) is the measure of the ability or characteristic that we are interested in.

¨ E (error score) is measurement error, which is anything that is NOT the thing we want to measure.

n There are several possible sources of measurement error:

¨ Subject inconsistency – a person could have a good or bad day when being tested, which means their performance differs from day to day.

¨ Poor test conditions – noise or other distractions when taking a written test or a slippery condition when taking a running test. Poor conditions rather than the subject's ability affect the subject's score.

¨ Poorly constructed test – writing bad test questions that no one can understand so that the test takers end up guessing. Guessing is not a measure of academic ability.

¨ Poor test equipment – The equipment (e.g., the gas analyzer when measuring VO2) is inconsistent because it is malfunctioning or improperly calibrated.

n Theoretically, you could determine a person’s true score (T) by calculating the mean of an infinite number of tests.

¨ Measurement error is assumed to be random and normally distributed. Thus, the mean error over several trials will be 0, with predictable variability on either side of 0.

Interpreting Test Reliability

n A reliability coefficient represents the proportion of total variance that is measuring true score differences among the subjects.

¨ A reliability coefficient can range from a value of 0.0 (all the variance is measurement error) to a value of 1.00 (no measurement error). In reality, all tests have some error, so reliability is never 1.00.

¨ A test with high reliability (≥0.70) is desired, because lower reliability indicates that a large proportion of test variance is measurement error.

¨ If test reliability is 0, and test scores are used to assign grades, a student's grade would be assigned purely by chance, similar to flipping a coin or rolling dice!

n High reliability indicates that the test is measuring something; validity studies (Topic 12) determines what the test is measuring.

¨ High test reliability is required for test validity. A test cannot be valid if it is not reliable.

¨ Low reliability means most of the observed test variance is measurement error – due to chance.

¨ If test variance is largely due to chance, it is not measuring anything.

¨ If a test is not measuring anything, it cannot be a valid measure of anything.

Determining Reliability

n Test reliability is always established for a defined population; reliability of a test in one population may not be the same as in other populations.

¨ Test variance is central to reliability.

¨ Since a test score (X) consists of true and error components, the total variance (σx2) of a test administered to a group consists of true score variance (σt2) and error variance (σe2):

¨ Total Variance = True Score Variance + Error Variance: 4.0 = 0.0 + 4.0

n Test reliability (Rxx) is calculated from these variances:

¨ For this example test reliability would be: Rxx = [(4.0 – 4.0)/4.0] = 0/4.0 = 0.0

¨ The reliability of this test is 0, which means that all test variance was due to measurement error. The test did not measure anything.

n As another example of test reliability, let us assume we have a test with a total variance of 40 and error variance is 4. If this were the case we would have the following:

¨ The reliability of the test would be: Rxx = [(40 – 4)/40] = 36/40 = 0.90.

¨ The test reliability would be 0.90. 90% of the test variance is attributable to true score differences; only 10% of the total test variance was due to measurement error. This test has good reliability for detecting differences among subjects for the ability or trait being measured.

Types of Reliability

Stability or “Test-Retest” Reliability

n Involves administering the same test on two or more different occasions.

n Typically the tests are administered within a 7-day period to ensure that true score does not change in the testing period.

n This method can be used with any test, but is often used with tests that cannot be administered twice within the same day. An example would be an endurance test like the 1.5-mile run.

n The stability reliability of a scorer (i.e., comparing multiple scores assigned by a single judge) is called intrarater reliability (how is this different from interrater reliability, or objectivity?

Internal Consistency Reliability

n This type of test involves getting multiple measures within a day, usually at a single testing session. Examples are:

n Written test. The items are the multiple measures. The person’s score is the sum of all items answered correctly.

¨ Since a test score (X) consists of true and error components, the total variance (σ_x²) of a test administered to a group consists of true score variance (σ_t²) and error variance (σ_e²):

n Test reliability (R_xx) is calculated from these variances:

¨ For this example test reliability would be: R_xx = [(4.0 – 4.0)/4.0] = 0/4.0 = 0.0

¨ The reliability of the test would be: R_xx = [(40 – 4)/40] = 36/40 = 0.90.

n The stability reliability of a scorer (i.e., comparing multiple scores assigned by a single judge) is called *intra**rater reliability* (how is this different from *inter**rater* reliability, or objectivity?