You are here

Errors in test results: The "quick guide" part 1

Charles Darr
Abstract: 

Test scores reduce complex phenomena, such as achievement in mathematics or ability to comprehend a written text, to numbers. If a test has done a good job of testing what it was designed to assess, then test results can be useful, especially when we want to do things like track progress and make comparisons. However, it is important to understand that test scores are not error free. This Assessment News article is the first of two that provide a “quick guide” to some sources of error that can make our use of test results problematic and the implications this has for practice. In this first article, we look at the impact of measurement error on results for individuals and groups.

Journal issue: 

Errors in test results

The “quick guide” part 1

CHARLES DARR

Test scores reduce complex phenomena, such as achievement in mathematics or ability to comprehend a written text, to numbers. If a test has done a good job of testing what it was designed to assess, then test results can be useful, especially when we want to do things like track progress and make comparisons. However, it is important to understand that test scores are not error free. This Assessment News article is the first of two that provide a “quick guide” to some sources of error that can make our use of test results problematic and the implications this has for practice. In this first article, we look at the impact of measurement error on results for individuals and groups.

Measurement error and test results for individual students

No test is perfectly reliable. All test scores for individuals include a component of random error. The random error reflects the many things that can lead to variations in the way a student will perform on any one day. Test manuals will usually provide an estimate of the random error associated with scores on the test. Sometimes this is provided as an average error that can be applied to all scores and sometimes as more specific errors associated with each possible score. The random error is often reported as the standard error of measurement (SEM) and can be used to describe a range within which we can be reasonably sure (that is, there is a 68 percent chance) that the true score for a student actually lies. The true score can be thought of as the average of the scores a student would receive if they were able to sit the test again and again under standardised conditions, and kept no memory of each instance. For example, the SEM for a student’s PAT: Mathematics scale score of 56 might be estimated to be 3.4 patm units. This means the student’s test result should be reported as 56 plus or minus 3.4 (56 ± 3.4) patm units. This says that there is approximately a 68 percent chance the student’s true score lies in the range 52.6 to 59.4 patm units.

The amount of measurement error associated with a result is affected by things like the reliability of the test, and how well the test has “targeted” the student. This latter point means that if a test was too hard or too easy for a student there will be more error associated with the result (we have less information about their true achievement level if they get almost all the questions correct, or almost all incorrect).

We need to keep the size of the measurement error in mind when we compare two results, or compare a result against an expected level of achievement (for instance, a standard). It is good practice to think of a score as a range (taking the SEM into account), rather than a fixed point, and to communicate test results this way. As a rule of thumb, when we compare two test scores and these ranges overlap, in the absence of other information it is unwise to claim that any difference between the scores is something more than a chance event. For example, if a student’s PAT: Mathematics result in March was 56 ± 3.4 (52.6 to 59.4) patm units and in November was 59.0 ± 3.4 (55.6 to 62.4) patm units, we cannot be sure that the apparent progress was more than the kind of random variation in scores that might just occur by chance. However, had their end of year score been 63 ± 3.4 (59.6 to 66.4) patm units, we can have some confidence that their improved result does indicate real progress.

Measurement error and groups

How do things change when we combine individual test scores to work out the average score for a group? First of all, measurement error becomes less of an issue. As we aggregate information, the measurement errors for individuals tend to cancel each other out—those students who “benefited” from the random error associated with their results are countered by those who were “disadvantaged”. By the time we have 30 or so students in a group, the measurement error associated with the average score for the group is small compared to the measurement error associated with an individual’s score. In the case of PAT: Mathematics, the standard error of measurement associated with the average score for a group of 30 students will be about 0.66 patm units.

Although random measurement error is less of an issue at this group level, test results for a group can also be affected by systematic (nonrandom) errors. Systematic errors consistently “push” results up or down across a group of students and are sometimes referred to as bias. For example, if a group of students were asked to complete a test in a classroom that was crowded and noisy, their results are liable to be biased downwards.

Systematic errors like this are difficult to quantify and it is important to note that if a set of results have been systematically affected by error (that is, they have been biased in one direction), combining the results to calculate a group average will not get rid of the systematic error. The average score will also be biased. One of the reasons tests are often done under standardised conditions is to control for irrelevant things that could introduce error and, across a class, bias. Here “standardised” means having set expectations about when, where and how the assessment should be administered. Being haphazard about the way students are administered an assessment can make it difficult to use the results in any comparative way.1

Implications for practice

It is important to take measurement error into account when comparing test results, whether it is a recent result and a past result for the same student or average scores for a class before and after a unit of work. Concluding that any differences between results represent real differences in achievement, without first considering the error inherent in the results, can lead to poor decision making. Table 1 lists some of the times we compare test results and the possible consequences of not taking error into account. The list is by no means exhaustive, but is representative of the kinds of issues that can occur.

Concluding thoughts

The presence of measurement error in test results means we have to be careful when using results to reach conclusions and make decisions. As seen above, ignoring the error can lead to faulty decision making. The fact that results are subject to error doesn’t make the test results impotent. In many cases, supplementing a test result with other “measures”, formal and informal, will lead to robust platforms for decision making. The strongest support for our conclusions and decisions comes when we have several measurements taken over time and we can begin to study trends. When patterns within a learning area emerge over time and/or over different measurements, we can become more “relaxed” about the imprecision of the results on any one measure and more confident that something other than error might be making an impact.

This “quick guide” to error doesn’t end here. In the next Assessment News article we will look at other sources of error that can affect test results, including sampling error.

Note

1Sometimes we might have a good reason for using an assessment in a nonstandardised way. We just need to acknowledge this when using a result to draw conclusions and make decisions.

TABLE 1 SOME CONTEXTS WHERE MEASUREMENT ERROR MATTERS



Context How ignoring the error could lead to a poor conclusion or decision
When deciding whether one test score indicates a student has achieved more highly than another. One student is said to be doing better, when actually the reverse could be true.
When looking to see if a second score indicates progress has been made since a first score was recorded. Progress is said to have occurred when in fact it hasn’t.
When comparing a score with a score that has been set as a performance standard. We say a student is below a standard, when in fact they could have reached it.
When using scores from a test to put people into groups (e.g., low, medium and high achievers), or when using scores from a test to choose who should be considered for a special programme. We put people into the “wrong” groups. For instance, when results on a single test are used to place students into groups, many students could be arbitrarily placed in one or another just because they were advantaged (or disadvantaged) by measurement error. It is difficult to “isolate” the poorest or best performers in a learning area using one set of test results. Many of the students at the bottom of a score distribution on one test will in fact score higher on a second administration just because they are unlikely to have two “bad days” in a row.
When reporting test scores to parents and caregivers. The parent or caregiver uses an imprecise score to form an inaccurate impression of the student’s level of achievement.

CHARLES DARR is a senior researcher and manager of the assessment and design reporting team at the New Zealand Council for Educational Research.

Email : charles.darr@nzcer.org.nz