Test Item Performance: The Item Analysis
Table of Contents
Summary of Test StatisticsTest Frequency Distribution
Item Difficulty and Discrimination: Quintile Table
Interpreting Item Statistics
MERMAC - Test Analysis and Questionnaire Package
The ITEM ANALYSIS output consists of four parts: A summary of test statistics, a test frequency distribution, an item quintile table, and item statistics. This analysis can be processed for an entire class. If it is of interest to compare the item analysis for different test forms, then the analysis can be processed by test form. The Division of Measurement and Evaluation staff is available to help instructors interpret their item analysis data.
Summary of Test Statistics
Part I of the ITEM ANALYSIS consists of a summary of the following statistics:
* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *
SAMPLE ITEM ANALYSIS
SUMMARY OF TEST STATISTICS
NUMBER OF ITEMS: (Number of items on the test.) |
80 |
MEAN SCORE:
(Arithmetic average; the sum of all scores divided by the number
of scores.) |
60.92 |
MEDIAN SCORE:
(The raw score point that divides the raw score distribution in half;
50% of the scores fall above the median and 50% fall below.) |
63.15 |
STANDARD DEVIATION:
(Measure of the spread or variability of the score distribution.
The higher the value of the standard deviation, the better the
test is discriminating among student performance levels.) |
12.24 |
RELIABILITY (KR-20):
(Is an estimate of test reliability indicating the internal consistency
of the test. The range of the reliability is from 0.00 to 1.00.
A reliability of .70 or better is desirable for classroom tests.) |
0.915 |
RELIABILITY (KR-21):
(When item difficulties are approximately equal, is an estimate of
test reliability indicating the internal consistency of the test.
The range of the reliability is from 0.00 to 1.00. A reliability
of .70 or better is desirable for classroom tests.) |
0.915 |
S.E. OF MEASUREMENT:
(The accuracy of measurement expressed in the test score scale. The
larger the standard error, the less precise the measure of student
achievement. Two-thirds of the time test takers obtained scores
fall within one standard error of measurement of their true score.) |
3.58 |
POSSIBLE LOW SCORE:
(The possible low score.) |
0 |
POSSIBLE HIGH SCORE:
(The possible high score.) |
80 |
OBTAINED LOW SCORE:
(The obtained low score.) |
0 |
OBTAINED HIGH SCORE:
(The obtained high score.) |
80 |
NUMBER OF SCORES:
(The number of answer sheets submittedfor scoring.) |
603 |
BLANK SCORES1:
(Number of test scores that could be not computed.) |
0 |
INVALID SCORES:
(Number of test scores out of range specified by the user.) |
0 |
VALID SCORES:
(Only those scores that fall within the range specified by the user
are included in the analysis so thatthe user has the option of disregarding certain scores.) |
603 |
1Blank and invalid scores (those falling outside the specified range) are counted and are omitted from the analysis
Test Frequency Distribution
Part II of the ITEM ANALYSIS program displays a test frequency distribution. The raw scores are ordered from high to low with corresponding statistics:
- Standard score--a linear transformation of the raw score that sets
the mean equal to 500 and the standard deviation equal to 100; in normal
score distributions for classes of 500 students of more the standard score
range usually falls between 200 and 800 (plus or minus three standard deviations
of the mean); for classes with fewer than 30 students the standard score
range usually falls within two standard deviations of the mean, i.e., a range
of 300 to 700.
- Percentile rank--the percentage of individuals who received a score
lower than the given score plus the percentage of half the individuals who
received the given score. This measure indicates a person's relative position
within a group.
- Percentage of people in the total group who received the given score.
- Frequency--in a test analysis, the number of individuals who receive
a given score.
- Cumulative frequency--in a test analysis, the number of individuals who score at or below a given score value.
.
* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE
* * *
SAMPLE ITEM ANALYSIS
TEST FREQUENCY DISTRIBUTION
RAW STANDARD PER-
CUM
SCORE SCORE CENTILE PERCENT FREQ FREQ EACH * REPRESENTS
1 PERSON(S)
91 708 99 0.3 2 602 **
90 700 99 0.0 0 600
89 691 99 0.2 1 600 *
88 683 99 0.8 5 599 *****
87 675 99 0.3 2 594 **
86 666 98 1.0 6 592 ******
85 658 97 1.3 8 586 ********
84 649 96 1.2 7 578 *******
83 641 95 2.0 12 571 ************
82 632 93 1.7 10 559 **********
81 624 91 1.5 9 549 *********
80 615 90 1.5 9 540 *********
79 607 88 2.8 17 531 *****************
78 598 85 4.1 25 514 *************************
77 590 81 2.3 14 489 **************
76 562 79 4.0 24 475 ************************
75 573 75 2.2 13 451 *************
74 565 73 3.3 20 438 ********************
73 556 69 2.0 12 418 ************
72 548 67 3.8 23 406 ***********************
71 539 64 2.8 17 383 *****************
70 531 61 3.0 18 366 ******************
69 522 58 3.2 19 326 *******************
67 505 51 3.6 22 307 **********************
66 497 47 3.8 23 285 ***********************
65 489 43 2.7 16 262 ****************
64 480 41 3.2 19 246 *******************
63 472 38 2.5 15 227 ***************
62 463 35 3.2 19 212 *******************
61 455 32 2.5 15 193 ***************
60 446 30 1.8 11 178 ***********
59 438 28 2.3 14 167 **************
58 429 25 3.0 18 153 ******************
57 421 22 1.7 10 135 **********
56 413 21 3.2 12 106 ************
54 396 16 1.7 10 94 **********
53 387 14 1.5 9 84 *********
52 379 12 1.2 7 75 *******
51 370 11 2.0 12 68 ************
50 362 9 1.2 7 56 *******
49 353 8 1.3 8 49 ********
48 345 7 1.7 10 41 **********
Item Difficulty and Discrimination: Quintile Table
Part III of the ITEM ANALYSIS output, an item quintile table, can aid in the interpretation of Part IV of the output. Part IV compares the item responses versus the total score distribution for each item. A good item discriminates between students who scored high or low on the examination as a whole. In order to compare different student performance levels on the examination, the score distribution is divided into fifths, or quintiles. The first fifth includes students who scored between the 81st and 100th percentiles; the second fifth includes students who scored between the 61st and 80th percentiles, and so forth. When the score distribution is skewed, more than one-fifth of the students may have scores within a given quintile and as a result, less than one-fifth of the students may score within another quintile. The table indicates the sample size, the proportion of the distribution, and the score ranges within each fifth.
* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *
THE QUINTILE GRAPH AND MATRIX OF RESPONSES
APPEARING WITH EACH ITEM ARE BASED ON THE
STATISTICS INDICATED IN THE TABLE BELOW:
| QUINTILE | SAMPLE SIZE | PROPORTION | SCORE RANGE |
| 1ST | 128 | 0.21 | 77 - 92 |
| 2ND | 127 | 0.21 | 70 - 76 |
| 3RD | 121 | 0.20 | 64 - 69 |
| 4TH | 121 | 0.20 | 56 - 63 |
| 5TH | 106 | 0.18 | 24 - 55 |
Interpreting Item Statistics
Part IV of ITEM ANALYSIS portrays item statistics which can help determine which items are good and which need improvement or deletion from the examination. The quintile graph on the left side of the output indicates the percent of students within each fifth who answered the item correctly. A good, discrimination item is one in which students who scored well on the examination answered the correct alternative more frequently than students who did not score well on the examination. Therefore, the scattergram graph should form a line going from the bottom left-hand corner to the top right-hand corner of the graph. Item 1 in the sample output shows an example of this type of positive linear relationship. Item 2 in the sample output also portrays a discriminating item; although few students correctly answered the item, the students in the first fifth answered it correctly more frequently than the students in the rest of the score distribution. Item 3 indicates a poor item, the graph indicates no relationship between the fifths of the score distribution and the percentage of correct responses by fifths. However, it is likely that this item was miskeyed by the instructor--note the response pattern for alternative B.
A. Evaluating Item Distractors: Matrix of Responses
On the right-hand side of the output, a matrix of responses by fifths
shows the frequency of students within each fifth who answered each
alternative and who omitted the item. This information can help point
out what distractors, or incorrect alternatives, are not successful
because: (a) they are not plausible answers and few or no students
chose the alternative (see alternatives D and E, item 2), or (b) too
many students, especially students in the top fifths of the distribution,
chose the incorrect alternative instead of the correct response (see
alternative B, item 3). A good item will result in students in the
top fifths answering the correct response more frequently than students
in the lower fifths, and students in the lower fifths answering the
incorrect alternative more frequently than students in the top fifths.
The matrix of responses prints the correct response of the item on
the right-hand side and encloses the correct response in the matrix
in parentheses.
B. Item Difficulty: The PROP Statistic
The proportion (PROP) of students who answer each alternative and who omit the item is printed in the first row below the matrix. The item difficulty is the proportion of subjects in a sample who correctly answer the item. In order to obtain maximum spread of student scores it is best to use items with moderate difficulties. Moderate difficulty can be defined as the point halfway between perfect score and chance score. For a five choice item, moderate difficulty level is .60, or a range between .50 and .70 (because 100% correct is perfect and we would expect 20% of the group to answer the item correctly by blind guessing).
Evaluating Item Difficulty. For the most part, items which are too easy or too difficult cannot discriminate adequately between student performance levels. Item 2 in the sample output is an exception; although the item difficulty is .23, the item is a good, discriminating one. In item 4, everyone correctly answered the item; the item difficulty is 1.00. Such an item does not discriminate at all between good and poor students, and therefore does not contribute statistically to the effectiveness of the examination. However, if one of the instructor's goals is to check that all students grasp certain basic concepts and if the examination is long enough to contain a sufficient number of discrimination items, then such an item may remain on the examination.
C. Item Discrimination: Point Biserial Correlation (RPBI)
Interpreting the RBI Statistic. The point biserieal correlation (RPBI) for each alternative and omit is printed below the PROP row. It indicates the relationship between the item response and the total test score within the group tested, i.e., it measures the discriminating power of an item. It is interpreted similarly to other correlation coefficients. Assuming that the total test score accurately discriminates among individuals in the group tested, then high positive RPBI's for the correct responses would represent the most discriminating items. That is, students who answered the correct response scored well on the examination, whereas students who not answer the correct response did not score well on the examination. It is also interesting to check the RPBI's for the item distractors, or incorrect alternatives. The opposite correlation between total score and choice of alternative is expected for the incorrect vs. the correct alternative. Where a high positivecorrelation is desired for the RPBI of a correct alternative, a high negative correlation is good for the RPBI of a distractor, i.e., students who answer with an incorrect alternative did not score well on the total examination. Due to restrictions incurred when correlating a continuous variable (total examination score) with a dichotomous variable (response vs nonresponse of an alternative), the highest possible RPBI is .80 instead of the usual maximum value of 1.00 for a correlation. This maximum RPBI is directly influenced by the item difficulty level. The maximum RPBI value of .80 occurs with items of moderate difficulty level; the further the difficulty level deviates from the moderate difficulty level in either direction, the lower the ceiling and RPBI. For example, the maximum RPBI is about .58 for difficulty levels of .10 or .90. Therefore, in order to maximize item discrimination, items of moderate difficulty level are preferred, although easy and difficult items still can be discriminating (see item 2 in the sample output).
Evaluating Item Discrimination. When an instructor examines the item analysis data, the RPBI is an important indicator in deciding which items are discriminating and should be retained, and which items are not discriminating and should be revised or replaced by a better item (other content considerations aside). The quintile graph also illustrates this same relationship between item response and total scores. However, the RPBI is a more accurate representation of this relationship. An item with a RPBI of .25 or below should be examined critically for revision or deletion; items with RPBIs of .40 and above are good discriminators. Note that all items, not only those with RPBIs lower than .25, can be improved. An examination of the matrix of responses by fifths for all items may point out weaknesses, such as implausible distractors, that can be reduced by modifying the item.
It is important to keep in mind that the statistical functioning of an item should not be the sole basis for deleting or retaining an item. The most important quality of a classroom test is its validity, the extent to which items measure relevant tasks. Items that perform poorly statistically might be retained (and perhaps revised) if they correspond to specific instructional objectives in the course. Items that perform well statistically but are not related to specific instructional objectives should be reviewed carefully before being reused.
References
Ebel, R. L. & Frisbee, D. A. (1986). Essentials of educational measurement (4th ed.). Eaglewood Cliffs, NJ: New Jersey: Prentice-Hall, Inc.
Guilford, J. P. Pshychometric method. New York: McGraw-Hill, 1954.
Gronlund, N. E. & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). NY: MacMillan.
Osterlind, S. J. Constructing test items Norwell, MA: Kluwer Academic Publishers, 1989.
Thorndike, Robert L. & Hagen, Elizabeth. Measurement and evaluation in psychology and education (3rd ed.). New York: John Wiley & Sons, 1969, Chapters 4, 6.
* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *
ITEM 1 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX
OF RESPONSES BY FIFTHS E IS CORRECT RESPONSE
A B C D (E) OMIT
1ST + * 1ST
0 25 1 0 102 0
2ND + * 2ND
1 45 6 0 75 0
3RD + * 3RD
1 63 5 3 49 0
4TH + * 4TH
2 76 9 0 34 0
5TH + *
5TH 11 73 13 4 5 0
+----+----+----+----+----+----+----+----+----+
0 10 20 30 40 50 60 70 80 90 100 PROP 0.02 0.47
0.06 0.01 (0.44) 0.00
RPBI -0.20 -0.33 -0.20 -0.13 (0.51) 0.00
ITEM 2 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX
OF RESPONSES BY FIFTHS A IS CORRECT RESPONSE
(A) B C D E OMIT
1ST + * 1ST
83 35 10 0 0 0
2ND + * 2ND
19 85 23 0 0 0
3RD + * 3RD
17 67 37 0 0 0
4TH + * 4TH
13 78 30 0 0 0
5TH + * 5TH
6 84 16 0 0 0
+----+----+----+----+----+----+----+----+----+
0 10 20 30 40 50 60 70 80 90 100 PROP (0.23) 0.57
0.19 0.00 0.00 0.00
RPBI (0.43)-0.33 -0.05 0.00 0.00 0.00
ITEM 3 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX
OF RESPONSES BY FIFTHS E IS CORRECT RESPONSE
A B C D (E) OMIT
1ST * 1ST 2
125 0 1 0 0
2ND +*
2ND 6 109 0 8 4 0
3RD + * 3RD
14 86 4 7 10 0
4TH + *
4TH 23 71 2 19 6 0
5TH + *
5TH 29 45 8 15 8 1
+----+----+----+----+----+----+----+----+----+
0 10 20 30 40 50 60 70 80 90 100 PROP 0.12 0.72
0.02 0.08 (0.05) 0.00
RPBI-0.24 0.45 -0.16 -0.17 (0.13)-0.14
ITEM 4 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX
OF RESPONSES BY FIFTHS E IS CORRECT RESPONSE
A B C D (E) OMIT
1ST + *
1ST 0 0 0 0 128 0
2ND + * 2ND
0 0 0 0 127 0
3RD + *
3RD 0 0 0 0 121 0
4TH + *
4TH 0 0 0 0 121 0
5TH + *
5TH 0 0 0 0 106 0
+----+----+----+----+----+----+----+----+----+
0 10 20 30 40 50 60 70 80 90 100 PROP 0.00 0.00
0.00 0.00 (1.00) 0.00
RPBI 0.00 0.00 0.00 0.00 (0.00) 0.00

