Test Item Performance: The Item Analysis, Center for Innovation in Teaching and Learning, University of Illinois at Urbana-Champaign

Test Item Performance: The Item Analysis

Table of Contents

Summary of Test Statistics
Test Frequency Distribution
Item Difficulty and Discrimination: Quintile Table
Interpreting Item Statistics
MERMAC - Test Analysis and Questionnaire Package

The ITEM ANALYSIS output consists of four parts: A summary of test statistics, a test frequency distribution, an item quintile table, and item statistics. This analysis can be processed for an entire class. If it is of interest to compare the item analysis for different test forms, then the analysis can be processed by test form. The Division of Measurement and Evaluation staff is available to help instructors interpret their item analysis data.
 

Summary of Test Statistics


Part I of the ITEM ANALYSIS consists of a summary of the following statistics:

* * *  MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE  * * *
SAMPLE ITEM ANALYSIS
SUMMARY OF TEST STATISTICS

NUMBER OF ITEMS:
(Number of items on the test.)
  80
MEAN SCORE:
(Arithmetic average; the sum of all scores divided by the number of scores.)
  60.92
MEDIAN SCORE:
(The raw score point that divides the raw score distribution in half; 50% of the scores fall above the median and 50% fall below.)
  63.15
STANDARD DEVIATION:
(Measure of the spread or variability of the score distribution.  The higher the value of the standard deviation, the better the test is discriminating among student performance levels.)
  12.24
RELIABILITY (KR-20):
(Is an estimate of test reliability indicating the internal consistency of the test. The range of the reliability is from 0.00 to 1.00. A reliability of .70 or better is desirable for classroom tests.)
  0.915
RELIABILITY (KR-21):
(When item difficulties are approximately equal, is an estimate of test reliability indicating the internal consistency of the test. The range of the reliability is from 0.00 to 1.00. A reliability of .70 or better is desirable for classroom tests.)
  0.915
S.E. OF MEASUREMENT:
(The accuracy of measurement expressed in the test score scale. The larger the standard error, the less precise the measure of student achievement. Two-thirds of the time test takers obtained scores fall within one standard error of measurement of their true score.)
  3.58
POSSIBLE LOW SCORE:
(The possible low score.)
  0
POSSIBLE HIGH SCORE:
(The possible high score.)
  80
OBTAINED LOW SCORE:
(The obtained low score.)
  0
OBTAINED HIGH SCORE:
(The obtained high score.)
  80
NUMBER OF SCORES:
(The number of answer sheets submitted
for scoring.)
  603
BLANK SCORES1:
(Number of test scores that could be not computed.)
  0
INVALID SCORES:
(Number of test scores out of range specified by the user.)
  0
VALID SCORES:
(Only those scores that fall within the range specified by the user are included in the analysis so that
the user has the option of disregarding certain scores.)
  603

1Blank and invalid scores (those falling outside the specified range) are counted and are omitted from the analysis

Table of Contents

Test Frequency Distribution


Part II of the ITEM ANALYSIS program displays a test frequency distribution. The raw scores are ordered from high to low with corresponding statistics:

  1. Standard score--a linear transformation of the raw score that sets the mean equal to 500 and the standard deviation equal to 100; in normal score distributions for classes of 500 students of more the standard score range usually falls between 200 and 800 (plus or minus three standard deviations of the mean); for classes with fewer than 30 students the standard score range usually falls within two standard deviations of the mean, i.e., a range of 300 to 700.

  2. Percentile rank--the percentage of individuals who received a score lower than the given score plus the percentage of half the individuals who received the given score. This measure indicates a person's relative position within a group.

  3. Percentage of people in the total group who received the given score.

  4. Frequency--in a test analysis, the number of individuals who receive a given score.

  5. Cumulative frequency--in a test analysis, the number of individuals who score at or below a given score value.

.

              * * *  MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE  * * *
                                  SAMPLE ITEM ANALYSIS             
                              TEST FREQUENCY DISTRIBUTION

   RAW     STANDARD     PER-                                          CUM
SCORE     SCORE     CENTILE      PERCENT    FREQ     FREQ    EACH * REPRESENTS 1 PERSON(S)

92             717    99             0.2             1            603         *
91             708    99             0.3             2            602         **
90             700    99             0.0             0            600
89             691    99             0.2             1            600         *
88             683    99             0.8             5            599         *****
87             675    99             0.3             2            594         **
86             666    98             1.0             6            592         ******
85             658    97             1.3             8            586         ********
84             649    96             1.2             7            578         *******
83             641    95             2.0            12           571         ************
82             632    93             1.7            10           559         **********
81             624    91             1.5             9            549         *********
80             615    90             1.5             9            540         *********
79             607    88             2.8            17           531         *****************
78             598    85             4.1            25           514         *************************
77             590    81             2.3            14           489         **************
76             562    79             4.0            24           475         ************************
75             573    75             2.2            13           451         *************
74             565    73             3.3            20           438         ********************
73             556    69             2.0            12           418         ************
72             548    67             3.8            23           406         ***********************
71             539    64             2.8            17           383         *****************
70             531    61             3.0            18           366         ******************
69             522    58             3.2            19           326         *******************
67             505    51             3.6            22           307         **********************
66             497    47             3.8            23           285         ***********************
65             489    43             2.7            16           262         ****************
64             480    41             3.2            19           246         *******************
63             472    38             2.5            15           227         ***************
62             463    35             3.2            19           212         *******************
61             455    32             2.5            15           193         ***************
60             446    30             1.8            11           178         ***********
59             438    28             2.3            14           167         **************
58             429    25             3.0            18           153         ******************
57             421    22             1.7            10           135         **********
56             413    21             3.2            12           106         ************
54             396    16             1.7            10            94          **********
53             387    14             1.5             9             84          *********
52             379    12             1.2             7             75          *******
51             370    11             2.0            12            68          ************
50             362    9              1.2             7             56          *******
49             353    8              1.3             8             49          ********
48             345    7              1.7            10            41          **********

Table of Contents

Item Difficulty and Discrimination: Quintile Table


Part III of the ITEM ANALYSIS output, an item quintile table, can aid in the interpretation of Part IV of the output. Part IV compares the item responses versus the total score distribution for each item. A good item discriminates between students who scored high or low on the examination as a whole. In order to compare different student performance levels on the examination, the score distribution is divided into fifths, or quintiles. The first fifth includes students who scored between the 81st and 100th percentiles; the second fifth includes students who scored between the 61st and 80th percentiles, and so forth. When the score distribution is skewed, more than one-fifth of the students may have scores within a given quintile and as a result, less than one-fifth of the students may score within another quintile. The table indicates the sample size, the proportion of the distribution, and the score ranges within each fifth.

* * *  MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *

THE QUINTILE GRAPH AND MATRIX OF RESPONSES
APPEARING WITH EACH ITEM ARE BASED ON THE
STATISTICS INDICATED IN THE TABLE BELOW:

QUINTILE SAMPLE SIZE PROPORTION SCORE RANGE
1ST   128 0.21 77 - 92
2ND  127 0.21 70 - 76
3RD  121 0.20 64 - 69
4TH  121    0.20   56 - 63
5TH  106   0.18   24 - 55
                   

Table of Contents

Interpreting Item Statistics


Part IV of ITEM ANALYSIS portrays item statistics which can help determine which items are good and which need improvement or deletion from the examination. The quintile graph on the left side of the output indicates the percent of students within each fifth who answered the item correctly. A good, discrimination item is one in which students who scored well on the examination answered the correct alternative more frequently than students who did not score well on the examination. Therefore, the scattergram graph should form a line going from the bottom left-hand corner to the top right-hand corner of the graph. Item 1 in the sample output shows an example of this type of positive linear relationship. Item 2 in the sample output also portrays a discriminating item; although few students correctly answered the item, the students in the first fifth answered it correctly more frequently than the students in the rest of the score distribution. Item 3 indicates a poor item, the graph indicates no relationship between the fifths of the score distribution and the percentage of correct responses by fifths. However, it is likely that this item was miskeyed by the instructor--note the response pattern for alternative B.



A. Evaluating Item Distractors: Matrix of Responses


On the right-hand side of the output, a matrix of responses by fifths shows the frequency of students within each fifth who answered each alternative and who omitted the item. This information can help point out what distractors, or incorrect alternatives, are not successful because: (a) they are not plausible answers and few or no students chose the alternative (see alternatives D and E, item 2), or (b) too many students, especially students in the top fifths of the distribution, chose the incorrect alternative instead of the correct response (see alternative B, item 3). A good item will result in students in the top fifths answering the correct response more frequently than students in the lower fifths, and students in the lower fifths answering the incorrect alternative more frequently than students in the top fifths. The matrix of responses prints the correct response of the item on the right-hand side and encloses the correct response in the matrix in parentheses.

B. Item Difficulty: The PROP Statistic


The proportion (PROP) of students who answer each alternative and who omit the item is printed in the first row below the matrix. The item difficulty is the proportion of subjects in a sample who correctly answer the item. In order to obtain maximum spread of student scores it is best to use items with moderate difficulties. Moderate difficulty can be defined as the point halfway between perfect score and chance score. For a five choice item, moderate difficulty level is .60, or a range between .50 and .70 (because 100% correct is perfect and we would expect 20% of the group to answer the item correctly by blind guessing).


Evaluating Item Difficulty. For the most part, items which are too easy or too difficult cannot discriminate adequately between student performance levels. Item 2 in the sample output is an exception; although the item difficulty is .23, the item is a good, discriminating one. In item 4, everyone correctly answered the item; the item difficulty is 1.00. Such an item does not discriminate at all between good and poor students, and therefore does not contribute statistically to the effectiveness of the examination. However, if one of the instructor's goals is to check that all students grasp certain basic concepts and if the examination is long enough to contain a sufficient number of discrimination items, then such an item may remain on the examination.


C. Item Discrimination: Point Biserial Correlation (RPBI)


Interpreting the RBI Statistic. The point biserieal correlation (RPBI) for each alternative and omit is printed below the PROP row. It indicates the relationship between the item response and the total test score within the group tested, i.e., it measures the discriminating power of an item. It is interpreted similarly to other correlation coefficients. Assuming that the total test score accurately discriminates among individuals in the group tested, then high positive RPBI's for the correct responses would represent the most discriminating items. That is, students who answered the correct response scored well on the examination, whereas students who not answer the correct response did not score well on the examination. It is also interesting to check the RPBI's for the item distractors, or incorrect alternatives. The opposite correlation between total score and choice of alternative is expected for the incorrect vs. the correct alternative. Where a high positivecorrelation is desired for the RPBI of a correct alternative, a high negative correlation is good for the RPBI of a distractor, i.e., students who answer with an incorrect alternative did not score well on the total examination. Due to restrictions incurred when correlating a continuous variable (total examination score) with a dichotomous variable (response vs nonresponse of an alternative), the highest possible RPBI is .80 instead of the usual maximum value of 1.00 for a correlation. This maximum RPBI is directly influenced by the item difficulty level. The maximum RPBI value of .80 occurs with items of moderate difficulty level; the further the difficulty level deviates from the moderate difficulty level in either direction, the lower the ceiling and RPBI. For example, the maximum RPBI is about .58 for difficulty levels of .10 or .90. Therefore, in order to maximize item discrimination, items of moderate difficulty level are preferred, although easy and difficult items still can be discriminating (see item 2 in the sample output).

Evaluating Item Discrimination. When an instructor examines the item analysis data, the RPBI is an important indicator in deciding which items are discriminating and should be retained, and which items are not discriminating and should be revised or replaced by a better item (other content considerations aside). The quintile graph also illustrates this same relationship between item response and total scores. However, the RPBI is a more accurate representation of this relationship. An item with a RPBI of .25 or below should be examined critically for revision or deletion; items with RPBIs of .40 and above are good discriminators. Note that all items, not only those with RPBIs lower than .25, can be improved. An examination of the matrix of responses by fifths for all items may point out weaknesses, such as implausible distractors, that can be reduced by modifying the item.

It is important to keep in mind that the statistical functioning of an item should not be the sole basis for deleting or retaining an item. The most important quality of a classroom test is its validity, the extent to which items measure relevant tasks. Items that perform poorly statistically might be retained (and perhaps revised) if they correspond to specific instructional objectives in the course. Items that perform well statistically but are not related to specific instructional objectives should be reviewed carefully before being reused.



References


Ebel, R. L. & Frisbee, D. A. (1986). Essentials of educational measurement (4th ed.). Eaglewood Cliffs, NJ: New Jersey: Prentice-Hall, Inc.

Guilford, J. P. Pshychometric method. New York: McGraw-Hill, 1954.

Gronlund, N. E. & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). NY: MacMillan.

Osterlind, S. J. Constructing test items Norwell, MA: Kluwer Academic Publishers, 1989.

Thorndike, Robert L. & Hagen, Elizabeth. Measurement and evaluation in psychology and education (3rd ed.). New York: John Wiley & Sons, 1969, Chapters 4, 6.

Table of Contents

                      

      * * *  MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE  * * *                      

 

ITEM      1   PERCENT OF CORRECT RESPONSE BY FIFTHS               MATRIX OF RESPONSES BY FIFTHS              E IS CORRECT RESPONSE                                              
A     B     C     D     (E)  OMIT
1ST        +                                          *              1ST      0    25     1     0          102       0
2ND      +                               *                          2ND      1    45     6     0          75         0
3RD       +                     *                                   3RD      1    63     5     3          49         0
4TH       +              *                                           4TH      2    76     9     0           34        0
5TH       + *                                                        5TH     11    73    13    4          5          0
+----+----+----+----+----+----+----+----+----+
0   10  20   30  40   50   60  70   80  90 100        PROP  0.02  0.47  0.06  0.01  (0.44)  0.00
RPBI -0.20 -0.33 -0.20 -0.13  (0.51)  0.00

 

ITEM      2   PERCENT OF CORRECT RESPONSE BY FIFTHS               MATRIX OF RESPONSES BY FIFTHS                     A IS CORRECT RESPONSE                      
(A)    B     C      D            E  OMIT 
1ST        +                                  *                      1ST         83    35    10             0             0          0
2ND      +      *                                                    2ND       19    85    23              0             0          0
3RD       +     *                                                    3RD        17    67    37             0             0          0
4TH       +    *                                                      4TH        13    78    30             0             0          0
5TH       +  *                                                        5TH        6    84    16               0             0          0
+----+----+----+----+----+----+----+----+----+
0   10  20   30  40   50   60  70   80  90 100        PROP (0.23) 0.57  0.19        0.00        0.00  0.00
RPBI (0.43)-0.33 -0.05         0.00        0.00  0.00

 

ITEM      3   PERCENT OF CORRECT RESPONSE BY FIFTHS               MATRIX OF RESPONSES BY FIFTHS                     E IS CORRECT RESPONSE                       
A      B     C              D           (E) OMIT 
1ST        *                                                                                        1ST     2    125     0               1             0         0
2ND      +*                                                                                       2ND     6    109     0               8             4         0
3RD       +   *                                                                                   3RD    14     86     4               7            10         0
4TH       + *                                                                                      4TH    23     71     2               19            6         0
5TH       +  *                                                                                     5TH    29     45     8               15            8         1
+----+----+----+----+----+----+----+----+----+
0   10  20   30  40   50   60  70   80  90 100        PROP 0.12   0.72  0.02         0.08   (0.05) 0.00
RPBI-0.24   0.45 -0.16 -0.17   (0.13)-0.14

 

ITEM      4   PERCENT OF CORRECT RESPONSE BY FIFTHS               MATRIX OF RESPONSES BY FIFTHS                     E IS CORRECT RESPONSE                       
A      B     C              D           (E) OMIT 
1ST        +                                            *             1ST      0      0     0                 0          128        0
2ND      +                                            *             2ND       0      0     0                 0          127        0
3RD       +                                            *             3RD      0      0     0     0       121        0
4TH       +                                            *             4TH       0      0     0                 0          121        0
5TH       +                                            *             5TH       0      0     0                 0          106        0
+----+----+----+----+----+----+----+----+----+
0   10  20   30  40   50   60  70   80  90 100        PROP 0.00   0.00  0.00         0.00   (1.00) 0.00

                                                                                   

RPBI 0.00   0.00  0.00            0.00   (0.00) 0.00      

Table of Contents