A Mathematics Item Bank for Indonesia

Hayat, Bahrul

     The idea of criterion-referenced assessment is to provide a direct interpretation of a test score in terms of skills and competencies specified in a curriculum. Its main purpose is to integrate testing with instruction. Limitations of the traditional psychometric method do not enable this idea to be achieved. Traditional psychometric method also has no capacity to provide a sound basis for validating the instructional sequence. Rasch measurement model (1960), in contrast, has a potential to be used both in developing a sound criterion-referenced measurement and for validating empirically the sequence of instructional materials for optimum learning.
     In developing a calibrated item bank, a problem of obtaining appropriate empirical data for vertical equating is not new. The problem is not only related to logistic matters, but also to an unavoidable presence of off-target items administered to different grade levels. The current research has applied Rasch measurement model to calibrate mathematics items for a national item bank. The bank will be used for Indonesian national secondary school (grades 7 to 9) examinations. The item difficulty estimates were obtained through one-step calibration in which all difficulty estimates of items from different grades and test forms were automatically put in one scale. One-step calibration not only provided efficiency of calibration process, but also increased the precision of the estimates. The stability of item calibrations across different regional samples were also evaluated. The scale of the bank was given substance by specifying the mathematics skills implied by the items so that the scale can be criterion-referenced. Content sequences defined by the item calibration, in general, were in agreement with content orderings of the mathematics national curriculum.
     Satisfactory result was obtained in using expert judgment for grade equating. Judgment equating and empirical equating provided identical estimates for both item difficulty and person ability. Separation statistics indicated that judgment equating defined the mathematics variable better than empirical equating. Judges were also very successful in identifying items that do not fit the Rasch model. Judge background variables including gender, experience, and education had no effect on judging process. Familiarity with items and context of items, however, significantly affected the judgments.