Applied Assessment in the Glendale Union High School District: An Application of the Many-Faceted Rasch Model

Robert K. Hess, Arizona State University West
Marc S. Becker, Glendale Union High School District

 

 

The Glendale Union High School District (GUHSD), Glendale, AZ, has been dedicated to integrating assessment, curriculum, and instruction since 1972. The model employed by GUHSD utilizes both top-down and bottom-up perspectives. The core curriculum for the district has been aligned with the state standards and national skills providing a direct linkage to state and national expectations. Emphasis has been placed upon developing higher order thinking abilities in all students in all courses. The teachers within the district collectively developed the core curriculum and, in content teams, in coordination with central office curriculum leaders, developed the instructional material guides and the resultant district wide testing instruments for the courses. GUHSD employs a curriculum-driven assessment program rather than an assessment-driven curriculum. In this fashion GUHSD students, teachers, administrators, and parents are assured that the assessments are valid representations of the instructional expectations placed upon the learners.

 

GUHSD consists of nine high schools with nearly 13,000 students attending the district. The total district-wide model of assessment combines performance-based and multiple choice instruments given annually to over 12,000 students in 24 different content areas. The results of the tests are disseminated to all teachers and administrators for evaluation of the curriculum and growth and development of the students. The data are used for program evaluation, instructional planning, and for aiding in the placement of students in courses best suited for the students’ needs and abilities.

 

Test Development

The GUHSD Instructional Management System is illustrated in Figure 1. At GUHSD the commitment is to a team oriented program. Teachers are encouraged to review the tests and to make note of concerns and issues they or their students may have concerning any instrument employed in the district. Test development is seen as a continuous, on-going process, rather than an one time event.

Figure 1: GUHSD Instructional Management System

 

Figure 2 illustrates the timeline for the introduction of a new or modified assessment instrument in GUHSD. The first stage in the process is the preliminary research. This stage often is initiated as a result of the need for a new instrument to reflect new models and strategies of teaching, but could also be initiated as a result of a need to revamp an existing instrument. Year one of the program is spent in the development or revision of the new instrument. Teams of teachers from the district are gathered together to map out instructional expectations and the best case strategies for assessing the expectations. These strategies are then used to construct an assessment blueprint. The drafts of instruments field tested in two stages: (1) in one-on-one settings with students to improve clarity, purpose, directions, etc., and (2) by small group testing situations simulating real testing conditions with students permitted to provide written feedback concerning the instrument, the items, and the alignment of the instrument to instruction. Simultaneously, subject matter teachers are also encouraged to review the potential instruments and to provide written evaluations, feedback, and recommendations.

Figure 2: GUHSD Timelines for Assessments

The second year of development is a piloting year. Piloting involves a larger scale of assessment, with entire classes of students and, if under development, alternative forms of the instruments. Approximately 25% to 40% of the targeted students throughout the district will participate in the pilot testing. The results of the pilot tests are utilized to create final forms of the tests for the third year, baseline implementation. The results of the this round of testing are used to establish a baseline of student performance and to evaluate instructional needs and changes as a result of the testing. All tests are reviewed annually for curriculum alignment and are subject to revision as needed. It is not uncommon for a new form of a test to be under development while an existing instrument is being utilized.

 

Performance-Based Assessment

As of the 1996-97 school year 20 of the 24 courses with district-wide testing involved a performance-based assessment component. The subject matter areas include: English (research and composition), social studies (government, history, and interdisciplinary), science (physical and life sciences), and foreign languages (Spanish, French, & German), as well as several vocational areas currently under development. The tests all require an extended response activity or activities which often takes several days for the students to complete. The guidelines for these activities have been developed by the teachers and are employed uniformly across the course(s) throughout the district.

 

The tests are scored by two raters. The first rater (or judge) is usually the teacher of record for the students. The exceptions to this rule have been the teachers of Advance Placement (AP) classes who have been found to be far more rigid and severe in the scoring of their own students than when they score other, non-AP students, or when in comparison to other teachers scoring the AP students. The second ratings occur at sessions under the direction of the district’s curriculum coordinators at a centralized location following the completion of the school year. In previous years, prior to 1995-96, a third reader would be employed to arbitrate any pass/fail disagreements between the first and second readers. With as many as 15% to 35% of the tests needing third readers (depending on the course being tested) the cost of this process was extraordinary, requiring as many as 500 third readings.

 

The scoring model employed by the district utilizes a four-point rubric for each trait (or item) scored. The four levels of achievement are defined as (4) Outstanding, (3) Highly Successful, (2) Successful, and (1) Not Yet Successful. The rubrics were designed by the teams of teachers utilizing a clear set of attributes for each trait as absolute references for determining student performance (c.f., Wiggins, 1993). Table 1 presents a sample of the scoring rubric for the three scored traits in the Thinking Science assessment.

 

Table 1: Scoring Rubric for Thinking Science Assessment

The Thinking Science assessment is provided as an illustration of the type of performance activity employed in GUHSD. In this activity the students are given an observable phenomenon and asked to:

a. observe the phenomenon.

b. identify the problem for investigation.

c. develop a hypothesis.

d. design and conduct an experiment to test the hypothesis.

e. analyze the data resulting from the experiment.

f. write an overall evaluation of the experiment.

The timeframe for task completion is six days. A full set of instructions and guidelines are given each student. GUHSD uses five different experimental conditions for the Thinking Science assessment and students are randomly assigned to one of the five experiments. This procedure parallels the other performance assessments utilized in the district with the exception that currently only Thinking Science employs optional prompts (in this case experiments). The activity is scored for three traits: Data Collection, Analysis and Conclusions, and Evaluation (see the Table 1 for the rubrics for these traits.)

 

Issues of Concern

The application of district-wide performance-based assessment instruments created a set of problems and concerns. First, the raw score, summative model initially employed in the district assumed that all traits were of equal difficulty. In performance-based assessments some items or traits are more difficult than others, just as some items on multiple-choice tests are more difficult than others. Under the classical analysis and scoring used in summative techniques all traits are treated as equal in difficulty. One strategy previously employed was to weight a trait more than another, but this weighting was often quite arbitrary and at best a guess at the assigned weightings.

 

A second problem encounter was the cost factor. Scoring over 50,000 different tests with two raters is quite expensive, add to this a need to score 15% or more of the tests with a third rater and the cost rises considerably. While the district employed a sound and practical training procedure for all raters, scoring tests at this level external to the classroom was getting prohibitive.

 

The decision to have classroom teachers act as first raters did reduce the cost somewhat but then the issue of rater consistency and severity arose. The simple truth is that raters, no matter how well trained bring their own beliefs and levels of severity to the scoring task. An analysis of the raw scores indicated a high percentage of raters within ± 1 point (on the four point scale) of each other (80%) but that their inter-rater correlations (first to second rater) was only .66 to .70. Furthermore, an analysis of persons acting as third raters found a much lower mean score given students than by the same judge when acting as a first or second rater. Another problem was found when a comparison of pass/fail rates indicated that when students were rated by a third judge, indicating a need to arbitrate between a pass by one and a fail by another, the students were more than likely to be adjudicated as failing (more than 2/3rds of the time).

 

 

 

 

The Many-Faceted Rasch Model (FACETS)

Primary Function of the Many-faceted Rasch Model

The many-faceted approach to analyzing performance-based models incorporates a strategy of analysis that not only examines the level of achievement demonstrated by the learner relative to the difficulty of the task but also permits this performance to be freed from the biasing effects that the third party called judges or raters brings into the scoring activity. Engelhard (1994) suggests that perhaps the best way to envision the work of the many-faceted Rasch model is to think of raters as being analogous to test forms of varying difficulty in a traditional testing format. Facets then permits the equating of these forms of the test in order to link the performance of the students on a common scale.

 

The many-faceted approach permits the calibration of a severity (or leniency) effect that differentiates judges in their application of knowledge, experience, and expertise and allows removal of this measure of influence from the score obtained by the learner (Linacre, 1989). A rater imposes his or her own unique standards and perceptions of a task and as a consequence will differ from other raters in their leniency and severity of an examinee's performance while scoring (Lunz, Wright, & Linacre, 1990). As Lunz, et al (1990) note "… when judges vary in severity, raw scores are affected and decisions may be different (p. 333)." This process helps to insure the objectivity and fairness of the results (Lunz, et al, 1990). Using the many-faceted Rasch model the raters who grade the performances are also calibrated for their severity in scoring independent of the examinees' ability and the item difficulties (Linacre, 1989; Lunz, et al, 1990). In other words, the severity biases that a judge may bring into a scoring event are equated for all learners. It therefore matters little to the learner which set of judges he or she may be scored by since the individual influence of the judges will be adjusted and accounted for when calibrating a measure of the student’s performance (Engelhard, 1993).

 

A second source of bias identified by the many-faceted model is the degree of internal consistency exhibited by a judge. While judge severity may be accounted for in calibrating the performance of a learner it is required that the judges be consistent in their scoring—that is, if they tend to score more harsh or lenient than another judge that they stay at this level. A judge is seen as inconsistent if they waiver or drift in their scoring (Lunz, Wright, & Linacre, 1990). For example, scoring one set of papers harshly that other judges agree should be scored so but scoring another set in a lenient manner that other judges would score more severely. Another illustration of a judge-centered problem is when a judge scores all papers at the same level of performance, regardless of their distinct qualities—this is called a "halo effect" (Engelhard, 1994). The Facets program allows the identification of judges guilty of either of these actions. While neither of these effects may be accounted for in the calibration of the students’ performance, identification of the offending judge can be made and a rescoring of the work of the students using more appropriate judges can be accomplished (as well as a retraining of the offending judge).

 

The ultimate outcome of an analysis of a performance instrument with FACETS is a set of measures on the learner, the traits utilized to assess the learner, and the judges used to rate the learner. The score obtained by the learner is generally expressed in a unit called a logit (or log odds probability). The score reflects the ability of the student to perform on a trait of given difficulty appropriate to the learner’s instructional experiences. If the learner is able to complete an appropriate task their score will reflect this ability regardless of the specific set of items presented them in the examination. Figure 3 illustrates the process of student performance determination as a result of the application of the many-faceted Rasch model.

 

Figure 3: GUHSD—Performance Assessment Model

 

Model Building Considerations

Using FACETS correctly requires more than just plugging in a set of scores and judge ID’s. The many-faceted model requires that certain assumptions be met by the instrument and users of the Facets program. First, each trait must represent a unique action, activity, process or product. Traits that measure multiple events are very hard to score for judges. Each trait must be uniquely distinct, otherwise some judges will focus in on one component while others will focus on another. This causes apparent patterns of subsetting by these sets of judges.

 

Another problem is the uncontrolled influence of an external source of bias in the judges, the learner, or the instrument—such as the reading difficulty of a prompt in a writing activity. This will also cause subsetting problems in the scoring. Similarly, suppose a single judge scores both sets of papers for a set of students, this results as well in a subsetting problem.

 

Related to the above is the problem of judge identification and the number of times a judge must rate papers to be viable in the model. While FACETS is able to account for missing elements in an analysis model, too many missing judges (whose scores are nonetheless important in the assessment of performance) will prevent the model from functioning properly. Also, numerous passes through previous data indicated that care must be taken that the judges rate a reasonable number of the learners. The figure that seems to work is either 2% or 30, whichever is larger. This should ensure sufficient crossing effects among the judges for FACETS to be able to calibrate the judges’ severity effects and accommodate for their influence in calibrating the students’ performance relative to the item difficulties. The importance of the identification of judges and the number of papers cannot be over emphasized. The model is dependent upon sufficient crossing of the judges to calibrate the learners’ performance.

 

Samples of Results

Two sets of results in Tables 2 and 3, are drawn from the 1994-95 calibration studies performed in GUHSD and illustrate the type of data gathered comparing the summated technique and the many-faceted strategy. The first set represents juniors taking a multi-paragraph writing assessment. This test was scored by up to three raters and the two raters agreeing on a level determination produced the final raw score for the student. The many-faceted model utilized all three raters (if needed).

 

The cross tabulation in Table 2 presents a picture of a comparison of the pass/fail rate using a summated model vs. using a calibrated FACETS model. One of the real strengths of the many-faceted model is the ability to isolate the effects due to raters and to determine their internal (self) consistency and their severity effect. While the many-faceted model will control for the latter, the ability to determine which raters lack internal consistency is very important.

 

Table 2: Multi-Paragraph Writing—Juniors

   

Comments

Number of Students

2204

 

Number of Raters

37

 

Number of Items

4

 

Student Reliability

.86

 

Rater Reliability

.82

judge #99 rated over 28% of the students

Trait Reliability

.99

 

 

Facets

     

Regular

Fail

Pass

Percent

 

Fail

459

273

32.8

 

Pass

29

1407

67.2

 

Total

488

1743

Difference

 

Percent

21.2

78.1

+11.1

 

Facets fails 29 students who would pass with the summated model and passes 273 students who would fail with the summated strategy.

Rater #99 who read 392 papers as the second/third read had a average score of 2.05 compared to an average of 2.16 for all ratings. On the first readings #99 had an average score of 2.25 for 45 students. This rater was slightly severe overall but very severe when reading as the second/third rater. Considering the high number of students read by this reader their effect was strong.

 

Table 3 is for the same activity (multi-paragraph writing) for sophomores. This data does not include any third raters. The cross tabulation in Table 3 shows a much stronger agreement between the summated scores and the multi-faceted scores.

Table 3: Multi-Paragraph Writing—Sophomores

   

Comments

Number of Students

1561

 

Number of Raters

30

 

Number of Items

4

 

Student Reliability

.83

 

Rater Reliability

.89

2 judges flagged as inconsistent; no real extremes

Item Reliability

.98

 

 

 

 

 

Facets

   

Regular

Fail

Pass

Percent

Fail

273

59

19.8

Pass

40

1302

80.2

Total

313

1361

Difference

Percent

18.7

81.3

+1.1

Facets fails 40 students who would have passed with the summated model and passes 59 students who would have failed with the summated strategy.

This is one of the examinations in which only two readings took place. The match between the two systems is excellent. Facets hits on essentially the same proportion of pass/fails as the absolute standard. But it is important to remember that the influence of severe third raters is not present in this examination.

 

A Detailed Example: Thinking Science

The following example is drawn from the Thinking Science Assessment for 1996-97 school year. The assessment activity and scoring rubric have been previously described. A total of 1211 students were judged as possessing scoreable tests. The raw scores for two judges ranged from a minimum of 6 to a maximum of 24 (two raters X (3 traits X 4 possible points)). The distribution of raw scores is presented in Figure 4.

 

 

Figure 4: Raw Score Distribution for Thinking Science Assessment

When the many-faceted Rasch model is applied employing students, raters, traits, and experiment as facets the distribution pictured in Figure 5 is obtained. When the many-faceted model accounts for rater severity, trait difficulty, and equates for experiment selected, the result is a much wider distribution of achievement.

 

Figure 5: Many-Faceted Distribution of Student Achievement for Thinking Science Assessment

One of the many advantages of the Rasch model is the ability to place all of the facets on a common logit scale. Figure 6 is a dot plot representation of the distribution of the scalings for students (PERSONS), RATERS, experiment (EXPER) and traits (ITEMS). As is common with Rasch analysis, positive scores for students indicate higher achievement while positive scores for raters, traits, and experiments indicate difficulty or severity influences.

 

Figure 6: Distribution of Facets for Thinking Science

The correlation between the raw summated scores and the calibrated IRT scores is strong (r=.95); however, as may be seen in Figure 7, the scatterplot indicates that the many-faceted model is making adjustments for the effects of rater, experiment and trait difficulty.

For raters, the range of logit scores (as a measure of severity) is 1.94 logits (standard deviation of .56) indicating a range of 3.46 standard deviations between the most severe of the raters and the most lenient. For the experiments the difference between the most difficult and the least difficult of the stations is 2.875 standard deviations. The traits possess the least influential effect, a difference of only 2.4 standard deviations between the most difficult trait (Data Collection) and the least difficult (Analysis & Conclusions). The index of separation, as a measure of explained variance, provides a reliability index of .81 for students, .98 for raters, .94 for experiments, and .89 for traits.

 

Figure 7: Scatterplot of Raw and Facets Achievement Scores on Thinking Science

District Reporting

While all of these iterations and calculations help to justify the use of the many-faceted Rasch model, without a practical application of them, the results mean little. The Appendix to this paper contains five sample reports generated from the analysis. Before generating the reports the student logit scores are transformed to a more usable scale. A modified WITS scale (Wright & Stone, 1979) is employed where 50 indicates meeting the standard of Successful (the standard deviation is 10). This is a scale more quickly understood by parents, students, and teachers.

The first report is a school within district summary. The second is the District Summary by Success level. The third report illustrates the individual summaries for each student taking the assessment. The fourth is an illustration of how inconsistent raters may be tracked by analyzing their residual scores. The fifth report illustrates the general summary reported for each content/course employing a performance-based assessment.

 

 

 

 

Summary

Glendale Union High School District has employed district-wide assessments for over 25 years. These assessments are both performance-based and multiple choice. The assessments are built employing an integrated approach with teachers as a primary source of their development.

1. The GUHSD Instructional Management System:

2. A three-year timeline is employed for the introduction of any new test or the modification of existing instruments.

3. The many-faceted Rasch model was selected for analyzing and determining student performance because:

4. The use of the many-faceted Rasch model has revealed several facts about GUHSD assessment.

 

 

References

 

Engelhard, G. (1992). Developing rater banks. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA

 

Engelhard, G. (1993). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5, pp. 171-192

 

Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, pp. 93-112.

 

Linacre, J.M. (1989). Many-faceted Rasch Measurement. Chicago: MESA Press.

Lunz, M.E., Wright, B.D. and Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, pp. 331-345.

 

Wiggins, G.P. (1993). Assessing Student Performance: Exploring the purpose and limits of testing. San Francisco: Jossey-Bass.

 

Wright, B.D. and Masters, G.N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press.

 

Wright, B.D. and Stone, M.H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.