Discussant’s Comments
Joe Ryan
Arizona State University West
When you get a set of papers like these, there are really two approaches, one is to go through each paper and beat up on people individually. The other is to beat up on people as a group. I'm going to do the group.
Actually I'm feeling bold today. An overflowing room at 8:15 in the morning is really hard to imagine, but apparently there are people starving for this. What I want to do is, instead of going through each paper one at a time, is frame this session in a broader context that I think is very, very important. Frankly, I think it's extremely important that sessions like this do get scheduled at the National Council on Measurement in Education. I think one of the things that we can fall into in our measurement work is to get so preoccupied with the technical aspects of what we do that we forget that we're supposed to be informing educators. If it doesn't pay off sooner or later in the classroom, while it may help us in academic life proceed through tenure and promotion, gets us travel money from our districts and state departments to come to scenic Chicago, what is the point?
I have a couple of technical notes but I want to save those for the end if we have time because I don't think they are terribly important. What I would like to do is suggest the framework of testing and measurement really serving two worlds. I'm going to contrast these, although it's really a continuum and I don't want to reify this dichotomy, but let me try just for this framework purpose. What's one of the things that we do when give testing and measurement, we are doing it for purposes of accountability. We're investing millions and billions of dollars at the state level, and at the national levels and we want to count heads. We want to know how well some people are doing are relevant to some set of standards that we think that we can understand. And that is one thing that we do -- NAEP is a very good example of that. We don't use NAEP to try to tell classroom teachers how their kids are doing on local curriculum. The commercially available testing programs are also particularly well suited for accountability purposes, and of course every effort is made to make them sensitive to local curriculum.
The other thing that we need to do, and I think is personified in these papers, is that we need to try to obliterate the distinction between assessment, curriculum and instruction. As long as there are the testing people and the curriculum people and we have separate meetings, somebody is not doing what they are saying they are doing. We are not doing educational measurement. We're, maybe, doing measurement but it's not educational if it doesn't speak to curriculum and instruction. So I want to frame this into the context of how these papers really speak to the issues of curriculum and instruction as a major focus. And one of the aspects of that, of course, is equity and fairness. And although the people in their presentations didn't really emphasize that, one of the things that goes on in all of the papers is really a strong emphasis on equity and fairness. The number of things they are doing in Mesa, and all the other programs in regards to that issue.
One of the presenters used the phrase, meaningful. I think it was actually Joe talking about making these things meaningful for educators. We have to think about the issue of meaning just for a second. Numbers don't mean anything. If I tell you the scores in a game were 7 to 2, 11 to 6, and 9 to 12, you don't know anything. The numbers have no meaning until I tell you who played. In fact, it helps to know what kind of game it was. I think what most of these papers are about is dealing with the question of what does this mean, and the meaning derives from looking at the numbers in context. One of the things that a couple of the people made very explicit, and though I know its' true in all cases, is the question of who can make meaning for measurement numbers. For far too long I think the psychometric community presumed to have that competence. I doubt that very much, frankly. Bearing in mind that psychometrics have been very good to me for the last thirty years. If you actually said come back in ten years, ten years would be my 37th AERA. I'm not going to do that.
One of the things I have observed is that psychometricans just presume that because they can write the software and run the programs and get the data that they know how to interpret it. One of the things we saw in all of these cases, particularly Joe's, illustrated this for us. You take the numbers, you go to the school folks, and you say how could this have happened in terms of the dynamics with which you are familiar? Not in terms of betas, and gammas, alphas, and deltas, but in terms of kids, curricular scope in sequence and emphasis.
Let me talk about each of the papers briefly in the context of curriculum and instruction. All of the presenters left some things out in their papers. All of these programs are built on curricular expectations. They didn't start with test items.
Bob has a reference in his paper. This is very early in their process, and he talks about how teachers were brought together for the first step to talk about what were their expectations for student learning. And although he put this up as an assessment development process, the first step is to ask teachers what do you expect kids to learn. The Portland program, which is the grandfather of all Rasch programs, I guess, is a curriculum referenced program. If you will look at the scoring rubrics in Bob's paper they are very curricula referenced. Just let me read one to you to get the flavor. Under outstanding data collection, "accurately displays all data in both table and graph forms, data is reasonable, includes values, labels, titles, appropriate scaling." Fifteen years ago this could have been written as behavioral objective for curriculum people. Now we are using it as scoring rubrics, and we're combining what we would have called the curriculum specification with an assessment specification for four points. If you look through these papers you see that going on all the way through.
Figure 1: Mean mathematics RIT scores across sixteen years

I love this picture, it's a great picture. And I'm going to take my guess at what I think is means.
Well, I think, I'm going to use this an illustration of maybe its' time to ask teachers what is this thing? Is everyone clear what this is? This is a good news picture. We're making progress despite public criticism to the contrary. Everything seems to be getting better. Gage assured us that this was not a mistake. Well I've had occasion on two different times in the last fifteen years to be an outside reviewer, evaluating their program. I don't think it's a mistake either -- these are what the numbers look like.
But what do you make of this? One of the things I think you make of it is a pretty stable curriculum. While Gage was emphasizing stability in the scale, I think that is a reflection of a school district defining what fourth grade math includes. In fact, they literally have this chart with model items all the way through. There is a stability in the curriculum, it's not stability in the scale. The scale is a reflection of the stability in the curriculum.
I have a couple of questions, and these are really questions that school folks, the teachers and the curriculum experts ,might be asked. One is sort of a sequence or a score issue. If you look at the point around year 16. They are clearly well above the pre for the same grade 16 years previously. My question is, are the teachers starting at a different point in the curriculum? If they really are, if the kids are starting off with this higher level, has the entry point for instruction for fourth grade changed to accommodate that? If that is so, we shouldn't be surprised that the magnitude of growth within a year doesn't change any unless you add more days or more instructional time to math. If you keep moving up the sequence, you still have 180 days, five periods of math a week, probably, fifty minutes a -- fourth grade is less probably than that. I don't think that we should be surprised that the magnitude of growth hasn't been charted, we have been moving up into the sequence. Math is, -- everybody picks math because it's so easy to use. Do this with language arts and try to make sense out of it or science.
While this is up let me mention one of the technical notes that I had in mind, and that is to say all of these districts have population data. You go through grad school you learn about measurement, you learn about all of this stuff. They've got every fourth grader who ever went through this school district on this test. These are mu’s and sigmas, they are not x bars and thetas. So interpreting change relative to standard deviation, which is the thing I think we would do reflexively to try to see, well there is 1.5 differing from random. It isn't. That's a delta mu over 16 years, it really is 1.5 There is variation within grade, but then we know that kids vary quite a bit.
Joe O'Reilly's paper is another example with the interpretation at the school level, the school report. Principals don't applaud very much, by the way, as many of you might know. The school report interpretation they also do with gender and ethnicity. They also do an ability interpretation. All of those attest to fit issues. All of those are looking at observed minus expected from the model. They are doing it somewhat heuristically, rather than statistically, but then these aren't statistics, they are parameters anyway, we got every single kid in the district. These are all the kids.
I think the school level report Joe describes -- you got the idea, you place the school in terms of its mean performance. Then for each item you have a prediction from the model, then you look at the observed peak, then look to see what is going on there. That's clearly an attempt to talk to principals, curriculum leaders about what portions of the curricula do they seem to be emphasizing, what areas might they need in-service or professional development. I think one of the things that you have to be a little careful of is that once you start to get a couple of plus pluses, you're gonna turn up a couple of minus, minuses. You know because the data are constrained. So you have to be a little careful about interpreting variation. Although one, I had occasion to look at some of this printout recently in another context. If you will look at the plus, plus, pluses, and the minus, minus, minuses, I mean these are folks that are way above what you thought -- ten or fifteen percent above, ten or fifteen percent below. If you have a school of 600 people, ten percent of 60 of them is two classrooms.
So one of the ways to think about this, and this was an interpretation question that came up somewhere else. How do you talk about these changes? If you have ten percent more kids doing well on a certain domain of items and you have -- what's a normal size school? [answer: 850]
Eight hundred fifty, I'll use ten percent, that will be good. You have 85 youngsters -- that's approximately three classrooms. So if you want to talk about how are things going, you can talk about having about three classrooms worth of kids doing much better on this objective than we thought. These procedures allow interpretation in a very understandable way. Other IRT models do so as well, but this does it in a very straight forward way, is we can say to principals, you have about four classrooms worth of kids doing better on these items. Then you find out who the teachers are and what they are doing. I mean you can get back that level. I think the school reports are useful, and I'm not surprised that principals applaud. It's our speaking to them in the language around which curriculum has organized around objectives.
I think again in the C&I context, there's very little variation among the experiments or the items which are the traits. I think one of the things to look at is to ask teachers, maybe a priori, which of these experimental stations they felt would be relatively more difficult and relatively easier. To try to get some sense of whether the data are reflecting what people familiar with the curriculum would have expected might be going on. I guess it's sort of a content type of thing.
Again let me slip into one of my technical notes here. I think we have to be very careful in the application of the facets model, about the kind of inferences that we draw. These inferences are conditional inferences, conditional on having obliterated all sources of variation in the scoring. This is what your score would be. So if there were no rate of variation, if there were no differences of experiment, there were no difference in the degree of difficulties, your score would be this. Well, the fact is that there are differences among raters and there are differences among the experiments, and so on. And I think we have to be, it's reminiscent of problems when people do analysis of covariance to make things go away that they wish weren't in their data. Then you end up with inferences that are conditional on some things that you created artificactually.
Now in the case that Bob describes, these are conditions we wish we had. We wish we had perfect rater agreement, but we don't. I would just urge serious caution in that regard. First of all I think we ought to think about it very carefully. The other is I think there are some ethical and potential legal problems here. You have a high stakes test in passing a certain number of points, just how well can you explain to the court that Joey really got a seven and he needed a six to pass. But he had an easy prompt and the two easiest raters on the planet, and so we are really going to fail him, because we know how to do all of the psychometric stuff. On the other hand, Joe had a hard prompt and the blue meanie scored his paper, so even though he had a three, we are adjusting his score up. I think we have to think through whether they system can deal with those kinds of statistical adjustments. This is not news, this is something's we deal with all of the time statistically when we create conditions with our statistical procedures.
There was the issue how to report growth. As I said earlier, you can actually state your report of growth in terms of the number of students, additional numbers of students and how many classes this means.
I want to mention one last thing just briefly. And I don't mean this at the end as a throw away. It may be at the end because it is really important. The least we can do in testing and measurement is be fair. It that's all we ever do, that would be pretty decent.
I get very, very concerned at the proliferation of assessment devices, assessment protocols. Everybody in the world of course can do all kinds of assessment today. I get very, very concerned about equity issues. There is an increasing body of literature on equity related to students abilities to perform on open ended tests. I think we have to be very careful about that. All of these programs, and it is explicit in two of the papers, and it wasn't the emphasis in Gage's papers, who didn't happen to mention it. But all of these programs look at the question item by item of, is there any variation between gender that is surprising to us. You know conditional on ability, is there any variation by ethnicity conditional on ability. I think the variation by ability, looking at ability as one of the aspects of diversity, I think is a very important aspect of diversity.
We tend to think sometimes as diversity as ethnicity, gender, perhaps region and so on, but variation in ability is one of the aspect of diversity in which we have to cope. And that is addressed in at least one of the papers, and I'm fairly certain it's in some of the others. So in some, I think these are very important papers at NCME and I think they are very important anyway. For a long time, this kind of work was too technical to be at the National Council of Math Teachers (NCMT), far too technical for that. But not rigorous enough to be in Division D of AERA. And, in fact, for a long time it was difficult to get these applied papers even at NCME. I think it is very, very important that these papers have found a home here, and I'm pleased to have had a chance to talk to you all.
Discussant’s Comments
H.D. Hoover
University of Iowa
I have to out do Joe Ryan here about the time we've been at these damn things. This is my 30th anniversary of AERA. In 1967, the first time I came to AERA, the most noticeable differences were, first, the weather was nicer. Second, and the main one, is that I remember paying 90 cents for a beer. And being a hillbilly, I'd never even imagined that I would pay 90 cents for a beer. And the reason that I remember that so vividly, I had left a dime laying there on the bar where I had this beer. I just laid it there I didn't mean it as a tip for godsakes. I'd already spend all of my money. And, this woman bartender, I will always remember, took that dime and slides it over and grabs it and says, "last of the big time spenders." So I remember this vividly, and it was in Chicago. I'll have to see if that woman is still in my bar down the street.
Well I do want to start off by making some very complimentary statements here. The development task that obviously is taking place in all institutions is very impressive. As someone who builds tests I know what is involved. It's obvious that all three places here are doing an awful lot of work, not only in test development, but lots of work technically trying to do the best job they can to help people interpret the data that they are presenting to them. And, I think in most cases, appear to be really being very helpful. Now I would have to say that 90 percent, since this title was really dealing with taming the Rasch tiger, or something to that effect, that 90 percent of what has been presented here you could have done with the Classical Model. Nearly everything here we've had here today you could have done probably just as easily with your p-values. In some cases actually they get somewhat easier to interpret some of the results, because you are on a scale that most people understand a little bit better. But, still I am really very impressed with it.
Now what I want to do, I would though, since the issue is dealing with item response theory in the Rasch model, and I'm going to come back to a couple of specific examples here in a minute. I want to read two things that people have said, because, it happens that both of these people have been my boss. So that's why I'm going to quote both of them. You know, I'm still at the stage in my life where I could use a raise any time.
The first person I will quote is Bill Kaufmann. He says something, and I think we have to think about this, especially regarding a couple of things I want to look at here in a moment, especially regarding Gage's presentation.
"Recall that the Rasch model is based on the same basic information as the classical model. That is, the responses of individuals to test items, and there is nothing magical about the numbers that are produced by the application of the model." This is pretty important because even related to, Bob Hess's presentation, I think this is fairly important regarding some things on his paper, I hope I'll be able to refer to later. Then Bill closed by saying, "If you propose to make use of the Rasch model in generating your test scores, recognize that the better that you understand the model and its implication, the more likely you are to make sensible interpretations with the test scores you generate. Don't be mesmerized by the sales pitches of those who are marketing items pools or computer programs. Insist on understanding what it is you are buying before you close the deal."
I'll have to say this, I think everyone here was very cautionary in terms of what they wrote in their papers about interpreting the data, the underlying assumptions that were involved here. But I'm going to also quote a little bit and this is cheating. I told Bob that I was probably going to do this. This is actually from Bob Brennans' NCME presidential address [tomorrow] morning, but this is just a little piece of it. And, so, this is again something I think that is really important that we remember this – "in general strong assumptions lead to strong results. This is particularly evident in IRT. Indeed, the assumptions of IRT hold many of the most intractable measurement problems (murmur). However, a claim that a model solves a thorny measurement problem is credibly only to the extent that the assumptions engaged in addressing the problem can be shown to withstand serious challenge. Too frequently, in my opinion, we act as if assumptions are met without question. Such unrestrained confidence can usually lead to excessive, or at least unsubstantiated, public claims about what our models can accomplish in real life educational testing context."
Then he hits himself a little by saying "there is also an unfortunate tendency to act as if powerful models can compensate for designed deficiencies or lack of data. For example, although Rasch has the theoretical capability of disentangling multiple sources or error variance., to do so well in practice requires relatively large numbers and condition or facets. The theory per se cannot compensate for lack of data or poor data collection designs. As our models become more complex the need for computer programs has mushroomed. While good programs are available for numerous models, none of them produces infallible estimates and probably all of them can produce silly results. It is always dangerous to have unquestioned confidence in the output of computer programs, and the more complicated the model the greater the danger. You know most of what we have been seeing up here today is stuff that is being dumped out of computer programs, and again I think we have to be a little careful in interpreting those."
In Bob Hess and Mark Becker’s paper they say, "Glendale Union employees have a curriculum driven assessment program rather than an assessment driven curriculum. In this fashion Glendale Union students, teachers, administrators, and parents are assured that the assessments are a valid representations of the instructional expectations placed upon the learners."
Well, I'm not sure that I buy that, just because you say well I'm doing this instead of that. Then all at once I'm assuring everybody that what I'm doing is valid because the validity actually is a much more complex issue than that. It deals with the nature of the test questions. Here in this case it deals with the nature of the appropriateness of the model for making the inferences that you are making. And a point that I want to make because I think that this is related to all three of these districts. All three programs appear to be very much curriculum related to the local curriculum. We hear this now all the time it seems, that the curriculum and the tests should all match together perfectly. Well, you know if you do that we run into some serious problems as to whether we really know what's going on. I mean one of the big reasons that we give tests, especially large scale tests, especially tests that I think in every case here serve some external monitoring function. If the curriculum is crappy, and your tests matches it perfectly, you might have problems.
And we're over selling this idea that gee -- these matching together perfectly is just what we want because then you truly are left maybe with a position where you really aren't finding anything out. And so, that's something we over sell and I think it is very important that we have at least some kind of, and I'm not up here trying to sell the ITBS. Hell, it does fine and people sell it fine, but I do think that this sort of external way we think of tests especially the large scale tests, whether they be state, or federal assessments.
I do think that this is something that we have to think about fairly seriously, and, in the Hess paper there are some specific questions. This goes back to the idea that your data is just still a function of kids responses to items. It has nothing to do with the model . That's all the data you got -- is what these little kids did. Now before you said that you had these inter-rater reliabilities of .66 to .70. Then afterward we throw facets in here, and the inter rater reliabilities go up to .80 to .86, .89, .89, let's see -- .86, .82, .89, and the. 98. Is there a typo in there?!!?!
Also, the point is made that people are going to feel better if we equate the difficulty of these tasks. Well, the primary issue in terms of tasks, I think maybe you are fooling people a little. The real issue with multiple tasks is really a fairness issue and if I happen to be a little kid that gets a task that I'm not very good at or don't understand very well or I don't find interesting, equating out some kind of little statistical adjustment really doesn't help me any. And so I think you're misleading the people a little to think that you are really making it much fairer. You are maybe making a little tiny difference in terms of this, but the real issue here is still this kid is taking one or two tasks, which in fact, the task by student interaction is still as big as it ever was.
Now Joe O’Reilly's paper. Heck, you should have said that comment in your written report about sows or cellars on the ITBS in your presentation. Because I was going to make fun of him, I will anyway. Because of the bowling ball thing, you know he was using that example -- bowling balls. In his paper he contrasted that fact that when they gave the Iowa's they had these sows in cellars, hogs I think would be the right word, you know. I want to tell you something, we don't have sows and stuff on the ITBS, or we'd be in big trouble. There is a lot of stuff there if you can tell by listening to me about hillbillies and rednecks, and we know about bowling balls, and would have done fine on that.
Now the primary thing here deals with how do I interpret this Mesa Scale Score [growth] Scale? I always loved this, because some of the first talks I ever gave to this conference deal with grade equivalents, which, and it's always because people hate themselves. It's so much fun to be on the other side, and so you know I talk. Well you know he kept trying to call those things grade equivalents, you know he just wouldn't ever do it.
And, eventually, that's what you're gonna have to do to interpret the mess because that's what you kept saying. But anyway, so that's my answer to your problem. These are GE's, and just call them that!
And I did notice one thing about your data that might be related to Gage's a little. The idea that that these scales tend to always show this declining variability. Yours do too up on these graphs. And I'll have to say that's something about IRT that will always bewilder me, and I don't understand. Actually there's a paper on it, I see Haffee sitting back here. Omar Haffee is here and he's going to present a paper later that might help explain maybe why these things, why that happens with IRT scales.
Figure 1: Mean mathematical RIT scores across sixteen years

The last one is Gage's data, which is actually neat (Fig. 1). Either K through 2 teachers have done an incredible job in Portland, and they should get all of the credit and all the raises obviously. Or the kids in Portland are just a lot smarter than they were ten or fifteen years ago. Now the real question I would ask myself about all this, is there other evidence supporting this from somewhere else? And some that I might think of as being at least more external to the system. And, when I look at this data I did see some cut things that make wonder about the equating. If I take the last four years of each one of these grades I get a real distinct pattern here nearly in every case. Then if I take the first five and separate them out I see three distinct trends in this data. Remember this is "population" data. Now either there are some major population shifts occurred at these points. See here you get nice steady growth, then they are flat. And they bounce around they're flat. Then all at once we get a jump. Now, by the way, these are cross sectional jumps. This isn't a jump that is occurring where these kids followed these, these followed these. Now I find that more impressive than those things right there. And so when I looked at the data I found it interesting and I would like to look at it more carefully. But, that to me is more noticeable.
Now what else did I have here? Well, I guess, though, the other thing I would ask myself about these trend issues. I think there are fundamental equating issues, and again, using the term equating is maybe not quite right here because we're just going to (murmur). We just keep putting these on the same scale. But it is really important, I think, in terms of how you are really doing this. It is pretty easy if you introduce new items regularly. You have this effect going on in the district where, as you introduce items and the ones that you are doing the linking with, if your new group of people this year are not familiar with the new items, but were with the old ones, it will always make the new ones look harder but maybe they really aren't harder. And so your equating or linking design here really is important. I don't know how it works out there, but I'm still not convinced that you don't have some effects in the way you are gathering your data that are helping to create this. But the other possibility I think is that essentially you just have people getting more and more familiar with what you are measuring.
That's enough, we need to have some time for questions. I enjoyed the papers and again I think they really do represent a lot of work on the part of the people in these school districts in terms of test development and analysis of data. But, as I said, I'm not sure most of this had much of anything to do with the Rasch model. And I don't care where the curve came from, it's just arithmetic. Thanks.