The pyrotechnics around the Common Core has obscured some of the more prosaic, but critical, questions it poses for K-12 reform. Field testing for the Common Core-aligned PARCC and Smarter Balanced assessments is now in full swing for millions of students, and dozens of states are planning to make big decisions based on next year’s results. Yet, four years after these testing consortia launched, I still can’t get answers to practical questions about whether the results will provide the kind of valid, reliable data needed to support transparency, accountability, and informed competition.
A couple weeks ago, I flagged a few of these questions. Despite a flurry of assurances that PARCC and SBAC officials were going to offer reassuring responses, I’m still waiting to hear anything more than, “Trust us, we’re really smart.” Here are three questions that I can’t seem to get answered:
How will we compare the results of students who take the assessment using a variety of different devices? There will be variability in screen sizes, keyboards, and potentially in the visual display. Some students will be using certain kinds of devices for the first time. And many states will be administering tests to some number of students using paper and pencil in 2015, and likely beyond. What do we know about how to account for all this variation in order to produce valid, reliable results?
While there are always questions about consistency of testing conditions, these get super-sized when the stakes climb and variation is non-random. Well, limited access to the required devices means that all the usual questions get accentuated.
How will PARCC and SBAC account for vastly different testing conditions? Depending on testing infrastructure, some schools will be able to assess students in their regular classroom while other schools will have to shuffle students around the building, to schools across town, or to independent testing centers. How much does this matter? What do we know about how to track and then account for the impact of such factors on outcomes?
How will we account for the fact that we’re apparently looking at testing windows that will stretch to four or even 12 weeks? This means that some schools will give the test many weeks after other schools. Students in schools which administer the test towards the end of the testing window will have had a lot more instructional time than students in schools which test at the beginning. The variation could be 10% or more of the instructional year, or more. How is this going to be tracked and accounted for when comparing teachers, schools, programs, and vendors?
Right now, it appears likely that some schools and teachers will have substantially less time to teach the tested material and will have students taking the tests in adverse conditions, but that the testing consortia and state officials will plan to treat the results as apples-to-apples comparisons of school quality. I can’t think of any well-managed enterprise, for-profit or nonprofit, that would think that made sense.
The National Assessment of Educational Progress is credible in part because it’s careful about this stuff. When students take NAEP, they do so under identical, controlled testing conditions. If these questions about the Common Core tests don’t get answered, and soon, it’s going to gravely undermine the legitimacy of metrics on school and teacher performance. That’s bad for transparency, accountability, and efforts to evaluate teachers in smarter ways. The hurried push to introduce national, computer-based assessments raises important questions about the validity and reliability of results, and these may even serve to put new teacher evaluation and pay systems in legal jeopardy.
The point is NOT that the above questions are necessarily deal-breakers, but that they should have been addressed during the design phase–and that they need to be answered convincingly, and soon.