In a few recent posts I’ve talked about the difficulties with using criteria as the sole reference point for exams. I think these difficulties can be seen very clearly with national curriculum levels, which are used as a method of criterion referencing internal assessments.
National curriculum levels are going to be abolished, but it is likely that many schools will continue to use them for their own internal assessments. One of the reasons many people like NC levels is that they provide a ‘shared language’ or a ‘common framework’ about what pupils can and can’t do. In this post, Chris Husbands runs through a lot of the problems that NC levels had, but concludes that they were still worth retaining because of this common language. He argues that national curriculum levels were valuable because ‘they provided a national standard…The loss of a common national framework – something which international visitors have generally admired – is a big price to pay.’ The NfER make a similar point here, criticising levels but lamenting the loss of a shared framework.
In this post I want to argue that in fact national curriculum levels did quite a poor job at providing a common language, and that a large part of the reason for this is that the way they were used for internal assessments relied very heavily on unstandardized criterion references.
In this post I spoke about some of the general difficulties with criterion-referenced exams. In this one I looked at how this was particularly applicable to English creative writing assessments. A common response to this is to say that it is particularly difficult to mark creative writing assessments because creative writing judgments are inherently subjective. There is some truth to this, of course, and in fact when I was first training, I did think that it would be possible to criterion reference maths and science questions in a much more accurate way. But actually, the same problems afflict maths and science. Tim Oates again:
Even a well-crafted statement of what you need to get an A grade can be loaded with subjectivity – even in subjects such as science. It’s genuinely hard to know how difficult a specific exam is.
For a very good and specific example of this, here’s Paul Bambrick Santoyo in Driven by Data.
To illustrate this, take a basic standard taken from middle school math:
Understand and use ratios, proportions and percents in a variety of situations.
To understand why a standard like this one creates difficulties, consider the following premise. Six different teachers could each define one of the following six questions as a valid attempt to assess the standard of percent of a number. Each could argue that the chosen assessment question is aligned to the state standard and is an adequate measure of student mastery:
Identify 50% of 20.
Identify 67% of 81.
Shawn got 7 correct answers out of 10 possible answers on his science test. What percent of questions did he get correct?
J.J Redick was on pace to set an NCAA record in career free throw percentage. Leading into the NCAA tournament in 2004, he made 97 of 104 free throw attempts. What percentage of free throws did he make?
J.J Redick was on pace to set an NCAA record in career free throw percentage. Leading into the NCAA tournament in 2004, he made 97 of 104 free throw attempts. In his first tournament game, Redick missed his first five free throws. How far did his percentage drop from before the tournament game to right after missing those free throws?
J.J Redick and Chris Paul were competing for the best free-throw shooting percentage. Redick made 94% of his first 103 shots, while Paul made 47 out of 51 shots.
a) Which one had a better shooting percentage?
b)In the next game, Redick made only 2 of 10 shots while Paul made 7 of 10 shots. What are their new overall shooting percentages?
c) Who is the better shooter?
d) Jason argued that if Paul and J.J each made their next ten shots, their shooting percentages would go up the same amount. Is this true? Why or why not?
Although the English curriculum’s maths descriptors are slightly more detailed, this point still holds. The level 5 descriptor for number and algebra is ‘pupils calculate fractional or percentage parts of quantities and measurements, using a calculator where appropriate.’ Level 6 is ‘pupils evaluate one number as a fraction or percentage of another. They understand and use the equivalences between fractions, decimals and percentages, and calculate using ratios in appropriate situations.’
Even though the levels give slightly more detail than Bambrick-Santoyo’s example of the New Jersey standard, his criticisms still hold. Using those level descriptors, you would classify both these questions as being of equal difficulty: What is 50% of 100? and What is 87% of 437?
Whether allowed access to a calculator or not, most pupils would get the first question right. Not nearly as many would get the second question right. Yet both of these, according to the levels, count as level 5 questions.
This poses a particular problem for the creators of national exams. They have to create exams that have different questions, but are of comparable difficulty. The criteria, or standards, or level descriptors, are meant to help guide them, but it is fiendishly difficult to do. As I said in my previous blog post:
Examiners are in an extremely difficult situation. Year after year, they have to create tests of comparable difficulty to those of the year before that are nevertheless sufficiently different from those of the year before. If they make the test close in structure and format to that of the year before, then it will be easier to compare across years, but it will also be easier for the succeeding year to do well at the test. If they depart more significantly from the structure and format of the year before, they ensure that the succeeding year don’t gain an unfair advantage, but they also make it much harder to compare between the two years.
Chris Husbands makes a very similar point here. He shows that in order to make consistent and comparable judgments, you need shared tasks. Individual questions are not judged as being easy or hard in reference to abstract criteria, but in reference to how pupils have performed on them. This is easy for international tests such as PISA to do, but much harder for national curriculum tests (my italics).
The aspiration is to hold the difficulty of the test constant over time, so that children with similar attributes do equally well in any year. It is not too difficult for PISA to achieve this – questions are kept private so that some can be re-used and the difficulty of new ones scaled against them, whilst the test is administered only every three years to a sample. But it will be impossible to keep KS2 questions private as teachers administer the tests, and they will do so to all pupils in all schools. If questions are not re-used then it will be difficult genuinely to scale the test each year to secure consistency. But if questions are re-used it will be difficult to make the test sufficiently different each year to avoid a repeat of the gradual grade improvement as teachers learn what is expected.
Bambrick-Santoyo concludes from his analysis of standards that ‘standards are meaningless until you define how you will assess them.’ I would agree, and I would even go further. Yes, standards are meaningless if left undefined. But they give the illusion of meaning, which makes them very confusing. If two people use the same word to refer to something different, there is huge potential for confusion.
It’s at this point that I would depart from Chris Husbands. Although he accepts in the paragraph above that you can’t have comparable standards without norming, he then goes on to argue that national curriculum levels – which are essentially criteria – could provide comparable standards. But I don’t think levels did give us this common national framework. They gave us the illusion of one. One person was talking about a level 4 and thinking it meant ‘Identify 50% of 20’, whereas another person was talking about the same descriptor but thinking it meant ‘Identify 67% of 81’. In English, one person was defining ‘convincing’ persuasion in one way, and another person was defining it in quite another. One teacher might set an interim termly task asking pupils to infer from a short unseen poem, another might set a task asking pupils to infer from a novel they’d studied in class. The result was not shared standards, but the illusion of shared standards. It was often noted that a primary school level 4 was very different from a secondary school level 4. But I actually think that the problem went wider than that – that actually, even in between schools and teachers there was a variation in what these levels really meant. The fact is, if you want a shared language, you need shared content and tasks. If you keep the shared language but get rid of the shared content and tasks, you don’t actually have a shared language any more. It’s as if we were all using the word ‘cat’ and pronouncing it in the same way, but some of us were using it to refer to zebras, some to lions, some to dogs and some to domestic cats.
Where I would agree with Chris Husbands is that this was more of a problem at KS3 than at KS2. And that is because the freedom for individual teachers and schools to select content was in practice greater at KS3, and also because after KS3 tests were abolished they took with them the sole reference point to shared content in that key stage.
We don’t get a shared language through abstract criteria. We get a shared language through teaching shared content, doing shared tasks, and sharing pupil work with colleagues. Tom Sherrington’s post here gives some idea as to how this might work.