Back in 2013 I wrote a lengthy review of Measuring Up by Daniel Koretz. This book has had a huge influence on how I think about assessment. Last year I read Principled Assessment Design by Dylan Wiliam, which is equally good and very helpful for anyone looking to design a replacement for national curriculum levels. As with Koretz’s book, it packs a lot in – there are useful definitions and explanations of validity, reliability, and common threats to validity. There are two areas in particular I want to comment on here: norm-referencing and multiple-choice questions. These are two aspects of assessment which people are often quite prejudiced against, but Wiliam shows there is some evidence in favour of them.
Wiliam shows that in practice, norm-referencing is very hard to get away from. The alternative is criterion-referencing, where you set a criterion and judge whether pupils have met it or not. This sounds much fairer, but it is actually much harder to do than it sounds. Wiliam gives a couple of very good practical examples of this. Take the criterion ‘can compare two fractions to identify which is larger’. Depending on which fractions are selected, as many as 90% or as few as 15% of pupils will get the question right. How should we decide which pair of fractions to include in our assessment? One useful way would be to work out what percentage of pupils from a representative norm-group got each question right. That’s essentially norm-referencing. The criterion can only be given meaning through some use of of norming. ‘As William Angoff observed four decades ago, “we should be aware of the fact that lurking behind the criterion-referenced evaluation, perhaps even responsible for it, is the norm-referenced evaluation” (Angoff, 1974, p.4).’
He also draws a distinction between norm-referencing and cohort-referencing. Norm-referencing is when you ‘interpret a student’s performance in the light of the performance of some well-defined group of students who took the same assessment at some other time.’ Cohort-referencing is when you interpret performance in the light of other pupils in a cohort – be that a class or a particular age cohort. This may not sound much of a distinction, but it is crucial: ‘The important point here is that while cohort-referenced assessment is competitive (if my peers do badly, I do better), norm-referenced assessment is not competitive. Each student is compared to the performance of a group of students who took the assessment at an earlier time, so in contrast to cohort-referenced assessment, sabotaging the performance of my peers does me no good at all.’
I hadn’t fully considered this before, but I think it is an extraordinarily important point to make because it could perhaps help to improve norm-referencing’s bad image. Often, people associate norm-referencing with an era when the top grade in public exams was reserved for a certain percentage of the pupils taking the exam. However, that’s not norm-referencing – it’s cohort-referencing. Norm-referencing doesn’t have any of these unfair or competitive aspects. Instead, it is simply about providing a measure of precision in the best way we can.
Wiliam also warns against over-reliance on rubrics which attempt to describe what quality looks like. He quotes a famous passage from Michael Polanyi which shows the limitations of attempting to describe quality. I’ve written at greater length about this here.
As with norm-referencing, multiple choice or selected-response questions often get a bad press. ‘Particularly in the UK, there appears to be a widespread belief that selected-response items should not be used in school assessment…It is true that many selected-response questions do measure only shallow learning, but well-designed selected-response items can probe student understanding in some depth.’
Wiliam gives some good examples of these types of question. For example:
What can you say about the means of the following two data sets?
Set 1: 10 12 13 15
Set 2: 10 12 13 15 0
A. The two sets have the same mean.
B. The two sets have different means.
C. It depends on whether you choose to count the zero.
As he says, ‘this latter option goes well beyond assessing students’ facility with calculating the mean and probes their understanding of the definition of the mean, including whether a zero counts as a data point or not.’ I would add that these types of questions can offer better feedback than open-response question. If you deliberately design a question to include a common misconception as a distractor, and the pupil selects that common misconception, then you have learnt something really valuable – far more valuable than if they simply don’t answer an open-response question.
Wiliam also notes (as he has done before, here) that one very simple way of stopping pupils guessing is to have more than one right answer. If there are five possible options, and the pupils know one is right, they have a 1 in 5 chance of guessing the right answer. If they don’t know how many are right, they have a 1 in 32 chance of guessing the right answer. In response to this I actually designed a lot of MC tests with this structure this year, and I can confirm that it significantly increases the challenge. Pupils have to spend a lot more time thinking about all of the distractors. if you want to award marks for the test and record the marks, you have to think carefully about how this works. For example, if there are, say, three right answers and a pupil correctly identifies two, it can feel harsh for the pupil to get no credit, particularly when compared to a pupil who has just guessed one completely wrong answer. This isn’t a problem if the questions are used completely formatively, of course, but it is something to bear in mind. However, I can definitely vouch for Wiliam’s central point: multiple choice questions can be made to be extremely difficult and challenging, and they can certainly test higher-order learning objectives. For more on this, see my series of posts on MCQs starting here.