Principled Assessment Design by Dylan Wiliam

Back in 2013 I wrote a lengthy review of Measuring Up by Daniel Koretz. This book has had a huge influence on how I think about assessment. Last year I read Principled Assessment Design by Dylan Wiliam, which is equally good and very helpful for anyone looking to design a replacement for national curriculum levels. As with Koretz’s book, it packs a lot in – there are useful definitions and explanations of validity, reliability, and common threats to validity. There are two areas in particular I want to comment on here: norm-referencing and multiple-choice questions. These are two aspects of assessment which people are often quite prejudiced against, but Wiliam shows there is some evidence in favour of them.

Norm-referencing
Wiliam shows that in practice, norm-referencing is very hard to get away from. The alternative is criterion-referencing, where you set a criterion and judge whether pupils have met it or not. This sounds much fairer, but it is actually much harder to do than it sounds. Wiliam gives a couple of very good practical examples of this. Take the criterion ‘can compare two fractions to identify which is larger’. Depending on which fractions are selected, as many as 90% or as few as 15% of pupils will get the question right. How should we decide which pair of fractions to include in our assessment? One useful way would be to work out what percentage of pupils from a representative norm-group got each question right. That’s essentially norm-referencing. The criterion can only be given meaning through some use of of norming. ‘As William Angoff observed four decades ago, “we should be aware of the fact that lurking behind the criterion-referenced evaluation, perhaps even responsible for it, is the norm-referenced evaluation” (Angoff, 1974, p.4).’

He also draws a distinction between norm-referencing and cohort-referencing. Norm-referencing is when you ‘interpret a student’s performance in the light of the performance of some well-defined group of students who took the same assessment at some other time.’ Cohort-referencing is when you interpret performance in the light of other pupils in a cohort – be that a class or a particular age cohort. This may not sound much of a distinction, but it is crucial: ‘The important point here is that while cohort-referenced assessment is competitive (if my peers do badly, I do better), norm-referenced assessment is not competitive. Each student is compared to the performance of a group of students who took the assessment at an earlier time, so in contrast to cohort-referenced assessment, sabotaging the performance of my peers does me no good at all.’

I hadn’t fully considered this before, but I think it is an extraordinarily important point to make because it could perhaps help to improve norm-referencing’s bad image. Often, people associate norm-referencing with an era when the top grade in public exams was reserved for a certain percentage of the pupils taking the exam. However, that’s not norm-referencing – it’s cohort-referencing. Norm-referencing doesn’t have any of these unfair or competitive aspects. Instead, it is simply about providing a measure of precision in the best way we can.

Wiliam also warns against over-reliance on rubrics which attempt to describe what quality looks like. He quotes a famous passage from Michael Polanyi which shows the limitations of attempting to describe quality. I’ve written at greater length about this here.

Multiple-choice questions
As with norm-referencing, multiple choice or selected-response questions often get a bad press. ‘Particularly in the UK, there appears to be a widespread belief that selected-response items should not be used in school assessment…It is true that many selected-response questions do measure only shallow learning, but well-designed selected-response items can probe student understanding in some depth.’

Wiliam gives some good examples of these types of question. For example:

What can you say about the means of the following two data sets?

Set 1: 10 12 13 15
Set 2: 10 12 13 15 0

A. The two sets have the same mean.
B. The two sets have different means.
C. It depends on whether you choose to count the zero.

As he says, ‘this latter option goes well beyond assessing students’ facility with calculating the mean and probes their understanding of the definition of the mean, including whether a zero counts as a data point or not.’ I would add that these types of questions can offer better feedback than open-response question. If you deliberately design a question to include a common misconception as a distractor, and the pupil selects that common misconception, then you have learnt something really valuable – far more valuable than if they simply don’t answer an open-response question.

Wiliam also notes (as he has done before, here) that one very simple way of stopping pupils guessing is to have more than one right answer. If there are five possible options, and the pupils know one is right, they have a 1 in 5 chance of guessing the right answer. If they don’t know how many are right, they have a 1 in 32 chance of guessing the right answer. In response to this I actually designed a lot of MC tests with this structure this year, and I can confirm that it significantly increases the challenge.  Pupils have to spend a lot more time thinking about all of the distractors. if you want to award marks for the test and record the marks, you have to think carefully about how this works. For example, if there are, say, three right answers and a pupil correctly identifies two, it can feel harsh for the pupil to get no credit, particularly when compared to a pupil who has just guessed one completely wrong answer. This isn’t a problem if the questions are used completely formatively, of course, but it is something to bear in mind. However, I can definitely vouch for Wiliam’s central point: multiple choice questions can be made to be extremely difficult and challenging, and they can certainly test higher-order learning objectives. For more on this, see my series of posts on MCQs starting here.

Advertisements

4 thoughts on “Principled Assessment Design by Dylan Wiliam

  1. Andy Day

    You’re right – multiple choice questions have a place and can be effective, but it depends on what you want to discern, requires very careful compilation and contain one key flaw. I devised 20 m/c Q tests for each end of unit geography assessment in Y7-9, some with more than one possible answer; in some cases it was indicated that ‘2’ were ‘correct’, other times that ‘more than 1 is correct’ but without indicating how many were (there were bonus marks for hitting on the right number to deter students from selecting 4 out of 5). The value of using these m/c tests were that some students preferred them to extended writing, they had additional value if some of the Qs were reversed to ‘which of these are incorrect’ as opposed to ‘correct’ in that going over them there was more explanation to be had in covering why a statement was ‘false’ rather than trying to explain why it was ‘right’, and they were quick and easy to mark and total within the lesson by peers. The flaw (and this is why we always used them alongside an issue-based extended answer question) is that on many occasion a student would challenge an ‘incorrect’ answer and talk it through. What you quickly realised was that they were thinking through a sequence of links that were entirely logical – but not in the same linear manner as I’d envisaged when writing the question. And that’s the key issue with m/c if it’s the only method of assessment – you are scoring whether students are thinking along the same lines and with the same interpretation of words as the teacher. Sometimes they’re not – and that’s why we always made sure we used them with open-ended questions too – to catch the insightful, but different, ways of thinking that will occur within a class that we didn’t account for. The best assessements have to provide a route by which the assessor learns something too – a link they hadn’t thought of, an interpretation they hadn’t considered. Students require the opportunity to demonstrate a line of thinking and a comprehension that we haven’t catered for.

    Reply
  2. Crispin Weston

    The point that “lurking behind the criterion-referenced evaluation is the norm-referenced evaluation” can also be turned on its head: “lurking behind the norm-referenced evaluation is the criterion referenced evaluation”. The subject of your normative evaluation has to be defined if the assessment is to have any relevance to your learning objectives.

    The trouble here is that in referring to “criterion referencing”, people are generally conflating two things: first, the attempt to define the capability that is being tested; secondly, the assumption that this capability can be satisfactorily expressed as a binary value – i.e. it is either something that you can do or it is something that you can’t. In reality, most capabilities are possessed to some degree. We therefore need to understand the difference between the way in which we understand the capability that is being measured and the way in which the incremental mastery of that capability can be usefully expressed.

    Nor is the distinction between norm-referencing and cohort-referencing as simply as you make out because what typically changes between one year to the next is not just the cohort but also the exam. In most cases, you cannot compare a student’s performance against a global norm in the particular questions that are being answered, but only against the rest of the cohort that answered that question.

    So long as the cohort is drawn from all the students of a particular age across the country, I suspect that the differences between the average ability of one year and another is likely to be significantly smaller than the difference between the difficulty of one year’s question and the next year’s question – even if that difference in the difficulty of the question might not be apparent until the results are in. That is how I would make the argument for what is generally called norm-referencing, which is in fact referencing against a large cohort.

    The lesson from this is that what matters is that the cohort should be as large as possible. This becomes very difficult to achieve when the assessment is devised by teachers, for use with only a small population of students in a formative. And this is what normally happens when people try to embed assessment within the teaching process, in the worthwhile attempt to merge summative and formative purposes of assessment.

    This is where ed-tech again offers excellent solutions, the potential for which continue to be almost completely ignored by teachers, academics and officials alike. Practice exercises devised and tracked centrally but sequenced and assigned locally will be able to generate statistically useful quantities of outcome data, ensuring that we retain the statistical merits of large populations, without suffering the disadvantages of one-shot sampling exercises.

    Such a scenario also allows for the optimization of questions and exercises in response to extensive piloting, the need for which is highlighted by the issues raised by Andy Day.

    Reply
  3. Toni Soto (@ToniSoto_Vigo)

    After many hours reading this blog (English is not my mother language) following the Assessment tag and hundreds of comments I can say that here, I buy both your arguments (Daisy and Crispin). I really believe in the power of well contructed MCQ (actually learning activities) which can benefit self or direct instruction, deliver a quality feedback (formative assessment) and with the support of learning analytics report about what’s going on in educational systems.

    Thanks for sharing your thoughts with us. I do appreciate your contributions!

    Keep posting and commenting.

    Happy 2016 from Spain.

    Reply
  4. Rufus

    Really enjoying reading your blogs and I’m excited about MCQs. If I gave a test of 5 questions with 4 choices with only 1 correct answer each, then I wonder what the odds are of guessing the right answers. I’m a mathematician, so I thinks these calculations are right
    Guess none right: 243 out of 1024
    1: 405 out of 1024
    2: 270 out of 1024
    3: 90 out of 1024
    4: 15 out of 1024
    5: 1 out of 1024

    This says to me that guessing might help the students to get 1 question right, and the majority would get 2 questions right, but guessing would not help students to pass the test.

    Best, Rufus

    Reply

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s