Intelligence Squared debate: Don’t end the tyranny of the test

On Thursday I spoke at an Intelligence Squared debate called ‘Let’s end the tyranny of the test: relentless school testing demeans education’. Together with Toby Young, I spoke against the motion; Tony Little and Tristram Hunt spoke for it.

There were a number of important points of agreement between the two sides. Tony Little told the story of Tom, a brilliant history student who got a U in his History A-level because his argument was essentially too sophisticated for the narrow exam rubric. I’ve known Toms in my time teaching, and I’ve also known the opposite – the student who gains top grades in exams through a mastery of the exam technique, as opposed to the subject itself. I completely agree with Tony Little that this is a real problem in our exam system. I’ve written about this in this collection of essays here, where I am critical of narrow exam syllabuses and textbooks that encourage a focus on exam structures, rather than the topic itself. For example, in this exam textbook on Germany 1919-1945, there are large sections called ‘Meeting the Examiner’ where the technicalities of the ‘8 mark question’ are discussed, but there is no detail at all on any aspect of German history outside that narrow period. In this textbook, it is more important to Meet the Examiner than it is to meet Bismarck. If you also have concerns over this phenomenon, I’d recommend reading Heather Fearn’s blogs on the subject here.

We also agreed with the proposition that blunt government targets were problematic and that teaching to the test was not a good thing (I have written more about this here), and they agreed with us that exams should always play an important part in education. However, where I felt there were central areas of disagreement were in the attitudes to exams. The proposition viewed them as essentially a necessary evil, which were in many ways inimical to good education, and whose role and impact needed to be reduced as far as possible. Tristram Hunt said that our real focus should not be on exams, but on ensuring equity in the early years, and on teacher quality. Tony Little said that our focus should not be on exams, but on ensuring a love of learning; he also said that exams ‘atomise knowledge’. It was clear that for both of them, exams were not always helpful for these other aims. Whilst I accept that in our current system this may be the case, I am convinced that exams have an important and indeed indispensable role to play in achieving these aims.  I don’t think you can improve equity, teacher quality and a love of learning without some form of reliable feedback – and exams are basically the best and most accurate method of gathering feedback that we have. In many different areas of life, improvements and innovations are often brought about by improvements in measurement and feedback. Exams are our measurement system in education, and that in some ways we misuse our measures at the moment is not an argument against measurement: as Sir Philip Sidney said, ‘shall the abuse of a thing make the right use odious?’

Where exams have gone wrong – as I fully accept they have – we have to reform them, not abolish them. And in many ways, the ways they have gone wrong is that they have drifted away from being pure exams. For example, for me, one of the problems with the kinds of history exams Tony Little rightly criticised is that they are too dependent on human judgment against abstract criteria, which we know is a very ineffective method of assessment.  I would like to see an element of multiple choice questions in such exams, which would help to eliminate such problems – but of course, multiple choice questions are exactly the type of question which receive even more criticism for ‘atomising knowledge’. Similarly, one way we could ensure greater equity in the early years is to introduce exams at KS1, rather than teacher assessments, since we have some evidence that teacher assessments at this age are biased against pupils from a low-income background – but again, if you suggest replacing teacher assessments with tests, you generally do not get a great response. So this, for me, was the difference between the two sides: we both acknowledged the flaws in the current exam system, and both had very similar aims for education, but on our side, we felt that the problems could be solved by wiser use of exams, and perhaps even more of them, whereas for the proposition, they felt that there should just be fewer exams, that they should be of less importance, and that there should be more extended project-type assessments.

In my speech itself, I had two main arguments. First, I argued that exams were accurate and less subject to bias than other methods of assessment, such as teacher assessment and coursework. This is a well-established finding in the literature, but it is curiously little-known. Teacher assessment is frequently biased against disadvantaged pupils, but people assume again and again that actually, such assessment helps these pupils. I will write more about this in the future, but if you are interested in it, then my article in the Policy Exchange collection of essays here has more on this, and Rob Coe’s video here does too. I also argued that properly designed tests really do predict things of value. For example, this fascinating study by Benbow and Lubinski tracked top scorers on the American SAT at age 13. At age 38, many of them had gone on to achieve remarkable things, and not just in the predictable areas of business and academia – many of them excelled in the creative professions too. And the American SAT is the kind of test that many would criticise for ‘atomising knowledge’, or for just being, as David Baddiel says here, a kind of ‘cognitive trick’. This is not true, and it fundamentally misunderstands how tests work.  Questions on tests do not have to look exactly like the kind of problems we face in real life in order to provide useful information about how we might do on such real-life problems.

My second argument was that testing was also a useful pedagogical technique. We know from Bjork’s research on the testing effect that straining to recall something, as we do in a test, actually helps us to recall it in future. We also know that practice testing is a very effective revision technique: much more effective than the more common approach of rereading notes and highlighting them. The frequently-heard line about how ‘weighing the pig isn’t the same as feeding it’ is false: in the case of education, weighing the pig is the same as feeding it. Testing actually helps you to learn.

We narrowly lost the debate, and afterwards I spoke to a number of pupils in the audience whose experiences of exams were similar to those of Tom’s above, and who were therefore understandably sceptical of the value of exams. It’s worrying that this is happening, and it makes reform of our exam system all the more important. This same pattern – misuse of exams leading to widespread mistrust of them – has also been seen in the US, and has been outlined brilliantly by Daniel Koretz in Measuring Up. Koretz is at pains to point out the incredibly valuable information we can get from apparently ‘narrow’ standardised tests, but he is also very critical of the No Child Left Behind Act, and the teaching to the test and gaming it has encouraged. I completely agree with both his points, but it’s a combination of views that feels very rare: often, it feels as though if you are in favour of tests, you must be in favour of teaching to them; and if you are worried about how tests are being used, you must be in favour of abolishing them. It would be nice to open up space for a more Koretz-esque view of tests in this country. A confrontational debate may not be the best way of doing this, but I did enjoy it, and I hope it did at least pique some people’s

The Commission on Assessment without Levels

I was a member of the Commission on Assessment without Levels, which met earlier this year to look at ways of supporting schools with the removal of national curriculum levels. The final report was published last week, and here are a few key points from it.

1. Assessment training is very weak

The Commission agreed with the Carter Review that teacher training and professional development in assessment was weak. It’s worth quoting the Carter Review at length on this.

“For example, our review of course materials highlighted that important concepts relating to evidence-based teaching (validity, reliability, qualitative and quantitative data, effect sizes, randomised controlled trials) appeared to be covered in only about a quarter of courses…there are significant gaps in both the capacity of schools and ITT providers in the theoretical and technical aspects of assessment. This is a great concern – particularly as reforms to assessment in schools mean that teachers have an increased role in assessment. There are also important links here with the notion of evidence-based teaching. The profession’s ability to become evidence- based is significantly limited by its knowledge and understanding of assessment – how can we effectively evaluate our own practice until we can securely assess pupil progress?”

It’s particularly frustrating that assessment training is so weak, as compared to a lot of other aspects of teacher training this is not hard to deliver. It should be relatively straightforward to design a taught course covering the topics above.

2. Performance descriptors have big weaknesses. Judging pupils against ‘can-do’ statements is popular, but flawed.

“Some assessment tools rely very heavily on statements of achievement drawn from the curriculum. For example, teachers may be required to judge pupils against a series of ‘can-do’ statements. Whilst such statements appear precise and detailed, they are actually capable of being interpreted in many different ways. ‘A statement like ‘Can compare two fractions to identify which is larger’ sounds precise, but whether pupils can do this or not depends on which fractions are selected. The Concepts in Secondary Mathematics and Science (CSMS) project investigated the achievement of a nationally representative group of secondary school pupils, and found out that when the fractions concerned were 3/7 and 5/7, around 90% of 14-year-olds answered correctly, but when more typical fractions, such as 3/4 and 4/5 were used, 75% answered correctly. However, where the fractions concerned were 5/7 and 5/9, only around 15% answered correctly.’”

I’ve written about this at length here.

3. Teacher assessment is not always fairer than tests

“Standardised tests (such as those that produce a reading age) can offer very reliable and accurate information, whereas summative teacher assessment can be subject to bias.”

4. Ofsted does not expect to see any one particular assessment system.

Here’s a link to a video of Sean Harford, another member of the commission and the National Director for Schools at Ofsted, making exactly this point.

5. A national item bank could be an innovative way of providing a genuine replacement for levels.

“Some schools use online banks of questions to help with formative assessment. Such banks of question give meaning to the statements contained in assessment criteria and allow pupils to take ownership of their learning by seeing their strengths and weaknesses and improvement over time. Some commercial packages exist with pre-set questions, particularly for maths and science. Other products allow teachers to create their own questions, thus ensuring they align perfectly with the school curriculum.

One of the flaws with national curriculum levels was the way a summative measure came to dominate formative assessment. One way the government could support formative assessment without recreating the problems of levels would be to establish a national item bank of questions based on national curriculum content. Such an item bank could be used for low-stakes assessments by teachers and would help to create a shared language around the curriculum and assessment. It could build on the best practice of schools that are already pioneering this approach. Over time, the bank could also be used to host exemplar work for different subjects and age groups.”

New Zealand appear to have something similar.

For more of my posts on assessment, see here.

“Certain things then follow from that”: Notes on ED Hirsch’s Policy Exchange lecture

On Thursday evening I had the privilege of hearing ED Hirsch give the Policy Exchange education lecture.  Hirsch in person was much like Hirsch the author: self-effacing, erudite, quietly compelling and wryly humorous.  He spoke about what the best kind of early education should look like, and stressed the egalitarian effect of teaching knowledge to young children. In order to become good readers, children have to develop a large vocabulary and a lot of knowledge about the world, and both vocabulary and background knowledge are ‘plants of slow growth’. That’s why it’s so important to start in the early years. We also cannot rely on search engines to teach pupils this vital knowledge, because “Google is not an equal opportunity fact-finder”: to look something up on the internet requires knowledge to begin with.

One of the most interesting moments came near the end of the lecture, when in response to a question, he said that if you acknowledge the research on reading comprehension, “certain things then follow from that.” This, I think, is one of the most pleasurable things about reading books by Hirsch. He is a master at compiling a logical case. He practises what he preaches about knowledge: every book or article of his that I have read marshals a vast array of evidence, carefully detailing each piece of research, and then clearly outlining the implications which flow from it. The links between experiments on chess players, the evolution of irregular verbs, and kindergarten resources on ancient Mesopotamia are not immediately clear, but Hirsch makes them so.  It’s like watching a master craftsman at work, or reading a clever detective novel where every clue and red herring falls neatly into place in the final chapter.

Hirsch’s own intellectual journey is similarly full of such unexpected meaning: he began as a literature professor writing about the Romantics, critiqued the popular literary theory of reader-response in Validity in Interpretation, and carried out original research on students’ ability to appreciate written style. It was this latter research which led, obliquely, to his interest in education, because the experiments also showed that students were unable to understand a text if they lacked knowledge of its subject.  This intellectual journey reminds me of another famous Virginian: it was Thomas Jefferson who said that he was “bold in the pursuit of knowledge, never fearing to follow truth and reason to whatever results they led, and bearding every authority which stood in their way.”  For Jefferson too, certain things followed from reason, and sometimes those things contradicted the prevailing authorities. The modern educational establishment is obviously not as warlike as George III, although sometimes it can feel as though they are equally blasé about reality.

One other point I found of particular interest was Hirsch’s discussion of some of the popular modern aims of education. He quoted the motto of school boards in Tucson, Milwaukee and Santa Fe  – all variations on the theme of ‘children will develop organisational, critical thinking and problem-solving skills’. He argued that such vague, motherhood-and-apple-pie type statements are so popular because they are very convenient with bureaucrats. I could not agree with this more. I have written at length lately about the problems with performance descriptors – those wishy-washy statements which are meant to ‘define’ what a pupil can know and do. Despite the fact that most people recognise how useless they are, they have a zombie-like tenacity. Why is this? As Hirsch says, they are popular with bureaucrats, and I think this is because they offer an illusion of meaning, and also because the alternative to such statements is to specify knowledge, which always requires hard thinking and often leads to controversy. This brings me to the final reason why Hirsch’s work is so important: his work is always practical. The Core Knowledge Curriculum is used in thousands of schools in the US; the ‘What your nth-grader needs to know’ series are bestsellers; the Core Knowledge Language Arts resources have been adopted by New York City and been the subject of a highly successful evaluation. Untold numbers of children have had a better education because of Hirsch.

This is the light in which we should view the famous list of facts at the end of Cultural Literacy. I know it’s popular even among people who are sympathetic to Hirsch to dismiss the list of facts and statements in Cultural Literacy as ‘simplistic’ or ‘naïve’. Not at all. It is precisely the concrete simplicity of the list which is so valuable. It is easy for academics to waffle on at length about the importance of knowledge, and then, at the crunch moment, resort to some vague statements of competency and skill which have absolutely no practical use.  The list may look simplistic, and it is simple, but it is the product of much abstract research. In Hirsch’s work, clarity of action proceeds from clarity of writing, which proceeds from clarity of thought.  Hirsch makes research tangible. That is his genius.

Guide to my posts about assessment

Over the last two years, I have written a number of posts about assessing without levels. Here’s a guide to them.

First of all, what were the problems with national curriculum levels that led to them being abolished? And were there any good things about them that we need to retain? It’s often claimed that one of the good things about NC levels was that they gave us a shared language – but I argue here, in Do national curriculum levels give us a shared language? that actually, they don’t. Instead, they provide us with the illusion of a common language, which is actually very misleading.

In Why national curriculum levels need replacing, I extended this argument, looking at research from the early 90s which showed that even then, national curriculum levels obscured the reality of pupils’ achievement.

One of the main reasons that NC levels were so confusing is that they were based on prose performance descriptors. Any replacement for NC levels which is based on similar descriptors is likely to run into similar problems. This post, Problems with performance descriptors, extends the argument against NC levels to all kinds of performance descriptors. And in The adverb problem, I look at how hard it is to define excellence in terms of adverbs or adjectives – what does it mean to say that a pupil writes effectively or brilliantly or originally?

David Didau responded to this argument accepting the major points, but also putting forward a replacement for levels that was based on performance descriptors. This was a pattern I was starting to see again and again – people would accept my criticisms of NC levels, but then produce a replacement which was essentially a rehashed version of levels. In this post, I tried to tackle this head on, by saying that Assessment is difficult, but it is not mysterious.

In response to that post, many people quite rightly claimed that if I wanted people to accept my critique of performance descriptors, I had to come up with some alternatives. This was the subject of the next two posts. In Assessment alternatives 1: using questions instead of criteria, I looked at how you could use banks of questions to replace descriptors. Instead of asking teachers if pupils can compare two fractions and see which is bigger, ask the pupils if ¾ is bigger than 4/5, or if 5/7 is bigger than 5/9. And if you get a computer to mark the question for you, you’ve cut down on a lot of workload too. In order to use this kind of approach, multiple choice questions can be very helpful. I’ve written a three-part guide to them here.

Again, many people responded to that post by saying that whilst it might work for maths and science, it wasn’t really applicable to essay-based assessments that you get in English and the humanities. In Assessment alternatives 2: using pupil work instead of criteria, I suggested that for these types of assessment you could use pupil work instead of criteria. Instead of marking an essay against a set of criteria, compare essays against each other and against exemplar work.

In the next two posts I looked at some of the theoretical reasons why performance descriptors don’t work. First, in Marking essays and poisoning dogs, I look at research which shows that human judgment is relative, not absolute. This is the theory which underpins No More Marking, a really interesting new way of assessing the quality of complex tasks like essays.

Second, in Tacit knowledge, I look at the research around tacit knowledge, which shows that it is hard to describe expertise and excellence using words. When we try to do so, as with performance descriptors, the result is often not quality but absurdity. For example, pupils who have been coached with certain rules of mark schemes sometimes end up writing essays which read less like a good essay and more like the mark scheme itself.

The two books I’ve found most helpful in thinking about assessment are Measuring Up by Daniel Koretz, and Principled Assessment Design by Dylan Wiliam. My three part review of Koretz’s book starts here, and my review of Wiliam’s book is here.

In this post, from March 2014, I listed different approaches that were being taken by different schools. I hope to be able to update this soon with further developments from the past 18 months.

From March-July 2015 I was a part of the government’s Commission on Assessment without Levels. My take on the final report can be found here.

Finally, I gave a speech at Research Ed in September 2015 summarising a lot of these ideas. The slides are here, and the video is below.

Research Ed 2015


Every Research Ed I’ve been to has been brilliant, and every single one has been better than the one before. Great conversations, great people, fascinating ideas – I loved it all. Here is my summary of the day.

Session One
I spoke about replacing national curriculum levels. You can see my slides here: REd 2015 DC. The video is here.

I have been thinking about this issue for about three years now. It is the hardest project I have ever worked on, but also one of the most rewarding. I am really grateful to the people who came along and heard me speak, who offered such helpful suggestions and comments, and who have said kind things about it on twitter. It was a pleasure to be able to present on this topic. I hope it was of some use.

I’ll add one thing to what I said yesterday. At the end, Joe Kirby asked a question about what the government could do to encourage the approaches to assessment I was advocating, and I said, (more or less) very little. Broadly speaking, I am supportive of the move to abolish levels and hand responsibility for the replacements to schools. I think that in the long term, this will lead to better assessment. However, I did also begin my speech by saying that I thought that most of the systems I see schools designing are simply rehashed NC levels. A few people came up to me after the session and asked how i could think this and be so confident about a school-led system, and I think later in the day Sam Freedman spoke about similar issues to do with the capacity and expertise that schools need in order to make a school-led systems work. So how can I be so confident about the absence of government intervention when I am also saying that most schools are not really using their freedom to do anything new? My answer is threefold. First, most schools aren’t doing anything new, but by definition that means that they are basically carrying on with the status quo. So whilst things aren’t improving, they certainly aren’t getting worse (although of course in many cases schools are spending time and energy for little improvement, which does carry a cost). Second, most schools aren’t doing anything new, but a few are. In the medium to long term, if these new approaches prove successful, they could spread across the system. Third, where is the evidence that government intervention would improve things? There is a real risk that government would come up with a bad idea – after all, they are responsible for levels – and even if they came up with the perfect solution, there would also have to be serious doubts about it being implemented perfectly and as intended in every school in the country. So I think the right move is for government to step back and let schools innovate. And I say this as someone who has had some demoralising times over the past year or two looking at the various assessment systems on offer.

Session Three – Eric Kalenze
I heard Eric speak back in May at the New York Research Ed, when he was kind enough to give me a copy of his book, Education is Upside Down. It is absolutely fantastic. But then, I would say that, because it has an awful amount in common with my own book. I have been meaning to write about it for ages but there is so much I want to say about that I never finish the post. Essentially, Eric’s argument is that whilst education is in need of significant reform, most of the current reform movements focus too much on structures, and not enough on content. Reformers assume that the main problem with education is that the teachers aren’t working hard enough, so the solution is to make teachers work harder and set up punitive accountability regimes. In actual fact, whilst there is a problem with education, it most certainly is not that teachers are not working hard enough – it is that the system is in the grip of wrong ideas. He makes this argument – in the book and in his speeches – with the use of some really helpful, clear and amusing analogies, including his central one about how the funnel of education is upside down. Rather than criticise the teacher for unskilful use of the funnel, we should turn the funnel the right side up! Eric’s speech was all about the US system, but it was fascinating to consider just how many parallels there are to the UK. No Child Left Behind is so similar to our exam and accountability regime, New York’s SEDL programme so similar to our SEAL, and, I would argue, the often misplaced priorities that reformers have are similar in both countries too. More to follow on this.

Session Four – James Murphy
I loved this session, although I was slightly annoyed as I had planned to do something similar at a future Research Ed! James presented some of the theories of Siegfried Engelmann’s Direct Instruction and taught us a Maori word using the theories.

Session Five – Amanda Spielman
I got to this session early so I could get a front-row seat and be able to read all the small print on the axes of the graphs that would undoubtedly be presented. I wasn’t disappointed. In every other session, you could get away with a little light tweeting, but not in this one. If your attention went for a minute, you’d have missed several links in the meticulously presented chain of argument, in the rapid deductions, swift as intuitions yet always founded on a logical basis with which Spielman unravelled the problems entailed in carrying out remarks of exams. In Spielman’s presence, it’s hard not to feel like Watson or even Lestrade felt in the presence of Holmes. Of course, such a quiet air of mastery and pre-eminent intelligence can intimidate and even demoralise, but I guess us mere mortals have to soldier on, knowing that whilst we will never understand the nuance and beauty of exam remarking in quite the same way, we can still make our own much smaller contributions. Though we may not be luminous, we can perhaps conduct light. On that note: she recommended reading Measuring Up by Daniel Koretz. My three-part review starts here

Session Six – Katie Ashford
Katie Ashford gave a really practical and constructive session about teaching literacy. The discussion at the end was also really helpful, as we all discussed the best way to set up reading logs and check for understanding of vocabulary. This blog by Katie gives a good idea of the flavour of the session.

Session Seven – Rob Coe
Rob Coe finished the day off with some cheerful messages. He compared the state of education research in 1999 to the state today, arguing that there have been great improvements: more RCTs, more acknowledgement of the value of RCTs, more government funding for them, and more government interest in robust evaluation of policy. And he ended on a brilliantly positive note, saying that it was trailblazers like those in the audience who could make change happen.

As ever, there were regrets: I missed some amazing presentations, and missed the chance to talk to some great people too. But all in all it was a fantastic day, and I am hugely grateful to Tom Bennett, Helene Galdin O’Shea, Susie Wilson and all the team at SHHS for making this happen.


Twitter – pros and cons

If you haven’t already read Changing Schools, you can buy a copy here. As well as an essay by me on assessment, it features an even better one by Andrew Old on social media. Andrew interviews and cites a number of policymakers, bloggers and academics about the impact they think social media have had on education policy.  I’m one of the bloggers he interviewed for the essay, and his questions got me thinking – what is Twitter good for? What is it bad for? How can it help us – not just in education and policymaking, but in our lives in general? Here are my pros and cons.

Screen Shot 2015-07-28 at 20.37.06

Pro – allows you to find lots of interesting articles vs. Con – it’s a time sink
Twitter delivers a stream of relevant and important articles and research papers to your timeline. Unfortunately, you’ll be so busy reading the twitter storm that erupts around them and who said what to who about which, that you will never actually get round to reading the actual articles in the first place.

Pro – allows you to gather rapid feedback vs. Con – it’s an echo chamber
If you have any kind of new idea or question, twitter allows you to gather very quick feedback from people who have a direct interest and involvement in the field. Unfortunately, those same people will very likely be highly unrepresentative of the rest of the field.

Pro – allows you to meet interesting people vs. Con – allows you to meet horrible people
Twitter means you can meet brilliant people you have lots in common with whom you would never have met otherwise. It also means that you can meet horrible people you have very little in common with whom you would never have met otherwise.

Pro – accelerates discovery of truth vs. Con – retards discovery of truth (always assuming that ‘truth’ is a thing)
Twitter is a (fairly) level playing-field where ideas can grapple. When this happens, Milton tells us, Truth wins. On the other hand, Twitter forces ideas to be compressed and simplified. But, as Donne tells us, the Truth often isn’t simple.

Principled Assessment Design by Dylan Wiliam

Back in 2013 I wrote a lengthy review of Measuring Up by Daniel Koretz. This book has had a huge influence on how I think about assessment. Last year I read Principled Assessment Design by Dylan Wiliam, which is equally good and very helpful for anyone looking to design a replacement for national curriculum levels. As with Koretz’s book, it packs a lot in – there are useful definitions and explanations of validity, reliability, and common threats to validity. There are two areas in particular I want to comment on here: norm-referencing and multiple-choice questions. These are two aspects of assessment which people are often quite prejudiced against, but Wiliam shows there is some evidence in favour of them.

Wiliam shows that in practice, norm-referencing is very hard to get away from. The alternative is criterion-referencing, where you set a criterion and judge whether pupils have met it or not. This sounds much fairer, but it is actually much harder to do than it sounds. Wiliam gives a couple of very good practical examples of this. Take the criterion ‘can compare two fractions to identify which is larger’. Depending on which fractions are selected, as many as 90% or as few as 15% of pupils will get the question right. How should we decide which pair of fractions to include in our assessment? One useful way would be to work out what percentage of pupils from a representative norm-group got each question right. That’s essentially norm-referencing. The criterion can only be given meaning through some use of of norming. ‘As William Angoff observed four decades ago, “we should be aware of the fact that lurking behind the criterion-referenced evaluation, perhaps even responsible for it, is the norm-referenced evaluation” (Angoff, 1974, p.4).’

He also draws a distinction between norm-referencing and cohort-referencing. Norm-referencing is when you ‘interpret a student’s performance in the light of the performance of some well-defined group of students who took the same assessment at some other time.’ Cohort-referencing is when you interpret performance in the light of other pupils in a cohort – be that a class or a particular age cohort. This may not sound much of a distinction, but it is crucial: ‘The important point here is that while cohort-referenced assessment is competitive (if my peers do badly, I do better), norm-referenced assessment is not competitive. Each student is compared to the performance of a group of students who took the assessment at an earlier time, so in contrast to cohort-referenced assessment, sabotaging the performance of my peers does me no good at all.’

I hadn’t fully considered this before, but I think it is an extraordinarily important point to make because it could perhaps help to improve norm-referencing’s bad image. Often, people associate norm-referencing with an era when the top grade in public exams was reserved for a certain percentage of the pupils taking the exam. However, that’s not norm-referencing – it’s cohort-referencing. Norm-referencing doesn’t have any of these unfair or competitive aspects. Instead, it is simply about providing a measure of precision in the best way we can.

Wiliam also warns against over-reliance on rubrics which attempt to describe what quality looks like. He quotes a famous passage from Michael Polanyi which shows the limitations of attempting to describe quality. I’ve written at greater length about this here.

Multiple-choice questions
As with norm-referencing, multiple choice or selected-response questions often get a bad press. ‘Particularly in the UK, there appears to be a widespread belief that selected-response items should not be used in school assessment…It is true that many selected-response questions do measure only shallow learning, but well-designed selected-response items can probe student understanding in some depth.’

Wiliam gives some good examples of these types of question. For example:

What can you say about the means of the following two data sets?

Set 1: 10 12 13 15
Set 2: 10 12 13 15 0

A. The two sets have the same mean.
B. The two sets have different means.
C. It depends on whether you choose to count the zero.

As he says, ‘this latter option goes well beyond assessing students’ facility with calculating the mean and probes their understanding of the definition of the mean, including whether a zero counts as a data point or not.’ I would add that these types of questions can offer better feedback than open-response question. If you deliberately design a question to include a common misconception as a distractor, and the pupil selects that common misconception, then you have learnt something really valuable – far more valuable than if they simply don’t answer an open-response question.

Wiliam also notes (as he has done before, here) that one very simple way of stopping pupils guessing is to have more than one right answer. If there are five possible options, and the pupils know one is right, they have a 1 in 5 chance of guessing the right answer. If they don’t know how many are right, they have a 1 in 32 chance of guessing the right answer. In response to this I actually designed a lot of MC tests with this structure this year, and I can confirm that it significantly increases the challenge.  Pupils have to spend a lot more time thinking about all of the distractors. if you want to award marks for the test and record the marks, you have to think carefully about how this works. For example, if there are, say, three right answers and a pupil correctly identifies two, it can feel harsh for the pupil to get no credit, particularly when compared to a pupil who has just guessed one completely wrong answer. This isn’t a problem if the questions are used completely formatively, of course, but it is something to bear in mind. However, I can definitely vouch for Wiliam’s central point: multiple choice questions can be made to be extremely difficult and challenging, and they can certainly test higher-order learning objectives. For more on this, see my series of posts on MCQs starting here.