Comparative judgment: 21st century assessment

In my previous posts I have looked at some of the flaws in traditional teacher assessment and assessments of character. This post is much more positive: it’s about an assessment innovation that really works.

One of the good things about multiple-choice and short answer questions is that they offer very high levels of reliability. They have clear right and wrong answers; one marker will give you exactly the same mark as another; and you can cover large chunks of the syllabus in a short amount of time, reducing the chance that a high or low score is down to a student getting lucky or unlucky with the questions that came up. One of the bad things about MCQs is that they often do not reflect the more realistic and real-world problems pupils might go on to encounter, such as essays and projects. The problem with real-world tasks, however, is that they are fiendishly hard to mark reliably: it is much less likely that two markers will always agree on the grades they award. So you end up with a bit of a trade-off:  MCQs give you high reliability, but sacrifice a bit of validity. Essays allow you to make valid inferences about things you are really interested in, but you trade off reliability. And you have to be careful: trade off too much validity and you have highly reliable scores that don’t tell you anything anyone is interested in.  Trade off too much reliability and the inferences you make are no longer valid either.

One way of dealing with this problem has been to keep the real world tasks, and to write quite prescriptive mark schemes. However, this runs into problems of its own: it reduces the real-world aspect of the task, and ends up stereotyping pupils’ responses to the task. Genuinely brilliant and original responses to the task fail because they don’t meet the rubric, while responses that have been heavily coached achieve top grades because they tick all the boxes. Again, we achieve a higher degree of reliability, but the reliable scores we have do not allow us to make valid inferences about the things we really care about (see the Esse Quam Videri blog on this here).  I have seen this problem a lot in national exams, and I think that these kinds of exams are actually more flawed than the often-maligned multiple choice questions. Real world tasks with highly prescriptive mark schemes are incredibly easy to game. Multiple-choice and short answer questions are actually not as easy to game, and do have high levels of predictive validity. I think the problem people have with MCQs is that they just ‘look’ wrong. Because they look so artificial, people have a hard time believing that they really can tell you something about how pupils will do on authentic tasks. But they can, and they do, and I would prefer them to authentic tasks that either don’t deliver reliability, or deliver reliability in such a way that compromises their validity.

Still, even a supporter of MCQs like me has to acknowledge – as I always have – that in subjects like English and history, you would not want an entire exam to be composed of MCQs and short answer questions. You would want some extended writing too. In the past, I have always accepted that the marking of such extended writing has to involve some of the trade-offs and difficult decisions outlined above. I’ve also always accepted that it has to be a relatively time-consuming process, involving human markers, extensive training, and frequent moderation.

However, a couple of years ago I heard about a new method of assessment called comparative judgment which offers an elegant solution to the problem of assessing tasks such as essays and projects. Instead of writing prescriptive mark schemes, training markers in their use, getting them to mark a batch of essays or tasks and then come back together to moderate, comparative judgment simply asks an examiner to make a series of judgments about pairs of tasks. Take the example of an essay on Romeo and Juliet: with comparative judgment, the examiner looks at two essays, and decides which one is better. Then they look at another pair, and decide which one is better. And so on. It is relatively quick and easy to make such judgments – much easier and quicker than marking one individual essay.  The organisation No More Marking offer their comparative judgment engine online here for free. You can upload essays or tasks to it, and set up the judging process according to your needs.

Let’s suppose you have 100 essays that need marking, and five teachers to do the marking. If each teacher commits to making 100 pairs of judgments, you will have a total of 500 pairs of judgments. These judgments are enough for the comparative judgment engine to work out the rank order of all of the essays, and associate a score with each one. In the words of the No More Marking CJ guide here: ‘when many such pairings are shown to many assessors the decision data can be statistically modelled to generate a score for each student.’ If you want your score to be a GCSE grade or other kind of national benchmark, then you can include a handful of pre-graded essays in your original 100. You will then be able to see how many essays did better than the C-grade sample, how many better than the B-grade sample, and so on. This method of marking also allows you to see how accurate each marker is. Again, in the words of the guide: ‘the statistical modelling also produces quality control measures, such as checking the consistency of the assessors. Research has shown the comparative judgement approach produces reliable and valid outcomes for assessing the open-ended mathematical work of primary, secondary and even undergraduate students.’

I have done some trial judging with No More Marking, and at first, it feels a bit like voodoo. If, like most English teachers, you are used to laboriously marking dozens of essays against reams of criteria, then looking at two essays and answering the question ‘which is the better essay?’ feels a bit wrong – and far too easy. But it works. Part of the reason why it works is that it offers a way of measuring tacit knowledge. It takes advantage of the fact that amongst most experts in a subject, there is agreement on what quality looks like, even if it is not possible to define such quality in words. It eliminates the rubric and essentially replaces it with an algorithm. The advantage of this is that it also eliminates the problem of teaching to the rubric: to go back to our examples at the start, if a pupil produced a brilliant but completely unexpected response, they wouldn’t be penalised, and if a pupil produced a mediocre essay that ticked all the boxes, they wouldn’t get the top mark. And instead of teaching pupils by sharing the rubric with them, we can teach pupils by sharing other pupils’ essays with them – far more effective, as generally examples define quality more clearly than rubrics.

Ofqual have already used this method of assessment for a big piece of research on the relative difficulty of maths questions. The No More Marking website has case studies of how schools and other organisations are using it.  I think it has huge potential at primary school, where it could reduce a lot of the burden and administration around moderating writing assessments at KS1 & KS2.  On the No More Marking website, it says that ‘Comparative Judgement is the 21st Century alternative to the 18th Century practice of marking.’ I am generally sceptical of anything in education describing itself as ’21st century’, but in this case, it’s justified. I really think CJ is the future, and in 10 or 15 years’ time, we will look back at rubrics the way Marty McFly looks at Reagan’s acting.

Advertisements

25 thoughts on “Comparative judgment: 21st century assessment

  1. jfin107

    Thank you ‘The Wing to heaven’. This approach is how many think within the arts and arts education and allows for a child’s work to be viewed as ‘sui generis’, at least to start with, and for the child’s capacity for originality and swerve from normalisations to contribute to future expectations and revised standards of what might be considered valuable (worthy of assessment). After all Kant gave us the notion of intersubjectivity in coming to judgements. The process of coming to consensus through comparisons is essential dialogic I would suggest and about fusing horizons (Gadamer). All in all a release from current orthodoxies. Thank you for your thoughts.

    Reply
  2. Nick von Behr

    Thanks for sharing this. I’m so glad it’s finally taken off. Ian Jones, the scientific adviser to No More Marking was awarded a Royal Society fellowship for his research in this area. This was a scheme Michael Reiss, Ginny Page and I got off the ground. From small seeds come big trees.

    Reply
  3. julietgreen

    I’m surprised it was only a couple of years ago though. I do remember seeing this demonstrated about 20 years ago. It was a video about ranking as a means of assessing in art coursework. I wish I could find the original video. Very few colleagues have believed me since then, all through the dark days of fighting over levels.

    Reply
  4. Anthony Radice

    This makes complete sense to me. I always put a set of essays in rank order, and I’m confident about the ranking. It is much clearer than all the pseudo objective gobbledygook in the mark schemes. Capitalising on the ability to rank essays sounds like an excellent approach.

    Reply
  5. Pingback: Comparative judgment: 21st century assessment – connectededucation

  6. TheOtherDrX

    Interesting idea which I sort-of do after assigning a mark and giving detailed feedback as part of moderation. Maybe my difference is in HE where everything I award a mark to is essentially summative, and if it doesn’t have copious feedback, the students complain and NSS scores go through the floor. Therefore don’t expect any buy-in from the HE sector given the current HE climate. However when assessing which is the better essay, is it based on gut feeling or a mental note of all the likely points that would otherwise go on the rubric? When doing this for the purpose of moderation, it is mostly whether they hit the rubric and marks accurately assigned to it. But there are always those that ‘break’ the markscheme with unique qualities. Simple solution is the marksheme is disregarded. I can do that, and I often do.

    Reply
  7. Alison Hardy

    Comparative judgement as an assessment method was trialled and developed as part of the APU’s work in the 1980s and then developed with TERU at Goldsmith’s initially focussing Design & Technology by comparing portfolio work but then developing it into a e-portfolio with students capturing their ideas etc using videos and voice recording as well as through their finished work. Ross McGill outlines the process here: http://teachertoolkit.me/2014/09/04/rewarding-risk-how-e-scape-changes-learning-by-teachertoolkit/ Exam boards were involved in the development and it was trialled in science and geography.

    Reply
  8. manderson

    I love this idea, though I find the No More Marking website incredibly confusing to this non-psychometrician estadounidense.

    I also wonder if there’s a benefit to a mixed system of both ranking and “marking” utilizing a specific rubric. I usually find myself ranking papers first to get a ballpark estimate of where they fall, then aligning specifics to my rubric in order to provide feedback to the student. To ensure my students process this feedback more deeply, I combine passing back feedback with a reflection protocol.

    Reply
  9. Pingback: 15 Blogposts for 2015! #15for2015 | From the Sandpit....

  10. Pingback: The Blogosphere in 2015 | Pragmatic Education

  11. Crispin Weston

    Anyone who has seen “The Social Network” knows that it is the same principle that lay behind Mark Zucherberg’s first app, in which you were invited to rank a series of randomly selected pictures of pairs of girls as hotter and less hot. And if you find that objectionable, then this method has exactly the same problem. It emphasizes rank order which, as the advocates of Assessment for Learning point out, is extremely problematic: it involves ego rather than engaging effort and tends to suggests innate ability (because rank order tends to stay fairly stable). As others in this thread have pointed out, it does not lend itself to informing formative feedback…unless you rank essays in respect of x (where x is one of those much-maligned rubrics).

    I can see the use of this technique to double check the consistency of marking (as suggested by Anthony Radice) or for CPD or moderation purposes, to develop a consistent appreciation of “the rubric” (an unsatisfactory term for the educational objective). But for routine marking, I doubt that it is practical. It also becomes less useful the more narrow the sample from which work is compared (e.g. it becomes much more useful when you can compare work across schools and classes than when used to compare work within a class because such comparisons will tend to confirm rather than challenge the teacher’s existing views of the learning objective).

    Finally, the obvious answer to the trade-off between reliability and validity is to sample more frequently, as well as assess the reliability (i.e. consistency) of teacher judgments. What a shame, then, that the report of the Assessment without Levels Commission, on which Daisy sat, explicitly argued against approaches that would encourage frequent sampling by drawing a hard and largely spurious distinction between formative and summative purposes of assessment and suggesting that data collected for formative purposes should be thrown away as soon as possible and certainly not used to inform summative judgments.

    Reply
  12. Pingback: Refining assessment without levels | Teaching: Leading Learning

  13. PRichards (@PhiRichards)

    Dear All,

    I left this comment on David Didau’s post about CJ but it wasn’t responded to. I have only just seen this post even though it was written before his. Can anyone (who has trialled CJ and has some time) try to answer my question please?

    What do you think you using to judge, particularly at the top end of your ranking?

    I came up with a list of things:- (in order)
    Structure
    -letter formation/handwriting
    -punctuation including layout on the page, capitals, accuracy in full stops, use of paragraphs

    So, all the lower end of the essays fell down on those aspects. They are pretty easy to spot when glancing over work – even the full stops.

    Next, I think my brain processed:
    Content
    -missing words
    -informal or incorrect grammar
    -unclear sentences

    This was much harder to judge quickly or not to feel that I hadn’t skipped over children’s work

    I ask about this because it’s easy to judge whether a shade of colour is darker or lighter – as demonstrated on the CJ website – I’m only looking at one thing. To be crass/use an example mentioned earlier, it’s also straightforward to judge between an ‘ugly girl’ and a ‘pretty girl’ but it’s hard to judge between two beautiful people.

    But I was curious to understand what I was looking at when I was judging the writing.

    It’s important for two reasons. The first reason is that it affects how the essays at the upper end of the judgement are ranked. The second is that articulating what you think you are judging the writing on helps you to understand what makes good/bad writing (this is different to what makes a level 4 piece of writing).

    And that’s important because sometimes, we (teachers) get caught up in what OfSTED, QCA, STA or the DfE want us to teach rather than what is actually good. CJ forces you out of this mindset. In trialling it I realised that the ‘bottom’ children were being taught to try and follow a simplfied version of the genre being taught. But when I used CJ to look at their work I realised that they’d never get out of the ‘bottom’ rank unless they actually used full stops accurately every single time, unless they actually indented etc – and that isn’t quite the focus in writing lessons.

    Desperately curious,

    Anon

    Reply
    1. sxpmaths

      I think the formulation of the question is a critical aspect of using CJ effectively. The NMM site poses the question centrally above the pairs of responses. Do you want to judge pupils on the technical accuracy of their writing or on their demonstrated understanding of a topic, for example. I don’t think you could judge both aspects in a single round of CJ. Equally (as Ofqual explored) you could have more than one round of CJ with a different question each time, but the same pupils’ work.

      The feedback question that others have raised is an interesting concern. I guess I see CJ as an approach that might be of interest to, say, A level examiners. In this case, prompt, reliable grading is needed without feedback to candidates.

      Personally, I’m also thinking about the potential of CJ as a pupil activity: requiring them to make judgements and this be exposed to a number of pieces of work of varied quality. Would they distil ideas about what ‘best’ looks like?

      Reply
  14. Pingback: Faculty Diary

  15. Pingback: Testing isn’t evil, from Teach Meet Devon | MrHistoire.com

  16. Pingback: Comparative judgment: practical tips for in-school use | The Wing to Heaven

  17. Pingback: Comparative Judgement Day – exploring a different way of assessing writing | Site Title

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s