In my last few blog posts, I’ve looked at the problems with performance descriptors such as national curriculum levels. I’ve suggested two alternatives: defining these performance descriptors in terms of 1) questions and 2) example work. I discussed the use of questions here, and in this post I’ll discuss the use of pupil work.
Take the criterion: ‘Can compare two fractions to see which is bigger’. If we define this in terms of a series of closed questions – which is bigger: 5/7 or 5/9?; which is bigger: 1/7 or 5/7? – it gives us more clarity about exactly what the criterion means. It also means we can then calculate the percentage of pupils who get each question right and use that as a guide to the relative difficulty of each question. Clearly, this approach won’t work for more open questions. Take the criterion: ‘Can explain how language, structure and form contribute to writers’ presentation of ideas, themes and settings’. There are ways that we can interpret this criterion in terms of a closed question – see my posts on multiple choice questions here. But it is very likely that we would also like to interpret this criterion in terms of an open question, for example, ‘How does Dickens create suspense in chapter 1 of Great Expectations.’ We then need some more prose descriptions telling us how to mark it. Here are some from the latest AQA English Literature spec.
Band 5 – sophisticated, perceptive
sophisticated analysis of a wide range of aspects of language supported by impressive use of textual detail
Band 4 – assured, developed
assured analysis of a wide range of aspects of language supported by convincing use of textual detail
Band 3 – clear, consistent
clear understanding of a range of aspects of language supported by relevant and appropriate textual detail
Band 2 – some demonstration
some familiarity with obvious features of language supported by some relevant textual detail
Band 1 – limited demonstration
limited awareness of obvious features of language
So in this case, defining the question hasn’t actually moved us much further on, as there is no simple right or wrong answer to this question. We are still stuck with the vague prose descriptors – this time in a format where the change of a couple of adjectives (‘impressive’ is better than ‘convincing’ is better than ‘relevant and appropriate’) is enough to represent a whole band’s difference. I’ve written about this adverb problem here, and Christine Counsell has shown how you can cut up a lot of these descriptors and jumble them up, and even experienced practitioners can’t put them back together in the right order again. So how can we define these in a more concrete and meaningful way? A common response is to say that essays are just vaguer than closed questions and we just have to accept that. I accept that in the case of these types of open question, we will always have lower levels of marking reliability than in the case of closed questions like ‘What is 5/9 of 45?’ However, I still think there is a way we can help to define this criterion a bit more precisely. That way is to define the above band descriptors in terms of example pupil work. Instead of spending a lot of time excavating the etymological differences between ‘sophisticated’ and ‘assured’, we can look at a sample piece of work that has been agreed to be sophisticated, and one that has been agreed to be assured. The more samples of work, the better, and reading and discussing these would form a great activity for teacher training and professional development.
So again, we have something that sits behind the criterion, giving it meaning. Again, it would be very powerful if this could be done at a national scale – imagine a national bank of exemplar work by pupils of different ages, in different subjects, and at different grades. But even if it were not possible to do this nationally, it would still be valuable at a school level. Departments could build up exemplar work for all of their frequent essay titles, and use them in subsequent years to help with marking and moderation meetings. Just as creating questions is useful for teaching and learning, so the collection of samples of pupil work is helpful pedagogically too. Tom Sherrington gives an idea of what this might look like in this blog here.
Many departments and exam boards already do things like this, of course, and I suspect many teachers of subjects like English and History will tell you that moderation meetings are some of the most useful professional development you can do. The best moderation meetings I have been a part of have been structured around this kind of discussion and comparison of pupil work. I can remember one particularly productive meeting where we discussed how the different ways that pupils had expressed their thoughts actually affected the quality of the analysis itself. The discussions in that meeting formed the basis of this blog post, about the value of teaching grammar.
However, not all the moderation meetings I have attended have been as productive as this. The less useful type are those where discussion always focusses on finer points of the rubric. Often these meetings can descend into fairly sterile and unresolvable arguments about whether an essay is ‘thoughtful’ or ‘sophisticated’, or what ratio of sophisticated analysis to unsophisticated incoherence is needed to justify an overall judgment of ‘sophisticated’. (‘It’s a level 2 for AF6 technical accuracy, but a level 8 for AF1 imagination – so overall it deserves a level 4’).
So, if we accept the principle that the criteria in mark schemes need interpreting through exemplars, then I would suggest that discussions in moderation meetings should focus more on comparison of essays to other essays and exemplars, and less on comparison of essays to the mark scheme.
Just as with the criterion / question comparison, this does not mean that we have to get rid of the criteria. It means that we have to define the words of the criteria in terms of the sophistication of actual work, not the sophistication, or more often the sophistry, of the assessor’s interpretation of the mark scheme.
There are two interesting pieces of theory which have informed what I’ve written above. The first is about how humans are very bad at making absolute judgments, like those against criteria. We are much better at making comparative judgments. The No More Marking website has a great practical demonstration of this, as well as a bibliography. The second is the work of Michael Polanyi and Thomas Kuhn on tacit knowledge, and on the limitations of prose descriptions. Dylan Wiliam has written about the application of this to assessment in Embedded Formative Assessment.
Over the next couple of weeks I’ll try to blog a bit more about both of these topics.