Twitter – pros and cons

If you haven’t already read Changing Schools, you can buy a copy here. As well as an essay by me on assessment, it features an even better one by Andrew Old on social media. Andrew interviews and cites a number of policymakers, bloggers and academics about the impact they think social media have had on education policy.  I’m one of the bloggers he interviewed for the essay, and his questions got me thinking – what is Twitter good for? What is it bad for? How can it help us – not just in education and policymaking, but in our lives in general? Here are my pros and cons.

Screen Shot 2015-07-28 at 20.37.06

Pro – allows you to find lots of interesting articles vs. Con – it’s a time sink
Twitter delivers a stream of relevant and important articles and research papers to your timeline. Unfortunately, you’ll be so busy reading the twitter storm that erupts around them and who said what to who about which, that you will never actually get round to reading the actual articles in the first place.

Pro – allows you to gather rapid feedback vs. Con – it’s an echo chamber
If you have any kind of new idea or question, twitter allows you to gather very quick feedback from people who have a direct interest and involvement in the field. Unfortunately, those same people will very likely be highly unrepresentative of the rest of the field.

Pro – allows you to meet interesting people vs. Con – allows you to meet horrible people
Twitter means you can meet brilliant people you have lots in common with whom you would never have met otherwise. It also means that you can meet horrible people you have very little in common with whom you would never have met otherwise.

Pro – accelerates discovery of truth vs. Con – retards discovery of truth (always assuming that ‘truth’ is a thing)
Twitter is a (fairly) level playing-field where ideas can grapple. When this happens, Milton tells us, Truth wins. On the other hand, Twitter forces ideas to be compressed and simplified. But, as Donne tells us, the Truth often isn’t simple.

Principled Assessment Design by Dylan Wiliam

Back in 2013 I wrote a lengthy review of Measuring Up by Daniel Koretz. This book has had a huge influence on how I think about assessment. Last year I read Principled Assessment Design by Dylan Wiliam, which is equally good and very helpful for anyone looking to design a replacement for national curriculum levels. As with Koretz’s book, it packs a lot in – there are useful definitions and explanations of validity, reliability, and common threats to validity. There are two areas in particular I want to comment on here: norm-referencing and multiple-choice questions. These are two aspects of assessment which people are often quite prejudiced against, but Wiliam shows there is some evidence in favour of them.

Wiliam shows that in practice, norm-referencing is very hard to get away from. The alternative is criterion-referencing, where you set a criterion and judge whether pupils have met it or not. This sounds much fairer, but it is actually much harder to do than it sounds. Wiliam gives a couple of very good practical examples of this. Take the criterion ‘can compare two fractions to identify which is larger’. Depending on which fractions are selected, as many as 90% or as few as 15% of pupils will get the question right. How should we decide which pair of fractions to include in our assessment? One useful way would be to work out what percentage of pupils from a representative norm-group got each question right. That’s essentially norm-referencing. The criterion can only be given meaning through some use of of norming. ‘As William Angoff observed four decades ago, “we should be aware of the fact that lurking behind the criterion-referenced evaluation, perhaps even responsible for it, is the norm-referenced evaluation” (Angoff, 1974, p.4).’

He also draws a distinction between norm-referencing and cohort-referencing. Norm-referencing is when you ‘interpret a student’s performance in the light of the performance of some well-defined group of students who took the same assessment at some other time.’ Cohort-referencing is when you interpret performance in the light of other pupils in a cohort – be that a class or a particular age cohort. This may not sound much of a distinction, but it is crucial: ‘The important point here is that while cohort-referenced assessment is competitive (if my peers do badly, I do better), norm-referenced assessment is not competitive. Each student is compared to the performance of a group of students who took the assessment at an earlier time, so in contrast to cohort-referenced assessment, sabotaging the performance of my peers does me no good at all.’

I hadn’t fully considered this before, but I think it is an extraordinarily important point to make because it could perhaps help to improve norm-referencing’s bad image. Often, people associate norm-referencing with an era when the top grade in public exams was reserved for a certain percentage of the pupils taking the exam. However, that’s not norm-referencing – it’s cohort-referencing. Norm-referencing doesn’t have any of these unfair or competitive aspects. Instead, it is simply about providing a measure of precision in the best way we can.

Wiliam also warns against over-reliance on rubrics which attempt to describe what quality looks like. He quotes a famous passage from Michael Polanyi which shows the limitations of attempting to describe quality. I’ve written at greater length about this here.

Multiple-choice questions
As with norm-referencing, multiple choice or selected-response questions often get a bad press. ‘Particularly in the UK, there appears to be a widespread belief that selected-response items should not be used in school assessment…It is true that many selected-response questions do measure only shallow learning, but well-designed selected-response items can probe student understanding in some depth.’

Wiliam gives some good examples of these types of question. For example:

What can you say about the means of the following two data sets?

Set 1: 10 12 13 15
Set 2: 10 12 13 15 0

A. The two sets have the same mean.
B. The two sets have different means.
C. It depends on whether you choose to count the zero.

As he says, ‘this latter option goes well beyond assessing students’ facility with calculating the mean and probes their understanding of the definition of the mean, including whether a zero counts as a data point or not.’ I would add that these types of questions can offer better feedback than open-response question. If you deliberately design a question to include a common misconception as a distractor, and the pupil selects that common misconception, then you have learnt something really valuable – far more valuable than if they simply don’t answer an open-response question.

Wiliam also notes (as he has done before, here) that one very simple way of stopping pupils guessing is to have more than one right answer. If there are five possible options, and the pupils know one is right, they have a 1 in 5 chance of guessing the right answer. If they don’t know how many are right, they have a 1 in 32 chance of guessing the right answer. In response to this I actually designed a lot of MC tests with this structure this year, and I can confirm that it significantly increases the challenge.  Pupils have to spend a lot more time thinking about all of the distractors. if you want to award marks for the test and record the marks, you have to think carefully about how this works. For example, if there are, say, three right answers and a pupil correctly identifies two, it can feel harsh for the pupil to get no credit, particularly when compared to a pupil who has just guessed one completely wrong answer. This isn’t a problem if the questions are used completely formatively, of course, but it is something to bear in mind. However, I can definitely vouch for Wiliam’s central point: multiple choice questions can be made to be extremely difficult and challenging, and they can certainly test higher-order learning objectives. For more on this, see my series of posts on MCQs starting here.

Tacit knowledge

In my most recent blogs about assessment, I’ve looked at some of the practical problems with assessment criteria.  I think these practical problems are related to two theoretical issues: the nature of human judgment, which I’ve written about here, and tacit knowledge, which is what this post is about. In Michael Polanyi’s phrase, ‘we know more than we can tell’, and we certainly know more than we can tell in APP grids.

Take writing as an example. Teachers know what quality writing is, and when given examples of writing, teachers tend to agree on the relative quality of the examples. But it is fiendishly difficult to articulate exactly what makes one piece of writing better quality than another, and still harder to generate a set of rules which will allow a novice to identify or create quality. Sets of rules for creating quality writing or quality anything can descend into absurdity. Dylan Wiliam is fond of quoting the following from Michael Polanyi:

Maxims are rules, the correct application of which is part of the art which they govern. The true maxims of golfing or of poetry increase our insight into golfing or poetry and may even give valuable guidance to golfers and poets; but these maxims would instantly condemn themselves to absurdity if they tried to replace the golfer’s skill or the poet’s art. Maxims cannot be understood, still less applied by anyone not already possessing a good practical knowledge of the art. They derive their interest from our appreciation of the art and cannot themselves either replace or establish that appreciation.

‘These maxims would instantly condemn themselves to absurdity.’ This phrase goes through my mind whenever I read essays that have been self-consciously written to the rules of a mark scheme, rubric, or other kind of maxim. For example, I often read essays which have been quite obviously written to the rules of a PEE paragraph structure.

In this poem, the poet is angry. I know he is angry because it says the word ‘anger’. This shows me that he is angry.


In this extract, Dickens shows us that Magwitch is frightening. I know this because it says ‘bleak’ and this word shows me that Magwitch is very intimidating.

Or, at GCSE, (and this is derived from an examiners’ report, here)

This article tells us that horse-racing is dangerous. We know it is dangerous because it is dangerous.

Or, at A-level, I have read essays where pupils repeat chunks of the assessment objectives, as if to flag up to the examiner that they are ticking this particular objective.

In The Darkling Thrush, Hardy uses an unusual form to shape meaning. He also uses a different structure and his language is very interesting, and overall, the form, structure and language shape meaning in this literary text.

Or, more commonly at primary, writing where every sentence begins with an adverbial word or phrase which barely makes any sense.

Forgettably, he crept through the darkness.

I think the absurdity here results from pupils having been given a rule or maxim which is of some help but which will not on its own create quality. Generally, it is a good idea to use evidence and explain your reasoning, to comment on form, structure and language, and to use adverbial sentence openers. But without concrete examples of how such rules operate in practice, they are of very limited value. And this is the problem with criteria and rubrics: they are full of prose descriptions of what quality is, but they will not actually help anyone who doesn’t already know what quality is to acquire it. Or, in Rob Coe’s words, criteria ‘are not meaningful unless you know what they already mean.’

I’ve argued before that over-reliance on criteria leads to confusion and inaccuracies with grading. But what we see here is even worse: reliance on criteria also leads to confusion in the classroom. Prose criteria are only helpful if you already understand the subject, so using them as a method to inculcate understanding is futile. And yet, we’re often recommended to share ‘success criteria’ with pupils, and rubrics are often rewritten in ‘pupil-friendly language’ which may be pupil friendly in that pupils can pronounce them, but are certainly not pupil friendly in that they can understand what they mean. This approach also leads to a ‘tick-box’ mentality, where pupils and teachers look to make sure that pupils have ticked off everything on the mark scheme. But again, this is unhelpful, because for something like writing, the question is not whether a pupil has used a an adverbial opener or referred to historical context, but is more about how well they have used the adverbial opener, and how appropriate and insightful their reference to context is. The people for whom rubrics and criteria will be least helpful are novices: pupils and new teachers. And yet the people who end up relying on them the most, and who are often encouraged to rely on them as a means to acquire expertise, are pupils and new teachers. Rubrics on their own will not help them to acquire expertise, and in many cases, I worry that they may even inhibit the development of expertise.

Polanyi’s student Thomas Kuhn wrote about the problem of tacit knowledge in The Structure of Scientific Revolutions.

A phenomenon familiar to both students of science and historians of science provides a clue. The former regularly report that they have read through a chapter of their text, understood it perfectly, but nonetheless had difficulty solving a number of the problems at the chapter’s end…learning is not acquired by exclusively verbal means. Rather it comes as one is given words together with concrete examples of how they function in use; nature and words are learned together.

Kuhn is talking about science here; to adapt this for writing, we might say that examples and words are learned together. As I’ve argued here, it is not enough to provide descriptors describing quality writing: descriptors need to be accompanied with examples of what essays of this particular standard look like.

This idea of tacit knowledge can sometimes be interpreted to mean that pupils can never learn something explicitly and must just pick up expertise implicitly. That is not my interpretation at all, nor do I think it is borne out by Kuhn or Polanyi’s work. Kuhn does say that rules and prose descriptions on their own cannot bring understanding, but he does not suggest replacing them with aimless discovery. His suggestion is that it is the problem sets at the end of the prose textbook chapter which really bring meaning. The types of problem sets he is referring to are often quite artificial and isolated examples of the natural world, examples that have been deliberately isolated and selected to prove the textbook’s point. Polanyi also talks quite extensively about how the expert has spent hours focussing their attention on tiny details, and learning to recognise differences that completely elude the casual observer. This is not achieved through discovery, but through direction. It is not achieved quickly, but through thousands of hours of deliberate practice.

Similarly, to go back to the example of writing, I don’t think that we can just expect pupils to pick up notions of quality writing through discovery. What we need are examples of quality writing where the salient features are isolated and discussed, and where pupils have to respond in some way to them, just as the problem sets in a science textbook require certain responses.

If we want to explain what quality is, we need more than just prose descriptors. We need annotated examples, problem sets, and plenty of time.

Marking essays and poisoning dogs

This psychological experiment asked participants to judge the following actions.

(1) Stealing a towel from a hotel
(2) Keeping a dime you find on the ground
(3) Poisoning a barking dog

They had to give each action a mark out of 10 depending on how immoral the action was, on a scale where 1 is not particularly bad or wrong and 10 is extremely evil.

A second group were asked to do the same, but they were given the following three actions to judge.

(1”) Testifying falsely for pay
(2”) Using guns on striking workers
(3”) Poisoning a barking dog

I am sure you can guess the point of this. Item (3) and item (3”) are identical, and yet the two groups consistently differ on their ratings of these items. The latter group judge the action to be less evil than the former group. The reason is not hard to see: when you are thinking in terms of stealing towels and dimes, poisoning a barking dog seems heinous, but in the context of killing humans, it seems less so. The same principle has been observed in many other fields, and has led many psychologists to conclude that human judgment is essentially comparative, not absolute. There is a fantastic game on the No More Marking website which demonstrates exactly the same point. I recommend you click on the image below which links to the game, and play it right now, as it will illustrate this point better than any words can.

Screen Shot 2015-06-27 at 15.02.25

In brief, the game asks you to look at 8 different shades of blue individually and rank them from light to dark. It then asks you to make a series of comparisons between shades of blue. Everyone is much better at the latter than at the former. The No More Marking website also includes a bibliography about this issue.

Hopefully, you can also see the applications of this to assessment. This is one of the reasons why we need to define absolute criteria with comparative examples.

The interesting thing to note about the ‘evil’ and ‘blue’ examples is that the criteria are not that complex. One does not need special training or qualifications to be able to tell light blue from dark blue. The final judgment is one that everyone would agree with. Similarly, whilst judging evil is morally complex, it is not technically complex – everyone knows what it means. And yet, even in cases where the criteria are so clear, and so well understood, we still struggle. Imagine how much more we will struggle when the criteria are technically complex, as they are in exams.  If we aren’t very accurate when asked to judge evil or blue in absolute terms, what will we be like when asked to judge how well pupils have analysed a text? The other thing this suggests is that learning more about a particular topic, and learning more about how pupils respond to it, will not of itself make you a better marker. You could have great expertise and experience in teaching and reading essays on a certain topic, but if you continue to mark them in this absolute way, you will still struggle. Expertise in a topic, and experience in marking essays on that topic, are necessary but not sufficient conditions of being an accurate marker. However expert we are in a topic, we need comparative examples to guide us.

Unfortunately, over the last few years, the idea that we can judge work absolutely has become very popular. Pupils’ essays are ticked off against APP grids or mark schemes, and if they tick enough of the statements, then that means they have achieved a certain grade. But as we have seen, this approach is open to so much interpretation. Our brains are just not equipped to make judgments in this way. I also worry that such an approach has a negative impact on teaching, as teachers teach towards the mark scheme and pupils produce essays which start to sound like the mark scheme itself. Instead, what we need to do is to take a series of essays and compare them against each other. This is at the heart of the No More Marking approach, which also has the benefit of making marking less time-consuming.  If you aren’t ready to adopt No More Marking, you can still get some of the benefits of this approach by changing the way you mark and think about marking. Instead of judging essays against criteria, compare them to other essays.  Comparisons are at the heart of human judgment.

I am grateful to Chris Wheadon from No More Marking for talking me through how his approach works. No More Marking are running a trial with FFT and Education Datalab which will explore how their approach can be used to measure progress across secondary. See here for more detail.

As an interesting aside, one of the seminal works in the research on the limitations of human judgment is George Miller’s paper The Magical Number Seven. I knew of this paper in a different context: it is also a seminal work in the field of working memory, and the limitations of working memory. Miller also wrote the excellent, and very practical, article ‘How Children Learn Words‘, which shows how looking words up in dictionaries and other reference sources may not be the best strategy for learning new vocabulary. I’ve written a bit about this here.

My last few posts have all been about the flaws in using criteria, and alternatives to using them. In my last post, I mentioned that there were two pieces of theory which had influenced my thinking on this: research on human judgment, and on tacit knowledge. This blog post has looked at the research on human judgment, and in my next, I will consider how the idea of tacit knowledge also casts doubt on the use of criteria.

Wellington Festival of Education 2015 – review

On Thursday and Friday I went to Wellington Education Festival for the fifth year in a row. It’s an amazing event and I’ve come back from every one feeling inspired and excited.  Back in 2011 the festival was on a weekend and I remember sitting in a talk with Katharine Birbalsingh trying to guess which person Andrew Old was. We managed to narrow it down to two people (I sometimes wonder who the other person was) and then went and sat with him in a cafeteria and talked about how surreal it was to meet people you know from Twitter.

This year the festival was on weekdays, there were many more speakers, exhibitors and visitors, but there was no Andrew Old, sadly. I did still get to meet Cazzypot and Andrew Sabisky, two tweeters I’d never met before. I also got to hear Angela Duckworth speak in person, and listened to a fascinating conversation on intelligence with her, Sebastian Faulks, Robert Plomin, and Anthony Seldon. I heard Sir Andrew Carter speak sense about a school-led education system. I had some great conversations with so many people throughout the two days, and still left feeling as though there were so many more people I wanted to speak to and hear from. I also spoke on three panels – here’s a brief summary.

How can we make great teaching sustainable?

I spoke on this panel with Brian Sims, Director of Education at Ark, Sam Freedman of Teach First, and Rob Peal of West London Free School.  We talked about some of the issues brought up in the workload challenge, and of the difficulties in defining good teaching. I spoke about one thing the government could  do to help with workload: continued reform of Ofsted inspections. Currently, I think Ofsted inspections, perhaps unintentionally, lead to the assumption that if something is recorded on paper, it has happened, and if it hasn’t been recorded on paper, it hasn’t happened. Neither assumption is true. I also spoke about one thing schools and teachers could do to help with workload, which is to set priorities. Schools and teachers can’t do everything. They have to choose what is most important and focus on that. James Theo’s recent blog really sums up my feelings about this.

Changing Schools

I spoke on this panel, chaired by Rob Peal, with Jonathan Simons of Policy Exchange, James O’Shaughnessy of Floreat Education, and Brett Wigdortz of Teach First. Rob has recently edited a collection of essays on school reform which Jonathan, James and I have all contributed to. I gave a very brief précis of my essay in the book, on assessment, and explained why I thought assessment changes were the most significant reforms of the past five years, more so than some of the structural reforms which generate a lot of controversy. In particular I spoke about 1) the removal of coursework in many national exams 2) the removal of national curriculum levels and 3)  the reform of accountability measures. There’s more on my blog here about why getting rid of coursework is a good thing. Most of the debate in this session was about Jonathan’s very interesting and provocative defence of an interfering, activist Secretary of State for Education.

Is teaching an art or a science?

Claire Fox chaired this Battle of Ideas session, where I spoke with Rob Coe, Tom Bennett, Alka Seghal-Cuthbert and Alistair McConville. This was the debate I was looking forward to most, as I think it underpins many other debates in education. Why is it so hard to have a fair system of school inspections, lesson observations or performance related pay?  Why does greater expenditure not necessarily lead to better outcomes? Should we outsource decisions on curriculum and assessment to panels of experts? Should we have a College of Teaching? In all of these debates, and many more, we need a definition of what teaching is. Is it an art or science?

In my opening statement, I read out an extract from a Diane Ravitch article where she compares education to medicine, and demonstrates the difference in their evidence bases.  I recommend reading the whole thing – it is brilliant. In the article, she recalls a time when she had to be treated in hospital for a serious medical condition, and imagines what would have happened if education experts had treated her, not medical experts. Here is a very short extract.

Instead, my new specialists began to argue over whether anything was actually wrong with me. A few thought that I had a problem, but others scoffed and said that such an analysis was tantamount to “blaming the victim.” Some challenged the concept of “illness,” claiming that it was a social construction, utterly lacking in objective reality. Others rejected the evidence of the tests used to diagnose my ailment; a few said that the tests were meaningless for females, and others insisted that the tests were meaningless for anyone under any circumstances. One of the noisier researchers maintained that any effort to focus attention on my individual situation merely diverted attention from gross social injustices; a just social order could not come into existence, he claimed, until anecdotal cases like mine were not eligible for attention and resources.

Rob Coe also compared education to medicine, and discussed the start of the evidence-based medicine movement in the early 90s. I feel that the appropriate medical comparison is much further back, however: in many ways, education is like medicine in the late 19th century, when medicine depended less on theory and evidence, and more on the subjective and intuitive understanding of the individual doctor. Ben Riley, of Deans for Impact, has an article here called ‘Can teacher educators learn from medical school reform?’ where he looks at the way the medical profession changed in the US in the early 20th century. I also quoted a great Keith Stanovich article which argues that the ‘adherence to a subjective, personalised view of knowledge is what continually leads to educational fads’. On the topic of education fads, Tom Bennett was, as ever, brilliant. However, there were others on the panel who felt that the comparison to medicine was not helpful: Claire, the chair, felt that the two fields were just too different to make a meaningful comparison, and also pointed out that evidence and science can’t help us decide whether to teach King Lear or computer coding. That decision is one that can only be made through reference to values. I think Alka also felt that the reliance on evidence and science was inimical to the aims of a liberal arts education.

There’s a lot more I could write about, and I’m sure the discussions will continue on Twitter and elsewhere. I’m grateful to all at Wellington for making the festival happen.

Assessment alternatives 2: using pupil work instead of criteria

In my last few blog posts, I’ve looked at the problems with performance descriptors such as national curriculum levels. I’ve suggested two alternatives: defining these performance descriptors in terms of 1) questions and 2) example work. I discussed the use of questions here, and in this post I’ll discuss the use of pupil work.

Take the criterion: ‘Can compare two fractions to see which is bigger’. If we define this in terms of a series of closed questions – which is bigger: 5/7 or 5/9?; which is bigger: 1/7 or 5/7? – it gives us more clarity about exactly what the criterion means. It also means we can then calculate the percentage of pupils who get each question right and use that as a guide to the relative difficulty of each question. Clearly, this approach won’t work for more open questions. Take the criterion: ‘Can explain how language, structure and form contribute to writers’ presentation of ideas, themes and settings’. There are ways that we can interpret this criterion in terms of a closed question – see my posts on multiple choice questions here. But it is very likely that we would also like to interpret this criterion in terms of an open question, for example, ‘How does Dickens create suspense in chapter 1 of Great Expectations.’ We then need some more prose descriptions telling us how to mark it. Here are some from the latest AQA English Literature spec.

Band 5 – sophisticated, perceptive
sophisticated analysis of a wide range of aspects of language supported by impressive use of textual detail

Band 4 – assured, developed
assured analysis of a wide range of aspects of language supported by convincing use of textual detail

Band 3 – clear, consistent
clear understanding of a range of aspects of language supported by relevant and appropriate textual detail

Band 2 – some demonstration
some familiarity with obvious features of language supported by some relevant textual detail

Band 1 – limited demonstration
limited awareness of obvious features of language

So in this case, defining the question hasn’t actually moved us much further on, as there is no simple right or wrong answer to this question. We are still stuck with the vague prose descriptors – this time in a format where the change of a couple of adjectives (‘impressive’ is better than ‘convincing’ is better than ‘relevant and appropriate’) is enough to represent a whole band’s difference. I’ve written about this adverb problem here, and Christine Counsell has shown how you can cut up a lot of these descriptors and jumble them up, and even experienced practitioners can’t put them back together in the right order again. So how can we define these in a more concrete and meaningful way? A common response is to say that essays are just vaguer than closed questions and we just have to accept that. I accept that in the case of these types of open question, we will always have lower levels of marking reliability than in the case of closed questions like ‘What is 5/9 of 45?’ However, I still think there is a way we can help to define this criterion a bit more precisely. That way is to define the above band descriptors in terms of example pupil work. Instead of spending a lot of time excavating the etymological differences between ‘sophisticated’ and ‘assured’, we can look at a sample piece of work that has been agreed to be sophisticated, and one that has been agreed to be assured. The more samples of work, the better, and reading and discussing these would form a great activity for teacher training and professional development.

So again, we have something that sits behind the criterion, giving it meaning. Again, it would be very powerful if this could be done at a national scale – imagine a national bank of exemplar work by pupils of different ages, in different subjects, and at different grades. But even if it were not possible to do this nationally, it would still be valuable at a school level. Departments could build up exemplar work for all of their frequent essay titles, and use them in subsequent years to help with marking and moderation meetings. Just as creating questions is useful for teaching and learning, so the collection of samples of pupil work is helpful pedagogically too. Tom Sherrington gives an idea of what this might look like in this blog here.

Many departments and exam boards already do things like this, of course, and I suspect many teachers of subjects like English and History will tell you that moderation meetings are some of the most useful professional development you can do. The best moderation meetings I have been a part of have been structured around this kind of discussion and comparison of pupil work. I can remember one particularly productive meeting where we discussed how the different ways that pupils had expressed their thoughts actually affected the quality of the analysis itself. The discussions in that meeting formed the basis of this blog post, about the value of teaching grammar.

However, not all the moderation meetings I have attended have been as productive as this. The less useful type are those where discussion always focusses on finer points of the rubric. Often these meetings can descend into fairly sterile and unresolvable arguments about whether an essay is ‘thoughtful’ or ‘sophisticated’, or what ratio of sophisticated analysis to unsophisticated incoherence is needed to justify an overall judgment of ‘sophisticated’. (‘It’s a level 2 for AF6 technical accuracy, but a level 8 for AF1 imagination – so overall it deserves a level 4’).

So, if we accept the principle that the criteria in mark schemes need interpreting through exemplars, then I would suggest that discussions in moderation meetings should focus more on comparison of essays to other essays and exemplars, and less on comparison of essays to the mark scheme.

criteria vs exemplars

Just as with the criterion / question comparison, this does not mean that we have to get rid of the criteria. It means that we have to define the words of the criteria in terms of the sophistication of actual work, not the sophistication, or more often the sophistry, of the assessor’s interpretation of the mark scheme.

There are two interesting pieces of theory which have informed what I’ve written above. The first is about how humans are very bad at making absolute judgments, like those against criteria. We are much better at making comparative judgments. The No More Marking website has a great practical demonstration of this, as well as a bibliography. The second is the work of Michael Polanyi and Thomas Kuhn on tacit knowledge, and on the limitations of prose descriptions. Dylan Wiliam has written about the application of this to assessment in Embedded Formative Assessment.

Over the next couple of weeks I’ll try to blog a bit more about both of these topics.

Assessment alternatives 1: using questions instead of criteria

In many blog posts over the last couple of years, I’ve talked about the problems with prose descriptors such as national curriculum levels and grade descriptors. It’s often said that national curriculum levels and the like give us a shared language: actually, as I argue here, they create the illusion of a shared language. I’ve also suggested two possible alternatives: criteria must be defined not through prose but through (1) questions and (2) pupil work. In this post and the next, I’ll expand a bit more on what I mean by these.

Defining criteria through questions

As Dylan Wiliam shows in this pamphlet, even a tightly defined criterion like ‘can compare two fractions to decide which is bigger’ can be interpreted in very different ways. If the two fractions are 3/7 and 5/7, 90% of pupils answer it correctly; if they are 5/7 and 5/9, only 15% do. In my experience, criteria such as ‘understand what a verb is’ will be met by nearly all pupils if defined as the following question.

Which of the following words can be used as a verb?
a) run
b) tree
c) car
d) person
e) apple

However, let’s imagine the question is the following:

In which sentences is ‘cook’ a verb?

a) I cook a meal.
b) He is a good cook.
c) The cook prepared a nice meal.
d) Every morning, they cook breakfast.
e) That restaurant has a great cook.

In this case, the percentage getting it right is much, much smaller. The problem when you rely solely on criteria is that some people are defining the criteria as the former, whereas others define it as the latter. And in some cases, criteria may be defined in even more unreliable ways than the above questions.

So, here’s the principle: wherever possible, define criteria through questions, through groups of questions and through question banks. If you must have criteria, have the question bank sitting behind each criterion. Instead of having teachers making a judgment about whether a pupil has met each criterion, have pupils answer questions instead. This is far more accurate, and also provides clarity for lesson and unit planning.

Writing questions can be burdensome, but you can share the burden, and once a question is written, you can reuse it, whereas judgments obviously have to be made individually. If you don’t have much technology, you can record results in an old-fashioned paper mark book or on a simple Excel spreadsheet. If you have access to more technology, then you can store questions on a computer database, get pupils to take them on computer and have them automatically marked for you, which is a huge timesaver.  Imagine if all the criteria on the national curriculum were underpinned by digital question banks of hundreds, or even thousands of questions, and if each question came with statistics about how well pupils did on it.  It would have great benefits not just for the accuracy of assessment, but also for improving teaching and learning. These questions don’t have to be organised into formal tests and graded – in fact, I would argue they shouldn’t be. The main aim of them is to get reliable information on what a pupil can and cannot do. As Bodil Isaksen shows here, the type of data you get from this sort of system is really useful, as opposed to what she calls the ‘junk data’ you get from criteria-judgments.

Below is a comparison between the two options. Of course, strictly speaking, question-based judgments don’t entail abolishing criteria. You can still have criteria, but they have to be underpinned by the questions. The crucial difference between the two options is not the existence or otherwise of criteria, but the evidence each option produces.

Screen Shot 2015-06-07 at 14.37.34

What about essays?
What about subjects where the criteria can’t be defined through questions? For example, we might set an essay question for English, but clearly, the question here is not the same as the above maths and grammar questions. With closed questions like the maths and grammar ones above, most of the effort and the judgment in those goes before the fact, in the creation of the question. In the case of open questions like essays, the effort and judgment mostly comes after the fact, in the marking of the question. So here, what is important is not just defining the question, but defining the response. That will be the subject of my next post.