Marking essays and poisoning dogs

This psychological experiment asked participants to judge the following actions.

(1) Stealing a towel from a hotel
(2) Keeping a dime you find on the ground
(3) Poisoning a barking dog

They had to give each action a mark out of 10 depending on how immoral the action was, on a scale where 1 is not particularly bad or wrong and 10 is extremely evil.

A second group were asked to do the same, but they were given the following three actions to judge.

(1”) Testifying falsely for pay
(2”) Using guns on striking workers
(3”) Poisoning a barking dog

I am sure you can guess the point of this. Item (3) and item (3”) are identical, and yet the two groups consistently differ on their ratings of these items. The latter group judge the action to be less evil than the former group. The reason is not hard to see: when you are thinking in terms of stealing towels and dimes, poisoning a barking dog seems heinous, but in the context of killing humans, it seems less so. The same principle has been observed in many other fields, and has led many psychologists to conclude that human judgment is essentially comparative, not absolute. There is a fantastic game on the No More Marking website which demonstrates exactly the same point. I recommend you click on the image below which links to the game, and play it right now, as it will illustrate this point better than any words can.

Screen Shot 2015-06-27 at 15.02.25

In brief, the game asks you to look at 8 different shades of blue individually and rank them from light to dark. It then asks you to make a series of comparisons between shades of blue. Everyone is much better at the latter than at the former. The No More Marking website also includes a bibliography about this issue.

Hopefully, you can also see the applications of this to assessment. This is one of the reasons why we need to define absolute criteria with comparative examples.

The interesting thing to note about the ‘evil’ and ‘blue’ examples is that the criteria are not that complex. One does not need special training or qualifications to be able to tell light blue from dark blue. The final judgment is one that everyone would agree with. Similarly, whilst judging evil is morally complex, it is not technically complex – everyone knows what it means. And yet, even in cases where the criteria are so clear, and so well understood, we still struggle. Imagine how much more we will struggle when the criteria are technically complex, as they are in exams.  If we aren’t very accurate when asked to judge evil or blue in absolute terms, what will we be like when asked to judge how well pupils have analysed a text? The other thing this suggests is that learning more about a particular topic, and learning more about how pupils respond to it, will not of itself make you a better marker. You could have great expertise and experience in teaching and reading essays on a certain topic, but if you continue to mark them in this absolute way, you will still struggle. Expertise in a topic, and experience in marking essays on that topic, are necessary but not sufficient conditions of being an accurate marker. However expert we are in a topic, we need comparative examples to guide us.

Unfortunately, over the last few years, the idea that we can judge work absolutely has become very popular. Pupils’ essays are ticked off against APP grids or mark schemes, and if they tick enough of the statements, then that means they have achieved a certain grade. But as we have seen, this approach is open to so much interpretation. Our brains are just not equipped to make judgments in this way. I also worry that such an approach has a negative impact on teaching, as teachers teach towards the mark scheme and pupils produce essays which start to sound like the mark scheme itself. Instead, what we need to do is to take a series of essays and compare them against each other. This is at the heart of the No More Marking approach, which also has the benefit of making marking less time-consuming.  If you aren’t ready to adopt No More Marking, you can still get some of the benefits of this approach by changing the way you mark and think about marking. Instead of judging essays against criteria, compare them to other essays.  Comparisons are at the heart of human judgment.

I am grateful to Chris Wheadon from No More Marking for talking me through how his approach works. No More Marking are running a trial with FFT and Education Datalab which will explore how their approach can be used to measure progress across secondary. See here for more detail.

As an interesting aside, one of the seminal works in the research on the limitations of human judgment is George Miller’s paper The Magical Number Seven. I knew of this paper in a different context: it is also a seminal work in the field of working memory, and the limitations of working memory. Miller also wrote the excellent, and very practical, article ‘How Children Learn Words‘, which shows how looking words up in dictionaries and other reference sources may not be the best strategy for learning new vocabulary. I’ve written a bit about this here.

My last few posts have all been about the flaws in using criteria, and alternatives to using them. In my last post, I mentioned that there were two pieces of theory which had influenced my thinking on this: research on human judgment, and on tacit knowledge. This blog post has looked at the research on human judgment, and in my next, I will consider how the idea of tacit knowledge also casts doubt on the use of criteria.

Wellington Festival of Education 2015 – review

On Thursday and Friday I went to Wellington Education Festival for the fifth year in a row. It’s an amazing event and I’ve come back from every one feeling inspired and excited.  Back in 2011 the festival was on a weekend and I remember sitting in a talk with Katharine Birbalsingh trying to guess which person Andrew Old was. We managed to narrow it down to two people (I sometimes wonder who the other person was) and then went and sat with him in a cafeteria and talked about how surreal it was to meet people you know from Twitter.

This year the festival was on weekdays, there were many more speakers, exhibitors and visitors, but there was no Andrew Old, sadly. I did still get to meet Cazzypot and Andrew Sabisky, two tweeters I’d never met before. I also got to hear Angela Duckworth speak in person, and listened to a fascinating conversation on intelligence with her, Sebastian Faulks, Robert Plomin, and Anthony Seldon. I heard Sir Andrew Carter speak sense about a school-led education system. I had some great conversations with so many people throughout the two days, and still left feeling as though there were so many more people I wanted to speak to and hear from. I also spoke on three panels – here’s a brief summary.

How can we make great teaching sustainable?

I spoke on this panel with Brian Sims, Director of Education at Ark, Sam Freedman of Teach First, and Rob Peal of West London Free School.  We talked about some of the issues brought up in the workload challenge, and of the difficulties in defining good teaching. I spoke about one thing the government could  do to help with workload: continued reform of Ofsted inspections. Currently, I think Ofsted inspections, perhaps unintentionally, lead to the assumption that if something is recorded on paper, it has happened, and if it hasn’t been recorded on paper, it hasn’t happened. Neither assumption is true. I also spoke about one thing schools and teachers could do to help with workload, which is to set priorities. Schools and teachers can’t do everything. They have to choose what is most important and focus on that. James Theo’s recent blog really sums up my feelings about this.

Changing Schools

I spoke on this panel, chaired by Rob Peal, with Jonathan Simons of Policy Exchange, James O’Shaughnessy of Floreat Education, and Brett Wigdortz of Teach First. Rob has recently edited a collection of essays on school reform which Jonathan, James and I have all contributed to. I gave a very brief précis of my essay in the book, on assessment, and explained why I thought assessment changes were the most significant reforms of the past five years, more so than some of the structural reforms which generate a lot of controversy. In particular I spoke about 1) the removal of coursework in many national exams 2) the removal of national curriculum levels and 3)  the reform of accountability measures. There’s more on my blog here about why getting rid of coursework is a good thing. Most of the debate in this session was about Jonathan’s very interesting and provocative defence of an interfering, activist Secretary of State for Education.

Is teaching an art or a science?

Claire Fox chaired this Battle of Ideas session, where I spoke with Rob Coe, Tom Bennett, Alka Seghal-Cuthbert and Alistair McConville. This was the debate I was looking forward to most, as I think it underpins many other debates in education. Why is it so hard to have a fair system of school inspections, lesson observations or performance related pay?  Why does greater expenditure not necessarily lead to better outcomes? Should we outsource decisions on curriculum and assessment to panels of experts? Should we have a College of Teaching? In all of these debates, and many more, we need a definition of what teaching is. Is it an art or science?

In my opening statement, I read out an extract from a Diane Ravitch article where she compares education to medicine, and demonstrates the difference in their evidence bases.  I recommend reading the whole thing – it is brilliant. In the article, she recalls a time when she had to be treated in hospital for a serious medical condition, and imagines what would have happened if education experts had treated her, not medical experts. Here is a very short extract.

Instead, my new specialists began to argue over whether anything was actually wrong with me. A few thought that I had a problem, but others scoffed and said that such an analysis was tantamount to “blaming the victim.” Some challenged the concept of “illness,” claiming that it was a social construction, utterly lacking in objective reality. Others rejected the evidence of the tests used to diagnose my ailment; a few said that the tests were meaningless for females, and others insisted that the tests were meaningless for anyone under any circumstances. One of the noisier researchers maintained that any effort to focus attention on my individual situation merely diverted attention from gross social injustices; a just social order could not come into existence, he claimed, until anecdotal cases like mine were not eligible for attention and resources.

Rob Coe also compared education to medicine, and discussed the start of the evidence-based medicine movement in the early 90s. I feel that the appropriate medical comparison is much further back, however: in many ways, education is like medicine in the late 19th century, when medicine depended less on theory and evidence, and more on the subjective and intuitive understanding of the individual doctor. Ben Riley, of Deans for Impact, has an article here called ‘Can teacher educators learn from medical school reform?’ where he looks at the way the medical profession changed in the US in the early 20th century. I also quoted a great Keith Stanovich article which argues that the ‘adherence to a subjective, personalised view of knowledge is what continually leads to educational fads’. On the topic of education fads, Tom Bennett was, as ever, brilliant. However, there were others on the panel who felt that the comparison to medicine was not helpful: Claire, the chair, felt that the two fields were just too different to make a meaningful comparison, and also pointed out that evidence and science can’t help us decide whether to teach King Lear or computer coding. That decision is one that can only be made through reference to values. I think Alka also felt that the reliance on evidence and science was inimical to the aims of a liberal arts education.

There’s a lot more I could write about, and I’m sure the discussions will continue on Twitter and elsewhere. I’m grateful to all at Wellington for making the festival happen.

Assessment alternatives 2: using pupil work instead of criteria

In my last few blog posts, I’ve looked at the problems with performance descriptors such as national curriculum levels. I’ve suggested two alternatives: defining these performance descriptors in terms of 1) questions and 2) example work. I discussed the use of questions here, and in this post I’ll discuss the use of pupil work.

Take the criterion: ‘Can compare two fractions to see which is bigger’. If we define this in terms of a series of closed questions – which is bigger: 5/7 or 5/9?; which is bigger: 1/7 or 5/7? – it gives us more clarity about exactly what the criterion means. It also means we can then calculate the percentage of pupils who get each question right and use that as a guide to the relative difficulty of each question. Clearly, this approach won’t work for more open questions. Take the criterion: ‘Can explain how language, structure and form contribute to writers’ presentation of ideas, themes and settings’. There are ways that we can interpret this criterion in terms of a closed question – see my posts on multiple choice questions here. But it is very likely that we would also like to interpret this criterion in terms of an open question, for example, ‘How does Dickens create suspense in chapter 1 of Great Expectations.’ We then need some more prose descriptions telling us how to mark it. Here are some from the latest AQA English Literature spec.

Band 5 – sophisticated, perceptive
sophisticated analysis of a wide range of aspects of language supported by impressive use of textual detail

Band 4 – assured, developed
assured analysis of a wide range of aspects of language supported by convincing use of textual detail

Band 3 – clear, consistent
clear understanding of a range of aspects of language supported by relevant and appropriate textual detail

Band 2 – some demonstration
some familiarity with obvious features of language supported by some relevant textual detail

Band 1 – limited demonstration
limited awareness of obvious features of language

So in this case, defining the question hasn’t actually moved us much further on, as there is no simple right or wrong answer to this question. We are still stuck with the vague prose descriptors – this time in a format where the change of a couple of adjectives (‘impressive’ is better than ‘convincing’ is better than ‘relevant and appropriate’) is enough to represent a whole band’s difference. I’ve written about this adverb problem here, and Christine Counsell has shown how you can cut up a lot of these descriptors and jumble them up, and even experienced practitioners can’t put them back together in the right order again. So how can we define these in a more concrete and meaningful way? A common response is to say that essays are just vaguer than closed questions and we just have to accept that. I accept that in the case of these types of open question, we will always have lower levels of marking reliability than in the case of closed questions like ‘What is 5/9 of 45?’ However, I still think there is a way we can help to define this criterion a bit more precisely. That way is to define the above band descriptors in terms of example pupil work. Instead of spending a lot of time excavating the etymological differences between ‘sophisticated’ and ‘assured’, we can look at a sample piece of work that has been agreed to be sophisticated, and one that has been agreed to be assured. The more samples of work, the better, and reading and discussing these would form a great activity for teacher training and professional development.

So again, we have something that sits behind the criterion, giving it meaning. Again, it would be very powerful if this could be done at a national scale – imagine a national bank of exemplar work by pupils of different ages, in different subjects, and at different grades. But even if it were not possible to do this nationally, it would still be valuable at a school level. Departments could build up exemplar work for all of their frequent essay titles, and use them in subsequent years to help with marking and moderation meetings. Just as creating questions is useful for teaching and learning, so the collection of samples of pupil work is helpful pedagogically too. Tom Sherrington gives an idea of what this might look like in this blog here.

Many departments and exam boards already do things like this, of course, and I suspect many teachers of subjects like English and History will tell you that moderation meetings are some of the most useful professional development you can do. The best moderation meetings I have been a part of have been structured around this kind of discussion and comparison of pupil work. I can remember one particularly productive meeting where we discussed how the different ways that pupils had expressed their thoughts actually affected the quality of the analysis itself. The discussions in that meeting formed the basis of this blog post, about the value of teaching grammar.

However, not all the moderation meetings I have attended have been as productive as this. The less useful type are those where discussion always focusses on finer points of the rubric. Often these meetings can descend into fairly sterile and unresolvable arguments about whether an essay is ‘thoughtful’ or ‘sophisticated’, or what ratio of sophisticated analysis to unsophisticated incoherence is needed to justify an overall judgment of ‘sophisticated’. (‘It’s a level 2 for AF6 technical accuracy, but a level 8 for AF1 imagination – so overall it deserves a level 4′).

So, if we accept the principle that the criteria in mark schemes need interpreting through exemplars, then I would suggest that discussions in moderation meetings should focus more on comparison of essays to other essays and exemplars, and less on comparison of essays to the mark scheme.

criteria vs exemplars

Just as with the criterion / question comparison, this does not mean that we have to get rid of the criteria. It means that we have to define the words of the criteria in terms of the sophistication of actual work, not the sophistication, or more often the sophistry, of the assessor’s interpretation of the mark scheme.

There are two interesting pieces of theory which have informed what I’ve written above. The first is about how humans are very bad at making absolute judgments, like those against criteria. We are much better at making comparative judgments. The No More Marking website has a great practical demonstration of this, as well as a bibliography. The second is the work of Michael Polanyi and Thomas Kuhn on tacit knowledge, and on the limitations of prose descriptions. Dylan Wiliam has written about the application of this to assessment in Embedded Formative Assessment.

Over the next couple of weeks I’ll try to blog a bit more about both of these topics.

Assessment alternatives 1: using questions instead of criteria

In many blog posts over the last couple of years, I’ve talked about the problems with prose descriptors such as national curriculum levels and grade descriptors. It’s often said that national curriculum levels and the like give us a shared language: actually, as I argue here, they create the illusion of a shared language. I’ve also suggested two possible alternatives: criteria must be defined not through prose but through (1) questions and (2) pupil work. In this post and the next, I’ll expand a bit more on what I mean by these.

Defining criteria through questions

As Dylan Wiliam shows in this pamphlet, even a tightly defined criterion like ‘can compare two fractions to decide which is bigger’ can be interpreted in very different ways. If the two fractions are 3/7 and 5/7, 90% of pupils answer it correctly; if they are 5/7 and 5/9, only 15% do. In my experience, criteria such as ‘understand what a verb is’ will be met by nearly all pupils if defined as the following question.

Which of the following words can be used as a verb?
a) run
b) tree
c) car
d) person
e) apple

However, let’s imagine the question is the following:

In which sentences is ‘cook’ a verb?

a) I cook a meal.
b) He is a good cook.
c) The cook prepared a nice meal.
d) Every morning, they cook breakfast.
e) That restaurant has a great cook.

In this case, the percentage getting it right is much, much smaller. The problem when you rely solely on criteria is that some people are defining the criteria as the former, whereas others define it as the latter. And in some cases, criteria may be defined in even more unreliable ways than the above questions.

So, here’s the principle: wherever possible, define criteria through questions, through groups of questions and through question banks. If you must have criteria, have the question bank sitting behind each criterion. Instead of having teachers making a judgment about whether a pupil has met each criterion, have pupils answer questions instead. This is far more accurate, and also provides clarity for lesson and unit planning.

Writing questions can be burdensome, but you can share the burden, and once a question is written, you can reuse it, whereas judgments obviously have to be made individually. If you don’t have much technology, you can record results in an old-fashioned paper mark book or on a simple Excel spreadsheet. If you have access to more technology, then you can store questions on a computer database, get pupils to take them on computer and have them automatically marked for you, which is a huge timesaver.  Imagine if all the criteria on the national curriculum were underpinned by digital question banks of hundreds, or even thousands of questions, and if each question came with statistics about how well pupils did on it.  It would have great benefits not just for the accuracy of assessment, but also for improving teaching and learning. These questions don’t have to be organised into formal tests and graded – in fact, I would argue they shouldn’t be. The main aim of them is to get reliable information on what a pupil can and cannot do. As Bodil Isaksen shows here, the type of data you get from this sort of system is really useful, as opposed to what she calls the ‘junk data’ you get from criteria-judgments.

Below is a comparison between the two options. Of course, strictly speaking, question-based judgments don’t entail abolishing criteria. You can still have criteria, but they have to be underpinned by the questions. The crucial difference between the two options is not the existence or otherwise of criteria, but the evidence each option produces.

Screen Shot 2015-06-07 at 14.37.34

What about essays?
What about subjects where the criteria can’t be defined through questions? For example, we might set an essay question for English, but clearly, the question here is not the same as the above maths and grammar questions. With closed questions like the maths and grammar ones above, most of the effort and the judgment in those goes before the fact, in the creation of the question. In the case of open questions like essays, the effort and judgment mostly comes after the fact, in the marking of the question. So here, what is important is not just defining the question, but defining the response. That will be the subject of my next post.

Assessment is difficult, but it is not mysterious

This is a follow-up to my blog from last week about performance descriptors.

In that blog, I made three basic points: 1) that we have conflated assessment and prose performance descriptors, with the result that people assume the latter basically is the former; 2) that prose performance descriptors are very unhelpful because they can be interpreted in so many different ways and 3) that there are other ways of assessing.

In response, David Didau wrote this post, in which he agreed with a lot of the things I said. I was pleased by this, because I really admire David’s work and think he has done great things in bringing important research to a wider audience. However, I was completely baffled by the end of the post, and I am going to explain why – if I’m a bit harsh, I’m sorry, and none of this changes the fact that I think most of David’s work is fantastic.

After agreeing with me about the vagueness of prose performance descriptors, he then suggested that as a replacement for prose performance descriptors, schools should use…wait for it…prose performance descriptors! Here is his example.

english assessment grid


I am really astonished by this. The above grid has all the flaws of national curriculum levels, and offers no improvement. It reproduces all the errors I discuss in my previous post. To take just one example, what does a challenging assumption about the cultural value of a text look like? Come to think of it, what does a text look like? A pupil might be able to make a challenging assumption about the value of a shampoo advert, or a limerick, or a piece of graffiti, but struggle to make one about the value of War and Peace, a newspaper article, or an unseen poem. One teacher might interpret this criterion in the context of a short poem pupils have studied before, in which case many of their pupils might achieve it, whilst another might interpret it in the context of a lengthy unseen poem, in which case many of their pupils will not achieve it. The parts on inference are particularly baffling. We know that inference is not a formal skill: we know that pupils (and indeed adults) can make great inferences about a text about baseball, and poor ones about sentences like ‘I believed him when he said he had a lake house, until he said it was forty feet from the shore at high tide.’ (Both those examples from Dan Willingham – see here for more from him about inference and reading). In short, the above grid will result in teachers collecting ‘junk data’ of the type Bodil Isaksen discusses here.

In the rest of the post, David appears to be suggesting that such an approach is OK just as long as we accept that it has flaws and can never be precise. In his words, ‘You can never hope for precision from performance descriptors, but then precision will always be impossible to achieve.’ In making this argument, David has basically proven my first point, which is that we have become so used to prose performance descriptors that we have come to assume that they are assessment and few alternatives are possible. Of course, if performance descriptors were the only way we could assess, then perhaps we would just have to accept the imprecision. But they aren’t! There are other ways, ways which offer far greater precision and accuracy. One example of a method that is far more precise is standardised tests. It’s true that they may not be perfectly precise, but there is still a world of difference between them and performance descriptors. Let me give a very concrete example of this: Back in 1995, Peter Pumfrey gave a group of 7-year-old  pupils who had been assessed as level 2 a standardised reading test. On this latter test, their reading ages ranged from 5.7 to 12.9.

And this, in short, is why I care so much about this, and why I think it is so important. There are pupils out there who are really struggling with basic skills. Flawed assessment procedures are hiding that fact from them and their teachers, and therefore stopping them getting the help they need to improve. Worse, the kinds of assessment processes which would help to identify these problems are being completely ignored.

It’s as though you saw someone try to measure the length of a room by taking a guess, and someone trying to measure the same distance using a measuring tape marked with centimetres. Then, because neither method can give you a measure to three decimal points, you conclude that ‘all measurement is imprecise and fundamentally mysterious’, so you’ll just use your best guess. Well, OK, both methods may be imprecise, but the latter method is far less so than the former, and you will be much better advised to buy a carpet based on it.

Complexity is not the same as mystery. I worry that by saying that assessment is mysterious and that it is very difficult to get a handle on how pupils are doing, we legitimise a woolly approach where anything goes because we can’t really measure anything anyway. We can do a lot better than we are doing at the moment, and one of the first things we can do is to stop depending so much on generic prose descriptors.

I realise that this leaves open the question David posed at the start of his article – ‘OK smart arse, what should we do?’ In my last post, and in others from the past, I’ve repeatedly argued that the better approach is to define criteria in terms of a) actual questions / test papers and b) actual pupil work. For example, back in December 2013 I wrote here that ‘we don’t get a shared language through abstract criteria. We get a shared language through teaching shared content, doing shared tasks, and sharing pupil work with colleagues.’ I realise I need to expand on these points further, and I will do so in my next blog post.

Problems with performance descriptors

A primary teacher friend recently told me of some games she and her colleagues used to play with national curriculum levels. They would take a Michael Morpurgo novel and mark it using an APP grid, or they would take a pupil’s work and see how many different levels they could justify it receiving. These are jokes, but they reveal serious flaws. National curriculum levels are, and always have been, vague and unhelpful.

For example, compare:

Pupils’ writing is confident and shows appropriate and imaginative choices of style in a range of forms.

Pupils’ writing in a range of forms is lively and thoughtful.

The first is a description of performance at level 7, the second at level 4. That’s what I mean about vague and unhelpful, and that’s why my friend was able to justify the same piece of work receiving several different levels.

However, what is frustrating is that many of the replacements for national curriculum levels rely on precisely the same kind of vague performance descriptions. In fact, in many conversations I have with people, they cannot even begin to imagine an assessment system that doesn’t use some form of descriptor. For many people, descriptors simply are assessment, and if a school is to create its own assessment system, then the first – and possibly last – step must surely involve the creation of a new set of descriptors.  Unfortunately, the truth is very different: as I’ve written here, descriptors do not give us a common language but the illusion of a common language. They can’t be relied on to deliver accuracy or precision about how pupils are doing. In this post, I will recap the problems with descriptors; in the next, I will suggest some alternatives.

First, Tim Oates shows here that creating accurate prose descriptions of performance, even in subjects like maths and science, is fiendishly difficult.

Even a well-crafted statement of what you need to get an A grade can be loaded with subjectivity – even in subjects such as science. It’s genuinely hard to know how difficult a specific exam is.

Second, Dylan Wiliam shows here in Principled Assessment Design that even very precise descriptors can be interpreted in completely different ways.

Even in subjects like mathematics, criteria have a degree of plasticity. For example, a statement like ‘Can compare two fractions to identify which is larger’ sounds precise, but whether students can do this or not depends on which fractions are selected. The Concepts in Secondary Mathematics and Science (CSMS) project investigated the achievement of a nationally representative group of secondary school students, and found out that when the fractions concerned were 3/7  and 5/7  then around 90% of 14-year-olds answered correctly, but when more typical fractions, such as 3/4 and 4/5  were used, then 75% answered correctly. However, where the fractions concerned were 5/7 and 5/9 then only around 15% answered correctly (Hart, 1981).

Finally, Paul Bambrick-Santoyo makes a very similar point in Driven by Data. I’ve abridged the below extract.

 To illustrate this, take a basic standard taken from middle school math:

Understand and use ratios, proportions and percents in a variety of situations.

To understand why a standard like this one creates difficulties, consider the following premise. Six different teachers could each define one of the following six questions as a valid attempt to assess the standard of percent of a number. Each could argue that the chosen assessment question is aligned to the state standard and is an adequate measure of student mastery:

Identify 50% of 20.

Identify 67% of 81

Shawn got 7 correct answers out of 10 possible answers on his science test. What percent of questions did he get correct?

J.J Redick was on pace to set an NCAA record in career free throw percentage. Leading into the NCAA tournament in 2004, he made 97 of 104 free throw attempts. What percentage of free throws did he make?

Bambrick-Santoyo goes on to give examples of two even more difficult questions. As with the Dylan William example, we can see that whilst 90-95% of pupils might get the first question right, many fewer would get the last one right.

The problems with the vagueness and inaccuracy of descriptors are not just a problem with the national curriculum levels. It is a problem associated with all forms of prose descriptors of performance. The problem is not a minor technical one that can be solved by better descriptor drafting, or more creative and thoughtful use of a thesaurus. It is a fundamental flaw. I worry when I see people poring over dictionaries trying to find the precise word that denotes performance in between ‘effective’ and ‘original’. You might find the word, but it won’t deliver the precision you want from it. Similarly, the words ’emerging’, ‘expected’ and ‘exceeding’ might seem like they offer clear and precise definitions, but in practice, they won’t.

So if the solution is not better descriptors, what is the answer? Very briefly, the answer is for the performance standards to be given meaning through a) questions and b) pupil work. I will expand on this in a later post.

What do exams and opinion polls have in common?

A lot.

Daniel Koretz, Professor of Education at Harvard University, uses polls as an analogy to explain to people how exams actually work. Opinion polls sample the views of a small number of people in order to try and work out the views of a much larger population. Exams are analogous, in that they feature a small sample of questions from a much wider ‘domain’ of knowledge and skill. In Measuring Up, Koretz says this:

The full range of skills or knowledge about which the test provides an estimate – analogous to the votes of the entire population of voters in [an opinion poll] – is generally called the domain by those in the trade. Just as it is not feasible for the pollster to obtain information from the entire population, it is not feasible for a test to measure an entire domain exhaustively, because the domains are generally too large. Instead we create an achievement test, which covers a small sample from the domain, just as the pollster selects a small sample from the population.

Since first reading Koretz’s book (see my review here) I’ve used the analogy quite a lot. I used it this week to explain something to a colleague. She stopped and looked at me like I was crazy. ‘Daisy’, she said, ‘I think you need to get a new analogy.’

I know what she means. After a week where opinion polls have been torn to shreds over their failure to predict the result of the 2015 UK general election, it seems perverse for me to keep using them as an analogy. But actually, the failure of the recent opinion polls makes the analogy all the more useful, because just as opinion polls can and do get things wrong, we need to acknowledge that the similar structure of exams means that they can and do get things wrong too. Exams are susceptible to the same kinds of errors that opinion polls are. In fact, in one way, exams are even more susceptible to error than opinion polls. In the case of opinion polls, we can check the validity of the poll result because eventually the domain is measured, in the form of the election. In the case of exams, there is no final equivalent measure of the domain.  Imagine if there never was an election, and all we ever had were opinion polls of differing types. That’s what exams are like.

Plenty of reasons have been put forward for the failure of the polls in this general election. One of the most popular is the idea that, for whatever reason, people did not tell the pollsters who they were really planning to vote for. The analogy with tests would be where a pupil is, for whatever reason, not interested in answering the items on the test to the best of their ability. In Koretz’s words,

Just as the accuracy of a poll depends on respondents’ willingness to answer frankly, the accuracy of a test score depends on the attitudes of the test-takers – for example, their motivation to perform well.

Whilst this may have been the problem with the UK opinion polls, I don’t think it is a major problem with UK tests. Enough important outcomes depend on the tests for pupils to be motivated to do well on them. Of course, we should never forget that variation in performance on the day is always one of the major factors in exam-score unreliability, and pupils who feel under too much pressure to perform may fail to produce their best work. But by and large, I don’t think this is the major problem with tests at the moment. However, there are two aspects of this sample and domain structure which I think do cause serious problems.

Here is Koretz again:

In the same way that the accuracy of a poll depends on seemingly arcane details about the wording of survey questions, the accuracy of a tests score depends on a host of often arcane details about the wording of items, the wording of ‘distractors’ (wrong answers to multiple choice items), the difficulty of the items, the rubric (criteria and rules) used to score students’ work, and so on…If there are problems with any of these aspects of testing, the results from the small sample of behaviour that constitutes the test will provide misleading estimates of students’ mastery of the larger domain. We will walk away believing that Dewey will beat Truman after all. Or, to be precise, we will believe that Dewey did beat Truman already.

Or, if I can update the analogy, we will believe that Ed Miliband is currently in negotiations with Nicola Sturgeon and Nick Clegg about forming the next government, and David Cameron is sunbathing in Ibiza. A great blog post here by Dan Hodges (written before the election) explains some of the ways in which some of the arcane details of opinion polls can be manipulated to get a certain result. Similarly, changes in arcane details of exam structure can change the value of the results we get from them. For me, there are two particular problems: poor test design, and teaching to the test. I’ve written about the problems of poor test design and teaching to the test on my blog before, here. I also have an article being published soon in Changing Schools where I discuss this at greater length. Here, I will add just one more point, about coursework. Coursework and controlled assessments are a perfect example of the problems with poor test design and teaching to the test.

The essential problem with coursework and controlled assessment is that they allow the teacher and pupils to know what the test sample is in advance. When I taught Great Expectations as a coursework text, I knew what the final ‘sample’ from the novel was –  the final essay would be on the first chapter of the novel. To extend the opinion poll analogy, it’s as if the final result of the election depended not on an vote of the population, nor on an anonymous sample of the population, but on a sample of 1,000 voters whose names were known in advance to all of the political parties. Even if the sample were well chosen, this would clearly be problematic. There would be an obvious incentive to neglect the views of anyone not in that sample. Some political parties might not want to behave so dishonourably, but they would almost certainly be forced to as if they didn’t, another party would do so and therefore win the ‘election’. I think the analogies with coursework and teaching are clear.  A teacher might not want to focus their instruction on the first chapter of Great Expectations, as they realise that doing so will not help pupils in the long run, particularly those pupils who want to study English Literature at A-level. But if they think that every other teacher is doing so, they may feel that they have no choice, as those pupils who do get the targeted instruction will probably get better grades. Thus, poor test design, in the shape of coursework and controlled assessment, encourages teaching to the test and distorts the validity of the final test score. Obviously, the whole problem is exacerbated by high-stakes testing, which places great weight on those test scores. But I think the poor test design is a major feature here too, and hopefully the analogy with opinion polls makes it clearer why this is such a problem.