Ouroboros by Greg Ashman

I’m a bit late to this, but I just wanted to write about how much I enjoyed Ouroboros by Greg Ashman. It’s a very elegantly and sparely written account of Greg’s experiences of teaching in England and Australia, and of the education research which is relevant to his experiences. The central organising metaphor is the ouroboros, ‘an ancient symbol of a snake or dragon that is consuming its own tail.’ Ouroboros can be ‘a vicious metaphor to represent the antithesis of progress – we cannot move forward if we are going round and round. Moreover, Ouroboros adds something to the cycle. It represents the reinvention of old ideas as new ideas. Again and again.’

I found this metaphor very helpful when thinking about modern education. It is so demoralising to see the number of fads that get warmed over and served up as new. And the great fear is not only that bad ideas persist. Even worse, the constant recycling of bad ideas prevents the adoption of new ones, and makes teachers understandably cynical and mistrusting of innovation in general, even though real innovation is what we desperately need to break out of this cycle.

But ouroboros can be a more positive metaphor. ‘We can also view Ouroboros as a virtuous metaphor; a feed-back loop with information flowing from the effect back to the cause. When we teach, we do not speak into the void.’

Greg thinks that this kind of feedback loop is at the heart of good teaching. However, he also notes that attempts to promote the use of feedback over the last decade or so, under the name of Assessment for Learning, have led to disillusion. ‘In U.K. schools, formative assessment followed an unfortunate trajectory that hollowed-out much of the original purpose and has therefore left many teachers quite jaded.’ However, as he notes, ‘the basic principle is sound.’ And there is much good advice in this book about how to rescue the sound principles of formative assessment from the ‘bureaucratic barnacles’ that have grown up around it.

Highly recommended.

How to crack the Oxford History Aptitude Test

Recently, a friend of mine sent me a link to ‪Oxford University’s History Aptitude Tests (HAT). These tests are designed for 18 year olds applying for admission to Oxford. I really liked the look of them – the one I saw was interesting, challenging, covered a broad range of historical eras and I can imagine that it would be quite an interesting test to discuss in class too. However, I did also think that some of the advice that came with the tests wasn’t as helpful as it could have been. For example, here:

‪”The HAT is a test of skills, not substantive historical knowledge. It is designed so that candidates should find it equally challenging, regardless of what period(s) they have studied or what school examinations they are taking.”

I am not sure this is the case. The HAT does require substantive historical knowledge, and a candidate with knowledge of the eras on the test paper would not find it as challenging as a candidate with no such knowledge. Let’s have a look at some of the questions from this paper.

Question one
The first question features an extract from a book about the Comanche empire. The test advises that ‘you do not need to know anything about the subject to answer the questions below.’ I suppose that is true in the loosest sense, in that I do not need to know anything about physics in order to take an A-level in it. But of course, that isn’t the sense in which most of the examinees will be interested in. I think you do need to know something about North American colonisation to do well at the questions below. There are two questions. One of them is ‘In your own words, write a single sentence identifying the main argument of the first paragraph’, and the second is ‘What does the author argue in this passage about recent attempts made by historians to integrate Native Americans into the history of colonialism in North America?’

At first glance, it may seem as if these really aren’t testing prior knowledge, but are instead testing an abstract skill of ‘summarising’, or ‘argument recognition’. However, in actual fact, even these questions are testing substantive historical knowledge. The passage and question from HAT are actually uncannily similar to one of the classic experiments used to show why knowledge is so important for cognition.  In 1978, E.D. Hirsch asked groups of students to read two passages of equal difficulty in terms of vocabulary and syntax. One was about friendship, and one was about Grant and Lee and the end of the Civil War. University students understood both passages equally well. Poorer students at the community college did just as well on the passage about friendship, but struggled on the one about the Civil War. Hirsch theorised that their weakness on this second passage was down to their lack of knowledge about the Civil War, not any lack in some innate ‘passage comprehension’ ability. Similar research has been carried out again and again, to the extent that researchers in this field say that reading is not a ‘formal skill’: it is dependent on background knowledge. Recently, Kate Hammond’s articles in Teaching History about the power of ‘substantive historical knowledge’ also speak to the importance of background knowledge for historical understanding. She shows how pupils who have historical knowledge that goes beyond the exam rubric and even the era being studied are often able to deploy such knowledge in a way that leads to better analysis. For example, if a pupil has knowledge of how minority parties operate within a democracy, this can lead to better analysis of the challenges that faced the Nazi party in the 1920s.

In the case of the HAT extract about the Comanche empire, students with knowledge of western colonialism and the nature of indigenous societies will understand the passage better, read it quicker, and summarise it more acutely. Pupils without that knowledge will not be able to employ their generic ‘summarising’ or ‘historical analysis’ mental muscles, because such muscles don’t exist. Instead, they will be puzzling over what a ‘Euro-American’ is or what the ‘colonization of the Americas’ entailed.

Question two
The next question is: Write an essay of 1.5 to 3 sides assessing and explaining who were the ‘winners’ and ‘losers’ in any historical event, process or movement. You may answer with reference to any society, period or place with which you are familiar.

Obviously, you will need ‘substantive historical knowledge’ to answer this question. The more knowledge of different eras you have, the more likely you are to find one that fits the bill for the question, and the more detailed knowledge you have of each individual era, the more likely you are to have something worthwhile to say about it.

Question three
The final question is a source from 16th century Germany. It says, “You do not need to know anything about Germany in the sixteenth century to answer the question below, nor should you draw on any information from outside the source.”

As regards the first piece of advice, again, it’s true that you may not need to know anything to answer the question, but it will certainly help you if you do. But that’s better than the second piece of advice, which is actually cognitively impossible. The modern research on reading and cognition shows us that when we read, we make sense of the content by…drawing on information from outside the source.  No written text contains all the information we need to make sense of it. All texts depend to some extent on the reader supplying certain bits of information themselves.

When we look at the source itself, we can find plenty of examples of how knowledge from outside the text is impossible to avoid using, and is extremely helpful. First of all, there’s vocabulary: knowing the historical meanings of alms, peasants, lodgings and bathhouse would definitely be helpful. Second, there are references to concepts that have a particular meaning in medieval Europe: the phrase ‘put out of the city’, for example, makes a lot more sense if you understand something about medieval European cities, their rights and freedoms, and their geographical limits and defences. Similarly, there is a reference to epilepsy, which is now understood as a physical illness, but in 16th century Europe was seen as a sign of madness. All of this ‘information from outside the source’ would be hugely valuable in answering the question, and those students who have this information will do better than those who don’t.

I can see how this advice is intended to be well-meaning. I can see that it might be trying to ensure that candidates are not intimidated if they haven’t heard of a particular period of history, and perhaps also to demonstrate that this admissions test is fair to all pupils regardless of educational background – that even if you are a state school pupil who has only studied the Nazis and Tudors, you won’t be at a disadvantage to pupils from independent schools who have studied more historical topics. The test is attempting to uncover some kind of innate ‘historical aptitude’ which exists regardless of the amount of history books you’ve read or historical ideas you have been exposed to. The only problem, of course, is that such innate historical aptitude doesn’t exist. Like many concepts we mistakenly describe as ‘skill’, the ability to analyse historical problems and sources is not something innate and discrete which resides mysteriously within us. It is learnt, and depends to a large degree on the amount of knowledge we have in long-term memory. Actually, in the case of history, this should be even easier to appreciate than in other fields of life. For example, there is no such thing as innate chess skill, but it does at least feel plausible that there might be a part of the brain devoted to the logic necessary for chess. There is no such thing as innate historical skill either, and it feels less plausible that there is a part of the brain devoted to analysing the causes of the Second World War. The concept of ‘historical aptitude’ reminds me of GK Chesterton’s famous quotation:

 Education, they say, is the Latin for leading out or drawing out the dormant faculties of each person. Somewhere far down in the dim boyish soul is a primordial yearning to learn Greek accents or to wear clean collars; and the schoolmaster only gently and tenderly liberates this imprisoned purpose. Sealed up in the newborn babe are the intrinsic secrets of how to eat asparagus and what was the date of Bannockburn. The educator only draws out the child’s own unapparent love of long division; only leads out the child’s slightly veiled preference for milk pudding to tarts.

I spoke to a couple of friends who teach at independent schools and frequently prepare students for this assessment. They disagreed with the ideas that a) you couldn’t prepare students for it and b) it didn’t depend on knowledge. They said that you could prepare students for it by getting them to read lots about lots of different historical eras, and that the students who knew more history generally did better on it. Interestingly, however, they also said that it was because of these reasons that, like me, they quite liked the test. It wasn’t possible to game it in any way, and preparing students for this test generally involved activities which made them better historians, not just better test takers. And they felt the results generally did distinguish between candidates who were and were not good at history. I suspect in many cases, therefore, the advice on this paper is not the end of the world, as plenty of people are probably ignoring it.  Still, both the friends I spoke to were at independent schools who put a lot of time and effort into cracking the Oxbridge admissions code. What about teachers at schools who don’t have a tradition of Oxbridge entry, or can’t devote as much time to reading the runes of these tests? Aren’t they more likely to take this advice at face value – and aren’t their pupils therefore more likely to do badly on such a test? Improving the advice on how to prepare for this test could help all students become better historians, but it could particularly help students from disadvantaged backgrounds.

What can teachers learn from high-performance sport? Plan for injury!

Yesterday I went to a brilliant day of professional development at Ark Globe Academy called Teach Like a Top Athlete: Coaching and Mastery Methods. I went to a workshop run by the amazing Jo Facer on Mastery Planning, and one by the equally amazing Dan Lavipour and Michael Slavinsky called What Can Teachers Learn From High Performance Sport? Dan is a former Olympic coach who now works in youth sports performance; Michael is a former French teacher and the Teaching and Learning Director at Researchers in Schools. Dan and Michael formed a great double act, as Dan talked us through some principles of high performance sport, and Michael drew out some of the analogies for the classroom. And there were tons of analogies. I think a lot about the links between sport and teaching, but these two took it to another level. There were dozens of things I could have chosen to blog about – deliberate practice, the theory of self-determination, the links between the conjugate periodization of training and linear exam courses – but the one that I am going to restrict myself to for now is what Dan and Michael had to say about planning for injury. In sport, injury happens. Netballers get ankle and knee problems; fast bowlers get stress fractures; footballers get hamstring issues. When you plan for injury, you work out what the common injuries are in your discipline and set up training plans that attempt to prevent such injuries.

So is there an analogy with teaching here? Obviously it’s not perfect, but I think there is. In our subjects, we can work out what the top 10 most common misconceptions or errors are, and set up our schemes of work to try and anticipate and prevent them. Here’s an example: I once did an analysis of recalled GCSE scripts in English which showed that ambiguous pronouns were a  major weakness and a real impediment to understanding. Pupils used ‘it’ and ‘they’ a lot, without always being clear who or what those pronouns were referring to.  Some targeted work on pronouns and antecedents could have helped improve clarity.

How can we identify such common misconceptions? In many subjects, we’ll already have a good idea, and in maths and the sciences, there are plenty of great resources out there listing them. But Michael suggested another profitable method: analysing examiners’ reports to see what issues seem to crop up again and again. This is something I started doing when I was researching my book, Seven Myths about Education. I included one example in the book: an examiners’ report which explained that many pupils thought a glacier was a wild tribe of humans from the north. In the report’s words:

Given the current interest in environmental issues, and the popularity of a particular type of film and television programme, it was surprising that a number of candidates seemed unaware of what a glacier is and some seemed to be convinced that the glaciers were some sort of tribe, presumably advancing from somewhere in the north.

There are other examiners’ reports which helpfully list the common writing errors made by pupils. This one, for example, from OCR:

Common errors included not marking sentence divisions, confusion over its and it’s, homophone errors (there/their/they’re and to/too), writing one word instead of two (infact, aswell, alot, incase, eachother) and writing two words instead of one (some one, no where, country side, your self, any thing, neighbour hood). A surprising number of candidates used capitals erratically: for example, they did not feature at the beginning of names but did appear randomly in the middle of words.

These reports also have interesting things to say about the use of PEE paragraphs, and mnemonic techniques like AFOREST. But my favourite type of  examiners’ reports are the ones on unseen reading and writing exams.  The unseen reading texts can be on any topic, and often, the examiners’ report ends up lamenting the students’ lack of knowledge about some crucial aspect of thet ext. They provide perfect examples of how reading is not a skill, and why background knowledge is crucial for comprehension. Here’s just a few examples of what I mean.

Most candidates were able to gain a mark for the next part of the question stating that the whale shark eats plankton. However a number of candidates offered no answer, perhaps they did not recognise plankton to be food, although the context should have made this clear.

The first part of the question simply asked candidates to note the distance Mike Perham had travelled on his round-world voyage. Most selected the correct distance: 30,000 miles, though some over-complicated the response by confusing the distance the report said Perham still had to cover on the final leg of his journey with the total distance. This led to some candidates saying the whole journey was 30,300, whilst others reported the voyage to be just 300 miles. (WJEC)

I thought that I might be apologising for how embarrassingly straightforward this question was but it proved to be inexplicably difficult as many of the candidates just could not focus their minds on the reasons why the Grand National is such a dangerous race. I know that comparison has always been difficult but this question was set up to make things as straightforward as possible. Still it seemed like an insurmountable hurdle, the examining equivalent of Becher’s Brook, at which large numbers fell dramatically. I cannot really explain why so many candidates got themselves into such a tangle with this question. Many of them went round in circles, asserting that the race was dangerous because it was dangerous. (WJEC)

However, what was very noticeable was that many candidates had very little idea of what was in these places or why someone might want to visit (except for Alton Towers of course). Specific attractions were often in very short supply and usually were just mentioned in passing before the article got to the serious business of shopping and eating. I have to admit that the idea of making a day trip to London or Manchester to shop in Primark or eat in KFC did not appeal massively, although it is true that teenagers may find such things irresistible. More seriously, I think a better sense of audience might have helped here, although the lack of knowledge about places is not easy to remedy. (WJEC)

Lack of knowledge in general is certainly not easy to remedy, especially in the short term when you are preparing for an exam. But if we took it back a couple of steps, and started to ‘plan for injury’ in schools, not just on the sports field, how might we try and address this lack of knowledge? What would we need to change? When and where would be need to begin? When you think about it like this, the advantages of a coherent and sequenced knowledge-based curriculum become very obvious.

Debating Education review

I spent yesterday at the Michaela Community School Debating Education event, which was absolutely brilliant. I spoke against the motion ‘Sir Ken is right: traditional education kills creativity’, and Guy Claxton spoke for it. Here are some of my notes from this debate, and the day.

It’s about methods, not aims

I agree with Sir Ken Robinson that creativity is the aim of education. However, where we disagree is on how you can best develop such creativity. Sir Ken praises High Tech High’s model of instruction, where instead of memorising, pupils are doing. Guy Claxton recommends, among other things, that to develop the skill of imagining, pupils should lie on the ground, look at the sky and then ‘close their eyes to imagine how the sky changes as a storm approaches.’ By contrast, I think the best way to develop creativity is through direct instruction, memorisation and deliberate practice (for a specific example of how memorisation leads to creativity in a scheme of work on Midsummer Night’s Dream, see here). This might sound counter-intuitive, but actually, such practices are more effective at developing creativity than just asking children to be creative. Robert Bjork has shown that performance isn’t the same as learning. K Anders Ericsson has shown that what matters isn’t just practice, but deliberate practice: ‘mere repetition of an activity will not automatically lead to improvement’. Deliberate practice is when you isolate the component parts of a task and repeatedly practice them instead. So asking pupils to do creative tasks isn’t the best way of developing creativity. Asking them to memorise examples of rhetorical devices might not look creative, but it might be better at developing creativity. The question is not about finding a balance between memory and creativity, or between knowledge and skill. It’s about recognising that memory is the pathway to creativity, and that skill is composed of knowledge. As John Anderson said, ‘All that there is to intelligence is the simple accrual and tuning of many small bits of knowledge which in total make up complex cognition. The whole is no more than the sum of its parts, but it has a lot of parts.’

What we had in yesterday’s debate was not a false dichotomy. There was real disagreement. If Sir Ken and Guy set up a school and I set up a school, they would look very different, even though we both had the same aim. And because we have the same aim, the argument is not about whether I am in favour of creativity or not (I am), or whether Sir Ken is in favour of knowledge or not (I’m prepared to accept he is), or whether we just need a balance between the two. The argument is about whose methods are more successful at delivering our shared aim of creativity.

The other debates

I’m very grateful to all at Michaela for organising so many good debates. Bruno Reddy and Andrew Old debated the value of mixed ability teaching.  James O’Shaughnessy and Joe Kirby had all the RE & philosophy teachers in the room getting excited  with their discussion of  ethics, character,  and ancient Greek philosophers. Katie Ashford and John Blake argued about the perennially  vexed issue of Ofsted.  Finally, Jonny Porter and Francis Glibert clashed over the reputation of Michael Gove, in front of an audience which may well have included nearly every teacher in England who agreed with him.

I particularly liked the way the day was structured as a series of debates. As one of the debaters, I can assure you that preparing for a debate of this type is a lot more hard work than preparing for a panel discussion. But I think it does also result in a better event. At panel discussions, it’s really easy for everyone to speak for five minutes on their pet theme, regardless of what the topic actually is. Even if the chair is good, it’s often hard to really get to the  heart of an issue. But with debates like these, you very quickly get to  the important and controversial issues. There are plenty of false dichotomies in education, certainly. But there are some real ones too, and we shouldn’t be afraid to discuss them. We discussed the hell out of them yesterday!

Comparative judgment: 21st century assessment

In my previous posts I have looked at some of the flaws in traditional teacher assessment and assessments of character. This post is much more positive: it’s about an assessment innovation that really works.

One of the good things about multiple-choice and short answer questions is that they offer very high levels of reliability. They have clear right and wrong answers; one marker will give you exactly the same mark as another; and you can cover large chunks of the syllabus in a short amount of time, reducing the chance that a high or low score is down to a student getting lucky or unlucky with the questions that came up. One of the bad things about MCQs is that they often do not reflect the more realistic and real-world problems pupils might go on to encounter, such as essays and projects. The problem with real-world tasks, however, is that they are fiendishly hard to mark reliably: it is much less likely that two markers will always agree on the grades they award. So you end up with a bit of a trade-off:  MCQs give you high reliability, but sacrifice a bit of validity. Essays allow you to make valid inferences about things you are really interested in, but you trade off reliability. And you have to be careful: trade off too much validity and you have highly reliable scores that don’t tell you anything anyone is interested in.  Trade off too much reliability and the inferences you make are no longer valid either.

One way of dealing with this problem has been to keep the real world tasks, and to write quite prescriptive mark schemes. However, this runs into problems of its own: it reduces the real-world aspect of the task, and ends up stereotyping pupils’ responses to the task. Genuinely brilliant and original responses to the task fail because they don’t meet the rubric, while responses that have been heavily coached achieve top grades because they tick all the boxes. Again, we achieve a higher degree of reliability, but the reliable scores we have do not allow us to make valid inferences about the things we really care about (see the Esse Quam Videri blog on this here).  I have seen this problem a lot in national exams, and I think that these kinds of exams are actually more flawed than the often-maligned multiple choice questions. Real world tasks with highly prescriptive mark schemes are incredibly easy to game. Multiple-choice and short answer questions are actually not as easy to game, and do have high levels of predictive validity. I think the problem people have with MCQs is that they just ‘look’ wrong. Because they look so artificial, people have a hard time believing that they really can tell you something about how pupils will do on authentic tasks. But they can, and they do, and I would prefer them to authentic tasks that either don’t deliver reliability, or deliver reliability in such a way that compromises their validity.

Still, even a supporter of MCQs like me has to acknowledge – as I always have – that in subjects like English and history, you would not want an entire exam to be composed of MCQs and short answer questions. You would want some extended writing too. In the past, I have always accepted that the marking of such extended writing has to involve some of the trade-offs and difficult decisions outlined above. I’ve also always accepted that it has to be a relatively time-consuming process, involving human markers, extensive training, and frequent moderation.

However, a couple of years ago I heard about a new method of assessment called comparative judgment which offers an elegant solution to the problem of assessing tasks such as essays and projects. Instead of writing prescriptive mark schemes, training markers in their use, getting them to mark a batch of essays or tasks and then come back together to moderate, comparative judgment simply asks an examiner to make a series of judgments about pairs of tasks. Take the example of an essay on Romeo and Juliet: with comparative judgment, the examiner looks at two essays, and decides which one is better. Then they look at another pair, and decide which one is better. And so on. It is relatively quick and easy to make such judgments – much easier and quicker than marking one individual essay.  The organisation No More Marking offer their comparative judgment engine online here for free. You can upload essays or tasks to it, and set up the judging process according to your needs.

Let’s suppose you have 100 essays that need marking, and five teachers to do the marking. If each teacher commits to making 100 pairs of judgments, you will have a total of 500 pairs of judgments. These judgments are enough for the comparative judgment engine to work out the rank order of all of the essays, and associate a score with each one. In the words of the No More Marking CJ guide here: ‘when many such pairings are shown to many assessors the decision data can be statistically modelled to generate a score for each student.’ If you want your score to be a GCSE grade or other kind of national benchmark, then you can include a handful of pre-graded essays in your original 100. You will then be able to see how many essays did better than the C-grade sample, how many better than the B-grade sample, and so on. This method of marking also allows you to see how accurate each marker is. Again, in the words of the guide: ‘the statistical modelling also produces quality control measures, such as checking the consistency of the assessors. Research has shown the comparative judgement approach produces reliable and valid outcomes for assessing the open-ended mathematical work of primary, secondary and even undergraduate students.’

I have done some trial judging with No More Marking, and at first, it feels a bit like voodoo. If, like most English teachers, you are used to laboriously marking dozens of essays against reams of criteria, then looking at two essays and answering the question ‘which is the better essay?’ feels a bit wrong – and far too easy. But it works. Part of the reason why it works is that it offers a way of measuring tacit knowledge. It takes advantage of the fact that amongst most experts in a subject, there is agreement on what quality looks like, even if it is not possible to define such quality in words. It eliminates the rubric and essentially replaces it with an algorithm. The advantage of this is that it also eliminates the problem of teaching to the rubric: to go back to our examples at the start, if a pupil produced a brilliant but completely unexpected response, they wouldn’t be penalised, and if a pupil produced a mediocre essay that ticked all the boxes, they wouldn’t get the top mark. And instead of teaching pupils by sharing the rubric with them, we can teach pupils by sharing other pupils’ essays with them – far more effective, as generally examples define quality more clearly than rubrics.

Ofqual have already used this method of assessment for a big piece of research on the relative difficulty of maths questions. The No More Marking website has case studies of how schools and other organisations are using it.  I think it has huge potential at primary school, where it could reduce a lot of the burden and administration around moderating writing assessments at KS1 & KS2.  On the No More Marking website, it says that ‘Comparative Judgement is the 21st Century alternative to the 18th Century practice of marking.’ I am generally sceptical of anything in education describing itself as ’21st century’, but in this case, it’s justified. I really think CJ is the future, and in 10 or 15 years’ time, we will look back at rubrics the way Marty McFly looks at Reagan’s acting.

Character assessment: a middle-class ramp?

My last two posts (here and here) have looked at how teacher assessments can be biased, and how tests can help to offset some of these biases. I’ve been quite sceptical of the possibility of improving teacher assessment so that it can become less biased: the more you try to reduce the bias in teacher assessment, the less it looks like teacher assessment.  Still, that’s not to say I am against all alternative forms of assessment. I think exams have many strengths, and are often unfairly maligned, but they have weaknesses too and we should always be looking to innovate to try and address such weaknesses. In this post and the next, I will look at two recent innovations in educational assessment: one which I think is hugely promising, and one which is less so. First, the less promising method.

Assessing character
Teaching character, or non-cognitive skills, is very popular at the moment, and for good reason. Children don’t just need academic skills to succeed in life; they need good character too. As E.D. Hirsch says here, character development has rightly been one of the major focuses of education from classical times.

Whilst we can probably all agree on the importance of teaching character in some form, assessing it is far more fraught. Angela Duckworth, whose research focusses on ‘grit’, or perseverance for long-term goals, has created a very simple 12 and 8 item ‘grit scale’ which ask you to answer a series of questions like this one:

Setbacks don’t discourage me.
a) Very much like me
b) Mostly like me
c) Somewhat like me
d) Not much like me
e) Not like me at all

Duckworth et al discuss the development, validation and limitations of the grit scale here. It’s obviously a self-report scale, with all of the problems they entail, but despite this limitation it can tell us some useful information about how ‘gritty’ individuals are, and the impact this will have on their success in other areas.

However, a self-report scale like this one is very obviously going to be of much less use in any more sophisticated or high-stakes assessments.  For example, if you wanted to measure the success of a particular ‘character’ intervention, this scale is not going to allow you to measure whether a cohort’s grit has increased over time. Similarly, if anyone wanted to use a measure of grit or character for accountability purposes, the grit scale is not going to be able to do that either. As the teaching of character has become more popular, more people clearly want a grit scale that is capable of carrying out these kinds of functions. As a result, Duckworth has actually written a paper here outlining in detail why the grit scale is not capable of these kinds of functions, and why it shouldn’t be used like this.

Of course, this doesn’t mean we should stop teaching character. And nor does it mean we have no way of measuring how effective our character education is. As Dan Willingham says here, we could always use measures of academic achievement to see how effective our character education is. The disadvantage is that of course other things will impact on academic achievement, not just the character education, but the advantage is that we are actually measuring one of the things we care about: ‘Indeed, there would not be much point in crowing about my ability to improve my psychometrically sound measure of resilience if such improvement meant nothing to education.’

However, whilst we shouldn’t stop teaching character, I think Duckworth’s paper and the problems surrounding the measurement of character mean we do have to be careful about how we assess it. To return to the theme of my previous posts, we know that teacher assessment is biased against pupils with poor behaviour, with SEN, from low-income backgrounds, and from ethnic minorities. There is every risk that subjective assessments of character might suffer from the same flaws.  In fact, I would argue that there is more of a risk. School-level maths is a fairly well-defined concept, and yet teacher assessments of it are biased. I don’t think ‘character’ or ‘grit’ are nearly as well-defined as school maths, so the risk of bias is even greater. Whilst Duckworth’s work on ‘grit’ is clearly defined, in general the recent interest in character education has served to expand the concept rather than define it more precisely. I am often struck by the number of different meanings ‘character’ seems to have, and how often people seem to use the term to mean ‘personal qualities that I have and/or approve of’. Given this, there is a real risk that subjective assessments of character would inadvertently tend to reinforce stereotypes about social classes, gender and ethnic groups, and end up disadvantaging pupils who are already disadvantaged.

Not only that, but if we loaded high-stakes outcomes onto character assessments – for example, giving assessments of character weight in university admissions – then there would be an incentive to game such assessments and again, it is not too far a stretch to think that it would be middle-class parents who would be adept at gaming such assessments for their children, and that students from less privileged backgrounds might be disadvantaged by such assessments. To put it bluntly, I’d worry that character assessments would become a middle-class ramp, a way for underachieving middle-class children to use their nice manners to compensate for their poor grades. Character assessments need a lot of improvement before they can be relied on in the same way as traditional academic exams.

Why is teacher assessment biased?

In my last post, I spoke about how disadvantaged pupils do better on tests than on teacher assessments – and also about how many people assume the opposite is the case. It’s interesting that today, we seem to think that teacher assessment will help the disadvantaged. In the late 19th and early 20th century, the meritocratic advantages of tests were better understood than they are today, as were the nepotism and prejudice that often resulted from other types of assessment. In the 19th century, the civil service reformers Charles Northcote and Stafford Trevelyan fought a long battle to make entry to the civil service based on test scores, rather than family connections. In the early 20th century, Labour educationalists such as RH Tawney and Beatrice and Sidney Webb fought for an education system based around exams, because they believed that only exams could ensure that underprivileged children were treated fairly. Since then, we have only gathered more evidence about the equalizing power of exams, but oddly, we seem to have forgotten these insights.

In my last post, I explained why it is that tests are fairer – they treat every pupil the same, and every pupil has to answer the same questions. However, whilst I gave plenty of evidence that teacher assessment was biased, I didn’t fully explain why this bias happens. As a result, quite a few people said that the solution was simply ‘better teacher assessment’, perhaps by introducing more moderation and CPD, as a group of teaching unions recommended doing in 2011. How realistic is that? Are the problems with teacher assessment really so insoluble that we have to resort to tests? And what exactly are the nature of these flaws?

Teacher assessment is biased not because it is carried out by teachers, but because it is carried out by humans. Tammy Campbell, the IoE researcher whose recent research showed bias in teacher assessments of 7-year-olds, is at pains to point this out. She says, ‘I want to stress that this isn’t something unique to teachers. It’s human nature. Humans use stereotypes as a cognitive shortcut and we’re all prone to it.’  A growing body of research reinforces Campbell’s point. We all have difficulties making certain complex judgments and decisions, and we resort to shortcuts when the mental strain becomes too great (see Daniel Kahneman’s work for more on this). Indeed, it is plausible to speculate that the reason why teacher assessment is biased is because it is so burdensome: when we are faced with difficult cognitive challenges we often default to stereotypes.

And teacher assessment really is burdensome. I thought I had it bad as a secondary teacher marking coursework for GCSE pupils, but I recently spoke to a primary colleague who told me about the hours they spend gathering evidence for KS1 assessments and cross-referencing it against the level descriptors – a task which, as I’ve said before, is at the limits of human cognitive capacity. When faced with such a difficult challenge, defaulting to stereotypes is in many ways a sensible attempt by our unconscious minds to reduce our workload. We know that on average pupils on free school meals do not attain as well, we know that the essay we are marking isn’t great, but it isn’t terrible, we know it sort of meets some of the criteria on the mark scheme, we need some more evidence in order to reach a final judgment, we could reread the essay and mark scheme but the mark scheme is hard to interpret…we also know that the pupil who wrote the essay is from the wrong side of the tracks. Done: it’s a below average essay. None of us want to admit that this is how our minds work, and for most of us, our minds don’t consciously work like this, but there is plenty of evidence that this is how our reasoning goes. That’s the nice story: here, Adrian Wooldridge gives the less charitable interpretation of the mental processes of assessors. He says that when you base assessment around ‘Oxford dons who pronounce mystically on whether a candidate possesses “a trace of alpha”’, don’t then be surprised when ‘a large number of those who show up favorably on the alpha detectors turn out to be Old Etonians.’

Is it possible to counter such bias in any way? Is it possible to ‘do better teacher assessment’? After all, whilst we humans are susceptible to bias, we are also self-aware. We know we make these errors, and in most of the fields where we make these errors, we have found ways around them. I once heard the scientific method described as a collection of practices designed to counteract human bias. Are there practices we can introduce to teacher assessment that would function like this, and counteract human bias? Yes, there are. We could anonymize the work that’s being marked. We could standardize tasks, conditions and marking protocols. We could carry out some statistical analysis of the marks different pupils get and that different teachers give. And once we had done all that, we would find that we had eliminated many of the biases associated with teacher assessment, but that we had also pretty much eliminated the teacher assessment and replaced it with a test. The flaws with teacher assessment are inherent in its very nature. Doing teacher assessment better basically means making it more test-like. The whole point of the test is that it, like the practices that characterize the scientific method, is essentially a method for countering human bias. As the Bew report says here, most of the attempts to reduce the bias of teacher assessment have failed, and those that have succeeded do so by making teacher assessment more test-like.

Teacher assessment discriminates against poorer pupils and minorities, and generates significant workload for teachers. Tests are fairer and less burdensome. They deserve a better reputation than they have, and a greater role in our assessment system.

Whilst I’m in favour of tests, and sceptical about the possibility of improving teacher assessment, I still think there are other ways we could improve assessment. In my next few posts I will look at some recent assessment innovations and see if they offer any improvements on the status quo.