Debating Education review

I spent yesterday at the Michaela Community School Debating Education event, which was absolutely brilliant. I spoke against the motion ‘Sir Ken is right: traditional education kills creativity’, and Guy Claxton spoke for it. Here are some of my notes from this debate, and the day.

It’s about methods, not aims

I agree with Sir Ken Robinson that creativity is the aim of education. However, where we disagree is on how you can best develop such creativity. Sir Ken praises High Tech High’s model of instruction, where instead of memorising, pupils are doing. Guy Claxton recommends, among other things, that to develop the skill of imagining, pupils should lie on the ground, look at the sky and then ‘close their eyes to imagine how the sky changes as a storm approaches.’ By contrast, I think the best way to develop creativity is through direct instruction, memorisation and deliberate practice (for a specific example of how memorisation leads to creativity in a scheme of work on Midsummer Night’s Dream, see here). This might sound counter-intuitive, but actually, such practices are more effective at developing creativity than just asking children to be creative. Robert Bjork has shown that performance isn’t the same as learning. K Anders Ericsson has shown that what matters isn’t just practice, but deliberate practice: ‘mere repetition of an activity will not automatically lead to improvement’. Deliberate practice is when you isolate the component parts of a task and repeatedly practice them instead. So asking pupils to do creative tasks isn’t the best way of developing creativity. Asking them to memorise examples of rhetorical devices might not look creative, but it might be better at developing creativity. The question is not about finding a balance between memory and creativity, or between knowledge and skill. It’s about recognising that memory is the pathway to creativity, and that skill is composed of knowledge. As John Anderson said, ‘All that there is to intelligence is the simple accrual and tuning of many small bits of knowledge which in total make up complex cognition. The whole is no more than the sum of its parts, but it has a lot of parts.’

What we had in yesterday’s debate was not a false dichotomy. There was real disagreement. If Sir Ken and Guy set up a school and I set up a school, they would look very different, even though we both had the same aim. And because we have the same aim, the argument is not about whether I am in favour of creativity or not (I am), or whether Sir Ken is in favour of knowledge or not (I’m prepared to accept he is), or whether we just need a balance between the two. The argument is about whose methods are more successful at delivering our shared aim of creativity.

The other debates

I’m very grateful to all at Michaela for organising so many good debates. Bruno Reddy and Andrew Old debated the value of mixed ability teaching.  James O’Shaughnessy and Joe Kirby had all the RE & philosophy teachers in the room getting excited  with their discussion of  ethics, character,  and ancient Greek philosophers. Katie Ashford and John Blake argued about the perennially  vexed issue of Ofsted.  Finally, Jonny Porter and Francis Glibert clashed over the reputation of Michael Gove, in front of an audience which may well have included nearly every teacher in England who agreed with him.

I particularly liked the way the day was structured as a series of debates. As one of the debaters, I can assure you that preparing for a debate of this type is a lot more hard work than preparing for a panel discussion. But I think it does also result in a better event. At panel discussions, it’s really easy for everyone to speak for five minutes on their pet theme, regardless of what the topic actually is. Even if the chair is good, it’s often hard to really get to the  heart of an issue. But with debates like these, you very quickly get to  the important and controversial issues. There are plenty of false dichotomies in education, certainly. But there are some real ones too, and we shouldn’t be afraid to discuss them. We discussed the hell out of them yesterday!

Comparative judgment: 21st century assessment

In my previous posts I have looked at some of the flaws in traditional teacher assessment and assessments of character. This post is much more positive: it’s about an assessment innovation that really works.

One of the good things about multiple-choice and short answer questions is that they offer very high levels of reliability. They have clear right and wrong answers; one marker will give you exactly the same mark as another; and you can cover large chunks of the syllabus in a short amount of time, reducing the chance that a high or low score is down to a student getting lucky or unlucky with the questions that came up. One of the bad things about MCQs is that they often do not reflect the more realistic and real-world problems pupils might go on to encounter, such as essays and projects. The problem with real-world tasks, however, is that they are fiendishly hard to mark reliably: it is much less likely that two markers will always agree on the grades they award. So you end up with a bit of a trade-off:  MCQs give you high reliability, but sacrifice a bit of validity. Essays allow you to make valid inferences about things you are really interested in, but you trade off reliability. And you have to be careful: trade off too much validity and you have highly reliable scores that don’t tell you anything anyone is interested in.  Trade off too much reliability and the inferences you make are no longer valid either.

One way of dealing with this problem has been to keep the real world tasks, and to write quite prescriptive mark schemes. However, this runs into problems of its own: it reduces the real-world aspect of the task, and ends up stereotyping pupils’ responses to the task. Genuinely brilliant and original responses to the task fail because they don’t meet the rubric, while responses that have been heavily coached achieve top grades because they tick all the boxes. Again, we achieve a higher degree of reliability, but the reliable scores we have do not allow us to make valid inferences about the things we really care about (see the Esse Quam Videri blog on this here).  I have seen this problem a lot in national exams, and I think that these kinds of exams are actually more flawed than the often-maligned multiple choice questions. Real world tasks with highly prescriptive mark schemes are incredibly easy to game. Multiple-choice and short answer questions are actually not as easy to game, and do have high levels of predictive validity. I think the problem people have with MCQs is that they just ‘look’ wrong. Because they look so artificial, people have a hard time believing that they really can tell you something about how pupils will do on authentic tasks. But they can, and they do, and I would prefer them to authentic tasks that either don’t deliver reliability, or deliver reliability in such a way that compromises their validity.

Still, even a supporter of MCQs like me has to acknowledge – as I always have – that in subjects like English and history, you would not want an entire exam to be composed of MCQs and short answer questions. You would want some extended writing too. In the past, I have always accepted that the marking of such extended writing has to involve some of the trade-offs and difficult decisions outlined above. I’ve also always accepted that it has to be a relatively time-consuming process, involving human markers, extensive training, and frequent moderation.

However, a couple of years ago I heard about a new method of assessment called comparative judgment which offers an elegant solution to the problem of assessing tasks such as essays and projects. Instead of writing prescriptive mark schemes, training markers in their use, getting them to mark a batch of essays or tasks and then come back together to moderate, comparative judgment simply asks an examiner to make a series of judgments about pairs of tasks. Take the example of an essay on Romeo and Juliet: with comparative judgment, the examiner looks at two essays, and decides which one is better. Then they look at another pair, and decide which one is better. And so on. It is relatively quick and easy to make such judgments – much easier and quicker than marking one individual essay.  The organisation No More Marking offer their comparative judgment engine online here for free. You can upload essays or tasks to it, and set up the judging process according to your needs.

Let’s suppose you have 100 essays that need marking, and five teachers to do the marking. If each teacher commits to making 100 pairs of judgments, you will have a total of 500 pairs of judgments. These judgments are enough for the comparative judgment engine to work out the rank order of all of the essays, and associate a score with each one. In the words of the No More Marking CJ guide here: ‘when many such pairings are shown to many assessors the decision data can be statistically modelled to generate a score for each student.’ If you want your score to be a GCSE grade or other kind of national benchmark, then you can include a handful of pre-graded essays in your original 100. You will then be able to see how many essays did better than the C-grade sample, how many better than the B-grade sample, and so on. This method of marking also allows you to see how accurate each marker is. Again, in the words of the guide: ‘the statistical modelling also produces quality control measures, such as checking the consistency of the assessors. Research has shown the comparative judgement approach produces reliable and valid outcomes for assessing the open-ended mathematical work of primary, secondary and even undergraduate students.’

I have done some trial judging with No More Marking, and at first, it feels a bit like voodoo. If, like most English teachers, you are used to laboriously marking dozens of essays against reams of criteria, then looking at two essays and answering the question ‘which is the better essay?’ feels a bit wrong – and far too easy. But it works. Part of the reason why it works is that it offers a way of measuring tacit knowledge. It takes advantage of the fact that amongst most experts in a subject, there is agreement on what quality looks like, even if it is not possible to define such quality in words. It eliminates the rubric and essentially replaces it with an algorithm. The advantage of this is that it also eliminates the problem of teaching to the rubric: to go back to our examples at the start, if a pupil produced a brilliant but completely unexpected response, they wouldn’t be penalised, and if a pupil produced a mediocre essay that ticked all the boxes, they wouldn’t get the top mark. And instead of teaching pupils by sharing the rubric with them, we can teach pupils by sharing other pupils’ essays with them – far more effective, as generally examples define quality more clearly than rubrics.

Ofqual have already used this method of assessment for a big piece of research on the relative difficulty of maths questions. The No More Marking website has case studies of how schools and other organisations are using it.  I think it has huge potential at primary school, where it could reduce a lot of the burden and administration around moderating writing assessments at KS1 & KS2.  On the No More Marking website, it says that ‘Comparative Judgement is the 21st Century alternative to the 18th Century practice of marking.’ I am generally sceptical of anything in education describing itself as ’21st century’, but in this case, it’s justified. I really think CJ is the future, and in 10 or 15 years’ time, we will look back at rubrics the way Marty McFly looks at Reagan’s acting.

Character assessment: a middle-class ramp?

My last two posts (here and here) have looked at how teacher assessments can be biased, and how tests can help to offset some of these biases. I’ve been quite sceptical of the possibility of improving teacher assessment so that it can become less biased: the more you try to reduce the bias in teacher assessment, the less it looks like teacher assessment.  Still, that’s not to say I am against all alternative forms of assessment. I think exams have many strengths, and are often unfairly maligned, but they have weaknesses too and we should always be looking to innovate to try and address such weaknesses. In this post and the next, I will look at two recent innovations in educational assessment: one which I think is hugely promising, and one which is less so. First, the less promising method.

Assessing character
Teaching character, or non-cognitive skills, is very popular at the moment, and for good reason. Children don’t just need academic skills to succeed in life; they need good character too. As E.D. Hirsch says here, character development has rightly been one of the major focuses of education from classical times.

Whilst we can probably all agree on the importance of teaching character in some form, assessing it is far more fraught. Angela Duckworth, whose research focusses on ‘grit’, or perseverance for long-term goals, has created a very simple 12 and 8 item ‘grit scale’ which ask you to answer a series of questions like this one:

Setbacks don’t discourage me.
a) Very much like me
b) Mostly like me
c) Somewhat like me
d) Not much like me
e) Not like me at all

Duckworth et al discuss the development, validation and limitations of the grit scale here. It’s obviously a self-report scale, with all of the problems they entail, but despite this limitation it can tell us some useful information about how ‘gritty’ individuals are, and the impact this will have on their success in other areas.

However, a self-report scale like this one is very obviously going to be of much less use in any more sophisticated or high-stakes assessments.  For example, if you wanted to measure the success of a particular ‘character’ intervention, this scale is not going to allow you to measure whether a cohort’s grit has increased over time. Similarly, if anyone wanted to use a measure of grit or character for accountability purposes, the grit scale is not going to be able to do that either. As the teaching of character has become more popular, more people clearly want a grit scale that is capable of carrying out these kinds of functions. As a result, Duckworth has actually written a paper here outlining in detail why the grit scale is not capable of these kinds of functions, and why it shouldn’t be used like this.

Of course, this doesn’t mean we should stop teaching character. And nor does it mean we have no way of measuring how effective our character education is. As Dan Willingham says here, we could always use measures of academic achievement to see how effective our character education is. The disadvantage is that of course other things will impact on academic achievement, not just the character education, but the advantage is that we are actually measuring one of the things we care about: ‘Indeed, there would not be much point in crowing about my ability to improve my psychometrically sound measure of resilience if such improvement meant nothing to education.’

However, whilst we shouldn’t stop teaching character, I think Duckworth’s paper and the problems surrounding the measurement of character mean we do have to be careful about how we assess it. To return to the theme of my previous posts, we know that teacher assessment is biased against pupils with poor behaviour, with SEN, from low-income backgrounds, and from ethnic minorities. There is every risk that subjective assessments of character might suffer from the same flaws.  In fact, I would argue that there is more of a risk. School-level maths is a fairly well-defined concept, and yet teacher assessments of it are biased. I don’t think ‘character’ or ‘grit’ are nearly as well-defined as school maths, so the risk of bias is even greater. Whilst Duckworth’s work on ‘grit’ is clearly defined, in general the recent interest in character education has served to expand the concept rather than define it more precisely. I am often struck by the number of different meanings ‘character’ seems to have, and how often people seem to use the term to mean ‘personal qualities that I have and/or approve of’. Given this, there is a real risk that subjective assessments of character would inadvertently tend to reinforce stereotypes about social classes, gender and ethnic groups, and end up disadvantaging pupils who are already disadvantaged.

Not only that, but if we loaded high-stakes outcomes onto character assessments – for example, giving assessments of character weight in university admissions – then there would be an incentive to game such assessments and again, it is not too far a stretch to think that it would be middle-class parents who would be adept at gaming such assessments for their children, and that students from less privileged backgrounds might be disadvantaged by such assessments. To put it bluntly, I’d worry that character assessments would become a middle-class ramp, a way for underachieving middle-class children to use their nice manners to compensate for their poor grades. Character assessments need a lot of improvement before they can be relied on in the same way as traditional academic exams.

Why is teacher assessment biased?

In my last post, I spoke about how disadvantaged pupils do better on tests than on teacher assessments – and also about how many people assume the opposite is the case. It’s interesting that today, we seem to think that teacher assessment will help the disadvantaged. In the late 19th and early 20th century, the meritocratic advantages of tests were better understood than they are today, as were the nepotism and prejudice that often resulted from other types of assessment. In the 19th century, the civil service reformers Charles Northcote and Stafford Trevelyan fought a long battle to make entry to the civil service based on test scores, rather than family connections. In the early 20th century, Labour educationalists such as RH Tawney and Beatrice and Sidney Webb fought for an education system based around exams, because they believed that only exams could ensure that underprivileged children were treated fairly. Since then, we have only gathered more evidence about the equalizing power of exams, but oddly, we seem to have forgotten these insights.

In my last post, I explained why it is that tests are fairer – they treat every pupil the same, and every pupil has to answer the same questions. However, whilst I gave plenty of evidence that teacher assessment was biased, I didn’t fully explain why this bias happens. As a result, quite a few people said that the solution was simply ‘better teacher assessment’, perhaps by introducing more moderation and CPD, as a group of teaching unions recommended doing in 2011. How realistic is that? Are the problems with teacher assessment really so insoluble that we have to resort to tests? And what exactly are the nature of these flaws?

Teacher assessment is biased not because it is carried out by teachers, but because it is carried out by humans. Tammy Campbell, the IoE researcher whose recent research showed bias in teacher assessments of 7-year-olds, is at pains to point this out. She says, ‘I want to stress that this isn’t something unique to teachers. It’s human nature. Humans use stereotypes as a cognitive shortcut and we’re all prone to it.’  A growing body of research reinforces Campbell’s point. We all have difficulties making certain complex judgments and decisions, and we resort to shortcuts when the mental strain becomes too great (see Daniel Kahneman’s work for more on this). Indeed, it is plausible to speculate that the reason why teacher assessment is biased is because it is so burdensome: when we are faced with difficult cognitive challenges we often default to stereotypes.

And teacher assessment really is burdensome. I thought I had it bad as a secondary teacher marking coursework for GCSE pupils, but I recently spoke to a primary colleague who told me about the hours they spend gathering evidence for KS1 assessments and cross-referencing it against the level descriptors – a task which, as I’ve said before, is at the limits of human cognitive capacity. When faced with such a difficult challenge, defaulting to stereotypes is in many ways a sensible attempt by our unconscious minds to reduce our workload. We know that on average pupils on free school meals do not attain as well, we know that the essay we are marking isn’t great, but it isn’t terrible, we know it sort of meets some of the criteria on the mark scheme, we need some more evidence in order to reach a final judgment, we could reread the essay and mark scheme but the mark scheme is hard to interpret…we also know that the pupil who wrote the essay is from the wrong side of the tracks. Done: it’s a below average essay. None of us want to admit that this is how our minds work, and for most of us, our minds don’t consciously work like this, but there is plenty of evidence that this is how our reasoning goes. That’s the nice story: here, Adrian Wooldridge gives the less charitable interpretation of the mental processes of assessors. He says that when you base assessment around ‘Oxford dons who pronounce mystically on whether a candidate possesses “a trace of alpha”’, don’t then be surprised when ‘a large number of those who show up favorably on the alpha detectors turn out to be Old Etonians.’

Is it possible to counter such bias in any way? Is it possible to ‘do better teacher assessment’? After all, whilst we humans are susceptible to bias, we are also self-aware. We know we make these errors, and in most of the fields where we make these errors, we have found ways around them. I once heard the scientific method described as a collection of practices designed to counteract human bias. Are there practices we can introduce to teacher assessment that would function like this, and counteract human bias? Yes, there are. We could anonymize the work that’s being marked. We could standardize tasks, conditions and marking protocols. We could carry out some statistical analysis of the marks different pupils get and that different teachers give. And once we had done all that, we would find that we had eliminated many of the biases associated with teacher assessment, but that we had also pretty much eliminated the teacher assessment and replaced it with a test. The flaws with teacher assessment are inherent in its very nature. Doing teacher assessment better basically means making it more test-like. The whole point of the test is that it, like the practices that characterize the scientific method, is essentially a method for countering human bias. As the Bew report says here, most of the attempts to reduce the bias of teacher assessment have failed, and those that have succeeded do so by making teacher assessment more test-like.

Teacher assessment discriminates against poorer pupils and minorities, and generates significant workload for teachers. Tests are fairer and less burdensome. They deserve a better reputation than they have, and a greater role in our assessment system.

Whilst I’m in favour of tests, and sceptical about the possibility of improving teacher assessment, I still think there are other ways we could improve assessment. In my next few posts I will look at some recent assessment innovations and see if they offer any improvements on the status quo.

Tests are inhuman – and that is what is so good about them

One of the frequent complaints about tests is that they are a bit dehumanising. Every pupil is herded into an exam hall, there to answer exactly the same questions. The questions they answer are often rather artificial ones, stripped from real-world contexts and on occasions placed in formats, such as multiple choice, that they will be unlikely to encounter outside of an exam hall. If they feel ill, or if they had an argument with their parents the night before, then no special allowances can be made.

In contrast with this, assessment based on teacher judgment seems not just nicer, but much fairer. Exams offer a highly artificial snapshot of a pupil’s grasp of atomised knowledge at just one moment in time. Teachers have knowledge of a pupil that spans months, maybe years, and takes into account the pupils’ performance on a range of different tasks and topics. Teacher assessment also doesn’t necessitate the one-off do-or-die pressure of the exam hall. So it’s clear why many people see teacher assessment as an altogether more sensible arrangement than exams: not just nicer on the pupil, but fairer too.

However, we run up against the problem that whilst teacher assessment appears on the surface to be much nicer and fairer than exams, what we find when we look at the research is that in some very important ways, teacher assessment is not only less fair, it also produces outcomes that are not that nice either. Specifically, teacher assessment has a huge problem with bias. You can watch this video by Rob Coe which sums up a lot of this research. A screenshot from it is below.

Coe screenshot

Here are some quotations from some key research papers backing up these points.

Both high and medium weight evidence indicated the following: there is bias in teachers’ assessment (TA) relating to student characteristics, including behaviour (for young children), gender and special educational needs; overall academic achievement and verbal ability may influence judgement when assessing specific skills. (Harlen, 2004)

Studies of the National Curriculum Assessment (NCA) for students aged 6 and 7 in England and Wales in the early 1990s, found considerable error and evidence of bias in relation to different groups of students (Shorrocks et al., 1993; Thomas et al., 1998). (Ibid)

It is argued that pupils are subjected to too many written tests, and that some should be replaced by teacher assessments… The results here suggest that might be severely detrimental to the recorded achievements of children from poor families, and for children from some ethnic minorities. (Burgess and Greaves, 2009)

Teachers tended to perceive low-income children as less able than their higher income peers with equivalent scores on cognitive assessments. (Campbell 2015)

In short, as we can see, teacher assessment is biased against disadvantaged pupils. This bias isn’t conscious, and it’s a feature of the kinds of flaws with all human judgment which people like Daniel Kahneman have written so much about. Tests, by contrast, avoid a lot of these flaws precisely because of all the dehumanising things about them which I outlined at the start. Every pupil is treated the same, they take the same questions in the same conditions at the same time, and it’s hard or even impossible to get special treatment. The wealthy pupil doesn’t get the chance to have their exam taken for them by their tutor. They don’t get the chance to redraft and redo it several times. Tests are also normally blind marked and structured and artificial questions like multiple-choice ones are easy to mark reliably and fairly.

As you can see, this research is very well-established, but it doesn’t seem to be very well-known. A couple of years ago, when there were discussions about reducing the amount of teacher assessment in national exams, many people actually asserted that reducing teacher assessment would penalise disadvantaged pupils. Here, for example, is Mary Bousted of the ATL:

 “We have serious concerns that the new-style GCSE will not give all children the chance to demonstrate what they have learned and will particularly disadvantage children with difficult home lives.”

And Ian Toone of Voice:

“Three-hour exams test academically able pupils’ ability to recall and present information under test conditions, but for very many young people, including those with special needs, coursework and teacher assessment are a better measure of their knowledge and abilities.”

But as we’ve seen, there is good evidence that teacher assessment does not help SEN pupils and those from low-income backgrounds. So we are in an odd situation, whereby we have solid research evidence that disadvantaged pupils do better on tests than on teacher assessments, but the popular understanding is that exactly the reverse is true. Yet another example, I would argue, of the serious consequences of the lack of high quality training in assessment.

So whilst on the surface teacher assessment seems a more human and fairer form of assessment, in practice it is often less fair. And what we also find is that the very things about exams which makes them seem so inhuman are also the very things which help guarantee their fairness. If you want fairness, progress, equality and reliability, then human judgment may not be the best method.

Intelligence Squared debate: Don’t end the tyranny of the test

On Thursday I spoke at an Intelligence Squared debate called ‘Let’s end the tyranny of the test: relentless school testing demeans education’. Together with Toby Young, I spoke against the motion; Tony Little and Tristram Hunt spoke for it.

There were a number of important points of agreement between the two sides. Tony Little told the story of Tom, a brilliant history student who got a U in his History A-level because his argument was essentially too sophisticated for the narrow exam rubric. I’ve known Toms in my time teaching, and I’ve also known the opposite – the student who gains top grades in exams through a mastery of the exam technique, as opposed to the subject itself. I completely agree with Tony Little that this is a real problem in our exam system. I’ve written about this in this collection of essays here, where I am critical of narrow exam syllabuses and textbooks that encourage a focus on exam structures, rather than the topic itself. For example, in this exam textbook on Germany 1919-1945, there are large sections called ‘Meeting the Examiner’ where the technicalities of the ‘8 mark question’ are discussed, but there is no detail at all on any aspect of German history outside that narrow period. In this textbook, it is more important to Meet the Examiner than it is to meet Bismarck. If you also have concerns over this phenomenon, I’d recommend reading Heather Fearn’s blogs on the subject here.

We also agreed with the proposition that blunt government targets were problematic and that teaching to the test was not a good thing (I have written more about this here), and they agreed with us that exams should always play an important part in education. However, where I felt there were central areas of disagreement were in the attitudes to exams. The proposition viewed them as essentially a necessary evil, which were in many ways inimical to good education, and whose role and impact needed to be reduced as far as possible. Tristram Hunt said that our real focus should not be on exams, but on ensuring equity in the early years, and on teacher quality. Tony Little said that our focus should not be on exams, but on ensuring a love of learning; he also said that exams ‘atomise knowledge’. It was clear that for both of them, exams were not always helpful for these other aims. Whilst I accept that in our current system this may be the case, I am convinced that exams have an important and indeed indispensable role to play in achieving these aims.  I don’t think you can improve equity, teacher quality and a love of learning without some form of reliable feedback – and exams are basically the best and most accurate method of gathering feedback that we have. In many different areas of life, improvements and innovations are often brought about by improvements in measurement and feedback. Exams are our measurement system in education, and that in some ways we misuse our measures at the moment is not an argument against measurement: as Sir Philip Sidney said, ‘shall the abuse of a thing make the right use odious?’

Where exams have gone wrong – as I fully accept they have – we have to reform them, not abolish them. And in many ways, the ways they have gone wrong is that they have drifted away from being pure exams. For example, for me, one of the problems with the kinds of history exams Tony Little rightly criticised is that they are too dependent on human judgment against abstract criteria, which we know is a very ineffective method of assessment.  I would like to see an element of multiple choice questions in such exams, which would help to eliminate such problems – but of course, multiple choice questions are exactly the type of question which receive even more criticism for ‘atomising knowledge’. Similarly, one way we could ensure greater equity in the early years is to introduce exams at KS1, rather than teacher assessments, since we have some evidence that teacher assessments at this age are biased against pupils from a low-income background – but again, if you suggest replacing teacher assessments with tests, you generally do not get a great response. So this, for me, was the difference between the two sides: we both acknowledged the flaws in the current exam system, and both had very similar aims for education, but on our side, we felt that the problems could be solved by wiser use of exams, and perhaps even more of them, whereas for the proposition, they felt that there should just be fewer exams, that they should be of less importance, and that there should be more extended project-type assessments.

In my speech itself, I had two main arguments. First, I argued that exams were accurate and less subject to bias than other methods of assessment, such as teacher assessment and coursework. This is a well-established finding in the literature, but it is curiously little-known. Teacher assessment is frequently biased against disadvantaged pupils, but people assume again and again that actually, such assessment helps these pupils. I will write more about this in the future, but if you are interested in it, then my article in the Policy Exchange collection of essays here has more on this, and Rob Coe’s video here does too. I also argued that properly designed tests really do predict things of value. For example, this fascinating study by Benbow and Lubinski tracked top scorers on the American SAT at age 13. At age 38, many of them had gone on to achieve remarkable things, and not just in the predictable areas of business and academia – many of them excelled in the creative professions too. And the American SAT is the kind of test that many would criticise for ‘atomising knowledge’, or for just being, as David Baddiel says here, a kind of ‘cognitive trick’. This is not true, and it fundamentally misunderstands how tests work.  Questions on tests do not have to look exactly like the kind of problems we face in real life in order to provide useful information about how we might do on such real-life problems.

My second argument was that testing was also a useful pedagogical technique. We know from Bjork’s research on the testing effect that straining to recall something, as we do in a test, actually helps us to recall it in future. We also know that practice testing is a very effective revision technique: much more effective than the more common approach of rereading notes and highlighting them. The frequently-heard line about how ‘weighing the pig isn’t the same as feeding it’ is false: in the case of education, weighing the pig is the same as feeding it. Testing actually helps you to learn.

We narrowly lost the debate, and afterwards I spoke to a number of pupils in the audience whose experiences of exams were similar to those of Tom’s above, and who were therefore understandably sceptical of the value of exams. It’s worrying that this is happening, and it makes reform of our exam system all the more important. This same pattern – misuse of exams leading to widespread mistrust of them – has also been seen in the US, and has been outlined brilliantly by Daniel Koretz in Measuring Up. Koretz is at pains to point out the incredibly valuable information we can get from apparently ‘narrow’ standardised tests, but he is also very critical of the No Child Left Behind Act, and the teaching to the test and gaming it has encouraged. I completely agree with both his points, but it’s a combination of views that feels very rare: often, it feels as though if you are in favour of tests, you must be in favour of teaching to them; and if you are worried about how tests are being used, you must be in favour of abolishing them. It would be nice to open up space for a more Koretz-esque view of tests in this country. A confrontational debate may not be the best way of doing this, but I did enjoy it, and I hope it did at least pique some people’s

The Commission on Assessment without Levels

I was a member of the Commission on Assessment without Levels, which met earlier this year to look at ways of supporting schools with the removal of national curriculum levels. The final report was published last week, and here are a few key points from it.

1. Assessment training is very weak

The Commission agreed with the Carter Review that teacher training and professional development in assessment was weak. It’s worth quoting the Carter Review at length on this.

“For example, our review of course materials highlighted that important concepts relating to evidence-based teaching (validity, reliability, qualitative and quantitative data, effect sizes, randomised controlled trials) appeared to be covered in only about a quarter of courses…there are significant gaps in both the capacity of schools and ITT providers in the theoretical and technical aspects of assessment. This is a great concern – particularly as reforms to assessment in schools mean that teachers have an increased role in assessment. There are also important links here with the notion of evidence-based teaching. The profession’s ability to become evidence- based is significantly limited by its knowledge and understanding of assessment – how can we effectively evaluate our own practice until we can securely assess pupil progress?”

It’s particularly frustrating that assessment training is so weak, as compared to a lot of other aspects of teacher training this is not hard to deliver. It should be relatively straightforward to design a taught course covering the topics above.

2. Performance descriptors have big weaknesses. Judging pupils against ‘can-do’ statements is popular, but flawed.

“Some assessment tools rely very heavily on statements of achievement drawn from the curriculum. For example, teachers may be required to judge pupils against a series of ‘can-do’ statements. Whilst such statements appear precise and detailed, they are actually capable of being interpreted in many different ways. ‘A statement like ‘Can compare two fractions to identify which is larger’ sounds precise, but whether pupils can do this or not depends on which fractions are selected. The Concepts in Secondary Mathematics and Science (CSMS) project investigated the achievement of a nationally representative group of secondary school pupils, and found out that when the fractions concerned were 3/7 and 5/7, around 90% of 14-year-olds answered correctly, but when more typical fractions, such as 3/4 and 4/5 were used, 75% answered correctly. However, where the fractions concerned were 5/7 and 5/9, only around 15% answered correctly.’”

I’ve written about this at length here.

3. Teacher assessment is not always fairer than tests

“Standardised tests (such as those that produce a reading age) can offer very reliable and accurate information, whereas summative teacher assessment can be subject to bias.”

4. Ofsted does not expect to see any one particular assessment system.

Here’s a link to a video of Sean Harford, another member of the commission and the National Director for Schools at Ofsted, making exactly this point.

5. A national item bank could be an innovative way of providing a genuine replacement for levels.

“Some schools use online banks of questions to help with formative assessment. Such banks of question give meaning to the statements contained in assessment criteria and allow pupils to take ownership of their learning by seeing their strengths and weaknesses and improvement over time. Some commercial packages exist with pre-set questions, particularly for maths and science. Other products allow teachers to create their own questions, thus ensuring they align perfectly with the school curriculum.

One of the flaws with national curriculum levels was the way a summative measure came to dominate formative assessment. One way the government could support formative assessment without recreating the problems of levels would be to establish a national item bank of questions based on national curriculum content. Such an item bank could be used for low-stakes assessments by teachers and would help to create a shared language around the curriculum and assessment. It could build on the best practice of schools that are already pioneering this approach. Over time, the bank could also be used to host exemplar work for different subjects and age groups.”

New Zealand appear to have something similar.

For more of my posts on assessment, see here.