Category Archives: Assessment

Four and a half things you need to know about new GCSE grades

Last week I had a dream that I was explaining the new GCSE number grades to a class of year 11s. No matter how many times I explained it, they kept saying ‘so 1 is the top grade, right miss? And 3 is a good pass? And if I get 25 marks I am guaranteed a grade 3?’

Here are the four and a half things I think you need to know about the new GCSE number grades

ONE: The new grading system will provide more information than the old one
When I taught in the 6th form, I felt that there were lots of pupils who had received the same grade in their English GCSE but who nevertheless coped very differently with the academic challenge of A-level. There are lots of reasons for this, but I think one is that grades C and B in particular are awarded to so many pupils. Nearly 30% of pupils receive a grade C in English and Maths, and there are clearly big differences between a pupil at the top of that grade and one at the bottom. With the new system, it looks as though the most common grade will be a 4, which only about 20% of pupils will get. With the old letter system, things had got a bit lop-sided: half the grades available were used to distinguish the top two-thirds of candidates.  In the new system, two-thirds of the available grades will be awarded to the top two-thirds of candidates, which is fairer, provides more information, and will help 6th forms and employers distinguish between candidates.

TWO: We don’t know what the grade boundaries will be.
Even with an established specification, it is really hard to predict in advance the relative difficulty of different questions, which is why grade boundaries can never be set in advance. This is even more the case with a new specification. We just don’t know how many marks will be needed to get a certain grade.

THREE: We do know roughly what the grade distribution will be like
Whilst we don’t know the number of marks needed to get a certain grade, we do know how many pupils will get a grade 4 and above (70%), and how many will get a grade 7 and above (16% in English, 20% in Maths). The new 4 grade is linked to the old C grade, and the new 7 to the old A. I’ve heard some people say that the new standards are a ‘complete unknown’. This isn’t the case. We know a lot about where the new standards will be, and this approach lets us know a lot more than other approaches which could have been taken (see below).

FOUR: There’s an ‘ethical imperative’ behind this process
The ‘ethical imperative’ is the idea that no pupil will be disadvantaged by the fact that they were the first to take these new exams. (See page 16-17 here). That’s why Ofqual have created a link between the last year of letter grades, and the first year of number grades. Suppose these new specs really are so fiendishly hard that all the pupils struggle dramatically on them. 70% of pupils will still get a grade 4+. They are not going to be disadvantaged by the introduction of new and harder exams.

AND A HALF: Secondary teachers: if you don’t like this approach, just talk to a primary colleague about what they went through last year!
At Ark, I’ve been involved with the changes to Sats that happened last year, and the changes to GCSE grading that are happening this year. There was no ‘ethical imperative’ at primary last year, meaning we didn’t know until the results were published what the standard would be. Whereas we know in advance with the new GCSE that about 70% of pupils will get a 4 or above, at primary we were left wondering if 80% would pass, if 60% would, or if 20% would! We didn’t have a clue! In the event, the standard for reading fell sharply compared to previous years. Not only did this lead to a very stressful year for primary teachers, it also means that it is extremely hard to compare results year on year from before and after 2016. One might argue that this matters less at primary as pupils do not take the results with them in life and get compared to pupils from previous years. But of course, the results of schools are compared over time, and a great deal depends on these comparisons. So I think an ethical imperative would have been welcome at primary too, and that the new GCSE grades have been designed in the fairest possible way for both schools and pupils.

What makes a good formative assessment?

This is part 5 of a series of blogs on my new book, Making Good Progress?: The future of Assessment for Learning. Click here to read the introduction to the series.

In the last two blog posts – here and here –  I’ve spoken about the importance of breaking down complex skills into smaller pieces. This has huge implications for formative assessments, where the aim is to improve a pupil’s performance, not just to measure it.

Although we typically speak of ‘formative assessment’ and ‘summative assessment’, actually, the same assessment can be used for both formative and summative purposes. What matters is how the information from an assessment is used. A test can be designed to give a pupil a grade, but a teacher can use the information from individual questions on the test paper to diagnose a pupil’s weaknesses and decide what work to give them next. In this case, the teacher is taking an assessment that has been designed for summative purposes, but using it formatively.

Whilst it is possible to reuse assessments in this way, it is also true that some types of assessment are simply better suited for formative purposes than others. Because complex skills can be broken down into smaller pieces, there is great value in designing assessments which try to capture progress against these smaller units.

However, too often, a lot of formative assessments are simply mini-summative assessments – tasks that are really similar in style and substance to the final summative task, with the only difference being that they have been slightly reduced in size. So for example, if the final assessment is a full essay on the causes of the first world war, the formative assessment is one paragraph on how the assassination of Franz Ferdinand contributed to the start of the first world war. If the final summative assessment is an essay analysing the character of Bill Sikes, the formative assessment is an essay analysing Fagin. The idea is that the comments and improvements a teacher gives pupils on the formative essay will help them improve for the summative essay.

But I would argue that in order to improve at a complex task, sometimes we need to practise other types of task. Here is Dylan Wiliam commenting on this, in the context of baseball.

The coach has to design a series of activities that will move athletes from their current state to the goal state. Often coaches will take a complex activity, such as the double play in baseball, and break it down into a series of components, each of which needs to be practised until fluency is reached, and then the components are assembled together. Not only does the coach have a clear notion of quality (the well-executed double play), he also understands the anatomy of quality; he is able to see the high-quality performance as being composed of a series of elements that can be broken down into a developmental sequence for the athlete. (Embedded Formative Assessment, p.122)

Wiliam calls this series of activities ‘a model of progression’. When you break a complex activity down into a series of components, what you end up with often doesn’t look like the final activity. When you break down the skill of writing an essay into its constituent parts, what you end up with doesn’t look like an essay. I wrote about this about five years ago, where I set out what I felt were some of the series of activities that could help pupils become a good writer.

Once we’ve established a model of progression in a subject, then we can think about how to measure progress – and measuring progress is what the next post will be about.

Teaching knowledge or teaching to the test?

This is part 2 of a series of blogs on my new book, Making Good Progress?: The future of Assessment for Learning. Click here to read the introduction to the series.

For many people, teaching knowledge, teaching to the test and direct, teacher-led instruction are one and the same thing. Here is Fran Abrams from BBC Radio 4’s Analysis programme making this argument.

In fact, there’s been an increasing focus on knowledge, as English schools have become ever more exam driven.

And also Tom Sherrington, who writes the Teacher Head blog.

If anything, we have a strong orientation towards exam preparation; exams are not as content free as some people suggest.

Teaching knowledge and teaching to the test are seen as similar things – but what I want to argue is that they’re actually very different.

I think teaching knowledge and direct teacher instruction are good things – but that teaching to the test is a really bad idea. I also think, perhaps slightly counter-intuitively, that teaching to the test is more likely to happen when you don’t focus on teaching knowledge. It’s when you try and teach generic skills that you end up teaching to the test.

First of all, what is teaching to the test and why is it bad? I’ve written at length about this here, but briefly, teaching to the test is bad because no test in the world can directly measure everything we want pupils to know and be able to do. Instead, tests select a smaller sample of material and use that to make an inference about everything else. If we focus teaching on the small sample, two bad things happen. One, the results a pupil gets are no longer a valid guide to their attainment in that subject. Two, we stop teaching important things that aren’t on the test, and start teaching peripheral things that are on the test. My favourite example of this is a history one. A popular exam textbook on interwar Germany doesn’t mention Bismarck, and barely mentions Kaiser Wilhelm II. It does have lengthy sections on how to answer the 4-mark and 8-mark question. That’s teaching to the test.

Direct instruction and teaching knowledge are very different from this. Direct instruction is about breaking a skill down into its smallest components, and getting pupils to practise them. Teaching knowledge is about identifying the really important knowledge pupils need to understand the world they live in, and teaching that.

A knowledge-based approach to teaching inter-war Germany would teach lots of key dates and facts and figures about not just about inter-war Germany, but about, for example, the growth of nationalism in 19th century Europe.

One possible difficulty with the knowledge-based, direct instruction approach is identifying what knowledge you should teach, and in what way you should break down complex skills. For example, I’ve said that to understand inter-war Germany, you should teach 19th century Europe and Bismarck – but am I right? How do you decide what content you need? And given that we presumably expect pupils to be able to write historical essays, surely some direct instruction in the 4-mark question, say, is valuable? This question – what should we expect pupils to memorise – is the subject of the next post.

Why didn’t Assessment for Learning transform our schools?

This is part 1 of a series of blogs on my new book, Making Good Progress?: The future of Assessment for Learning. Click here to read the introduction to the series.

Giving feedback works. There is an enormous amount of evidence that shows this, much of it summarised in Black and Wiliam’s Inside the Black Box.  The importance of giving feedback was the rationale behind the government-sponsored initiative of Assessment for Learning, or AfL. Yet, nearly twenty years after the publication of Inside the Black Box, and despite English teachers saying they give more feedback to pupils than nearly every comparable country, most metrics show that English education has not improved much over the same period. Dylan Wiliam himself has said that ‘there are very few schools where all the principles of AfL, as I understand them, are being implemented effectively’.

How has this happened?

My argument is that what matters is not just the act of giving feedback, but the type and quality of the feedback. You can give all the feedback you like, but if it doesn’t help pupils to improve, it doesn’t matter. And over the past twenty years or so, the feedback teachers were encouraged to give was based on a faulty idea of how pupils learn: the idea that pupils can learn generic skills.

National curriculum levels, the assessing pupil progress grids, the interim frameworks and various ‘level ladders’ are all based on the assumption that there were generic skills of analysis, problem-solving, inference, mathematical awareness and scientific thinking, etc., that could be taught and improved on. In these systems, all the feedback pupils get is generic. Teachers were encouraged to use the language of the level descriptors to give feedback, meaning that pupils got abstract and generic comments like: ‘you need to develop explanation of inferred meanings drawing on evidence across the text’ or ‘you need to identify more features of the writer’s use of language’.

Unfortunately, we know that skill is not something that can be taught in the abstract. We all know people who are good readers, but their ability to read and infer is not an abstract skill: it is dependent on knowledge of vocabulary and background information about the text.

What this means is that whilst statements like ‘you need to identify more features of the writer’s use of language’ might be an accurate description of a pupil’s performance, these statements are not actually going to help them improve. What if the pupil didn’t know any features to begin with? What if the features they knew weren’t present in this text?

Generic feedback is descriptive, not analytic. It’s accurate, but it isn’t helpful. It tells pupils how they are doing, but it does not tell them how to get better. For that, they need something much more specific and curriculum-linked. In fact, in order to give pupils more helpful feedback, they need to do more helpful, specific and diagnostic tasks. If you try to teach generic skills, and only give generic feedback, you will end up always having to use assessments that have been designed for summative purposes. That is, you will end up over-testing and teaching to the test.

Teaching to the test, and the vexed question of whether it is a good or a bad thing, will be the subject of the next post.

Making Good Progress?: The future of Assessment for Learning

In February, my second book is going to be published by Oxford University Press. It’s called Making Good Progress?: The future of Assessment for Learning. 

It is the assessment follow-up to my first book, Seven Myths about Education, which was about education more generally. In Seven Myths about Education, I argued that a set of flawed ideas had become dominant in education even though there was little evidence to back them up. Broadly speaking, I argued that knowledge and teacher-led instruction had been given an undeserved bad reputation, and that the research evidence showed that knowledge, practice and direct instruction were more likely to lead to success than discovery and project-based learning.

The hardest questions I had to answer about the book were from people who really liked these ideas, and wanted to know how they could create an assessment system which supported them.  Certain kinds of activities, lessons and assessment tasks simply didn’t work with national curriculum levels. For example, discrete grammar lessons, vocabulary quizzes, multiple choice questions, and historical narratives were hard, if not impossible, to assess using national curriculum levels. Many schools required every lesson, or every few lessons, to end with an activity which gave pupils a level: e.g., at the end of this lesson, to be a level 4a, you need to have done x, to be a 5c, you need to have done y, to be a 5b, you need to have done z. This type of lesson structure had become so dominant as to feel completely natural and inevitable. But actually, it was the product of a specific set of questionable beliefs about assessment, and it imposed huge restrictions on what you could teach. In short, the assessment system was exerting a damaging influence on the curriculum, and that influence was all the more damaging for being practically invisible.

Over the last four years, in my work at Ark Schools, I have been lucky enough to have the time to think about these issues in depth, and to work on them with some great colleagues. Making Good Progress is a summary of what I have learnt in that time. It isn’t a manual about one particular assessment system. But it does contain all the research and ideas I wish I had known about when I first started thinking about this. In the next seven blog posts, I will outline a few brief summaries of some of the ideas it contains. Here they are.

  1. Why didn’t AfL transform our schools?
  2. Teaching knowledge or teaching to the test?
  3. Is all practice good?
  4. How can we close the knowing-doing gap?
  5. What makes a good formative assessment?
  6. How can we measure progress in individual lessons?
  7. How do bad ideas about assessment lead to workload problems?

Research Ed 2016: evidence-fuelled optimism

One of the great things about the Research Ed conferences is that whilst their aim is to promote a sceptical, dispassionate and evidence-based approach to education, at the end of them I always end up feeling irrationally excited and optimistic. The conferences bring together so many great people and ideas that it’s easy to think educational nirvana is just around the corner. Of course, I also know from the many Research Ed sessions on statistics that this is a sampling error: the 750+ people at Capital City Academy yesterday are entirely unrepresentative of just about anything, and educational change is a slow and hard slog, not a skip into the sunlit uplands. Still, I am pretty sure there must be some research that says if you can’t feel optimistic at the start of September, you will never make it through November and February.

And there was some evidence that the community of people brought together by Research Ed really are making a difference, not just in England but in other parts of the world too. One of my favourite sessions of the day was the last one by Ben Riley of the US organisation Deans for Impact, who produced the brilliant The Science of Learning report. Ben thinks that English teachers are in the vanguard of the evidence-based education movement, and that we are way ahead of the US on this score.   One small piece of evidence for this is that a quarter of the downloads of The Science of Learning are from the UK. There clearly is a big appetite for this kind of stuff here. In the next few years, I am really hopeful that we will start to see more and more of the results and the impact of these new approaches.

Here’s a quick summary of my session yesterday, plus two others I attended.

My session

For the first time, I actually presented some original research at Research Ed, rather than talking about other people’s work. Over the last few months, I have been working with Dr Chris Wheadon of No More Marking on a pilot of comparative judgment of KS2 writing. We found that the current method of moderation using the interim frameworks has some significant flaws, and that comparative judgment delivers more reliable results with fewer distortions of teaching and learning. I will blog in more depth about this soon: it was only a small pilot, but it shows CJ has a lot of promise!

Heather Fearn

Heather (blog here) presented some research she has been working on about historical aptitude. What kinds of skills and knowledge do pupils need to be able to analyse historical texts they have never seen before, or comment on historical eras they have never studied? The Oxford Historical Aptitude Test (HAT) asks pupils to do just that, and I have blogged about it here before. In short, I think it is a great test with some bad advice, because it constantly tells pupils that they don’t have to know anything about history to be able to answer questions on the paper. Heather’s research proved how misleading this advice was. She got some of her pupils to answer questions on the HAT, and then analysed their answers and looked at the other historical eras they had referred to in order to make sense of the new ones they encountered on the HAT. Pupils were much better at analysing eras, like Mao’s China, where comparisons to Nazi Germany were appropriate or helpful. When asked to analyse eras like 16th century Germany, they fell back on to anachronisms such as talking about ‘the inner city’, because they didn’t really have a frame of reference for such eras.

This is a very very brief summary of some complex research, but I took two implications from it, one for history teachers, and one for everyone. First, the more historical knowledge pupils have, the more sophisticated analysis they can make and they more easily they are able to understand new eras of history. Second, there are profound and worrying consequences of the relentless focus in history lessons on the Nazis. Heather noted that her pupils were great at talking about dictatorships and fascism in their work, but when they had to talk about democracy, they struggled because they just didn’t understand it – even though it was the political system they had grown up with. This seems to me to offer a potential explanation of Godwin’s Law: we understand new things by comparing them to old things; if we don’t know many ‘old things’ we will always be forcing the ‘new things’ into inappropriate boxes; if all we are taught is the Nazis, we will therefore end up comparing everything to them. I think this kind of research shows we need to teach the historical roots of democracy more explicitly – perhaps by focussing more on eras such as the ancient Greeks, and the neglected Anglo-Saxons.

Ben Riley

Ben is the founder of Deans for Impact, a US teacher training organisation.  The Science of Learning, referenced above, is a report by them which focusses on the key scientific knowledge teachers need to understand how pupils learn. In this session, Ben presented some of their current thinking, which is more about how teachers learn. Their big idea is that ‘deliberate practice’ is just as valuable for teachers as it is for pupils. However, deliberate practice is a tricky concept, and one that requires a clear understanding of goals and methods. We might have a clear idea of how pupils make progress in mathematics. We have less of an idea of how they make progress in history (as Heather’s research above shows). And we probably have even less of a clear idea of how teachers make progress. Can we use deliberate practice in the absence of such understanding? Deans for Impact have been working with K Anders Ericsson, the world expert on expertise, to try and answer this question. I’ve been reading and writing a lot about deliberate practice over the last few months as part of the research for my new book, Making Good Progress?, which will be out in January. In this book, I focus on using it with pupils. I haven’t thought as much about its application to teacher education, but there is no doubt that deliberate practice is an enormously powerful technique which can lead to dramatic improvements in performance – so if we can make it work for teachers, we should.

“Best fit” is not the problem

I can remember taking part in marking moderation sessions using the Assessing Pupil Progress grids. We marked using ‘best fit’ judgments. At their worst, such ‘best fit’ judgments were really flawed. A pupil might produce a very inaccurate piece of writing that everyone agreed was a level 2 on Assessment Focus 6 – write with technical accuracy of syntax and punctuation in phrases, clauses and sentences. But then someone would point out how imaginative it was, and say that it deserved a much higher level for Assessment Focus 1 – write imaginative, interesting and thoughtful texts. Using a best fit judgment, therefore, a pupil would end up with the national average level even though they had produced work that had serious technical flaws. Another pupil might get the same level, but produce work that was much more accurate. Given this, it is easy to see why ‘best fit’ judgments have fallen out of favour. On the new primary interim writing frameworks, to get a certain grade you have to be ‘secure’ at all of the statements. So, we’ve moved from a best fit framework to a secure fit one. Another way of saying this is that we have moved from a ‘compensatory’ approach, where weaknesses in one area can be made up with strengths in another, to a ‘mastery’ approach, where pupils have to master everything to succeed.

The problem with the ‘secure’ or ‘mastery’ approach to assessment is that when it is combined with open tasks like extended writing it leads to tick box approaches to teaching. However good an essay is, if it doesn’t tick a particular box, it can’t get a particular grade. However bad an essay is, if it ticks all the boxes, it gets a top grade. It is much harder than you might think to construct the tick boxes so that good essays are never penalised, and that bad essays are never rewarded. I’ve written about this problem before, here. This approach penalises ambitious and original writers. For example, if a pupil knows that to achieve a certain grade they have to spell every word correctly and can’t misspell one, then the tactical thing to do is to only use very very basic words. Similarly with sentence structure, punctuation, grammar, etc. Thus, the pupil who writes a simple, boring but accurate story does better than the pupil who writes an interesting, thoughtful and ambitious story with a couple of errors. Teachers realise this is what is happening and adapt their teaching in response, focussing not just on the basics, but also, more damagingly, on actively eliminating anything that is even slightly more ambitious than the basics.

Regular readers might be surprised to hear me say this, since I have always made a point of the importance of accuracy, and of the importance of pupils learning to walk before they can run. I have also been very enthusiastic about mastery approaches to learning. So is this me changing my mind? No. I still think accuracy is extremely important, and that it enables creativity rather than stifling it. I still also think that mastery curriculums are the best type. My issue is not about the curriculum and teaching, but about assessment. Open tasks like extended writing are not the best way to assess accuracy or mastery. This is because, crucially, open tasks do not ask pupils to do the same thing in the same way. They introduce an element of discretion into the task. Pupil one might spell a lot of words wrong in her extended task, but that might be because she has attempted to use more difficult words. Pupil 2 might spell hardly any words wrong, but she may have used much easier words. Had you asked pupil 2 to spell the words that pupil 1 got wrong, she may not have been able to. So she isn’t a better speller, but she is credited as such. If you insist on marking open tasks in a secure fit way, this becomes a serious problem as it leads to huge distortions of both assessment and teaching. Essentially, what you are doing is giving the candidate discretion to respond to the task in different ways, but denying the marker similar discretion. Better spellers are marked as though they are weaker spellers, because they have essentially set themselves a higher standard thanks to having a more ambitious vocabulary. Lessons focus on how to avoid difficult words rather than use them, how to avoid the apostrophe rather than use it correctly. If the secure fit approach to marking open tasks really did reward accuracy, I would be in favour of it. But it doesn’t reward accuracy. It rewards gaming. Michael Tidd’s great article in the TES here shows exactly how this process works. James Bowen also has an excellent article here looking at some of these problems.

It is obviously important that pupils can spell correctly. But open tasks are not the best way of assessing this. The best and fairest way of checking that pupils can spell correctly is to give them a test on just that. If all pupils are asked to spell the same set of words in the same conditions, you can genuinely tell who the better spellers are, and the test will also have a positive impact on teaching and learning as the only way to do well on the test is to learn the spellings.

One final point is to note that the problems I outlined at the start about the flaws with ‘best fit’ judgments actually had less to do with ‘best fit’ and more to do with (drum roll) vague prose descriptors. The fundamental problem is getting any set of descriptors with whatever kind of ‘fit’ to adequately represent quality.

The prose descriptors allowed for pupils to be overmarked on particularly vague areas like AF1 – write imaginative texts –  when in actual fact they were probably not doing all that well on those areas. I don’t think there are hundreds of thousands pupils out there who write wonderfully imaginatively but can’t construct a sentence, or vice versa. It’s precisely because accuracy enables creativity that there aren’t millions of pupils out there struggling with the mechanics of writing but producing masterpieces nonetheless. I have made this point again and again with reference to evidence from cognitive psychology, but let me now give you a recent piece of assessment evidence that appears to point the same way. We have held quite a few comparative judgment sessions at Ark primary schools. You can read more about comparative judgment here, but essentially, it relies on best fit comparisons of pupils work, rather than ticks against criteria. You rely on groups of teachers making independent, almost instinctive judgments about what is the better piece of work. At the start of one of our comparative judgment sessions, one of the teachers said to me that he didn’t think we would get agreement at the end because we would all have different ideas of what good writing was. For him, good writing was all about creativity, and he was prepared to overlook flaws with technical accuracy in favour of really imaginative and creative writing. OK, I said, for today, I will judge as though the only thing that matters is technical accuracy. I will look solely for that, and disregard all else. At the end of the judging session, we both had a high level of agreement with the rest of the judging group.  This is of course just one small data point, but as I say, I think it goes to prove something which has been very well-evidenced in cognitive psychology. The high level of agreement between all teachers at this comparative judgment session (and on all the others we have run) also shows us that judging writing and even judging creativity are perhaps not as subjective as we might think. It is not the judgments themselves that are subjective, but the prose descriptors we have created to rationalise the judgments.

Similarly, if the problem with best fit judgments wasn’t actually the best fit, but the prose descriptors, then keeping the descriptors but moving to secure fit judgments won’t solve the fundamental problem. And again, we have some evidence that this is the case too. Michael Tidd has collected new writing framework results from hundreds of schools nationally. The results are, in Michael’s words, ‘erratic’. They don’t follow any kind of typical or expected pattern, and they don’t even correlate with schools’ previous results.  Whatever the precise reason for this, it is surely some evidence that introducing a secure fit model based on prose descriptors is not going to solve our concerns around the validity and reliability of writing judgments.

In conclusion

  • If you want to assess a specific and precise concept and ensure that pupils have learned it to mastery, test that concept itself in the most specific and precise way possible and mark for mastery – expect pupils to get 90% or 100% of questions correct.
  • If you want to assess performance on more open, real world tasks where pupils have significant discretion in how they respond to the task, you cannot mark it in a ‘secure fit’ or ‘mastery’ way without risking serious distortions of both assessment accuracy and teaching quality. You have to mark it in a ‘best fit’ way. If the pupil has discretion in how they respond, so should the marker.
  • Prose descriptors will be inaccurate and distort teaching whether they are used in a best fit or secure fit way. To avoid these inaccuracies and distortions, use something like comparative judgment which allows for performance on open tasks to be assessed in a best fit way without prose descriptors.