It is the assessment follow-up to my first book, Seven Myths about Education, which was about education more generally. In Seven Myths about Education, I argued that a set of flawed ideas had become dominant in education even though there was little evidence to back them up. Broadly speaking, I argued that knowledge and teacher-led instruction had been given an undeserved bad reputation, and that the research evidence showed that knowledge, practice and direct instruction were more likely to lead to success than discovery and project-based learning.
The hardest questions I had to answer about the book were from people who really liked these ideas, and wanted to know how they could create an assessment system which supported them. Certain kinds of activities, lessons and assessment tasks simply didn’t work with national curriculum levels. For example, discrete grammar lessons, vocabulary quizzes, multiple choice questions, and historical narratives were hard, if not impossible, to assess using national curriculum levels. Many schools required every lesson, or every few lessons, to end with an activity which gave pupils a level: e.g., at the end of this lesson, to be a level 4a, you need to have done x, to be a 5c, you need to have done y, to be a 5b, you need to have done z. This type of lesson structure had become so dominant as to feel completely natural and inevitable. But actually, it was the product of a specific set of questionable beliefs about assessment, and it imposed huge restrictions on what you could teach. In short, the assessment system was exerting a damaging influence on the curriculum, and that influence was all the more damaging for being practically invisible.
Over the last four years, in my work at Ark Schools, I have been lucky enough to have the time to think about these issues in depth, and to work on them with some great colleagues. Making Good Progress is a summary of what I have learnt in that time. It isn’t a manual about one particular assessment system. But it does contain all the research and ideas I wish I had known about when I first started thinking about this. In the next seven blog posts, I will outline a few brief summaries of some of the ideas it contains. Here they are.
This was the fifth national Research Ed conference, and in my mind they’ve started becoming a bit like FA Cup Finals or Christmas – recurring events that start to blur into one. “Oh, South Hampstead – was that the one where Ben Riley from Deans for Impact visited and it all kicked off about grammars?” “No, that was Capital City 2016 – South Hampstead 2015 was the one where Eric Kalenze visited and where James Murphy taught us the Maori word for green.” Etc. Looking back at my notes from 2013, I find that Ben Goldacre warned then against the ‘energy-zappers’ who will criticise everything you do – too true.
The title of my talk was: Improving assessment: the key to education reform.
As ever, it is inspiring to meet so many people who are so committed and excited about the cause of research in education, and to be able to talk and share ideas with them. I always come away from these conferences with my mind buzzing with new ideas. Research Ed has only been around for four years, but I cannot imagine the world of education without it. Here’s to many more brilliant conferences.
In the previous few posts, I’ve looked at the workload generated by traditional English mock marking, and at the low reliability, and I’ve suggested that comparative judgement can produce more reliable results and take less time. However, one question I frequently get about comparative judgement is: what about the feedback? Traditional marking may be time-consuming, but it often results in pupils getting personalised comments on their work. Surely this makes it all worthwhile? And beyond a grade, what kind of feedback can comparative judgement give you? This post is a response to those questions.
First, there’s a limit to the amount of formative feedback you can get from any summative assessment. That’s because summative assessments are not designed with formative feedback in mind: they are instead designed to give an accurate grade. So for the most useful kind of formative feedback, I think you need to set non-exam tasks. I write about this more in Making Good Progress.
Still, whilst formative feedback from summative assessments is limited, it does exist. When you read a set of exam scripts, there are obviously insights you’ll want to share back with your pupils, and similarly it’s always helpful to read examiners’ reports to get an idea of the common misconceptions all pupils make. I think we need to do fewer mock exams, because their usefulness is limited, but clearly when we do do them, we want to get whatever use we can from them.
So what is the best way for a teacher to give feedback on mock performance? The dominant method at the minute seems to be written comments at the bottom of an exam script. This is extraordinarily time-consuming, as we’ve documented here, and as other bloggers have noted here, here and here. What I want to suggest in this post is that these kinds of comments are also very unhelpful. Dylan Wiliam sums up why perfectly:
‘I remember talking to a middle school student who was looking at the feedback his teacher had given him on a science assignment. The teacher had written, “You need to be more systematic in planning your scientific inquiries.” I asked the student what that meant to him, and he said,“I don’t know. If I knew how to be more systematic, I would have been more systematic the first time.” This kind of feedback is accurate — it is describing what needs to happen — but it is not helpful because the learner does not know how to use the feedback to improve. It is rather like telling an unsuccessful comedian to be funnier — accurate, but not particularly helpful, advice.’
This might seem like a funny and slightly flippant comment, but actually it expresses a profound philosophical point put forward in the work of philosophers such as Michael Polanyi and Thomas Kuhn, which is that words are not always that good at explaining new concepts to novices. Often, part of what a novice needs to learn is what some of these words like ‘systematic’, or, to use an example from Kuhn, ‘energy’, really mean. If pupils don’t know what these words really mean, they can get stuck in a circular loop, similar to the one you might have experienced as a child when you didn’t know the meaning of a word, so you looked it up in a dictionary, only to find you didn’t know any of the words in that definition, so you looked those up, only to find that you didn’t understand the words in those definitions, and so forth…
Much more helpful than written comments are actions: things that a pupil has to do next in order to improve their performance. These do not have to be individual to every pupil, and they do not have to be laboriously written at the bottom of every script. They can be communicated verbally in the next lesson, and they can be acted on in that lesson too.
How does all this fit in with comparative judgement? One objection people have to comparative judgement is that whilst it may give an accurate grade, it doesn’t give pupils a comment at the bottom of their script. We’ve heard of a couple of schools where after judging a set of scripts, they’ve then required staff to go back and write comments on the scripts too. This is totally unnecessary and unhelpful! Instead, we’d recommend combining comparative judgement with whole-class marking. Whole-class marking is a concept I first came across on blogs by Joe Kirby and Jo Facer at Michaela Community School. Instead of writing comments on a set of books, you can jot down the feedback you want to give on a single piece of paper. You can formalise this a bit more by developing a one-page marking proforma, which gives you a structure to record your insights as you mark or judge a set of scripts, and to help you plan a lesson in response to the scripts. Here’s an example we’ve put together based on some year 7 narrative writing. The parts in red are the parts that involve teacher and/or pupil actions.
Caveat: this is written out far more neatly and coherently than is necessary — we’ve only done this to illustrate how it works. These proformas can be much more messy, as in Toby French’s example here. What’s important is the thought process they support, and the record they will provide over time of actions and improvements. In short, combining comparative judgement with one-page marking proformas will drastically reduce the time it takes to mark a set of scripts, and will give your pupils far more useful feedback than a series of written comments.
Our aim with our Progress to GCSE English project is to use tools like the one above to allow schools to replace traditional mock marking with comparative judgement. We ran our first training days in July, and will be running more in the autumn term. To find out more, sign up to our mailing list here. Our primary project, Sharing Standards, takes a similar approach, and you can read more about it here.
Most of the answers were in the range of 10 – 30 minutes. People also pointed out that the time it took to mark mocks varied depending on whether you wrote lengthy comments at the bottom of each script or not.
My own experience of marking the old spec GCSE English Language papers was that it took me about 15 minutes to mark each paper, which included some fairly brief comments. I also found it difficult to mark for more than about 90 minutes / 2 hours in one go, and if I did try and mark for longer than that, I would get slower and need to take more frequent breaks.
If we take 15 minutes, therefore, as a relatively conservative estimate, that means that if you teach 28 pupils, it will take you 7 hours to mark those scripts. That doesn’t include any moderation. If we assume a 90 minute moderation session for each mock, plus 90 minutes to go back and apply the insights from moderation, that means we are looking at a total of 10 hours.
That’s for one English Language Paper. There are two English Language papers, and two English Literature paper. So if you want pupils to do a complete set of English mocks, that’s a total of 40 hours of marking for the teacher.
With the old specification which included a lot of coursework, I think most English teachers spent the bulk of year 10 teaching and marking coursework essays, and didn’t get on to doing mocks until year 11. I was really pleased when coursework was abolished as I felt it would free up so much more time for teachers to plan and teach, instead of mark and administer coursework. However, it does appear as though a lot of this gained time has now been replaced with equally time-consuming mock marking, with mocks being introduced more and more in year 10. Many schools have three assessment points a year. If you were to do two mock papers three times a year in both year 10 and 11, then a teacher who taught one year 10 class and one year 11 class would spend 120 hours of the year marking GCSE mocks. That’s three normal working weeks, or nearly 10% of the contracted 1,265 annual hours of directed time.
In our first No More Marking Progress to GCSE English training days last week, we looked at how schools could use comparative judgement to reduce the amount of time it took to mark an English mock paper. The exact amount of time it takes to judge a set of scripts using comparative judgement will depend on the ratio of English teachers to pupils in your school. But we think that at worst, using comparative judgement will halve the amount of time it takes to grade a set of GCSE English papers; that is, it will take 5 hours instead of 10. The best case scenario is that we can get it down to 2 hours. That includes built-in moderation, as well as time to discuss the results with your department and prepare whole-class formative feedback. You can read more about the pilot, and how to sign up for it, here.
Of course, workload is not the only issue we should consider when looking at planning assessment calendars and marking policies. At No More Marking, we like to evaluate the effectiveness of an assessment by looking at these three things.
Efficiency and impact on workload
Reliability – is the assessment consistent?
Validity – does the assessment allow us to make helpful inferences about pupils, and does it help pupils and teachers to improve?
In future blog posts, we’ll consider how reliable and valid traditional mock marking is. But for now, it’s clear that on the measure of efficiency, traditional mock marking doesn’t do that well.
Here’s a quick guide to some of my life after levels blog posts from the last five years.
It was definitely a good thing to abolish levels. As I argued here, here and here, they didn’t give us a shared language. Instead, they provided us with the illusion of a common language, which is actually very misleading. This is because they were based on prose performance descriptors, which can be interpreted in many different ways. Unfortunately, many replacements for NC levels were based around the same flawed prose descriptor model.
If prose descriptors don’t work, what does? One good idea is to define your standards really clearly as questions. EG, instead of saying ‘Pupils can compare fractions to see which is larger’, actually ask them ‘what’s bigger: 4/7 or 6/7? 2/3 or 3/4? 5/7 or 5/9?’ And don’t expect that if they get one of those questions right that they will get them all right!
This works well for maths, but what about things like essays? How do you mark those without a descriptor or a rubric? Another great idea is to use comparative judgement. I first wrote about this back in November 2015. It is basically the most exciting thing to happen to assessment ever. I am so excited about it that I am going to work for No More Marking who provide an online comparative judgement engine. If you haven’t read about it already, do! You can also watch this video of me talking about one of our pilot projects at Research Ed in 2016.
In February 2017, Oxford University Press published my own book on assessment, Making Good Progress?: The Future of Assessment for Learning. You can read more about it here. At the Wellington Festival of Education in 2016, I gave a talk which summarised the book’s thesis – you can see the video of this here.
I think the abolition of levels has given teachers the chance to take control of assessment, and has sparked debate, discussion and innovation around assessment which has been hugely valuable. Of course, things still aren’t perfect. National primary assessment has had a number of setbacks, and there are still lots of examples of ‘new’ assessment systems which are essentially rehashed levels. But overall I am really excited, both about the work that has happened in the last five years, and the potential for even further improvements in the next few years.
The primary interim frameworks are now in their second year, and their inconsistencies have been well-documented. Education Datalab have shown that last year there were inconsistencies between local authorities, while more recently the TES published an article revealing that many writing moderators were unable to correctly assess specimen portfolios. Here are five ways to help deal with the uncertainty.
1. Look outside your school or network
Teachers are great judges of their pupils’ work, but find it much harder to place those judgements on a national scale. So wherever possible, try to get exposure to work outside your school to get a clearer idea of where the national standard is.
2. Use what we know about results last year
The interim frameworks were used for the first time last year and, as noted, there are plenty of inconsistencies in how they were applied. However, we do now know that last year, nationally, 74% of pupils were awarded EXS+, and 15% GDS. This compares to 66% and 19% respectively in reading.
3. Check your greater depth (especially if you’re a school in a disadvantaged area)
There is particular evidence that greater depth is being applied inconsistently, and that schools with below average attainment overall are reluctant to award greater depth.
4. Remember that all achievement is on a continuum
Like all grades, ‘greater depth’ and ‘expected standard’ are just arbitrary lines. A pupil who just scrapes ‘expected standard’ actually has more in common with a pupil at the top of ‘working towards’ than they do with a pupil at the top of ‘expected standard’. Not everyone in the same grade will have exactly the same profile, and sometimes the differences between pupils getting the same grade will be greater than pupils with different grades.
5. Use the Sharing Standards results
In March, 199 schools and 8512 pupils took part in Sharing Standards: a trial using comparative judgement to assess Year 6 writing. The results are available here, together with exemplar portfolios. The results offer all four of the benefits above: they involve teacher judgement from across the country; they use information from last year’s results to set this year’s standard; this means they avoid the problem of school-level bias; and they allow you to see the distribution of scripts, not just the grade.
Some people have expressed surprise at the quality of the work at the greater depth threshold. But as we’ve seen, there is no national agreement about what greater depth is. It is true that the comparative judgement process does not use the interim frameworks, but it does have the same intention: to support professionals in assessing writing quality. In our follow-up survey with schools, 98% of the respondents said they are planning to use their results in their moderation process as they felt the results supported their internal assessment of writing standards. The Sharing Standards results are the only nationally standardised scale of Key Stage 2 writing, so it can’t hurt to take a look and see how thousands of pupils nationally are doing.
Last week I had a dream that I was explaining the new GCSE number grades to a class of year 11s. No matter how many times I explained it, they kept saying ‘so 1 is the top grade, right miss? And 3 is a good pass? And if I get 25 marks I am guaranteed a grade 3?’
Here are the four and a half things I think you need to know about the new GCSE number grades
ONE: The new grading system will provide more information than the old one
When I taught in the 6th form, I felt that there were lots of pupils who had received the same grade in their English GCSE but who nevertheless coped very differently with the academic challenge of A-level. There are lots of reasons for this, but I think one is that grades C and B in particular are awarded to so many pupils. Nearly 30% of pupils receive a grade C in English and Maths, and there are clearly big differences between a pupil at the top of that grade and one at the bottom. With the new system, it looks as though the most common grade will be a 4, which only about 20% of pupils will get. With the old letter system, things had got a bit lop-sided: half the grades available were used to distinguish the top two-thirds of candidates. In the new system, two-thirds of the available grades will be awarded to the top two-thirds of candidates, which is fairer, provides more information, and will help 6th forms and employers distinguish between candidates.
TWO: We don’t know what the grade boundaries will be.
Even with an established specification, it is really hard to predict in advance the relative difficulty of different questions, which is why grade boundaries can never be set in advance. This is even more the case with a new specification. We just don’t know how many marks will be needed to get a certain grade.
THREE: We do know roughly what the grade distribution will be like
Whilst we don’t know the number of marks needed to get a certain grade, we do know how many pupils will get a grade 4 and above (70%), and how many will get a grade 7 and above (16% in English, 20% in Maths). The new 4 grade is linked to the old C grade, and the new 7 to the old A. I’ve heard some people say that the new standards are a ‘complete unknown’. This isn’t the case. We know a lot about where the new standards will be, and this approach lets us know a lot more than other approaches which could have been taken (see below).
FOUR: There’s an ‘ethical imperative’ behind this process
The ‘ethical imperative’ is the idea that no pupil will be disadvantaged by the fact that they were the first to take these new exams. (See page 16-17 here). That’s why Ofqual have created a link between the last year of letter grades, and the first year of number grades. Suppose these new specs really are so fiendishly hard that all the pupils struggle dramatically on them. 70% of pupils will still get a grade 4+. They are not going to be disadvantaged by the introduction of new and harder exams.
AND A HALF: Secondary teachers: if you don’t like this approach, just talk to a primary colleague about what they went through last year!
At Ark, I’ve been involved with the changes to Sats that happened last year, and the changes to GCSE grading that are happening this year. There was no ‘ethical imperative’ at primary last year, meaning we didn’t know until the results were published what the standard would be. Whereas we know in advance with the new GCSE that about 70% of pupils will get a 4 or above, at primary we were left wondering if 80% would pass, if 60% would, or if 20% would! We didn’t have a clue! In the event, the standard for reading fell sharply compared to previous years. Not only did this lead to a very stressful year for primary teachers, it also means that it is extremely hard to compare results year on year from before and after 2016. One might argue that this matters less at primary as pupils do not take the results with them in life and get compared to pupils from previous years. But of course, the results of schools are compared over time, and a great deal depends on these comparisons. So I think an ethical imperative would have been welcome at primary too, and that the new GCSE grades have been designed in the fairest possible way for both schools and pupils.
In July, I will be leaving my role at Ark Schools to work for No More Marking as Director of Education.
Over the last 6 months, No More Marking have been working with primary schools in England on a pilot of comparative judgement for year 6 writing called Sharing Standards. Comparative judgement is a quick and reliable method of marking open tasks like essays and stories. The easiest way to understand it is to try out the demo on the No More Marking website, but you can also read my explanation of it on this blog here.
The results of this pilot were published last Tuesday, and you can read the full report here.
Overall, 199 schools participated in the pilot, and a total of 8,512 writing portfolios were judged. 1,649 teachers in those schools did the judging, and the reliability of their judgements was 0.84. This allowed for the creation of a measurement scale featuring every portfolio, and then for the application of a national gradeset: Working Towards, Expected Standard and Greater Depth. The overview report on the No More Marking website features exemplars of the portfolios at each threshold. Here’s a piece from the portfolio that was judged as the best.
80% of the judgements teachers made were comparisons of pupils in their own school. 20% were comparisons of pupils from other schools. This allowed for the creation of a national scale, but it also meant that it wasn’t possible for teachers to favour pupils from their own schools, as they were never asked to directly compare their pupils with pupils from other schools.
The other nice thing about this structure was that it allowed teachers to see tasks and pupil work from other schools. I particularly noted the popularity of tasks that asked pupils to write from the point of view of a character in a novel, and the variety of novels selected as the basis for this task. And in discussions with teachers after, it was interesting to try and pick out the aspects that made such types of writing more or less successful. Very often it was subtle uses of syntax or vocabulary that made the difference. For example, some pupils trying to capture the voice of Bruno in ‘Boy in the Striped Pyjamas’ would use the same very precise and measured sentence structure of Bruno. Others would get this right, but then fall down by using modern slang terms that just didn’t ring true.
And this brings me to the most exciting next step for comparative judgement. As Jon Brunskill writes here, once you have the fascinating data set of accurately graded portfolios, you can then ask: now what? Why are some pieces of writing better than others? What aspects of writing matter, and how can we teach them? Of course, good teachers have always been doing this, but it’s also always been made harder by the way that traditional methods of marking writing lead to disagreement and disputes. If you can’t get reliable agreement on what good writing is, it’s obviously going to be much harder to teach good writing.
Take a look at the exemplar portfolios here and start this process yourself! Next year, No More Marking will be running similar national assessment windows for all primary year groups. See here for more details about how to participate.