I’ve recently finished reading an excellent book about assessment by Daniel Koretz, a professor of education at Harvard University. It’s called Measuring Up: What Educational Testing Really Tells Us. It is very readable and clarifies a number of tricky issues about assessment. It’s yet another book which you would struggle to find on any English teacher training course but which I think provides answers to the questions a lot of teachers ask all the time.
As well as being interesting and useful for teachers, Koretz’s book should also be required reading for policymakers in the UK and US. Koretz is American and the book largely focuses on the American context. But it is startling to see just how similar the misuses of tests have been in both countries, and how similar the important issues are. Startling, and also quite depressing – policymakers and educationalists on both sides of the Atlantic have made almost exactly the same mistakes.
In the next few blog posts, I am going to summarise the main things I learnt from reading this book.
How useful are tests?
It is not difficult to find things that have gone wrong with the way tests are used in this country. Indeed, it might be harder to find things that are right about our assessment system. Teaching to the test, focussing resources on C/D borderline pupils, constant resits, the pressure of league tables, the variation between different exam boards – those are just a few things that are wrong with our assessment system. Given the pernicious effect of many of the above developments, it is not hard to see why many people would want the abolition of league tables at the very least, and why some would call for the abolition of exams entirely. Often, the debate seems to polarise around these two extremes. In this debate, one side see any criticism of the exam system as an attempt to evade accountability, and the other side see any defence of exams as being opposed to the true aims of education.
Koretz makes a number of arguments that will be very uncomfortable for policymakers, and tells a few anecdotes about his meetings with education policymakers and the errors and false assumptions many of them make. He argues that tests have been misused, that high-stakes tests have created perverse incentives, and that some tests are inimical to the true aims of education. He denies the possibility of an optimal test, and rejects the idea that one measure can ever tell us all we need to know about education. But he is also equally firm that tests can provide us with useful information. He is critical of those who dismiss tests as devices for ‘creating winners and losers’, pointing out rather that tests merely reveal winners and losers. In a sense, his is a kind of Nixon-to-China position – he is able to criticise the ways assessments have been used in such strong terms because his background in devising assessments means that no-one can doubt his commitment to them as providers of useful information. And indeed, Koretz does make it clear that ‘careful testing can in fact give us tremendously valuable information about student achievement that we would otherwise lack and it does rest on several generations of accumulated scientific research and development.’
Koretz himself uses the Nixon-to-China analogy to describe a fellow assessment expert, E.F. Lindquist. Lindquist devised many of the major American assessments, but also wrote a paper outlining the limitations such assessments had. So what are the strengths and limitations of tests? Tests can never directly measure what we want to measure. The ultimate aim of education is for our pupils to apply their skills in real life contexts long after their education has finished. But this is exceptionally difficult to measure. We can’t have individuals going around tracking adults seeing if they apply algebra in every day life or if they read 19th century novels before going to sleep at night. Even if you could, you would have great difficulty compiling a scale to measure this kind of performance. And of course if you waited this long to make an assessment then you wouldn’t be able to use the information from it to improve instruction. So an assessment never can be a direct measure of the aims of education. (An article here notes that this is an important difference between test scores and the types of measures often used in evaluating hospitals such as patient survival rates. ‘Patient survival is not an indicator of the desired end-state, it is the desired end-state for heart surgery.’ By contrast, test scores are only indicators of the desired end-state.) In Koretz’s words:
Test scores usually do not provide a direct and complete measure of educational achievement. Rather they are incomplete measures, proxies for the more comprehensive measures that we would ideally use but that are generally unavailable to us.
For E.F. Lindquist, this led to certain conclusions about the structures of tests. It meant 1) that tests should focus on measuring what pupils have learnt from the curriculum, as this is the clearest proxy for educational achievement. 2) The point of a test is to elicit the behaviour you want to measure. 3) You need to standardize the test so that it is the same for everyone. 4) Tests should isolate specific knowledge and skills, because if they don’t, you won’t know what it is that caused the failure or success on that test. 5) As a result of these limitations, we should not rely solely on test scores for information about a pupil’s achievement. Koretz suggests that these sources might include the kind of information colleges and universities require for admissions – teacher assessments, personal statements, persistence in extra-curricular activities, etc.
Koretz notes that some of these conclusions are controversial. Point 4), that tests should isolate specific knowledge and skills, has certainly come under attack. A lot of modern tests don’t follow this principle, and instead aim to embed the use of knowledge and skills in more authentic, ‘real-world’ contexts. I’ll discuss this further in a later post.
In a few recent posts I’ve talked about the difficulties with using criteria as the sole reference point for exams. I think these difficulties can be seen very clearly with national curriculum levels, which are used as a method of criterion referencing internal assessments.
National curriculum levels are going to be abolished, but it is likely that many schools will continue to use them for their own internal assessments. One of the reasons many people like NC levels is that they provide a ‘shared language’ or a ‘common framework’ about what pupils can and can’t do. In this post, Chris Husbands runs through a lot of the problems that NC levels had, but concludes that they were still worth retaining because of this common language. He argues that national curriculum levels were valuable because ‘they provided a national standard…The loss of a common national framework – something which international visitors have generally admired – is a big price to pay.’ The NfER make a similar point here, criticising levels but lamenting the loss of a shared framework.
In this post I want to argue that in fact national curriculum levels did quite a poor job at providing a common language, and that a large part of the reason for this is that the way they were used for internal assessments relied very heavily on unstandardized criterion references.
In this post I spoke about some of the general difficulties with criterion-referenced exams. In this one I looked at how this was particularly applicable to English creative writing assessments. A common response to this is to say that it is particularly difficult to mark creative writing assessments because creative writing judgments are inherently subjective. There is some truth to this, of course, and in fact when I was first training, I did think that it would be possible to criterion reference maths and science questions in a much more accurate way. But actually, the same problems afflict maths and science. Tim Oates again:
Even a well-crafted statement of what you need to get an A grade can be loaded with subjectivity – even in subjects such as science. It’s genuinely hard to know how difficult a specific exam is.
For a very good and specific example of this, here’s Paul Bambrick Santoyo in Driven by Data.
To illustrate this, take a basic standard taken from middle school math:
Understand and use ratios, proportions and percents in a variety of situations.
To understand why a standard like this one creates difficulties, consider the following premise. Six different teachers could each define one of the following six questions as a valid attempt to assess the standard of percent of a number. Each could argue that the chosen assessment question is aligned to the state standard and is an adequate measure of student mastery:
Identify 50% of 20.
Identify 67% of 81.
Shawn got 7 correct answers out of 10 possible answers on his science test. What percent of questions did he get correct?
J.J Redick was on pace to set an NCAA record in career free throw percentage. Leading into the NCAA tournament in 2004, he made 97 of 104 free throw attempts. What percentage of free throws did he make?
J.J Redick was on pace to set an NCAA record in career free throw percentage. Leading into the NCAA tournament in 2004, he made 97 of 104 free throw attempts. In his first tournament game, Redick missed his first five free throws. How far did his percentage drop from before the tournament game to right after missing those free throws?
J.J Redick and Chris Paul were competing for the best free-throw shooting percentage. Redick made 94% of his first 103 shots, while Paul made 47 out of 51 shots.
a) Which one had a better shooting percentage?
b)In the next game, Redick made only 2 of 10 shots while Paul made 7 of 10 shots. What are their new overall shooting percentages?
c) Who is the better shooter?
d) Jason argued that if Paul and J.J each made their next ten shots, their shooting percentages would go up the same amount. Is this true? Why or why not?
Although the English curriculum’s maths descriptors are slightly more detailed, this point still holds. The level 5 descriptor for number and algebra is ‘pupils calculate fractional or percentage parts of quantities and measurements, using a calculator where appropriate.’ Level 6 is ‘pupils evaluate one number as a fraction or percentage of another. They understand and use the equivalences between fractions, decimals and percentages, and calculate using ratios in appropriate situations.’
Even though the levels give slightly more detail than Bambrick-Santoyo’s example of the New Jersey standard, his criticisms still hold. Using those level descriptors, you would classify both these questions as being of equal difficulty: What is 50% of 100? and What is 87% of 437?
Whether allowed access to a calculator or not, most pupils would get the first question right. Not nearly as many would get the second question right. Yet both of these, according to the levels, count as level 5 questions.
This poses a particular problem for the creators of national exams. They have to create exams that have different questions, but are of comparable difficulty. The criteria, or standards, or level descriptors, are meant to help guide them, but it is fiendishly difficult to do. As I said in my previous blog post:
Examiners are in an extremely difficult situation. Year after year, they have to create tests of comparable difficulty to those of the year before that are nevertheless sufficiently different from those of the year before. If they make the test close in structure and format to that of the year before, then it will be easier to compare across years, but it will also be easier for the succeeding year to do well at the test. If they depart more significantly from the structure and format of the year before, they ensure that the succeeding year don’t gain an unfair advantage, but they also make it much harder to compare between the two years.
Chris Husbands makes a very similar point here. He shows that in order to make consistent and comparable judgments, you need shared tasks. Individual questions are not judged as being easy or hard in reference to abstract criteria, but in reference to how pupils have performed on them. This is easy for international tests such as PISA to do, but much harder for national curriculum tests (my italics).
The aspiration is to hold the difficulty of the test constant over time, so that children with similar attributes do equally well in any year. It is not too difficult for PISA to achieve this – questions are kept private so that some can be re-used and the difficulty of new ones scaled against them, whilst the test is administered only every three years to a sample. But it will be impossible to keep KS2 questions private as teachers administer the tests, and they will do so to all pupils in all schools. If questions are not re-used then it will be difficult genuinely to scale the test each year to secure consistency. But if questions are re-used it will be difficult to make the test sufficiently different each year to avoid a repeat of the gradual grade improvement as teachers learn what is expected.
Bambrick-Santoyo concludes from his analysis of standards that ‘standards are meaningless until you define how you will assess them.’ I would agree, and I would even go further. Yes, standards are meaningless if left undefined. But they give the illusion of meaning, which makes them very confusing. If two people use the same word to refer to something different, there is huge potential for confusion.
It’s at this point that I would depart from Chris Husbands. Although he accepts in the paragraph above that you can’t have comparable standards without norming, he then goes on to argue that national curriculum levels – which are essentially criteria – could provide comparable standards. But I don’t think levels did give us this common national framework. They gave us the illusion of one. One person was talking about a level 4 and thinking it meant ‘Identify 50% of 20’, whereas another person was talking about the same descriptor but thinking it meant ‘Identify 67% of 81’. In English, one person was defining ‘convincing’ persuasion in one way, and another person was defining it in quite another. One teacher might set an interim termly task asking pupils to infer from a short unseen poem, another might set a task asking pupils to infer from a novel they’d studied in class. The result was not shared standards, but the illusion of shared standards. It was often noted that a primary school level 4 was very different from a secondary school level 4. But I actually think that the problem went wider than that – that actually, even in between schools and teachers there was a variation in what these levels really meant. The fact is, if you want a shared language, you need shared content and tasks. If you keep the shared language but get rid of the shared content and tasks, you don’t actually have a shared language any more. It’s as if we were all using the word ‘cat’ and pronouncing it in the same way, but some of us were using it to refer to zebras, some to lions, some to dogs and some to domestic cats.
Where I would agree with Chris Husbands is that this was more of a problem at KS3 than at KS2. And that is because the freedom for individual teachers and schools to select content was in practice greater at KS3, and also because after KS3 tests were abolished they took with them the sole reference point to shared content in that key stage.
We don’t get a shared language through abstract criteria. We get a shared language through teaching shared content, doing shared tasks, and sharing pupil work with colleagues. Tom Sherrington’s post here gives some idea as to how this might work.
The first issue is the slippery nature of standards.
Even a well-crafted statement of what you need to get an A grade can be loaded with subjectivity – even in subjects such as science. It’s genuinely hard to know how difficult a specific exam is.
It’s genuinely hard to set exams based on criteria, and it is also hard to mark exams only using criteria. Most of the exam criteria for essay based questions suffer from what I call the ‘adverb problem’. Often, the difference in criteria between the top few bands is the difference of an adverb – getting a band 3 may require you to write ‘well’; getting a band 4 requires you to write ‘fluently’. Etc. Here’s an extract from the WJEC exam board criteria for English.
Band 3: there is some use of devices to achieve particular effects
Band 4: devices to achieve particular effects are used consciously and effectively
Band 3: plot and characterisation are convincingly sustained
Band 4: plot and characterisation are effectively constructed and sustained
This last one is particularly interesting as I would think that ‘convincingly’ is better than ‘effectively’, but that’s by the by. The point is these adverbs are extremely vague and their meaning lies in the eye of the beholder. The only way we can give these criteria any kind of concrete and specific meaning is by referring to actual samples of pupil work – which is of course what we all do, in internal and external moderation days. Criteria are only given meaning by reference to individual pupil performance. That is, criteria are given meaning by norms.
Let me give another example of this. When I was first marking a set of creative writing coursework tasks, I came across one that I felt deserved the top band. One of the criteria for the top band was something like ‘appropriate and controlled variation of sentence structure.’ This pupil’s sentence structure was fairly (there is another adverb for you) well controlled and appropriate, but there were moments where it wasn’t completely secure. Did it deserve the top band? I asked my mentor, and she told me what an exam moderator had said to her, which was that when awarding the top band, you have to take into account what it is realistic for a 15 year old to achieve. I have no idea if this really is official exam board advice, but it seemed reasonable enough. On that basis, I gave it the top band. Yes, the piece of work was not flawless, but on the basis of what is realistic for 15 year olds to achieve, it deserved the top band. (Of course, this does raise the question of how a teacher is supposed to know what is realistic for 15 year olds to achieve – in this case, I had my mentor, who had experience of marking hundreds of exam scripts, backing up my judgment. In general, this is a reason why it is important for all teachers to have a good idea of the whole of the attainment range, and it is why exam moderation meetings often involve looking at sample scripts from across the range. Tom Sherrington makes a good case for why and how we should do more of this here.) The point is that here, we were giving criteria meaning by reference to norms. Suppose, for the sake of argument, that my norm in this case was not what it was possible for 15 year olds to achieve but what it is possible for Booker Prize winners to achieve. Clearly, I also expect them to vary their sentences in an appropriate and controlled way. That is, the same criterion applies to them. But if Booker Prize winners were the norm, I’d interpret that criterion in a completely different way. I could apply the same criterion to KS2 pupils, where I would again interpret it in a different way.
When judging performance on complex tasks, we don’t just want to see if the pupil can do it or not – we want to make a judgment about how well they can do it. You can judge if a pupil is capable of doing something or not in isolation of the performance of others. You can judge a pupil very clearly on the criteria ‘is able to write a creative story’. Either they can or they can’t, and you can judge that in isolation of other pupils. But we want exams to tell us more than that. For more complex tasks, performance isn’t pass/fail – it’s on a continuum. Most pupils can write a creative story – we want the exam to give us some idea of how well they do that. And that judgment is inevitably bound up with the performance of others. Criteria can help guide that judgment, but they can’t do the whole job on their own. The key point here is not different adverbs, but the comparative and superlative forms of the same adverb – not effectively, convincingly, fluently, but well, better, best.
In a previous post I looked at the vagueness of the 2007 national curriculum and some of the problems this posed for teachers. Often, when I discuss with this people, they say one of two things: First, that the curriculum wasn’t vague on content because it believed content shouldn’t be taught – it was vague on content in order to allow teachers to teach the content they thought appropriate. Secondly, that in reality the exam specifications provided the detail that the curriculums left out – so that whilst the curriculum might have been vague, the exam specs provided the detail. I’ll discuss each point in turn.
To take the first point – that the curriculum freed teachers up to choose the knowledge they thought was appropriate. In actual fact the curriculum was not as neutral as it liked to pretend. Firstly, in the curriculum guidance it was made very clear that there was ‘less prescribed subject content’ in order to allow for ‘more of a focus on key concepts’. Friends of mine who attended information sessions about this curriculum say that this point was stressed even more at these meetings. Secondly, whilst the curriculum did not specify content, at KS1, 2 & 3 it did specify a particular method of assessment – national curriculum levels. These levels were also vague and couched in the language of generic skills. Thus, the curriculum did not really free up teachers to teach the knowledge they thought appropriate. It wanted teachers to reduce subject content and focus on the key concepts and skills outlined in the curriculum and the levels. So by suggesting that reducing content would enhance conceptual understanding, the national curriculum encouraged the reduction of content. It suggested that high-level skills could somehow be worked on and acquired in the abstract. But that is not how the brain works.
The second point: it is correctly noted that in the absence of a curriculum providing detail on content, the exam will provide that detail instead. This is the case. However, whilst exam boards can provide high level specificity, they cannot provide – nor should they be asked to provide – finer detail than this. The exam board set the high level task – say, write an essay. But it is not the job of the exam board to specify the detailed knowledge that goes into making pupils able to do that task. Likewise, an exam board can say that essays will be marked for spelling, and they might even give some examples of what they mean by tricky spellings and spellings that they expect everyone to get right. But they do not provide a year by year list of spellings they expect pupils to know, broken down by difficulty. They expect pupils to be able to read an unseen non-fiction comprehension and to make inferences about that text. But they do not – nor should they – specify the range of vocabulary, idioms and background knowledge that you need to do well on unseen reading comprehensions. So, the exams add some detail to the curriculum. But they don’t – nor should they – add enough detail. And it is also worth pointing out that when exams were abolished at the end of key stage 3, even this small element of specificity was lost.
So, the provision of detail was largely left to the individual teacher. The individual teacher or school department were left to plan the detail in the curriculum. They had two main sources of advice – the detail in the national curriculum and the national curriculum levels, and the exam specifications. As we’ve seen, the detail in the national curriculum encouraged, and in the case of levels, forced, teachers into reducing subject content in an attempt to focus on key concepts. As we’ve seen, the exam specs provided high level detail about the types of tasks pupils should be able to do, but they didn’t provide the comprehensive and sequenced detail necessary to excel on such tasks.
The result was – and to a large extent still is – that the curriculum became endless repetitions of exam tasks. This was because a) the past papers and exam specs were the only detail teachers were given and b) the national curriculum promoted the idea that the kinds of skills that exam papers are looking for were not dependent on an underpinning body of knowledge and could be acquired in the abstract.
Here is a specific example of this. Take a look at the KS3 English curriculum from the 2007 curriculum. There is a section on key processes which is essentially a statement of aims – by the end of this course, pupils should be able to interpret information, infer and deduce meanings, etc. But that isn’t really a curriculum - it’s a statement of aims. As for how to achieve those aims, here is the specific content for reading from the KS4 curriculum:
Level 8 for reading is as follows ‘Pupils’ responses show their appreciation of, and ability to comment on, a range of texts, and they evaluate how authors achieve their effects through the use of linguistic, structural and presentational devices. They select and analyse information and ideas, and comment on how these are conveyed in different texts. They explore some of the ways in which texts from different times and cultures have influenced literature and society.’ Nothing in the curriculum gives you a guide that will help you achieve these criteria. There is no mention in the curriculum of the specific and detailed knowledge your pupils must know if they want to achieve level 8. The Core Knowledge Curriculum does provide this content – it has lists of vocab, word roots, sayings and phrases and specified texts. It also makes it clear that the knowledge pupils learn in other lessons will contribute to their reading ability. It is an extremely useful document. It will help your pupils to achieve, as opposed to just describing what achievement looks like. The 2007 national curriculum and the accompanying levels are like Moliere’s doctor, explaining that opium causes sleepiness because of its sleep-inducing properties.
The English GCSE exam gives you a bit more detail, but only a bit. Whichever exam board you pick, one of the final exam tasks is to read an unseen text and answer some questions on it. This is of course a perfectly reasonable and meaningful task. But again, it doesn’t give you the specific and detailed knowledge you need to achieve that task. Taken all together, the 07 National Curriculum, the NC levels and terminal exams provide teachers with no detail about what to teach and a lot of unhelpful, and in some cases plain wrong, guidance about creating that detail themselves.
All in all, this is a classic example of how flawed theory transmits itself into practice. The errors and flawed assumptions at the heart of the 07 national curriculum have at every stage frustrated efforts to create good classroom practice.
A couple of weeks ago the Greater London Authority published a fascinating resource for anyone interested in London schools.
It’s the GLA London Schools Atlas. The GLA have overlaid a map of London with tons of data about all the state schools in London. You can see all the schools in London and click on each school to see where it draws its pupils from. This gives you a lot of interesting information – such as, for example, if schools draw their pupils from local wards or from further away. You can also see that schools on the boundaries of boroughs often draw their pupils from the neighbouring boroughs.
You can also switch off this school view and switch on a filter that shows you how many pupils in each area go to state funded schools. In some areas, as few as 12-25% of pupils go to state funded schools. And you can overlay the map with the GLA’s projections about pupil places.
Anyone interested in London schools will find this interesting, but I think anyone interested in London will love it too. About 6 or 7 years ago the British Library did a wonderful exhibition of London maps. They had reproductions of the old Booth Maps of London, which were coloured according to wealth and poverty. The Booth maps of the late 19th century and John Snow’s cholera map of 1854 are both wonderful examples of how maps can be used to convey so much information in such an intuitive and compelling way. The London Schools Atlas is firmly in this tradition. It is fascinating to be able to compare across time – to compare the private school data on this map with the Booth wealth map, for example.
GLA map coloured to show areas with higher private school attendance
Extract from Charles Booth’s poverty map of London (from here)
Of course, the GLA today have the advantage on Booth and Snow in that GIS makes it much easier to do this kind of thing, and makes it so much easier for you to interact with the data and personalise it for your needs. But I think it is fair to say that when you look at this amazing Schools Atlas, you are looking at a concept that was pioneered in those same London streets 150 years ago.
I was very sad to hear today that Doris Lessing had died. Doris Lessing has been my favourite author since I first read Mara and Dann when I was 15. I think The Golden Notebook is probably my favourite, but I am also very keen on the Children of Violence series, The Grass is Singing and The Sweetest Dream. The two volumes of collected African short stories are also completely brilliant. I taught some of the African short stories to year 11 classes. I think that ‘Little Tembi’ and ‘No Witchcraft for Sale’ went down best – I would definitely recommend these to any English teachers. (An extract from ‘Sunrise on the Veld’ was used in a WJEC exam paper in about 2007, I think.) For me, the great thing about Lessing’s writing was its unbelievable and sometimes quite frightening psychological insight, but she was also extremely acute about the politics and ideologies of the 20th century. She was a kind of combination of George Eliot and George Orwell.
I saw Lessing speak at Warwick in my first term there in 2004. I will always remember one thing she said in reply to a question about one thing she wished she could have known when she was younger. I was going to write it up from memory, but what do you know, there is a recording of it on the Warwick website. I did have a slight panic that it might have misremembered it and have taken the completely wrong point – but no, it is just as I had remembered it – in fact, if anything, it is even better than I remembered it. Here it is, roughly typed up by me, with my emphasis in bold.
Questioner (I think this is Jeremy Treglown): The question is, is there anything, any wisdom she would like to impart to a younger generation, things she’d wished she’d known when she was in her twenties.
Lessing: The most important one is…an old person said to me when I was very young, this person said – don’t forget, this is the late 30s and 40s – what you have to remember is that it is what individuals do that is important, not a great power. What I was seeing when I looked out at the world was Hitler’s Germany was going to last for a thousand years, Mussolini’s Italy was going to last for a thousand years, the Soviet Union which was going to last for ever by definition, the British Empire, which showed no signs at that time of passing, all the European powerful empires, the racist structure: all these things seemed permanent and they’ve all gone as if they never were. If anyone had said to me aged 17, 18, 20, that all this is going to disappear as if it never was, I would have thought they were mad. So now when I look out to see certain fairly terrifying great powers, I think, hang on, how long are you going to last? And I offer this thought to you when you might be overpowered by these great structures which are much less powerful than you think they are. They are not monolithic and strong and all of one substance. They are riven, broken up in conflict and can disappear overnight, believe me.
I blogged recently about the difficulty of finding the right assessment system. Afterwards, in a throwaway remark on twitter, I said that finding the right assessment format was a bit like finding the right system for exchange rates – fiendishly difficult because you want one system to serve a multitude of purposes.
There are other ways in which I think economics and assessment face similar problems. Goodhart’s Law is a famous law that was invented by Charles Goodhart, a former Bank of England adviser. He observed that ‘as soon as the government attempts to regulate any particular set of financial assets, these become unreliable as indicators of economic trends.’ Extrapolating from that, you get the more general Goodhart’s Law: when a measure becomes a target it loses all value as a measure.
Exactly the same has happened with the 5 A*-C including English and Maths measure. Once upon a time it probably was a fairly reliable measure of how well a school was doing. Since it has become the government’s favourite target of how well a school is doing, it is now much less reliable as a measure. Schools have attempted to game the target using GCSE equivalencies, different exam boards and focussing their attention on C/D borderline pupils.
Campbell’s Law expresses the same point but in the field of social science. It was elaborated by the US social scientist Donald T Campbell and has been used in the US to explain the problems there with high-stakes testing. Campbell’s formulation is: ‘The more any quantitative social indicator (or even some qualitative indicator) is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.’
It’s for this reason that I think accountability measures should use more than one measure. I suggested a possible model for this a while back in this post and in this one. In the comments on the first post, Mr Chas made the following wonderful point:
Once you get to 4 or five measures, as the writer here says, you can ‘cheat’ on maybe one, but then as soon as you do it kicks the others way out of kilter. Like trying to squeeze into jeans a size too small. The fat has to go SOMEWHERE, it just pops up in a different place!
Of course, since we wrote that the government have adopted something similar – four key measures of accountability rather than one, plus an unspecified ‘destination measure’ to look at what pupils do after they leave school. I think this has to be an improvement on the previous system.