“Well, that didn’t work. Let’s do it again.”

11 November 2014

Suppose you toss a coin and it comes up heads. Do you conclude that it is a double-headed coin? No, you don’t. Suppose it comes up heads twice, and then a third time. Do you now conclude the coin is double-headed? Again, no you don’t. There is a one in eight chance (12.5 per cent) that a coin will come up heads three times in a row. So, though it is not that likely, it can and does happen.

So, if an impact evaluation finds that an intervention doesn’t work, should we discard that intervention? No, we shouldn’t. We should do it again. Our study is based on a sample, so there is a probability attached to the study findings. More specifically the power of the study is the probability that we correctly conclude that a successful intervention is working (’don’t accept the null’). Power is typically set to be 80 per cent. That means that 20 per cent of the time we find that successful programmes don’t work.

Actually it is worse than this. The true power for many impact evaluations is only around 50 per cent. So, if a programme is working, an under-powered study is no better than tossing a coin for finding that fact out! This is a rather distressing state of affairs. But it can be addressed in three ways: (1) realistic power calculations, (2) external replication, and (3) meta-analysis.

My colleagues, Ben and Eric, recently blogged on the importance of performing and reporting power calculations. And I would emphasise one of their points: it is crucial to have realistic assumptions for these calculations. That is frequently not the case. A main culprit is setting too large a minimum effect size – and the larger you set this effect, then the smaller the sample you need to detect it. But if the actual effect is less, then your study is underpowered. One reason this happens is that researchers believe project staffs’ inflated views of programme impact e.g. a 50 per cent increase in income (Really? Please include me in that project). So, if you use 50 per cent as your minimum effect but the true impact is ‘only’ 15 per cent, you have a greatly reduced chance of detecting it. A second factor is that researchers ignore the funnel of attrition. Far fewer people take part in the intervention than expected, so estimates of the treatment of the treated effect will be underpowered. This is why 3ie requires proposals it considers for funding to have detailed and well-grounded power calculations.

Second, as I hope is clear by now, false negatives are likely to be very common. Just because a study finds no significant impact doesn’t mean the intervention doesn’t work. To improve internal validity, one approach is to develop a theory of change (see here, here and here) which may show an obvious reason why an intervention failed, as in the case of the recent randomised controlled trial that found that textbooks don’t affect learning outcomes if they are not given to the students! But for external validity, the answer is to do it again! This is external replication: trying the same programme, usually in a different place. But actually, doing it in the same place is scientifically more sound.

But I don’t mean you should just keep doing it again and again until you have one study that finds an impact and then say, “Ah ha, so it does work. All the other results are false negatives.” ‘Goal scoring’, that is counting how many studies find a significant impact and how many don’t, is simply an incorrect way of summarising these data. Of course findings will be mixed since they are based on a sample and not the entire population.

But meta-analysis turns the confused signal of ‘mixed findings’ from multiple studies into a clear signal: the intervention works or it does not. The findings from all the studies can be pooled together to get one overall estimate of the impact of the programme using meta-analysis. Using meta-analysis levers the power of the combined sample across studies to get a more precise impact estimate. It can even turn out that three poorly powered studies individually find no effect. But once they are combined in meta-analysis, a positive effect is found.

So, think about power and get it right. If something doesn’t work, try it again. And, then take all the results and conduct a meta-analysis. Evidence can improve and even save lives. But if evidence is misused, it is just a waste of money.

“Well, that didn’t work. Let’s do it again.”

Leave a comment