Scaling Up What Works: Experimental Evidence on External Validity in Kenyan Education

Centre for Global Development Working Paper 321 3/27/13 Tessa Bold, Mwangi Kimenyi, Germano Mwabu, Alice Ng’ang’a, and Justin Sandefur
Available as pdf


The recent wave of randomized trials in development economics has provoked criticisms regarding external validity. We investigate two concerns—heterogeneity across beneficiaries and implementers—in a randomized trial of contract teachers in Kenyan schools. The intervention, previously shown to raise test scores in NGO-led trials in Western Kenya and parts of India, was replicated across all Kenyan provinces by an NGO and the government. Strong effects of shortterm contracts produced in controlled experimental settings are lost in weak public institutions: NGO implementation produces a positive effect on test scores across diverse contexts, while government implementation yields zero effect. The data suggests that the stark contrast in success between the government and NGO arm can be traced back to implementation constraints and political economy forces put in motion as the program went to scale.

Rick Davies comment: This study attends to two of the concerns I have raised in a  recent blog (My two particular problems with RCTs) – (a) the neglect of important internal variations in performance arising from a focus on average treatment effects, (b) the neglect of the causal role of contextual factors (the institutional setting in this case) which happens when the context is in effect treated as an externality.

It reinforces my view of the importance of a configurational view of causation.  This kind of analysis should be within the reach of experimental studies as well as methods like QCA. For years agricultural scientists have devised and used factorial designs (albeit using fewer factors than the number of conditions found in most QCA studies)

On this subject I came across this relevant quote from R A Fisher: “

If the investigator confines his attention to any single factor we may infer either that he is the unfortunate victim of a doctrinaire theory as to how experimentation should proceed, or that the time, material or equipment at his disposal is too limited to allow him to give attention to more than one aspect of his problem…..

…. Indeed in a wide class of cases (by using factorial designs) an experimental investigation, at the same time as it is made more comprehensive, may also be made more efficient if by more efficient we mean that more knowledge and a higher degree of precision are obtainable by the same number of observations.”

And also, from Wikipedia, another Fisher quote:

“No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this view is wholly mistaken.”

And also

The precarious nature of knowledge – a lesson that we have not yet learned?

Is medical science built on shaky foundations? by Elizabeth Iorns New Scientist article (15 Sept 2012).

The following text is relevant to the debate about the usefullness of randomised control trials (RCTs)  in assessing the impact of development aid initiatives. RCTs are an essential part of medical science research, but they are by no means the only research methods used. The article continues…

“More than half of biomedical findings cannot be reproduced – we urgently need a way to ensure that discoveries are properly checked

REPRODUCIBILITY is the cornerstone of science. What we hold as definitive scientific fact has been tested over and over again. Even when a fact has been tested in this way, it may still be superseded by new knowledge. Newtonian mechanics became a special case of Einstein’s general relativity; molecular biology’s mantra “one gene, one protein” became a special case of DNA transcription and translation.

One goal of scientific publication is to share results in enough detail to allow other research teams to reproduce them and build on them. However, many recent reports have raised the alarm that a shocking amount of the published literature in fields ranging from cancer biology to psychology is not reproducible.

Pharmaceuticals company Bayer, for example, recently revealed that it fails to replicate about two-thirds of published studies identifying possible drug targets (Nature Reviews Drug Discovery, vol 10, p 712).

Bayer’s rival Amgen reported an even higher rate of failure – over the past  decade its oncology and haematology researchers could not replicate 47 of 53 highly promising results they examined (Nature, vol 483, p 531). Because drug companies scour the scientific literature for promising leads, this is a good way to estimate how much biomedical research cannot be replicated. The answer: the majority” (read the rest of the article here)

See also: Should Deworming Policies in the Developing World be Reconsidered? The sceptical findings of a systematic review of the impact of de-worming initiatives in schools. De-worming has been one of the methods found effective via RCTs, and widely publicised as an example of how RCTs can really find out what works. The quote below is from Paul Garner’s comments on the systematic review. The same web page also has rejoinders to Garner’s comments, which are also worth reading.

“The Cochrane review on community programmes to deworm children of intestinal helminths has just been updated. We want people to read it, particularly those with an influence on policy, because it is important to understand the evidence, but the message is pretty clear. For the community studies where you treat all school children (which is what WHO advocates) there were some older studies which show an effect on weight gain after a single dose of deworming medicine; but for the most part, the effects on weight, haemoglobin, cognition, school attendance, and school performance are  either absent, small, or not statistically significant. We also found some surprises: a trial published in the British Medical Journal reported that deworming led to better weight gain in a trial of more than 27,000 children, but in fact the statistical test was wrong and in reality the trial did not detect a difference. We found a trial that examined school performance in 2659 children in Vietnam  that did not demonstrate a difference on cognition or weight that has never been published even though it was completed in 2006. We also note that a trial of 1 million children from India, which measured mortality and data collection completed in 2004, has never been published. This challenges the principles of scientific integrity. However, I heard within the last week that the authors do intend to get the results into the public domain-which is where it belongs.

We want to see powerful interventions that help people out of poverty, but they need to work, otherwise we are wasting everyone’s time and money. Deworming schoolchildren to rid them of intestinal helminths seems a good idea in theory, but the evidence for it just doesn’t stack up. We want policy makers to look at the evidence and the message and consider if deworming is as good as it is cracked up to be.”

Taylor-Robinson et al. “Deworming drugs for soil-transmitted intestinal worms in children: effects on nutritional indicators, haemoglobin and school performance” Cochrane Database of Systematic Reviews 2012.

See also: Truth decay: The half-life of facts ,by Samuel Arbesman, New Scientist, 19 September 2012

IN DENTAL school, my grandfather was taught the number of chromosomes in a human cell. But there was a problem. Biologists had visualised the nuclei of human cells in 1912 and counted 48 chromosomes, and it was duly entered into the textbooks studied by my grandfather. In 1953, the prominent cell biologist Leo Sachs even said that “the diploid chromosome number of 48 in man can now be considered as an established fact”.

Then in 1956, Joe Hin Tjio and Albert Levan tried a new technique for looking at cells. They counted over and over until they were certain they could not be wrong. When they announced their result, other researchers remarked that they had counted the same, but figured they must have made a mistake. Tjio and Levan had counted only 46 chromosomes, and they were right.

Science has always been about getting closer to the truth, …

See also book by the same author “The Half-Life of Facts: Why Everything We Know Has an Expiration Date on Amazon. Published October 2012

See also: Why Most Biomedical Findings Echoed by Newspapers Turn out to be False: the Case of Attention Deficit Hyperactivity Disorder by François Gonon, Jan-Pieter Konsman, David Cohen and Thomas Boraud, Plos One, 2012

Summary: Newspapers biased toward reporting early studies that may later be refuted : 7 of top 10 ADHD studies covered by media later attenuated or refuted without much attention

Newspaper coverage of biomedical research leans heavily toward reports of initial findings, which are frequently attenuated or refuted by later studies, leading to disproportionate media coverage of potentially misleading early results, according to a report published Sep. 12 in the open access journal PLOS ONE.

The researchers, led by Francois Gonon of the University of Bordeaux, used ADHD (attention deficit hyperactivity disorder) as a test case and identified 47 scientific research papers published during the 1990’s on the topic that were covered by 347 newspaper articles. Of the top 10 articles covered by the media, they found that 7 were initial studies. All 7 were either refuted or strongly attenuated by later research, but these later studies received much less media attention than the earlier papers. Only one out of the 57 newspaper articles echoing on these subsequent studies mentioned that the corresponding initial finding has been attenuated. The authors write that, if this phenomenon is generalizable to other health topics, it likely causes a great deal of distortion in health science communication.

See alsoThe drugs dont work – a modern medical scandal. The doctors prescribing them don’t know that. Nor do their patients. The manufacturers know full well, but they’re not telling”  by Ben Goldacre, he Guardian Weekend, 22 September 2012 p21-29

Excerpt: “In 2010, researchers from Harvard and Toronto found all the trials looking at five major classes of drug – antidepressants, ulcer drugs and so on – then measured two key features: were they positive, and were they funded by industry? They found more than 500 trials in total: 85% of the industry-funded studies were positive, but only 50% of the government-funded trials were. In 2007, researchers looked at every published trial that set out to explore the benefits of a statin. These cholesterol-lowering drugs reduce your risk of having a heart attack and are prescribed in very large quantities. This study found 192 trials in total, either comparing one statin against another, or comparing a statin against a different kind of treatment. They found that industry-funded trials were 20 times more likely to give results favouring the test drug.

These are frightening results, but they come from individual studies. So let’s consider systematic reviews into this area. In 2003, two were published. They took all the studies ever published that looked at whether industry funding is associated with pro-industry results, and both found that industry-funded trials were, overall, about four times more likely to report positive results. A further review in 2007 looked at the new studies in the intervening four years: it found 20 more pieces of work, and all but two showed that industry-sponsored trials were more likely to report flattering results.

It turns out that this pattern persists even when you move away from published academic papers and look instead at trial reports from academic conferences. James Fries and Eswar Krishnan, at the Stanford University School of Medicine in California, studied all the research abstracts presented at the 2001 American College of Rheumatology meetings which reported any kind of trial and acknowledged industry sponsorship, in order to find out what proportion had results that favoured the sponsor’s drug.”

The results section is a single, simple and – I like to imagine – fairly passive-aggressive sentence: “The results from every randomised controlled trial (45 out of 45) favoured the drug of the sponsor.”

Read more in Ben Goldacre’s new bookBad Pharma: How drug companies mislead doctors and harm patients” published in Sept 2012

See also Reflections on bias and complexity May 29, 2012 by Ben Ramalingam, which talks about a paper in Nature, May 2012 by Daniel Sarewitz, titled “Beware the creeping cracks of bias: Evidence is mounting that research is riddled with systematic errors. Left unchecked, this could erode public trust…”