How should we understand “clinical equipoise” when doing RCTs in development?

World Bank Blogs

 Submitted by David McKenzie on 2013/09/02

 While the blog was on break over the last month, a couple of posts caught my attention by discussing whether it is ethical to do experiments on programs that we think we know will make people better off. First up, Paul Farmer on the Lancet Global Health blog writes:

“What happens when people who previously did not have access are provided with the kind of health care that most of The Lancet’s readership takes for granted? Not very surprisingly, health outcomes are improved: fewer children die when they are vaccinated against preventable diseases; HIV-infected patients survive longer when they are treated with antiretroviral therapy (ART); maternal deaths decline when prenatal care is linked to caesarean sections and anti-haemorrhagic agents to address obstructed labour and its complications; and fewer malaria deaths occur, and drug-resistant strains are slower to emerge, when potent anti-malarials are used in combination rather than as monotherapy.

It has long been the case that randomized clinical trials have been held up as the gold standard of clinical research… This kind of study can only be carried out ethically if the intervention being assessed is in equipoise, meaning that the medical community is in genuine doubt about its clinical merits. It is troubling, then, that clinical trials have so dominated outcomes research when observational studies of interventions like those cited above, which are clearly not in equipoise, are discredited to the point that they are difficult to publish”

This was followed by a post by Eric Djimeu on the 3ie blog asks what else development economics should be learning from clinical trials, in which he writes: Continue reading “How should we understand “clinical equipoise” when doing RCTs in development?”

Evaluating simple interventions that turn out to be not so simple

Conditional Cash Transfer (CCT) programs have been cited in the past as examples of projects that are suitable for testing via randomised control trials. They are relatively simple interventions that can be delivered in a standardised manner. Or so it seemed.

Last year Lant Pritchett, Salimah Samji and Jeffrey Hammer wrote this interesting (if at times difficult to read) paper “It’s All About MeE: Using Structured Experiential Learning (‘e’) to Crawl the Design Space“(the abstract is reproduced below). In the course of that paper they argued that CCT programs are not as simple as they might seem. Looking at three real life examples they identified at least 10 different characteristics of CCTs that need to be specified correctly in order for them to work as expected. Some of these involve binary choices (whether to do x or y) and some involve tuning of a numerical variable. This means there were at least 2 to the power of 10 i.e. 1024 different possible designs. They also pointed out that while changes to some of these characteristics make only a small difference to the results achieved others, including some binary choices, can make quite major differences. In other words, overall it may well be a rugged rather than a smooth design space. The question then occurs, how well are RCTs suited to exploring such spaces?

Today the World Bank Development Blog posted an interesting confirmation of the point made in Pritchett et al paper, in a blog posting titled:  Defining Conditional Cash Transfer Programs: An Unconditional Mess. Basically they are in effect pointing out that the design space is even way more complicated than Princhett et al describe!. They conclude

So, if you’re a donor or a policymaker, it is important not to frame your question to be about the relative effectiveness of “conditional” vs. “unconditional” cash transfer programs: the line between these concepts is too blurry. It turns out that your question needs to be much more precise than that. It is better to define the feasible range of options available to you first (politically, ethically, etc.), and then go after evidence of relative effectiveness of design options along the continuum from a pure UCT to a heavy-handed CCT. Alas, that evidence is the subject of another post…

So stay tuned fore their next installment. Of course you could quibble with the fact that even this conclusion is a bit optimistic, in that it talks about a a continuum of design options, when in fact it is multi-dimensional space  with both smooth and rugged bits

PS: Here is the abstract for the Printchett paper:

“There is an inherent tension between implementing organizations—which have specific objectives and narrow missions and mandates—and executive organizations—which provide resources to multiple implementing organizations. Ministries of finance/planning/budgeting allocate across ministries and projects/programmes within ministries, development organizations allocate across sectors (and countries), foundations or philanthropies allocate across programmes/grantees. Implementing organizations typically try to do the best they can with the funds they have and attract more resources, while executive organizations have to decide what and who to fund. Monitoring and Evaluation (M&E) has always been an element of the accountability of implementing organizations to their funders. There has been a recent trend towards much greater rigor in evaluations to isolate causal impacts of projects and programmes and more ‘evidence base’ approaches to accountability and budget allocations Here we extend the basic idea of rigorous impact evaluation—the use of a valid counter-factual to make judgments about causality—to emphasize that the techniques of impact evaluation can be directly useful to implementing organizations (as opposed to impact evaluation being seen by implementing organizations as only an external threat to their funding). We introduce structured experiential learning (which we add to M&E to get MeE) which allows implementing agencies to actively and rigorously search across alternative project designs using the monitoring data that provides real time performance information with direct feedback into the decision loops of project design and implementation. Our argument is that within-project variations in design can serve as their own counter-factual and this dramatically reduces the incremental cost of evaluation and increases the direct usefulness of evaluation to implementing agencies. The right combination of M, e, and E provides the right space for innovation and organizational capability building while at the same time providing accountability and an evidence base for funding agencies.” Paper available as pdf

I especially like this point  about within-project variation (on which I have argue for in the past): “Our argument is that within-project variations in design can serve as their own counter-factual and this dramatically reduces the incremental cost of evaluation and increases the direct usefulness of evaluation to implementing agencies


Scaling Up What Works: Experimental Evidence on External Validity in Kenyan Education

Centre for Global Development Working Paper 321 3/27/13 Tessa Bold, Mwangi Kimenyi, Germano Mwabu, Alice Ng’ang’a, and Justin Sandefur
Available as pdf


The recent wave of randomized trials in development economics has provoked criticisms regarding external validity. We investigate two concerns—heterogeneity across beneficiaries and implementers—in a randomized trial of contract teachers in Kenyan schools. The intervention, previously shown to raise test scores in NGO-led trials in Western Kenya and parts of India, was replicated across all Kenyan provinces by an NGO and the government. Strong effects of shortterm contracts produced in controlled experimental settings are lost in weak public institutions: NGO implementation produces a positive effect on test scores across diverse contexts, while government implementation yields zero effect. The data suggests that the stark contrast in success between the government and NGO arm can be traced back to implementation constraints and political economy forces put in motion as the program went to scale.

Rick Davies comment: This study attends to two of the concerns I have raised in a  recent blog (My two particular problems with RCTs) – (a) the neglect of important internal variations in performance arising from a focus on average treatment effects, (b) the neglect of the causal role of contextual factors (the institutional setting in this case) which happens when the context is in effect treated as an externality.

It reinforces my view of the importance of a configurational view of causation.  This kind of analysis should be within the reach of experimental studies as well as methods like QCA. For years agricultural scientists have devised and used factorial designs (albeit using fewer factors than the number of conditions found in most QCA studies)

On this subject I came across this relevant quote from R A Fisher: “

If the investigator confines his attention to any single factor we may infer either that he is the unfortunate victim of a doctrinaire theory as to how experimentation should proceed, or that the time, material or equipment at his disposal is too limited to allow him to give attention to more than one aspect of his problem…..

…. Indeed in a wide class of cases (by using factorial designs) an experimental investigation, at the same time as it is made more comprehensive, may also be made more efficient if by more efficient we mean that more knowledge and a higher degree of precision are obtainable by the same number of observations.”

And also, from Wikipedia, another Fisher quote:

“No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this view is wholly mistaken.”

And also

The precarious nature of knowledge – a lesson that we have not yet learned?

Is medical science built on shaky foundations? by Elizabeth Iorns New Scientist article (15 Sept 2012).

The following text is relevant to the debate about the usefullness of randomised control trials (RCTs)  in assessing the impact of development aid initiatives. RCTs are an essential part of medical science research, but they are by no means the only research methods used. The article continues…

“More than half of biomedical findings cannot be reproduced – we urgently need a way to ensure that discoveries are properly checked

REPRODUCIBILITY is the cornerstone of science. What we hold as definitive scientific fact has been tested over and over again. Even when a fact has been tested in this way, it may still be superseded by new knowledge. Newtonian mechanics became a special case of Einstein’s general relativity; molecular biology’s mantra “one gene, one protein” became a special case of DNA transcription and translation.

One goal of scientific publication is to share results in enough detail to allow other research teams to reproduce them and build on them. However, many recent reports have raised the alarm that a shocking amount of the published literature in fields ranging from cancer biology to psychology is not reproducible.

Pharmaceuticals company Bayer, for example, recently revealed that it fails to replicate about two-thirds of published studies identifying possible drug targets (Nature Reviews Drug Discovery, vol 10, p 712).

Bayer’s rival Amgen reported an even higher rate of failure – over the past  decade its oncology and haematology researchers could not replicate 47 of 53 highly promising results they examined (Nature, vol 483, p 531). Because drug companies scour the scientific literature for promising leads, this is a good way to estimate how much biomedical research cannot be replicated. The answer: the majority” (read the rest of the article here)

See also: Should Deworming Policies in the Developing World be Reconsidered? The sceptical findings of a systematic review of the impact of de-worming initiatives in schools. De-worming has been one of the methods found effective via RCTs, and widely publicised as an example of how RCTs can really find out what works. The quote below is from Paul Garner’s comments on the systematic review. The same web page also has rejoinders to Garner’s comments, which are also worth reading.

“The Cochrane review on community programmes to deworm children of intestinal helminths has just been updated. We want people to read it, particularly those with an influence on policy, because it is important to understand the evidence, but the message is pretty clear. For the community studies where you treat all school children (which is what WHO advocates) there were some older studies which show an effect on weight gain after a single dose of deworming medicine; but for the most part, the effects on weight, haemoglobin, cognition, school attendance, and school performance are  either absent, small, or not statistically significant. We also found some surprises: a trial published in the British Medical Journal reported that deworming led to better weight gain in a trial of more than 27,000 children, but in fact the statistical test was wrong and in reality the trial did not detect a difference. We found a trial that examined school performance in 2659 children in Vietnam  that did not demonstrate a difference on cognition or weight that has never been published even though it was completed in 2006. We also note that a trial of 1 million children from India, which measured mortality and data collection completed in 2004, has never been published. This challenges the principles of scientific integrity. However, I heard within the last week that the authors do intend to get the results into the public domain-which is where it belongs.

We want to see powerful interventions that help people out of poverty, but they need to work, otherwise we are wasting everyone’s time and money. Deworming schoolchildren to rid them of intestinal helminths seems a good idea in theory, but the evidence for it just doesn’t stack up. We want policy makers to look at the evidence and the message and consider if deworming is as good as it is cracked up to be.”

Taylor-Robinson et al. “Deworming drugs for soil-transmitted intestinal worms in children: effects on nutritional indicators, haemoglobin and school performance” Cochrane Database of Systematic Reviews 2012.

See also: Truth decay: The half-life of facts ,by Samuel Arbesman, New Scientist, 19 September 2012

IN DENTAL school, my grandfather was taught the number of chromosomes in a human cell. But there was a problem. Biologists had visualised the nuclei of human cells in 1912 and counted 48 chromosomes, and it was duly entered into the textbooks studied by my grandfather. In 1953, the prominent cell biologist Leo Sachs even said that “the diploid chromosome number of 48 in man can now be considered as an established fact”.

Then in 1956, Joe Hin Tjio and Albert Levan tried a new technique for looking at cells. They counted over and over until they were certain they could not be wrong. When they announced their result, other researchers remarked that they had counted the same, but figured they must have made a mistake. Tjio and Levan had counted only 46 chromosomes, and they were right.

Science has always been about getting closer to the truth, …

See also book by the same author “The Half-Life of Facts: Why Everything We Know Has an Expiration Date on Amazon. Published October 2012

See also: Why Most Biomedical Findings Echoed by Newspapers Turn out to be False: the Case of Attention Deficit Hyperactivity Disorder by François Gonon, Jan-Pieter Konsman, David Cohen and Thomas Boraud, Plos One, 2012

Summary: Newspapers biased toward reporting early studies that may later be refuted : 7 of top 10 ADHD studies covered by media later attenuated or refuted without much attention

Newspaper coverage of biomedical research leans heavily toward reports of initial findings, which are frequently attenuated or refuted by later studies, leading to disproportionate media coverage of potentially misleading early results, according to a report published Sep. 12 in the open access journal PLOS ONE.

The researchers, led by Francois Gonon of the University of Bordeaux, used ADHD (attention deficit hyperactivity disorder) as a test case and identified 47 scientific research papers published during the 1990’s on the topic that were covered by 347 newspaper articles. Of the top 10 articles covered by the media, they found that 7 were initial studies. All 7 were either refuted or strongly attenuated by later research, but these later studies received much less media attention than the earlier papers. Only one out of the 57 newspaper articles echoing on these subsequent studies mentioned that the corresponding initial finding has been attenuated. The authors write that, if this phenomenon is generalizable to other health topics, it likely causes a great deal of distortion in health science communication.

See alsoThe drugs dont work – a modern medical scandal. The doctors prescribing them don’t know that. Nor do their patients. The manufacturers know full well, but they’re not telling”  by Ben Goldacre, he Guardian Weekend, 22 September 2012 p21-29

Excerpt: “In 2010, researchers from Harvard and Toronto found all the trials looking at five major classes of drug – antidepressants, ulcer drugs and so on – then measured two key features: were they positive, and were they funded by industry? They found more than 500 trials in total: 85% of the industry-funded studies were positive, but only 50% of the government-funded trials were. In 2007, researchers looked at every published trial that set out to explore the benefits of a statin. These cholesterol-lowering drugs reduce your risk of having a heart attack and are prescribed in very large quantities. This study found 192 trials in total, either comparing one statin against another, or comparing a statin against a different kind of treatment. They found that industry-funded trials were 20 times more likely to give results favouring the test drug.

These are frightening results, but they come from individual studies. So let’s consider systematic reviews into this area. In 2003, two were published. They took all the studies ever published that looked at whether industry funding is associated with pro-industry results, and both found that industry-funded trials were, overall, about four times more likely to report positive results. A further review in 2007 looked at the new studies in the intervening four years: it found 20 more pieces of work, and all but two showed that industry-sponsored trials were more likely to report flattering results.

It turns out that this pattern persists even when you move away from published academic papers and look instead at trial reports from academic conferences. James Fries and Eswar Krishnan, at the Stanford University School of Medicine in California, studied all the research abstracts presented at the 2001 American College of Rheumatology meetings which reported any kind of trial and acknowledged industry sponsorship, in order to find out what proportion had results that favoured the sponsor’s drug.”

The results section is a single, simple and – I like to imagine – fairly passive-aggressive sentence: “The results from every randomised controlled trial (45 out of 45) favoured the drug of the sponsor.”

Read more in Ben Goldacre’s new bookBad Pharma: How drug companies mislead doctors and harm patients” published in Sept 2012

See also Reflections on bias and complexity May 29, 2012 by Ben Ramalingam, which talks about a paper in Nature, May 2012 by Daniel Sarewitz, titled “Beware the creeping cracks of bias: Evidence is mounting that research is riddled with systematic errors. Left unchecked, this could erode public trust…”

Test, Learn, Adapt: Developing Public Policy with Randomised Controlled Trials

Laura Haynes, Owain Service,  Ben Goldacre, David Torgerson. Cabinet Office. Behavioral Insights Team. 2012. Available as pdf

Executive Summary
Part 1 – What is an RCT and why are they important?
What is a randomised controlled trial?
The case for RCTs-debunking some myths:
1.We don’t necessarily know‘what works’
2. RCTs don’t have to cost a lot of money
3 There are ethical advantages to using RCTs
4. RCTs do not have to be complicated or difficult to run
PART II-Conducting an RCT: 9 key steps
Step1: Identify two or more policy interventions to compare
Step 2: Define the outcome that the policy is intended to influence
Step 3: Decide on the randomisation unit
Step 4: Determine how many units are rquired for robust results
Step 5: Assign each unit to one of the polivy interventions using a robustly random method
Step 6: Introduce the poicy interventions to the assigned groups
Step 7: Measure the results and determine the impact of the policy interventions
Step 8: Adapt your policy intervention to reflect your findings
Step 9: Return to step 1

Innovations in Monitoring and Evaluation ‘as if Politics Mattered’,

Date: 17-18 October 2011
Venue: ANU, Canberra, Australia

Concept Note, Chris Roche & Linda Kelly, 4 August 2011

The Developmental Leadership Program (DLP)[1] addresses an important gap in international thinking and policy about the critical role played by leaders, elites and coalitions in the politics of development. At the core of DLP thinking is the proposition that political processes shape developmental outcomes at all levels and in all aspects of society: at national and sub-national levels and in all sectors and issue areas.

Initial findings of the DLP research program confirm that development is a political process and that leadership and agency matter. This is of course not new, but the DLP research provides important insights into how, in particular, leadership, elites and formal and informal coalitions can play a particularly important and under-recognized role in institutional formation (or establishing the ‘rules of the game’), policy reform and development processes[2].

International aid therefore needs to engage effectively with political processes. It needs to be flexible and be able to respond when opportunities open up. It needs to avoid the danger of bolstering harmful political settlements.

Furthermore Monitoring & Evaluation (M&E) mechanisms need to be improved and made compatible with flexible programming and recognize the importance of ‘process’ as well as outcomes. Donors should invest in a range of monitoring and evaluation techniques and approaches which are more aligned with the kinds of non-linear and unpredictable processes which characterise the kinds of political processes which drive positive developmental outcomes. This is important because it can be argued that, at best, current approaches are simply not appropriate to monitor the kinds of processes DLP research indicates are important; or, at worst, they offer few incentives to international assistance agencies to support the processes that actually lead to developmental outcomes Continue reading “Innovations in Monitoring and Evaluation ‘as if Politics Mattered’,”

RCTs for empowerment and accountability programmes

A GSDRC Helpdesk Research Report, Date: 01.04.2011, 14 pages, available as pdf.

Query: To what extent have randomised control trials been used to successfully measure the results of empowerment and accountability processes or programmes?
Enquirer: DFID
Helpdesk response
Key findings: This report examines the extent to which RCTs have been used successfully to measure empowerment and accountability processes and programmes. Field experiments present immense opportunities, but the report cautions that they are more suited to measuring short-term results with short causal chains and less suitable for complex interventions. The studies have also demonstrated divergent results, possibly due to different programme designs. The literature highlights that issues of scale, context, complexity, timeframe, coordination and bias in the selection of programmes also determine the degree of success reported. It argues that researchers using RCTs should make more effort to understand contextual issues, consider how experiments can be scaled up to measure higher-order processes, and focus more on learning. The report suggests strategies such as using qualitative methods, replicating studies in different contexts and using randomised methods with field activities to overcome the limitations in the literature.
1. Overview
2. General Literature (annotated bibliography)
3. Accountability Studies (annotated bibliography)
4. Empowerment Studies (annotated bibliography)


Micro-Methods in Evaluating Governance Interventions

This paper is available as a pdf.  It should be cited as follows: Garcia, M. (2011): Micro-Methods in Evaluating Governance Interventions. Evaluation Working Papers. Bonn: Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung.

The aim of this paper is to present a guide to impact evaluation methodologies currently used in the field of governance. It provides an overview of a range of evaluation techniques – focusing specifically on experimental and quasi-experimental designs. It also discusses some of the difficulties associated with the evaluation of governance programmes and makes suggestions with the aid of examples from other sectors. Although it is far from being a review of the literature on all governance interventions where rigorous impact evaluation has been applied, it nevertheless seeks to illustrate the potential for conducting such analyses.

This paper has been produced by Melody Garcia, economist at the German Development Institute (Deutsches Institut für Entwicklungspolitik, DIE). It is a part of a two-year research project on methodological issues related to evaluating budget support funded by the BMZ’s evaluation division. The larger aim of the project is to contribute to the academic debate on methods of policy evaluation and to the development of a sound and theoretically grounded approach to evaluation. Further studies are envisaged.



Jens Ludwig, Jeffrey R. Kling, Sendhil Mullainathan, Working Paper 17062
, NATIONAL BUREAU OF ECONOMIC RESEARCH, 1050 Massachusetts Avenue, Cambridge, MA 02138, May 2011 pdf copy available

A mechanism experiment is “an experiment that does not test a policy, but one which tests a causal mechanism that underlies a policy”

Randomized controlled trials are increasingly used to evaluate policies. How can we make these experiments as useful as possible for policy purposes? We argue greater use should be made of experiments that identify behavioral mechanisms that are central to clearly specified policy questions, what we call “mechanism experiments.” These types of experiments can be of great policy value even if the intervention that is tested (or its setting) does not correspond exactly to any realistic policy option.

RD comment: Well worth reading. Actually entertaining.

See also a blog posting about them by David McKenzie:  What are “Mechanism Experiments” and should we be doing more of them?  on Mon, 2011-06-06 01:02

Randomised controlled trials, mixed methods and policy influence in international development – Symposium

Thinking out of the black box. A 3ie-LIDC Symposium
Date: 17:30 to 19:30 Monday, May 23rd 2011
Venue: John Snow Lecture Theatre, London School of Hygiene and Tropical Medicine (LSHTM) Keppel Street, London, WC1E 7HT

Professor Nancy Cartwright, Professor of Philosophy, London School of Economics
Professor Howard White, Executive Director, 3ie
Chair: Professor Jeff Waage, Director, LIDC

Randomised  Controlled  Trials  (RCTs)  have  moved  to  the  forefront  of  the development  agenda  to  assess  development  results  and  the  impact  of development  programs.  In  words  of  Esther  Duflo  –  one  of  the  strongest advocates of RCTs – RCTs allow us to know which development efforts help and which cause harm.

But  RCTs  are  not  without  their  critics,  with  questions  raised  about  their usefulness, both  to provide more substantive  lessons about  the program being evaluated and whether the findings can be generalized to other settings.

This symposium brings perspectives from the philosophy of science, and a mixed method approach to impact analysis, to this debate.

For more information contact:

PS1: Nancy Cartwright wrote “Are RCTs the Gold Standard?” in 2007

PS2: The presentation by Howard White is now available here  – but without audio