The sources of algorithmic bias

“The foundations of algorithmic bias“, by Zacharay Chase Lipton, 2016. pdf copy here. Original source here

“This morning, millions of people woke up and impulsively checked Facebook. They were greeted immediately by content curated by Facebook’s newsfeed algorithms. To some degree, this news might have influenced their perceptions of the day’s news, the economy’s outlook, and the state of the election. Every year, millions of people apply for jobs. Increasingly, their success might lie, in part, in the hands of computer programs tasked with matching applications to job openings. And every year, roughly 12 million people are arrested. Throughout the criminal justice system, computer-generated risk-assessments are used to determine which arrestees should be set free. In all these situations, algorithms are tasked with making decisions.

Algorithmic decision-making mediates more and more of our interactions, influencing our social experiences, the news we see, our finances, and our career opportunities. We task computer programs with approving lines of credit, curating news, and filtering job applicants. Courts even deploy computerized algorithms to predict “risk of recidivism”, the probability that an individual relapses into criminal behavior. It seems likely that this trend will only accelerate as breakthroughs in artificial intelligence rapidly broadened the capabilities of software.

Turning decision-making over to algorithms naturally raises worries about our ability to assess and enforce the neutrality of these new decision makers. How can we be sure that the algorithmically curated news doesn’t have a political party bias or job listings don’t reflect a gender or racial bias? What other biases might our automated processes be exhibiting that that we wouldn’t even know to look for?”

Rick Davies Comment: This paper is well worth reading. It starts by explaining the basics (what an algorithm is and what machine learning is). Then it goes into detail about three sources of bias:(a) biased data, (c) bias by omission, and (c) surrogate objectives. It does not throw the baby out with the bathwater, i.e condemn the use of algorithms altogether because of some bad practices and weaknesses in their use and design


Many of the problems with bias in algorithms are similar to problems with bias in humans. Some articles suggest that we can detect our own biases and therefore correct for them, while for machine learning we cannot.  But this seems far ­fetched. We have little idea how the brain works. And ample studies show that humans are flagrantly biased in college admissions, employment decisions, dating behavior, and more. Moreover, we typically detect biases in human behavior post ­hoc by evaluating human behavior, not through an a priori examination of the processes by which we think.

Perhaps the most salient difference between human and algorithmic bias may be that with human decisions, we expect bias. Take for example, the well documented racial biases among employers, less likely to call back  workers with more more typically black names than those with white names but identical resumes.  We detect these biases because we suspect that they exist and have decided that they are undesirable, and therefore vigilantly test for their existence.

As algorithmic decision ­making slowly moves from simple rule ­based systems towards more complex, human level decision making, it’s only reasonable to expect that these decisions are susceptible to bias.

Perhaps, by treating this bias as a property of the decision itself and not focusing overly on the algorithm that made it, we can bring to bear the same tools and institutions that have helped to strengthen ethics and equality in the workplace, college admissions etc. over the past century.

See also:

  • How to Hold Algorithms Accountable, Nicholas Diakopoulos and Sorelle Friedler. MIT Technology Review, November 17, 2016. Algorithmic systems have a way of making mistakes or leading to undesired consequences. Here are five principles to help technologists deal with that.
  • Is the Gig Economy Rigged? by Will Knight November 17, 2016 A new study suggests that racial and gender bias affect the freelancing websites TaskRabbit and Fiverr—and may be baked into underlying algorithms.



Two useful papers on the practicalities of doing Realist Evaluation

1. Punton, M., Vogel, I., & Lloyd, R. (2016, April). Reflections from a Realist Evaluation in Progress: Scaling Ladders and Stitching Theory. IDS. Available here 

2. Manzano, A. (2016). The craft of interviewing in realist evaluation. Evaluation, 22(3), 342–360. Available here.

Rick Davies comment: I have listed these two papers here because I think they both make useful contributions towards enabling people (myself and others) to understand how to actually do a Realist Evaluation. My previous reading of comments that Realist Evaluation (RE) is “an approach” or a “a way of thinking” rather than a method” has not been encouraging. Both of these papers provide practically relevant details. The Punton et al paper includes comments about the difficulties encountered and where they deviated from current or suggested practice and why so, which I found refreshing.

I have listed some issues of interest to me below, with reflections on the contributions of the two paper.

Interviews as sources of knowledge

Interviews of stakeholders about if, how and why a program works, are a key resource in most REs (Punton et al). Respondents views are both sources of theories and sources of evidence for and against those theories, and there seems to be potential for mixing these up in  way that the process of theory elicitation and testing becomes less explicit than it should be. Punton et al have partially addressed this by coding the status of views about reported outcomes as “observed”, anticipated” or “implied”. The same approach could be taken with recording of respondents’ views on the context and mechanisms involved.

Manzano makes a number of useful distinctions between RE and constructivist interview approaches. But one distinction that is made seems unrealistic, so to speak. “…data collected through qualitative interviews are not considered constructions. Data are instead considered “evidence for real phenomena and processes”. But respondents themselves, as shown in some quotes in the paper, will indicate that on some issues they are not sure, they have forgotten or they are guessing. What is real here is that respondents are themselves making their best efforts to construct some sense out of a situation.So the issue of careful coding of the status of respondents’ views, as to whether they are theories or not, and if observations, what status these have, is important.

How many people to interview

According to Manzano there is no simple answer to this question, but is clear that in the early stages of a RE the emphasis is on capturing a diversity of stakeholder views in such a way that the diversity of possibly CMOs might be identified. So I was worried that the Punton et al paper referred to interviews being conducted in only 5 of the 11 countries where the BCURE program was operating. If some contextual differences are more influential than others, then I would guess that  cross-country differences  would be one such type of difference. I know in all evaluations resources are in limited supplies and choices need to be made. But this one puzzled me.

[Later edit] I think part of the problem here is the lack of what could be called an explicit search strategy. The problem is that the number of useful CMOs that could be identified is potentially equal to the number of people effected by a program, or perhaps even a multiple of that if they encountered a program on multiple occasions. Do you try to identify all of these, or do you stop when the number of new CMOs starts to drop off, per extra x number of interviewees? Each of these is a kind of search strategy. One pragmatic way of limiting the number of possible CMOs to investigate might be to decide in advance on just how dis-aggregated an analysis of “what works for whom in what circumstances” should be. To do this one would need to be clear on what the unit of analysis should be. I partially agree and disagree with Manzano’s point that “the unit of analysis is not the person, but the events and processes around them, every unique program participant uncovers a collection of micro-events and processes, each of which can be explored in multiple ways to test theories”. From my point of view, the person, especially the intended beneficiaries, should be the central focus, and selected events and process are relevant in as much as they impinge on these peoples lives.  I would re-edit the phrase above as follows “what works for whom in what circumstances”

If the unit of analysis is some category of persons then my guess is that the smallest unit of analysis would be a group of people probably defined by a combination of geographic dimensions (e.g. administrative units) and demographic dimensions (e.g. gender, religion, ethnicity of people to be affected). The minimal number of potential differences between these units of analysis seems to be N-1 (where N = number of identifiable groups) as shown by this fictional example below, where each green node is a point of difference between groups of people. Each of these points of difference could be explained by a particular CMO.
cmo tREE 2

 I have one reservation about this approach. It requires some form of prior knowledge about the groupings that matter. That is not unreasonable when evaluating a program that had an explicit goal about reaching particular people. But I am wondering if there is also a more inductive search option. [To be continued…perhaps]

How to interview

Both papers had lots of useful advice on how to interview, from a RE perspective. This is primarily from a theory elicitation and clarification perspective.

How to conceptualise CMOs

Both papers noted difficulties in operationalising the idea of CMOs, but also had useful advice in this area. Manzano broke the concept of Context down into sub-constructs such as  characteristics of the patients, staff and infrastructure, in the setting she was examining. Punton et al introduced a new category of Intervention, alongside Context and Mechanism. In a development aid context this makes a lot of sense to me. Both authors used interviewing methods that avoided any reference to “CMOs” as a technical term

Consolidating the theories

After exploring what could be an endless variety of CMOs a RE process needs to enter a consolidation phase. Manzano points out: “In summary, this phase gives more detailed consideration to a smaller number of CMOs which belong to many families of CMOs”. Punton et al refers to a process of abstraction  that leads to more general explanations “which encompass findings from across different respondents and country settings”. This process sounds very similar in principle to the process of minimization used in QCA, which uses a more algorithm based approach. To my surprise the Punton et al paper highlights differences between QCA and RE rather than potential synergies. A good point about their paper is that it explain this stage in more detail than that by Manzano, which is more focused specifically on interview processes.

Testing the theories

The Punton et al paper does not go into this territory because of the early stage of the work that it is describing. Manzano makes more reference to this process, but mainly the context of interviews that are eliciting peoples theories. This is the territory where more light needs to be shone in future, hopefully by follow up papers by Punton et al. My continuing impression is that theory elicitation and testing are so bound up together that the process of testing is effectively not transparent and thus difficult to verify or replicate. But readers could point me to other papers where this view could be corrected…:-)


The Value of Evaluation: Tools for Budgeting and Valuing Evaluations

Barr, J., Rinnert, D., Lloyd, R., Dunne, D., & Hentinnen, A. (2016, August). The Value of Evaluation: Tools for Budgeting and Valuing Evaluations Research for Development Output – GOV.UK. ITAF & DFID.


Exec Summary (first part): “DFID has been at the forefront of supporting the generation of evidence to meet the increasing demand for knowledge and evidence about what works in international development. Monitoring and evaluation have become established tools for donor agencies and other actors to demonstrate accountability and to learn. At the same time, the need to demonstrate the impact and value of evaluation activities has also increased. However, there is currently no systematic approach to valuing the benefits of an evaluation, whether at the individual or at the portfolio level.


This paper argues that the value proposition of evaluations for DFID is context-specific, but that it is closely linked to the use of the evaluation and the benefits conferred to stakeholders by the use of the evidence that the evaluation provides. Although it may not always be possible to quantify and monetise this value, it should always be possible to identify and articulate it.


In the simplest terms, the cost of an evaluation should be proportionate to the value that an evaluation is expected to generate. This means that it is important to be clear about the rationale, purpose and intended use of an evaluation before investing in one. To provide accountability for evaluation activity, decision makers are also interested to know whether an evaluation was ‘worth it’ after it has been completed. Namely, did the investment in the evaluation generate information that is in itself more valuable and useful than using the
funds for another purpose.


Against this background, this paper has been commissioned by DFID to answer two main questions:

1. What different methods and approaches can be used to estimate the value of evaluations before commissioning decisions are taken and what tools and approaches are available to assess the value of an already concluded evaluation?


2. How can these approaches be simplified and merged into a practical framework that can be applied and further developed by evaluation commissioners to make evidence-based decisions about whether and how to evaluate before commissioning and contracting?”


Rick Davies comment: The points I  noted/highlighted…
  • “…there is surprisingly little empirical evidence available to demonstrate the benefits of evaluation, or to show they can be estimated” …”Evidence’ is thus usually seen as axiomatically ‘a good thing’”
  • “A National Audit Office (NAO) review (2013) of evaluation in government was critical across its sample of departments – it found that: “There is little systematic information from the government on how it has used the evaluation evidence that it has commissioned or produced”.
  • “…there is currently no systematic approach to valuing the benefits of an evaluation, whether at the individual or at the portfolio level”
  • “…most ex-ante techniques may be too time-consuming for evaluation commissioners, including DFID, to use routinely”
  • ” The concept of ‘value’ of evaluations is linked to whether and how the knowledge generated during or from an evaluation will be used and by whom.”


The paper proposes that:

  • “Consider selecting a sample of evaluations for ex-post valuation within any given reporting period” Earlier it notes that “”…a growing body of ex–post valuation of evaluations at the portfolio level, and their synthesis, will build an evidence base to inform evaluation planning and create a feedback loop that informs learning about commissioning more valuable evaluations”
  • “Qualitative approaches that include questionnaires and self-evaluation may offer some merits for commissioners in setting up guidance to standardise the way ongoing and ex-post information is collected on evaluations for ex-post assessment of the benefits of evaluations.”
  • “Consider using a case study template for valuing DFID evaluations”
  • “An ex-ante valuation framework is included in this paper (see section 4)  which incorporates information from the examination of the above techniques and recommendations. Commissioners could use this framework to develop a tool, to assess the potential benefit of evaluations to be commissioned”


While I agree with all of these…

  • The is already a body of empirically-oriented literature on evaluation use dating back to the 1980s that should be given adequate attention. See my probably incomplete bibliography here. This includes a very recent 2016 study by USAID.
  • The use of case studies the kind used by the Research Excellence Framework (REF), known as Impact Case Studies’ makes sense. As this paper noted “. The impact case studies do not need to be representative of the spread of research activity in the unit rather they should provide the strongest examples of impact” They are in, other words, a kind of “Most Significant Change” story, including the MSC type requirement that there be “a list of sufficient sources that could, if audited, corroborate key claims made about the impact of the research”  Evaluation use is not a kind of outcome where it seems to make much sense investing a lot of effort into establishing “average affects”. Per unit of money invested it would seem to make more sense searching for the most significant changes (both positive and negative) that people perceive as the effects of an evaluation
  • The ex-ante valuation framework is in effect a “loose” Theory of Change“, which needs to be put in use and then tested for its predictive value! Interpreted in crude terms, presumably the more of the criteria listed in the Evaluation Decision Framework (on page 26) are met by a given evaluation the higher our expectations are that the evaluation will be used and have an impact. There are stacks of normative frameworks around telling us how to do things, e.g. on how to have effective partnerships. However, good ideas like these need to disciplined by some effort to test them against what happens in reality.

Process Tracing and Bayesian updating for impact evaluation

Befani, B., & Stedman-Bryce, G. (2016). Process Tracing and Bayesian updating for impact evaluation. Evaluation, 1356389016654584.


Abstract: Commissioners of impact evaluation often place great emphasis on assessing the contribution made by a particular intervention in achieving one or more outcomes, commonly referred to as a ‘contribution claim’. Current theory-based approaches fail to provide evaluators with guidance on how to collect data and assess how strongly or weakly such data support contribution claims. This article presents a rigorous quali-quantitative approach to establish the validity of contribution claims in impact evaluation, with explicit criteria to guide evaluators in data collection and in measuring confidence in their findings. Coined as ‘Contribution Tracing’, the approach is inspired by the principles of Process Tracing and Bayesian Updating, and attempts to make these accessible, relevant and applicable by evaluators. The Contribution Tracing approach, aided by a symbolic ‘contribution trial’, adds value to impact evaluation theory-based approaches by: reducing confirmation bias; improving the conceptual clarity and precision of theories of change; providing more transparency and predictability to data-collection efforts; and ultimately increasing the internal validity and credibility of evaluation findings, namely of qualitative statements. The approach is demonstrated in the impact evaluation of the Universal Health Care campaign, an advocacy campaign aimed at influencing health policy in Ghana.


Rick Davies comment: Unfortunately this paper is behind a paywall, but it may become more accessible in the future. If so, I recommend reading it, along with some related papers. These include a recent IIED paper on process tracing: Clearing the fog: new tools for improving the credibility of impact claims, by Barbara Befani, Stefano D’Errico, Francesca Booker, and Alessandra Giuliani. This paper is also about combining process tracing with Bayesian updating. The other is Azad, K. (n.d.). An Intuitive (and Short) Explanation of Bayes’ Theorem, which helped me a lot. Also worth watching out for are future courses on contribution tracing run by Pamoja. I attended their first three-day training event on contribution tracing this week. It was hard going but by the third day I felt I was getting on top of the subject matter. It was run by Befani and Stedman-Bryce, the authors of the main paper above.  Why am I recommending this reading? Because the combination of process tracing and Bayesian probability calculation strikes me as a systemic and transparent way of assessing evidence for and against a causal claim. The downside is the initial difficulty of understanding the concepts involved. Like some other impact assessment tools and methods what you gain in rigor seems to then be put at risk by the fact that it is difficult to communicate how the method works, leaving non-specialist audiences having to trust your judgement, which is what the use of such methods tries to avoid in the first place. The other issue which I think needs more attention is how you aggregate or synthesize multiple contribution claims that are found to have substantial posterior probability. And niggling in the background is a thought: what about all the contribution claims that are found not to be supported, what happens to these?

Pathways to change: Evaluating development interventions with qualitative comparative analysis (QCA)

by Barbara Befani, May 2016. Published by the Expert Group for Aid Studies, Stockholm.  Available in hardcover and as pdf

“Qualitative Comparative Analysis (QCA) is a method particularly well suited for systematic and rigorous comparisons and synthesis of information over a limited number of cases. In addition to a presentation of the method, this report provides an innovative step-wise guide on how to apply and quality-assure the application of QCA to real-life development evaluation, indicating the common mistakes and challenges for each step.”

Rick Davies Comment: This is an important publication, worth spending some time with. It is a detailed guide on the use of QCA, written specially for use by evaluators. Barbara Befani has probably more experience and knowledge of the use of QCA for evaluation purposes than anyone else. This is where she has distilled all her knowledge to date. There are lots of practical examples of the use of QCA scattered throughout the book, used to support particular points about how QCA works.  It is not an easy book to read but is well worth the effort because there is so much that is of value. It is the kind of book you probably will return to many times. I have read pre-publication drafts of the book and the final version and will be undoubtedly returning to different sections again in the future. While this book presents QCA as a package, as is normal, there are many ideas and practices in the book which are useful in themselves. For some of my thoughts on this view of QCA you can see my comments made at the book launch a week ago (in very rough PowerPoint form:A responseThe need-to-knows when commissioning evaluations)

Summary Continue reading “Pathways to change: Evaluating development interventions with qualitative comparative analysis (QCA)”

A Review of Umbrella Fund Evaluation – Focusing on Challenge Funds

This is a Specialist Evaluation and Quality Assurance Service – Service Request Report, authored by Lydia Richardson with David Smith and Charlotte Blundy of TripleLine, in October 2015. A pdf copy is available

As this report points out, Challenge Funds are common means of funding development aid projects but they have not received the evaluation attention they deserve.  In this TripleLine study the authors collated information 56 such funds. .”One of the key findings was that of the 56 funds, only 11 (19.6%) had a document entitled „impact assessment?, of these 7 have been published. Looking through these, only one (Chars Livelihood Programme) appears to be close to DFID?s definition of impact evaluation, although this programme is not considered to be a true challenge fund according to the definition outlined in the introduction. The others assess impact but do not necessarily fit DFID?s 2015 definition of impact evaluation”

Also noted later in the text:… “An email request for information on evaluation of challenge funds was sent to fund and evaluation managers. This resulted in just two responses from 11 different organisations. This verifies the finding that there is very little evaluation of challenge funds available in the public domain”….”Evaluation was in most cases not incorporated into the fund’s design”.

“This brief report focuses on the extent to which challenge funds are evaluable. It unpacks definitions of the core terms used and provides some analysis and guidance for those commissioning evaluations. The guidance is also relevant for those involved in designing and managing challenge funds to make them more evaluable”

1. Introduction
1 2. Methods used
1 2.1 Limitations of the review
2 3. Summary of findings of the scoping phase
2 3.1 Understanding evaluability
4 3.2 Typology for DFID Evaluations
4 4. Understanding the challenge fund aid modality
5 4.1 Understanding the roles and responsibilities in the challenge fund model
5 4.2 Understanding the audiences and purpose of the evaluation
6 4.3 Aligning the design of the evaluation to the design of the challenge fund
7 5. What evaluation questions are applicable?
8 5.1 Relevance
9 5.2 Efficiency
9 5.3 Effectiveness
10 5.4 Impact
10 5.5 Sustainability
12 6. The rigour and appropriateness of challenge fund evaluations
13 6.1 The use of theory of change.
13 6.2 Is a theory based evaluation relevant and possible?
13 6.3 Measuring the counterfactual and assessing attribution
14 6.4 The evaluation process and institutional arrangements
16 6.5 Multi-donor funds
16 6.6 Who is involved?
16 7. How data can be aggregated
17 8. Working in fragile and conflict affected states.
18 9. Trends
18 10. Gaps
20 11. Conclusions

Rick Davies Comment: .While projects funded by Challenge Funds are often evaluated, sometimes as a requirement of their funding, it seems that the project selection and funding process itself is is not given the same level of scrutiny. By this I mean the process whereby candidate projects are screened, assessed and then chosen or not, for funding. This process involves consideration of multiple criteria, including adherence to legal requirements, strategic alignment and grantees capacity to implement activities and achieve objectives. This is where the Challenge Fund Theory of Change actually gets operationalised. It should be possible to test the (tacit and explicit) theory being implemented at this point  by gathering data on the subsequent performance of the funded projects. There should be some form of consistent association between attributes of highly rated project proposals (versus lowly rated proposals) and the scale of their achievements when implemented. If there is not, then it suggests that the proposal screening process is not delivering value and that random choice would be cheaper and just as effective. One experience I have had of this kind of analysis was not very encouraging. We could not find any form of consistent association between project attributes noted during project selection and the scale of subsequent achievement. But perhaps with more comprehensive recording of data collected at the project assessment stage the analysis might have delivered more encouraging results…

PS: This report was done using 14 person days, which is a tight budget given the time need to collate data, let alone analyse it. A good report, especially considering these time constraints


Published June 2015. The principle authors of this guide are Dr. David Wilkie, Dr. Michelle Wieland and Diane Detoeuf of WCS. With thanks to Dr. Rick Davies for many useful discussions
and comments about adding the value of owned assets to the BNS (the modifcation). Available as pdf

“This manual is offered as a practical guide to implementing the Basic Necessities Survey (BNS) that was originally developed by Rick Davies (, and was recently modified and then field tested by WCS. The modified Basic Necessities Survey is imperfect, in that it does not attempt to answer all questions that could be asked about the impact of conservation (or development) actions on people’s well-being. But it is the perfect core to a livelihoods monitoring program, because it provides essential information about people’s well-being from their perspective over time, and implementing a modified BNS is easy enough that it does not preclude gathering additional household information that a conservation project feels they need to adaptively manage their activities”

“This technical manual was developed to offer conservation practitioners with limited budgets and staff a simple, practical, low-cost, quantitative approach to measuring and tracking trends in people’s well-being, and to link these measures where possible to the use and conservation of natural resources.”

“This approach is not based on the assumption that people are doing well if they make more than 1-2 dollars per day, or are in poverty if they make less. Rather, it is based on the understanding that people themselves are best able to decide what constitutes well-being. The approach is based on a United Nations definition of poverty as a lack of basic necessities. More specifically the approach asks communities to define what goods and services are necessary for a family to meet their basic needs. Examples of goods include material items such as: an axe, mobile phone, bed, or cook-stove. Services can include: access to clean drinking within 15 minutes’ walk, reasonable walking distance to health care, children attending school,
women participating in community decision making, or absence of domestic violence, etc. Families who do not own or have access to this basket of goods and services are, by community definition, not meeting a basic, minimum standard of well-being and thus according to the community-defined are poor (i.e., living below the community defined poverty line).”

Rick Davies comment: It has been gratifying to see WCS pick up on the value of the BNS and make its potential more widely known via this USAID publication. I would like to highlight two other potentially useful modifications/uses of the BNS. One is how to establish a community-defined poverty line within the distribution of BNS scores collected in a given community, thus enabling a “head count” measure of poverty. This is described on pages 31-37 of this 2007 report for the Ford Foundation in Vietnam. The other is how to extract from BNS data a simple prediction rule that succinctly summarise what survey responses best predict the overall poverty status of a given household. That method is described in this June 2013 issue of the EES Connections newsletter (pages 12-14)


Power calculation for causal inference in social science: Sample size and minimum detectable effect determination

Eric W Djimeu, Deo-Gracias Houndolo, 3ie Working Paper 26, March 2016. Available as pdf

1. Introduction
2. Basic statistics concepts: statistical logic
3. Power calculation: concept and applications
3.1. Parameters required to run power calculations
3.2. Statistical power and sample size determination
3.3. How to run power calculation: single treatment or multiple treatments?
4. Rules of thumb for power calculation
5. Common pitfalls in power calculation
6. Power calculations in the presence of multiple outcome variables
7. Experimental design
7.1. Individual-level randomisation
7.2. Cluster-level randomisation

1. Introduction

Since the 1990s, researchers have increasingly used experimental and quasi-experimental
primary studies – collectively known as impact evaluations – to measure the effects of
interventions, programmes and policies in low- and middle-income countries. However, we are
not always able to learn as much from these studies as we would like. One common problem is
when evaluation studies use sample sizes that are inappropriate for detecting whether
meaningful effects have occurred or not. To overcome this problem, it is necessary to conduct
power analysis during the study design phase to determine the sample size required to detect
the effects of interest. Two main concerns support the need to perform power calculations in
social science and international development impact evaluations: sample sizes can be too small
and sample sizes can be too large.

In the first case, power calculation helps to avoid the consequences of having a sample that is
too small to detect the smallest magnitude of interest in the outcome variable. Having a sample
size smaller than statistically required increases the likelihood of researchers concluding that
the evaluated intervention has no impact when the intervention does, indeed, cause a significant
change relative to a counterfactual scenario. Such a finding might wrongly lead policymakers to
cancel a development programme, or make counterproductive or even harmful changes in
public policies. Given this risk, it is not acceptable to conclude that an intervention has no
impact when the sample size used for the research is not sufficient to detect a meaningful
difference between the treatment group and the control group.

In the second case, evaluation researchers must be good stewards of resources. Data
collection is expensive and any extra unit of observation comes at a cost. Therefore, for costefficiency and value-for-money it is important to ensure that an evaluation research design does
not use a larger sample size than is required to detect the minimum detectable effect (MDE)
of interest. Researchers and funders should therefore use power calculations to determine the
appropriate budget for an impact evaluation study.

Sample size determination and power calculation can be challenging, even for researchers
aware of the problems of small sample sizes and insufficient power. 3ie developed this resource
to help researchers with their search for the optimal sample size required to detect an MDE in
the interventions they evaluate.

The manual provides straightforward guidance and explains the process of performing power
calculations in different situations. To do so, it draws extensively on existing materials to
calculate statistical power for individual and cluster randomised controlled trials. More
specifically, this manual relies on Hayes and Bennett (1999) for cluster randomised controlled
trials and documentation from Optimal Design software version 3.0 for individual randomised
controlled trials.

Evaluating the impact of flexible development interventions

ODI Methods Lab  report. . March 2016 Rick Davies. Available as pdf

“Evaluating the impact of projects that aim to be flexible and responsive is a challenge. One of the criteria for good impact evaluation is rigour – which, broadly translated, means having a transparent, defensible and replicable process of data collection and analysis. And its debatable apotheosis is the use of randomised control trials (RCTs). Using RCTs requires careful management throughout the planning, implementation and evaluation cycle of a development intervention. However, these requirements for control are the antithesis of what is needed for responsive and adaptive programming. Less demanding and more common alternatives to RCTs are theory-led evaluations using mixed methods. But these can also be problematic because ideally a good theory contains testable hypotheses about what will happen, which are defined in advance.

Is there a middle way, between relying on pre-defined testable theories of change and abandoning any hope altogether that they can cope with the open-ended nature of development?

Drawing on experiences of the Australia-Mekong NGO Engagement Platform and borrowing from the data-centred approaches of the commercial sector, this paper argues that there is a useful role for ‘loose’ theories of change and that they can be evaluable”

Key messages:

• For some interventions, tight and testable theories of change are not appropriate – for example, in fast moving humanitarian emergencies or participatory development programmes, a more flexible approach is needed.

• However, it is still possible to have a flexible project design and to draw conclusions about causal attribution. This middle path involves ‘loose’ theories of change, where activities and outcomes may be known, but the likely causal links between them are not yet clear.

• In this approach, data is collected ‘after the event’ and analysed across and within cases, developing testable models for ‘what works’. More data will likely be needed than for projects with a ‘tight’ theory of change, as there is a wider range of relationships between interventions and outcomes to analyse. The theory of change still plays an important role, in guiding the selection of data types.

• While loose theories of change are useful to identify long term impacts, this approach can also support short cycle learning about the effectiveness of specific activities being implemented within a project’s lifespan

Learning about Analysing Networks to Support Development Work?

Simon Batchelor, IDS Practice Paper in Brief. July 2011. Available as pdf

“Introduction Everyone seems to be talking about networks. Networks and the analysis of networks is now big business. However, in the development sector, analysis of networks remains weak.

This paper presents four cases where social network analysis (SNA) was used in a development programme. It focuses not so much on the organisational qualities of networks nor on the virtual networks facilitated by software, but on the analysis of connectivity in real world networks. Most of the cases are unintentional networks. What literature there is on network analysis within the development sector tends to focus on intentional networks and their quality. Our experience suggests there is considerable benefit to examining and understanding the linkages in unintentional networks, and this is a key part of this Practice Paper.

The four cases illustrate how social network analysis can

• Identify investments in training, and enable effective targeting of capacity building.

• Analyse a policy environment for linkages between people, and enable targeted interventions.

• Analyse an emerging policy environment, and stimulate linkages between different converging sectors.

• Look back on and understand the flow of ideas, thereby learning about enabling an environment for innovation.

These cases, while not directly from the intermediary sector, potentially inform our work with the intermediary sector.