A Coffey How-To note, June 2015, by Carrie Baptist and Barbara Befani. Available as pdf
QCA is a case based method which allows evaluators to identify different combinations of factors that are critical to a given outcome, in a given context. This allows for a more nuanced understanding of how different combinations of factors can lead to success, and the influence context can have on success.
QCA allows evaluators to test theories of change and answer the question ‘what works best, why and under what circumstances’ in a way that emerges directly from the empirical analysis, that can be replicated by other researchers, and is generalizable to other contexts.
While it isn’t appropriate for use in all circumstances and has limitations, QCA also has certain unique strengths – including qualitatively assessing impact and identifying multiple pathways to achieving change which make it a valuable addition to the evaluation toolkit.
Rick Davies comment: The availability of this sort of explanatory and introductory note is very timely, given the increased use of QCA for evaluation purposes. My only quibble with this how-to note is that the heart of the QCA process seems to have been left undescribed (see step 10, page 6), like the proverbial black box. For those looking for a more detailed exposition, keep an eye out for the extensive guide now being prepared by Barbara Befani, with support from the Expert Group for Aid Studies in Sweden (More details here). There is also an introductory posting on QCA on the Better Evaluation website
MODERATE NEED, ACUTE NEED Valid categories for humanitarian needs assessments? Evidence from a recent needs assessment in Syria
26 MARCH 2015 by Aldo Benini,
Needs assessments in crises seek to establish, among other elements, the number of persons in need. “Persons in need” is a broad and fuzzy concept; the estimates that local key informants provide are highly uncertain. Would refining the categories of persons in need lead to more reliable estimates?
The Syria Multi-Sectoral Needs Assessment (MSNA), in autumn 2014, provided PiN estimates for 126 of the 270 sub-districts of the country. It differentiated between persons in moderate and those in acute need. “Moderate Need, Acute Need – Valid Categories for Humanitarian Needs Assessments?” tests the information value of this distinction. The results affirm that refined PiN categories can improve the measurement of unmet needs under conditions that rarely permit exact classification. The note ends with some technical recommendations for future assessments.”
Prepared by Elliot Stern for the Big Lottery Fund, Bond, Comic Relief
and the Department for International Development, May 2015 Available as pdf
1. Introduction and scope 2
2. What is impact evaluation? 4
Defining impact and impact evaluation 4
Linking cause and effect 5
Explanation and the role of ‘theory’ 7
Who defines impact? 7
Impact evaluation and other evaluation approaches 8
Main messages 9
3. Frameworks for designing impact evaluation 10
Designs that support causal claims 10
The design triangle 11
Evaluation questions 11
Evaluation designs 13
Programme attributes 14
Main messages 15
4. What different designs and methods can do 16
Causal inference: linking cause and effect 16
Main types of impact evaluation design 20
The contemporary importance of the ‘contributory’ cause 21
Revisiting the ‘design triangle’ 21
Main messages 23
5. Using this guide 24
Drawing up terms of reference and assessing proposals for impact evaluations 25
Assessing proposals 25
Quality of reports and findings 27
Strengths of conclusions and recommendations 28
Using findings from impact evaluations 29
Main messages 29
Produced for DFID Evaluation Department by Lesley Groves, February 2015. Available as a pdf
The purpose of this paper is to analyse current practice of beneficiary feedback in evaluation and to stimulate further thinking and activity in this area. The Terms of Reference required a review of practice within DFID and externally. This is not a practical guide or How to Note, though it does make some recommendations on how to improve the practice of beneficiary feedback in evaluation. The paper builds on current UK commitments to increasing the voice and influence of beneficiaries in aid programmes. It has been commissioned by the Evaluation Department of the UK Department for International Development (DFID).
The paper builds on:
A review of over 130 documents (DFID and other development agencies), including policy and practice reports, evaluations and their Terms of Reference, web pages, blogs, journal
articles and books;
Interviews with 36 key informants representing DFID, INGOs, evaluation consultants/consultancy firms and a focus group with 13 members of the Beneficiary
Feedback Learning Partnership;
Analysis of 32 evaluations containing examples of different types of beneficiary feedback.
It is important to note that the research process revealed that the literature on beneficiary feedback in evaluation is scant. Yet, the research process revealed that there is a strong appetite for developing a shared understanding and building on existing, limited practice.
Introduction Part A: A Framework for a Beneficiary Feedback Approach to Evaluation
A.1 Drawing a line in the sand: defining beneficiary feedback in the context of evaluation
A.1.1 Current use of the term “beneficiary feedback”
A.1.2 Defining “Beneficiary”
A.1.3 Defining “Feedback
A.2 Towards a framework for applying a “beneficiary feedback” approach in the context of evaluation
A.3 A working definition of beneficiary feedback in evaluation Part B: Situating Beneficiary Feedback in Current Evaluation Practice
B.1 Situating beneficiary feedback in evaluation within DFID systems and evaluation standards
B.1.1 Applying a beneficiary feedback approach to evaluation within DFID evaluations
B.1.2 Inclusion of beneficiary feedback in evaluation policies, standards and principles
B.2 Learning from experience: Assessment of current practice
B.2.1 Existing analysis of current performance of beneficiary feedback in the development sector generally
B.2.2 Specific examples of beneficiary feedback in evaluation Part C: Enhancing Evaluation Practice through a Beneficiary Feedback Approach
C.1 How a beneficiary feedback approach can enhance evaluation practice
C.2 Checklists for evaluation commissioners and practitioners
C.3 What are the obstacles to beneficiary feedback in evaluation and how can they
Rick Davies Comment: I am keen on the development and use of checklists, for a number of reasons. They encourage systematic attention to a range of relevant issues and make lack of attention to any of these more visible and accountable. But I also like Scriven’s comments on checklists:
“The humble checklist, while no one would deny its utility in evaluation and elsewhere, is usually
thought to fall somewhat below the entry level of what we call a methodology, let alone a theory.
But many checklists used in evaluation incorporate a quite complex theory, or at least a set of
assumptions, which we are well advised to uncover; and the process of validating an evaluative
checklist is a task calling for considerable sophistication. Indeed, while the theory underlying a
checklist is less ambitious than the kind that we normally call a program theory, it is often all the
theory we need for an evaluation”
Scriven’s comments prompt me to ask, in the case of Lesley Grove’s checklists, if the attributes listed in the checklists are what we ideally should find in an evaluation, and many or all are in fact found to be present, then what outcome(s) might we then expect to see associated with these features of an evaluation? On page 23 of her report she lists four possible desirable outcomes:
Generation of more robust and rigorous evaluations particularly to ensure unintended and
negative consequences are understood;
Reduction of participation fatigue and beneficiary burden through processes that respect
participants and enable them to engage in meaningful ways;
Supporting of development and human rights outcomes;
Making programmes more relevant and responsive
With this list we are on our way to having a testable theory of how beneficiary feedback can improve evaluations.
The same chapter of the report goes even further, identifying the different types of outcomes that could be expected from different combinations of usages of beneficiary feedback, in a four by four matrix (see page 27).
[Spotted via tweet by Chris Roche]
Punton, M., Welle, K., 2015. Straws-in-the-wind, Hoops and Smoking Guns: What can Process Tracing Offer to Impact Evaluation?Available as pdf
See also the Annex Applying Process Tracing in Five Steps, also available as pdf
Abstract: “This CDI Practice Paper by Melanie Punton and Katharina Welle explains the methodological and theoretical foundations of process tracing, and discusses its potential application in international development impact evaluations. It draws on two early applications of process tracing for assessing impact in international development interventions: Oxfam Great Britain (GB)’s contribution to advancing universal health care in Ghana, and the impact of the Hunger and Nutrition Commitment Index (HANCI) on policy change in Tanzania. In a companion to this paper, Practice Paper 10 Annex describes the main steps in applying process tracing and provides some examples of how these steps might be applied in practice.”
Annex abstract: Abstract This Practice Paper Annex describes the main steps in applying process tracing, as adapted from Process-Tracing Methods: Foundations and Guidelines (Beach and Pedersen 2013). It
also provides some examples of how these steps might be applied in practice, drawing on a case study discussed in CDI Practice Paper 10.
Rick Davies Comment: This is one of a number of recent publications now available on process tracing (See bibliography below)). The good thing about this IDS publication is its practical orientation, on how to do process tracing. However, I think there are three gaps which concern me:
Not highlighting how process tracing (based on within-case investigations) can be complimentary to cross-case investigations (which can be done using the Configurational or Regularity approaches in Box 1 of this paper). While within-case investigations can elaborate on the how-things-work question, across-case questions can tell us about the scale on which these things are happening (i.e. their coverage). The former is about mechanisms, the latter is about associations. A good causal claim will involve both association(s) and mechanism(s).
Not highlighting out the close connection between conceptions of necessary and sufficient causes and the four types of tests the paper describes. The concepts of necessary and/or sufficient causes provide a means of connecting both levels of analyses, they can be used to describe what is happening at both levels (causal conditions and configurations in cross-cases investigations and mechanisms in within-case investigations).
Not highlighting out that there are two important elements to the tests, not just one (probability). One is the ability to disprove a proposition of sufficiency or necessity through the existence of a single contrary case, the other is the significance of the prior probability of an event happening. See more below…
The Better Evaluation website describes the relationship between the tests and the types of causes as follows (with some extra annotations here by myself)
‘Hoop’ test is failed when examination of a case shows the presence of a Necessary causal condition but the outcome of interest is not present.
Passing a common hoop condition is more persuasive than an uncommon one [This is the Bayesian bit referred to in the IDS paper – the significance of an event is affected by our prior assumptions about its occurrence]
‘Smoking Gun’ test is passed when examination of a case shows the presence of a Sufficient causal condition.
Passing an uncommon smoking gun condition is more persuasive than a common one [The Bayesian bit again]
‘Doubly Definitive’ test is passed when examination of a case shows that a condition is both Necessary and Sufficient support for the explanation. These tend to be rare.
Instead, the authors (possibly following other cited authors) make use of two related distinctions, between certainty and uniqueness, in place of necessity and sufficiency. I am not sure that this helps much. Certainty arises from something being a necessity, not the other way around
Rick Davies comment: The paper is interesting in the first instance because both the debate and practice about evidence based policy and practice seems to be much further ahead in the field of medicine than it is in the field of development aid (…broad generalisation that this is…). There are also parallels between different approaches in medicine and different approaches in development aid.
In medicine, one is rule based, focused on average affects when trying to meet common needs in populations and the other is expertise focused on the specific and often unique needs of individuals.
In development aid one is centrally planned and nationally rolled out services meeting basic needs like water supply or education and the other is much more person centered participatory rural and other development programs
The evidence based “quality mark” has been misappropriated by vested interests
The volume of evidence, especially clinical guidelines, has become unmanageable
Statistically significant benefits may be marginal in clinical practice
Inflexible rules and technology driven prompts may produce care that is management driven rather than patient centred
Evidence based guidelines often map poorly to complex multimorbidity
Box 2: What is real evidence based medicine and how do we achieve it?
Real evidence based medicine:
Makes the ethical care of the patient its top priority
Demands individualised evidence in a format that clinicians and patients can understand
Is characterised by expert judgment rather than mechanical rule following
Shares decisions with patients through meaningful conversations
Builds on a strong clinician-patient relationship and the human aspects of care
Applies these principles at community level for evidence based public health
Actions to deliver real evidence based medicine
Patients must demand better evidence, better presented, better explained, and applied in a more personalised way
Clinical training must go beyond searching and critical appraisal to hone expert judgment and shared decision making skills
Producers of evidence summaries, clinical guidelines, and decision support tools must take account of who will use them, for what purposes, and under what constraints
Publishers must demand that studies meet usability standards as well as methodological ones
Policy makers must resist the instrumental generation and use of “evidence” by vested interests
Independent funders must increasingly shape the production, synthesis, and dissemination of high quality clinical and public health evidence
The research agenda must become broader and more interdisciplinary, embracing the experience of illness, the psychology of evidence interpretation, the negotiation and sharing of evidence by clinicians and patients, and how to prevent harm from overdiagnosis
“This book argues that techniques falling under the label of process tracing are particularly well suited for measuring and testing hypothesized causal mechanisms. Indeed, a growing number of political scientists now invoke the term. Despite or perhaps because of this fact, a buzzword problem has arisen, where process tracing is mentioned, but often with little thought or explication of how it works in practice. As one sharp observer has noted, proponents of qualitative methods draw upon various debates – over mechanisms and causation, say – to argue that process tracing is necessary and good. Yet, they have done much less work to articulate the criteria for determining whether a particular piece of research counts as good process tracing (Waldner 2012: 65–68). Put differently, “there is substantial distance between the broad claim that ‘process tracing is good’ and the precise claim ‘this is an instance of good process tracing’” (Waldner 2011: 7).
This volume addresses such concerns, and does so along several dimensions. Meta-theoretically, it establishes a philosophical basis for process tracing – one that captures mainstream uses while simultaneously being open to applications by interpretive scholars. Conceptually, contributors explore the relation of process tracing to mechanism-based understandings of causation. Most importantly, we articulate best practices for individual process-tracing accounts – for example, criteria for how micro to go and how to deal with the problem of equifinality (the possibility that there may be multiple pathways leading to the same outcome).
Ours is an applied methods book – and not a standard methodology text – where the aim is to show how process tracing works in practice. If Van Evera (1997), George and Bennett (2005), Gerring (2007a), and Rohlfing (2012) set the state of the art for case studies, then our volume is a logical follow-on, providing clear guidance for what is perhaps the central within-case method – process tracing.
Despite all the recent attention, process tracing – or the use of evidence from within a case to make inferences about causal explanations of that case – has in fact been around for thousands of years. Related forms of analysis date back to the Greek historian Thucydides and perhaps even to the origins of human language and society. It is nearly impossible to avoid historical explanations and causal inferences from historical cases in any purposive human discourse or activity.
Although social science methodologists have debated and elaborated on formal approaches to inference such as statistical analysis for over a hundred years, they have only recently coined the term “process tracing” or attempted to explicate its procedures in a systematic way. Perhaps this is because drawing causal inferences from historical cases is a more intuitive practice than statistical analysis and one that individuals carry out in their everyday lives. Yet, the seemingly intuitive nature of process tracing obscures that its unsystematic use is fraught with potential inferential errors; it is thus important to utilize rigorous methodological safeguards to reduce such risks.
The goal of this book is therefore to explain the philosophical foundations, specific techniques, common evidentiary sources, and best practices of process tracing to reduce the risks of making inferential errors in the analysis of historical and contemporary cases. This introductory chapter first defines process tracing and discusses its foundations in the philosophy of social science. We then address its techniques and evidentiary sources, and advance ten bestpractice criteria for judging the quality of process tracing in empirical research. The chapter concludes with an analysis of the methodological issues specific to process tracing on general categories of theories, including structuralinstitutional, cognitive-psychological, and sociological. Subsequent chapters take up this last issue in greater detail and assess the contributions of process tracing in particular research programs or bodies of theory”
Part I. Introduction:
1. Process tracing: from philosophical roots to best practices Andrew Bennett and Jeffrey T. Checkel
Part II. Process Tracing in Action:
2. Process tracing the effects of ideas Alan M. Jacobs
3. Mechanisms, process, and the study of international institutions Jeffrey T. Checkel
4. Efficient process tracing: analyzing the causal mechanisms of European integration Frank Schimmelfennig
5. What makes process tracing good? Causal mechanisms, causal inference, and the completeness standard in comparative politics David Waldner
6. Explaining the Cold War’s end: process tracing all the way down? Matthew Evangelista
7. Process tracing, causal inference, and civil war Jason Lyall
Part III. Extensions, Controversies, and Conclusions:
8. Improving process tracing: the case of multi-method research Thad Dunning
9. Practice tracing Vincent Pouliot
10. Beyond metaphors: standards, theory, and the ‘where next’ for process tracing Jeffrey T. Checkel and Andrew Bennett
Appendix. Disciplining our conjectures: systematizing process tracing with Bayesian analysis.
“When you perform a hypothesis test in statistics, a p-value helps you determine the significance of your results. Hypothesis tests are used to test the validity of a claim that is made about a population. This claim that’s on trial, in essence, is called the null hypothesis….(continue here...)
As someone said, “Making predictions can be difficult, especially about the future”
Give your opinions on these predictions via the online poll at the bottom of this page, and see what others think
See also other writers predictions, access ble via links after the opinion poll
(1) Most evaluations will be internal.
The growth of internal evaluation, especially in corporations adopting environmental and social missions, will continue. Eventually, internal evaluation will overshadow external evaluation. The job responsibilities of internal evaluators will expand and routinely include organizational development, strategic planning, and program design. Advances in online data collection and real-time reporting will increase the transparency of internal evaluation, reducing the utility of external consultants.
(2) Evaluation reports will become obsolete.
After-the-fact reports will disappear entirely. Results will be generated and shared automatically—in real time—with links to the raw data and documentation explaining methods, samples, and other technical matters. A new class of predictive reports, preports, will emerge. Preports will suggest specific adjustments to program operations that anticipate demographic shifts, economic shocks, and social trends.
(3) Evaluations will abandon data collection in favor of data mining.
Tremendous amounts of data are being collected in our day-to-day lives and stored digitally. It will become routine for evaluators to access and integrate these data. Standards will be established specifying the type, format, security, and quality of “core data” that are routinely collected from existing sources. As in medicine, core data will represent most of the outcome and process measures that are used in evaluations.
(4) A national registry of evaluations will be created.
Evaluators will begin to record their studies in a central, open-access registry as a requirement of funding. The registry will document research questions, methods, contextual factors, and intended purposes prior to the start of an evaluation. Results will be entered or linked at the end of the evaluation. The stated purpose of the database will be to improve evaluation synthesis, meta-analysis, meta-evaluation, policy planning, and local program design. It will be the subject of prolonged debate.
(5) Evaluations will be conducted in more open ways.
Evaluations will no longer be conducted in silos. Evaluations will be public activities that are discussed and debated before, during, and after they are conducted. Social media, wikis, and websites will be re-imagined as virtual evaluation research centers in which like-minded stakeholders collaborate informally across organizations, geographies, and socioeconomic strata.
(6) The RFP will RIP.
The purpose of an RFP is to help someone choose the best service at the lowest price. RFPs will no longer serve this purpose well because most evaluations will be internal (see 1 above), information about how evaluators conduct their work will be widely available (see 5 above), and relevant data will be immediately accessible (see 3 above). Internal evaluators will simply drop their data—quantitative and qualitative—into competing analysis and reporting apps, and then choose the ones that best meet their needs.
(7) Evaluation theories (plural) will disappear.
Over the past 20 years, there has been a proliferation of theories intended to guide evaluation practice. Over the next ten years, there will be a convergence of theories until one comprehensive, contingent, context-sensitive theory emerges. All evaluators—quantitative and qualitative; process-oriented and outcome-oriented; empowerment and traditional—will be able to use the theory in ways that guide and improve their practice.
(8) The demand for evaluators will continue to grow.
The demand for evaluators has been growing steadily over the past 20 to 30 years. Over the next ten years, the demand will not level off due to the growth of internal evaluation (see 1 above) and the availability of data (see 3 above).
(9) The number of training programs in evaluation will increase.
There is a shortage of evaluation training programs in colleges and universities. The shortage is driven largely by how colleges and universities are organized around disciplines. Evaluation is typically found as a specialty within many disciplines in the same institution. That disciplinary structure will soften and the number of evaluation-specific centers and training programs in academia will grow.
(10) The term evaluation will go out of favor.
The term evaluation sets the process of understanding a program apart from the process of managing a program. Good evaluators have always worked to improve understanding and management. When they do, they have sometimes been criticized for doing more than determining the merit of a program. To more accurately describe what good evaluators do, evaluation will become known by a new name, such as social impact management.
Rick Davies Comment: I have highlighted interesting bits of text in red. The conclusions, also in red, are worth noting. And…make sure you check out the great (as often) xkcd comic at the end of the posting below :-)
“With the rapid expansion of impact evaluation evidence has come the cottage industry of the systematic review. Simply put, a systematic review is supposed to “sum up the best available research on a specific question.” We found 238 reviews in 3ie’s database of systematic reviews of “the effectiveness of social and economic interventions in low- and middle- income countries,” seeking to sum up the best evidence on topics as diverse as the effect of decentralized forest management on deforestation and the effect of microcredit on women’s control over household spending.
But how definitive are these systematic reviews really? Over the past two years, we noticed that there were multiple systematic reviews on the same topic: How to improve learning outcomes for children in low and middle income countries. In fact, we found six! Of course, these reviews aren’t precisely the same: Some only include randomized-controlled trials (RCTs) and others include quasi-experimental studies. Some examine only how to improve learning outcomes and others include both learning and access outcomes. One only includes studies in Africa. But they all have the common core of seeking to identify what improves learning outcomes.
Between them, they cover an enormous amount of educational research. They identify 227 studies that measure the impact of some intervention on learning outcomes in the developing world. 134 of those are RCTs. There are studies from around the world, with many studies from China, India, Chile, and – you guessed it – Kenya. But as we read the abstracts and intros of the reviews, there was some overlap, but also quite a bit of divergence. One highlighted that pedagogical interventions were the most effective; another that information and computer technology interventions raised test scores the most; and a third highlighted school materials as most important.
What’s going on? In a recent paper, we try to figure it out.
Differing Compositions. Despite having the same topic, these studies don’t study the same papers. In fact, they don’t even come close. Out of 227 total studies that have learning outcomes across the six reviews, only 3 studies are in all six reviews, per the figure below. That may not be surprising since there are differences in the inclusion criteria (RCTs only, Africa only, etc.). Maybe some of those studies aren’t the highest quality. But only 13 studies are even in the majority (4, 5, or 6) of reviews. 159 of the total studies (70 percent!) are only included in one review. 74 of those are RCTs and so are arguably of higher quality and should be included in more reviews. (Of course, there are low-quality RCTs and high-quality non-RCTs. That’s just an example.) The most comprehensive of the reviews covers less than half of the studies.
If we do a more parsimonious analysis, looking only at RCTs with learning outcomes at the primary level between 1990 and 2010 in Sub-Saharan Africa (which is basically the intersection of the inclusion criteria of the six reviews), we find 42 total studies, and the median number included in a given systematic review is 15, about one-third. So there is surprisingly little overlap in the studies that these reviews examine.
What about categorization? The reviews also vary in how they classify the same studies. For example, a program providing merit scholarships to girls in Kenya is classified alternatively as a school fee reduction, a cash transfer, a student incentive, or a performance incentive. Likewise, a program that provided computer-assisted learning in India is alternatively classified as “computers or technology” or “materials.”
What drives the different conclusions? Composition or categorization? We selected one positive recommendation from each review and examined which studies were driving that recommendation. We then counted how many of those studies were included in other reviews. As the figure below shows, the proportion varies enormously, but the median value is 33%: In other words, another review would likely have just one third of the studies driving a major recommendation in a given review. So composition matters a lot. This is why, for example, McEwan finds much bigger results for computers than others do: The other reviews include – on average – just one third of the studies that drive his result.
At the same time, categorization plays a role. One review highlights the provision of materials as one of the best ways to improve test scores. But several of the key studies that those authors call “materials,” other authors categorize as “computers” or “instructional technology.” While those are certainly materials, not all materials are created equal.
The variation is bigger on the inside. Systematic reviews tend to group interventions into categories (like “incentives” or “information provision” or “computers”), but saying that one of these delivers the highest returns on average masks the fact the variation within these groups is often as big or bigger than the variation across groups. When McEwan finds that computer interventions deliver the highest returns on average, it can be easy to forget that the same category of interventions includes a lot of clunkers, as you can see in the forest plot from his paper, below. (We’re looking at you, One Laptop Per Child in Peru or in Uruguay; but not at you, program providing laptops in China. Man, there’s even heterogeneity within intervention sub-categories!) Indeed, out of 11 categories of interventions in McEwan’s paper, 5 have a bigger standard deviation across effect sizes within the category than across effect sizes in the entire review sample. And for another 5, the standard deviation within category is more than half the standard deviation of the full sample. This is an argument for reporting effectiveness at lower levels of aggregation of intervention categories.
What does this tell us? First, it’s worth investing in an exhaustive search. Maybe it’s even worth replicating searches. Second, it may be worthwhile to combine systematic review methodologies, such as meta-analysis (which is very systematic but excludes some studies) and narrative review (which is not very systematic but allows inclusion of lots of studies, as well as examination of the specific elements of an intervention category that make it work, or not work). Third, maintain low aggregation of intervention categories so that the categories can actually be useful.
Finally, and perhaps most importantly, take systematic reviews with a grain of salt. What they recommend very likely has good evidence behind it; but it may not be the best category of intervention, since chances are, a lot of evidence didn’t make it into the review.
Oh, and what are the three winning studies that made it into all six systematic reviews?