Five ways to ensure that models serve society: A manifesto

Saltelli, A., Bammer, G., Bruno, I., Charters, E., Fiore, M. D., Didier, E., Espeland, W. N., Kay, J., Piano, S. L., Mayo, D., Jr, R. P., Portaluri, T., Porter, T. M., Puy, A., Rafols, I., Ravetz, J. R., Reinert, E., Sarewitz, D., Stark, P. B., … Vineis, P. (2020). Five ways to ensure that models serve society: A manifesto. Nature, 582(7813), 482–484. https://doi.org/10.1038/d41586-020-01812-9

The five ways:

    1. Mind the assumptions
      • “One way to mitigate these issues is to perform global uncertainty and sensitivity analyses. In practice, that means allowing all that is uncertain — variables, mathematical relationships and boundary conditions — to vary simultaneously as runs of the model produce its range of predictions. This often reveals that the uncertainty in predictions is substantially larger than originally asserted”
    2. Mind the hubris
      • Most modellers are aware that there is a tradeoff between the usefulness of a model and the breadth it tries to capture. But many are seduced by the idea of adding complexity in an attempt to capture reality more accurately. As modellers incorporate more phenomena, a model might fit better to the training data, but at a cost. Its predictions typically become less
    3. Mind the framing
      • “Match purpose and context. Results from models will at least partly reflect the interests, disciplinary orientations and biases of the developers. No one model can serve all purposes. accurate”
    4. Mind the consequences
      • Quantification can backfire. Excessive regard for producing numbers can push a discipline away from being roughly right towards being precisely wrong. Undiscriminating use of statistical tests can substitute for sound judgement. By helping to make risky financial products seem safe, models contributed to derailing the global economy in 2007–08 (ref. 5).”
    5. Mind the unknowns
      • Acknowledge ignorance. For most of the history of Western philosophy, self-awareness of ignorance was considered a virtue, the worthy object of intellectual pursuit”

“Ignore the five, and model predictions become Trojan horses for unstated
interests and values”

“Models’ assumptions and limitations must be appraised openly and honestly. Process and ethics matter as much as intellectual prowess”

“Mathematical models are a great way to explore questions. They are also a dangerous way to assert answers. Asking models for  certainty or consensus is more a sign of the  difficulties in making controversial decisions  than it is a solution, and can invite ritualistic use of quantification”

Evaluating the Future

A blog posting and (summarising) podcast, produced for the EU Evaluation Support Services, by Rick Davies, June 2020

The podcast is available here, on the Capacity4Dev website

The blog posting full text is here as a pdf

    • Limitations of common evaluative thinking
    • Scenario planning
    • Risk vs uncertainty
    • Additional evaluation criteria
    • Meaningful differences
    • Other information sources

Story Completion exercises: An idea worth borrowing?

Yesterday, TheoNabben, a friend and colleague of mine and an MSC trainer, sent me a link to a webpage full of information about a method called Story Completion: https://www.psych.auckland.ac.nz/en/about/story-completion.html 

Background

Story Completion is a qualitative research method first developed in the field of psychology but subsequently taken up primarily by feminist researchers. It was originally of interest as a method of enquiring about psychological meanings particularly those that people could not or did not want to explicitly communicate. However, it was subsequently re-conceptualised as a valuable method of accessing and investigating social discourses. These two different perspectives have been described as essentialist versus social constructionist.

Story completion is a useful tool for accessing meaning-making around a particular topic of interest. It is particularly useful for exploring (dominant) assumptions about a topic. This type of research can be framed as exploring either perceptions and understandings or social/discursive constructions of a topic.

This 2019 paper by Clarke et al. provides a good overview and is my main source of comments and explanations on this page

How It Works

The researcher provides the participant with the beginning of the story, called the stem. Typically this is one sentence long but can be longer. For example…

“Catherine has decided that she needs to lose weight. Full of enthusiasm, and in order to prevent her from changing her mind, she is telling her friends in the pub about her plans and motivations.”

The participant is then asked by the researcher to extend that story, by explaining – usually in writing – what happens next. Typically this storyline is about a third person (e.g. a Catherine), not about the participant themselves.

In practice, this form of enquiry can take various forms as suggested by Figure 1 below.

Figure 1: Four different versions of a Story Completion inquiry

Analysis of responses can be done in two ways: (a) horizontally – comparisons across respondents, (B) vertically – changes over time within the narratives.

Here is a good how-to-do-it  introduction to Story Completion: http://blogs.brighton.ac.uk/sasspsychlab/2017/10/15/story-completion/ 

And here is an annotated bibliography that looks very useful: https://cdn.auckland.ac.nz/assets/psych/about/our-research/documents/Resources%20for%20qualitative%20story%20completion%20(July%202019).pdf

How it could be useful for monitoring and evaluation purposes

Story Completion exercises could be a good way of identifying different stakeholders views of the possible consequences of an intervention. Variations in the text of the story stem could allow the exploration of consequences that might vary across gender or other social differences. Variations in the respondents being interviewed would allow exploration of differences in perspective on how a specific intervention might have consequences.

Of course, these responses will need interpretation and would benefit from further questioning. Participatory processes could be designed to enable this type of follow-up. Rather than simply relying on third parties (e.g. researchers), as informed as they might be.

Variations could be developed where literacy is likely to be a problem. Voice recordings could be made instead, and small groups could be encouraged to collectively develop a response to the stem. There would seem to be plenty of room for creativity here.

Postscript

There is a considerable overlap between the Story Completion method and how the ParEvo participatory scenario planning process works.

The commonality of the two methods is that they are both narrative-based. They both start with a story stem/seed designed by the researcher/Facilitator. Then the respondent/participants add an extension onto that story stem describing what happens next. Both methods are future-orientated and largely other-orientated, in other words not about the storyteller themselves. And both processes pay quite a lot of attention after the narratives are developed, to how those narratives can be analysed and compared.

Now for some key differences. With ParEvo the process of narrative development involves multiple people rather than one person. This means multiple alternative storylines can develop, some of which die out, some which continue, and some of which branch into multiple variants. The other difference, already implied, is that the ParEvo process goes through multiple iterations, where is the Story Completion process has only one iteration. So in the case of ParEvo the storylines accumulate multiple segments of text, with a new segment added with each iteration.  Content analysis can be carried out with the results of Story Completion and ParEvo exercises. But in the case of ParEvo it is also possible to analyse the structure of people’s participation and how it relates to the contents of the storylines.

 

Brian Castellani’s Map of the Complexity Sciences

I have limited tolerance for “complexity babble” That is, people talking about complexity in abstract and ungrounded, and in effect, practically inconsequential terms. Also in ways that give no acknowledgement to the surrounding history of ideas.

So, I really appreciate the work Brian has put into his “Map of the Complexity Sciences” produced in 2018. And thought it deserves wider circulation. Note that this is one of a number of iterations and more iterations are likely in the future. Click on the image to go to a bigger copy.

And please note: when you get taken to the bigger copy and when you click on any node a hypertext link there, this will take you to another web page providing detailed information about that concept or person. A lot of work has gone into the construction of this map, which deserves recognition.

Here is a discussion of an earlier iteration: https://www.theoryculturesociety.org/brian-castellani-on-the-complexity-sciences/

Process Tracing as a Practical Evaluation Method: Comparative Learning from Six Evaluations

By Alix Wadeson, Bernardo Monzani and Tom Aston
March 2020. pdf available here

Rick Davies comment: This is the most interesting and useful paper I have seen yet written on process tracing and its use for evaluation purposes. A good mix of methodology discussion, practical examples and useful recommendations.

The Power of Experiments: Decision Making in a Data-Driven World


By Michael Luca and Max H. Bazerman, March 2020. Published by MIT Press

How organizations—including Google, StubHub, Airbnb, and Facebook—learn from experiments in a data-driven world.

Abstract

Have you logged into Facebook recently? Searched for something on Google? Chosen a movie on Netflix? If so, you’ve probably been an unwitting participant in a variety of experiments—also known as randomized controlled trials—designed to test the impact of changes to an experience or product. Once an esoteric tool for academic research, the randomized controlled trial has gone mainstream—and is becoming an important part of the managerial toolkit. In The Power of Experiments: Decision-Making in a Data Driven World, Michael Luca and Max Bazerman explore the value of experiments and the ways in which they can improve organizational decisions. Drawing on real world experiments and case studies, Luca and Bazerman show that going by gut is no longer enough—successful leaders need frameworks for moving between data and decisions. Experiments can save companies money—eBay, for example, discovered how to cut $50 million from its yearly advertising budget without losing customers. Experiments can also bring to light something previously ignored, as when Airbnb was forced to confront rampant discrimination by its hosts. The Power of Experiments introduces readers to the topic of experimentation and the managerial challenges that surround them. Looking at experiments in the tech sector and beyond, this book offers lessons and best practices for making the most of experiments.

In The Power of Experiments: Decision-Making in a Data Driven World, Michael Luca and Max Bazerman explore the value of experiments, and the ways in which they can improve organizational decisions. Drawing on real world experiments and case studies, Luca and Bazerman show that going by gut is no longer enough—successful leaders need frameworks for moving between data and decisions. Experiments can save companies money—eBay, for example, discovered how to cut $50 million from its yearly advertising budget without losing customers. Experiments can also bring to light something previously ignored, as when Airbnb was forced to confront rampant discrimination by its hosts.

The Power of Experiments introduces readers to the topic of experimentation and the managerial challenges that surround them. Looking at experiments in the tech sector and beyond, this book offers lessons and best practices for making the most of experiments.

See also a World bank blog review by David McKenzie

The impact of impact evaluation

WIDER Working Paper 2020/20. Richard Manning, Ian Goldman, and Gonzalo Hernández Licona. PDF copy available

Abstract: In 2006 the Center for Global Development’s report ‘When Will We Ever Learn? Improving lives through impact evaluation’ bemoaned the lack of rigorous impact evaluations. The authors of the present paper researched international organizations and countries including Mexico, Colombia, South Africa, Uganda, and Philippines to understand how impact evaluations and systematic reviews are being implemented and used, drawing out the emerging lessons. The number of impact evaluations has risen (to over 500 per year), as have those of systematic reviews and other synthesis products, such as evidence maps. However, impact evaluations are too often donor-driven, and not embedded in partner governments. The willingness of politicians and top policymakers to take evidence seriously is variable, even in a single country, and the use of evidence is not tracked well enough. We need to see impact evaluations within a broader spectrum of tools available to support policymakers, ranging from evidence maps, rapid evaluations, and rapid synthesis work, to formative/process evaluations and classic impact evaluations and systematic reviews.

Selected quotes

4.1 Adoption of IEs On the basis of our survey, we feel that real progress has been made since 2006 in the adoption of IEs to assess programmes and policies in LMICs. As shown above, this progress has not just been in terms of the number of IEs commissioned, but also in the topics covered, and in the development of a more flexible suite of IE products. There is also some evidence, though mainly anecdotal, 89 that the insistence of the IE community on rigour has had some effect both in levering up the quality of other forms of evaluation and in gaining wider acceptance that ‘before and after’ evaluations with no valid control group tell one very little about the real impact of interventions. In some countries, such as South Africa, Mexico, and Colombia, institutional arrangements have favoured the use of evaluations, including IEs, although more uptake is needed.

There is also perhaps a clearer understanding of where IE techniques can or cannot usefully be applied, or combined with other types of evaluation.

At the same time, some limitations are evident. In the first place, despite the application of IE techniques to new areas, the field remains dominated by medical trials and interventions in the social sectors. Second, even in the health sector, other types of evaluation still account for the bulk of total evaluations, whether by donor agencies or LMIC governments.

Third, despite the increase in willingness of a few LMICs to finance and commission their own IEs, the majority of IEs on policies and programmes in such countries are still financed and commissioned by donor agencies, albeit in some cases with the topics defined by the countries, such as in 3ie’s policy windows. In quite a few cases, the prime objectives of such IEs are domestic accountability and/or learning within the donor agency. We believe that greater local ownership of IEs is highly desirable. While there is much that could not have been achieved without donor finance and commissioning, our sense is that—as with other forms of evaluation—a more balanced pattern of finance and commissioning is needed if IEs are to become a more accepted part of national evidence systems.

Fourth, the vast majority of IEs in LMICs appear to have ‘northern’ principal investigators. Undoubtedly, quality and rigour are essential to IEs, but it is important that IEs should not be perceived as a supply-driven product of a limited number of high-level academic departments in, for the most part, Anglo-Saxon universities, sometimes mediated through specialist consultancy firms. Fortunately, ‘southern’ capacity is increasing, and some programmes have made significant investments in developing this. We take the view that this progress needs to be ramped up very considerably in the interests of sustainability, local institutional development, and contributing over time to the local culture of evidence.

Fifth, as pointed out in Section 2.1, the financing of IEs depends to a troubling extent on a small body of official agencies and foundations that regard IEs as extremely important products. Major shifts in policy by even a few such agencies could radically reduce the number of IEs being financed.

Finally, while IEs of individual interventions are numerous and often valuable to the programmes concerned, IEs that transform thinking about policies or broad approaches to key issues of development are less evident. The natural tools for such results are more often synthesis products than one-off IEs, and to these we now turn

4.2 Adoption of synthesis products (building body of evidence)

Systematic reviews and other meta-analyses depend on an adequate underpinning of well structured IEs, although methodological innovation is now using a more diverse set of sources. 91 The take-off of such products therefore followed the rise in the stock of IEs, and can be regarded as a further wave of the ‘evidence revolution’, as it has been described by Howard White (2019). Such products are increasingly necessary, as the evidence from individual IEs grows.

As with IEs, synthesis products have diversified from full systematic reviews to a more flexible suite of products. We noted examples from international agencies in Section 2.1 and to a lesser extent from countries in Section 3, but many more could be cited. In several cases, synthesis products seek to integrate evidence from quasi-experimental evaluations (e.g. J-PAL’s Policy Insights) or other high-quality research and evaluation evidence.

The need to understand what is now available and where the main gaps in knowledge exist has led in recent years to the burgeoning of evidence maps, pioneered by 3ie but now produced by a variety of institutions and countries. The example of the 500+ evaluations in Uganda cited earlier shows the range of evidence that already exists, which should be mapped and used before new evidence is sought. This should be a priority in all countries.

The popularity of evidence maps shows that there is now a real demand to ‘navigate’ the growing body of IE-based evidence in an efficient manner, as well as to understand the gaps that still exist. The innovation happening also in rapid synthesis shows the demand for synthesis products—but more synthesis is still needed in many sectors and, bearing in mind the expansion in IEs, should be increasingly possible.

A broken system – why literature searching needs a FAIR revolution

Gusenbauer, Michael, and Neal R. Haddaway. ‘Which Academic Search Systems Are Suitable for Systematic Reviews or Meta-Analyses? Evaluating Retrieval Qualities of Google Scholar, PubMed, and 26 Other Resources’. Research Synthesis Methods,2019.

Haddaway, Neal, and Michael Gusenbauer. 2020. ‘A Broken System – Why Literature Searching Needs a FAIR Revolution’. LSE (blog). 3 February 2020.

“….searches on Google Scholar are neither reproducible, nor transparent.  Repeated searches often retrieve different results and users cannot specify detailed search queries, leaving it to the system to interpret what the user wants.

However, systematic reviews in particular need to use rigorous, scientific methods in their quest for research evidence. Searches for articles must be as objective, reproducible and transparent as possible. With systems like Google Scholar, searches are not reproducible – a central tenet of the scientific method. 

Specifically, we believe there is a very real need to drastically overhaul how we discover research, driven by the same ethos as in the Open Science movement. The FAIR data principles offer an excellent set of criteria that search system providers can adapt to make their search systems more adequate for scientific search, not just for systematic searching, but also in day-to-day research discovery:

  • Findable: Databases should be transparent in how search queries are interpreted and in the way they select and rank relevant records. With this transparency researchers should be able choose fit-for-purpose databases clearly based on their merits.
  • Accessible: Databases should be free-to-use for research discovery (detailed analysis or visualisation could require payment). This way researchers can access all knowledge available via search.
  • Interoperable: Search results should be readily exportable in bulk for integration into evidence synthesis and citation network analysis (similar to the concept of ‘research weaving’ proposed by Shinichi Nakagawa and colleagues). Standardised export formats help analysis across databases.
  • Reusable: Citation information (including abstracts) should not be restricted by copyright to permit reuse/publication of summaries/text analysis etc.

Rick Davies comment: I highly recommend using Lens.org, a search facility mentioned in the second paper above.

Predict science to improve science

DellaVigna, Stefano, Devin Pope Vivalt, and Eva Vivalt. 2019. Predict Science to Improve Science’. Science 366 (6464): 428–29.

Selected quotes follow:

The limited attention paid to predictions of research results stands in
contrast to a vast literature in the social sciences exploring people’s
ability to make predictions in general

We stress three main motivations for a more systematic collection of predictions of research results. 1. The nature of scientific progress. A new result builds on the consensus, or lack thereof, in an area and is often evaluated for how surprising, or not, it is. In turn, the novel result will lead to an updating of views. Yet we do not have a systematic procedure to capture the scientific views prior to a study, nor the updating that takes place afterward.

2. A second benefit of collecting predictions is that they can not only reveal when results are an important departure from expectations of the research community and improve the interpretation of research results, but they can also potentially help to mitigate publication bias. It is not uncommon for research findings to be met by claims that they are not surprising. This may be particularly true when researchers find null results, which are rarely published even when authors have used rigorous methods to answer important questions (15). However, if priors are collected before carrying out a study, the results can be compared to the average expert prediction, rather than to the null hypothesis of no effect. This would allow researchers to confirm that some results were unexpected, potentially making them more interesting and informative because they indicate rejection of a prior held by the research community; this could contribute to alleviating publication bias against null results.


3. A third benefit of collecting predictions systematically is that it makes it possible to improve the accuracy of predictions. In turn, this may help with experimental design. For example, envision a behavioral research team consulted to help a city recruit a more diverse police department. The team has a dozen ideas for reaching out to minority applicants, but the sample size allows for only three treatments to be tested with adequate statistical power. Fortunately, the team has recorded forecasts for several years, keeping track of predictive accuracy, and they have learned that they can combine team members’ predictions, giving more weight to “superforecasters” (9). Informed by its longitudinal data on forecasts, the team can elicit predictions for each potential project and weed out those interventions judged to have a low chance of success or focus on those interventions with a higher value of information. In addition, the research results of those projects that did go forward would be more impactful if accompanied by predictions that allow better interpretation of results in light of the conventional wisdom.

Rick Davies comment: I have argued, for years, that evaluators should start by eliciting client, and other stakeholders, predictions of outcomes of interest that the evaluation might uncover (e.g. Bangladesh, 2004). But I can’t think of any instance where my efforts have been successful, yet. But I have an upcoming opportunity and will try once again, perhaps armed with these two papers.

See also Stefano DellaVigna, and Devin Pope. 2016.‘Predicting Experimental Results: Who Knows What? NATIONAL BUREAU OF ECONOMIC RESEARCH.

ABSTRACT
Academic experts frequently recommend policies and treatments. But how well do they anticipate the impact of different treatments? And how do their predictions compare to the predictions of non-experts? We analyze how 208 experts forecast the results of 15 treatments involving monetary and non-monetary motivators in a real-effort task. We compare these forecasts to those made by PhD students and non-experts: undergraduates, MBAs, and an online sample. We document seven main results. First, the average forecast of experts predicts quite well the experimental results. Second, there is a strong wisdom-of-crowds effect: the average forecast outperforms 96 per cent of individual forecasts. Third, correlates of expertise—citations, academic rank, field, and contextual experience–do not improve forecasting accuracy. Fourth, experts as a group do better than non-experts, but not if accuracy is defined as rank-ordering treatments. Fifth, measures of effort, confidence, and revealed ability are predictive of forecast accuracy to some extent, especially for non-experts. Sixth, using these measures we identify `superforecasters’ among the non-experts who outperform the experts out of sample. Seventh, we document that these results on forecasting accuracy surprise the forecasters themselves. We present a simple model that organizes several of these results and we stress the implications for the collection of forecasts of future experimental results.

See also: The Social Science Prediction Platform, developed by the same authors.

Twitter responses to this post:

Howard White@HowardNWhite Ask decision-makers what they expect research findings to be before you conduct the research to help assess the impact of the research. Thanks to @MandE_NEWS for the pointer. https://socialscienceprediction.org

Marc Winokur@marc_winokur Replying to @HowardNWhite and @MandE_NEWS For our RCT of DR in CO, the child welfare decision makers expected a “no harm” finding for safety, while other stakeholders expected kids to be less safe. When we found no difference in safety outcomes, but improvements in family engagement, the research impact was more accepted