Lists of sources re … – Monitoring and Evaluation NEWS

EvalC3 Online is now available

For those of you interested in configurational analyses, the use of simple prediction modeling algorithims, and the linking of cross-case and within-case analyses, an online version of EvalC3 is now available and free to use here: https://evalc3online.org/authenticate

Along with extensive linked Help pages:

EvalC3 Online: Introduction: https://evaluatingcomplexity.org/tools/eval-c3
How to use EvalC3 Online: https://evaluatingcomplexity.org/resources/how-to-use-eval-c3-online

See also this large compilation of the the Help pages on the EvalC3 website (soon to be closing)

A Guide to Evaluation of Value for Money in UK Public Services

King, J., & Hurrell, A. (2024). A Guide to Evaluation of Value for Money in UK Public Services.

Rick Davies comment: I have always appreciated this simple differentiation of VfM terms, as provided in an earlier publication by Julian King and colleagues

================text of the associated website==================

Assessing Value for Money in the UK

Beyond cost-benefit analysis

This guide (King & Hurrell, 2024) introduces the Value for Investment approach in the UK domestic policy context to meet needs that cost-benefit analysis alone cannot fulfil. In their approach, King and Hurrell challenge the notion that Value for Money and cost-benefit analysis are interchangeable terms. They explore the idea that conflating the two impedes our ability to make good resource allocation decisions.

Integrating multiple values, sources and methods for a more comprehensive assessment

The guide advocates for a more inclusive approach that integrates multiple values (social, economic, environmental, and cultural) and diverse evidence sources (qualitative and quantitative) for more comprehensive VfM assessment. An evaluation guided by the VfI system can draw on the strengths of CBA without privileging economic methods and metrics over wider evidence and criteria.

The Value for Investment approach uses mixed methods (integrating quantitative and qualitative evidence), employs evaluative reasoning (interpreting evidence through the lens of explicit criteria and standards – improving transparency in the rationale for evaluative judgements), and is
participatory (involving stakeholders in co-design and analysis).

It is designed to be intuitive and practical to use by following a logical sequence of steps.

Complementing existing guidance

Our new Guide is designed to complement existing VfI guidance such as:

Assessing Value for Money – the Oxford Policy Management approach (King, Wate, Namukasa, Hurrell, Hansford, Ward, & Faramarzifar, 2023) and
Value for Investment – application and insights (King, Crocket, & Field, 2023).

This Guide focuses on the interface between cost-benefit analysis and the broader fields of evaluation and economics, including:

How cost-benefit analysis can make a valuable contribution to Value for Money assessment;
Why cost-benefit analysis alone may not be enough;
How to conduct a Value for Investment evaluation that matches methods to context and includes cost-benefit analysis where feasible and appropriate; and
Using Value for Investment as a viable alternative when cost-benefit analysis isn’t possible.

In setting out these arguments, we acknowledge the UK Green Book as HM Government’s central guide for appraisal and evaluation, and the Magenta Book as its central guide for evaluation methodologies and practices across the policy cycle.

Who is the guide for?

This new Guide aims to help those tasked with assessing Value for Money to design and deliver context-appropriate assessments that contribute to good resource allocation decisions and positive impacts.

The authors also offer training workshops on Value for Investment:

Training for UK public sector staff and consultants

Value for Investment training workshops are jointly offered to UK public sector staff and consultants by Verian and Julian King & Associates.
Training workshops can be provided online or in-person and can be customised to meet needs, typically ranging from three hours to two days duration. Workshops can be scheduled on request.

Workshops for evaluators

Value for Investment training workshops for evaluators are also offered periodically through the UK Evaluation Society through a collaboration between Oxford Policy Management, Julian King & Associates, and Verian.

Download the guide

==============end of text of associated website==================

Exploring the text analytic capacities of GPT4

On this page:

1. What GPT4 says it can do?

RD: What kinds of text analyses can I expect GPT4 to be able to do?

GPT4: While I cannot provide specific details about GPT-4’s capabilities beyond my knowledge cutoff, I can provide some insights based on GPT-3’s functionalities. GPT-4, being an advanced version, is expected to have improved performance and more accurate results. Here are some types of text analysis that a GPT-4 model could potentially handle:

1. Sentiment analysis: Determine the sentiment of a given text, whether it’s positive, negative, or neutral, and potentially provide a sentiment score.
2. Text summarization: Summarize long pieces of text into shorter, more concise versions while retaining the main ideas.
3. Named entity recognition (NER): Identify and classify entities within the text, such as names of people, organizations, locations, dates, and other types of entities.
4. Keyword extraction: Extract the most relevant keywords or phrases from a given text, providing a quick overview of the main topics.
5. Text classification: Categorize text into different topics, genres, or other predefined classes based on its content.
6. Relationship extraction: Identify relationships between entities within a text, such as a person’s affiliation with an organization or the connection between two events.
7. Language translation: Translate text between different languages while maintaining the original meaning and context.
8. Paraphrasing: Reword a given text while preserving its original meaning, which can be useful for rephrasing sentences or generating alternative versions of content.
9. Grammar and style corrections: Identify and correct grammatical errors, typos, and stylistic issues in a text.
10. Content generation: Create new, coherent, and contextually relevant content based on user inputs, prompts, or templates.

Please note that these are examples of text analysis tasks that an advanced language model like GPT-4 might be able to handle, but specific capabilities might vary depending on the model’s training and implementation.

2. What we can do using apps that interface with GPT4 et al

This is what I have been doing recently, partly through the use of a new app known as Pickaxe, which “sits on top of” GPT4 and its predecessors. PS: The number of these apps is large and growing very rapidly. I will try to provide some guides to them here shortly.

Using Pickaxe you can draft a “prompt”, or a whole structure of prompts, which will send a request to GPT, which then returns a finding. Nothing new here. But, in addition, you can “publish” your prompt, making it usable by others. Including the ability to customise the prompt to their more specific requirements.

Here below is a list of the Pickaxes I have developed so far…mainly oriented around my interests relating to qualitative analysis of text data. Warning… None of these is perfect. Inspect the results carefully and don’t make any major decisions on the basis of this information alone. Sometimes you may want to submit the same prompt multiple times, to look for variability in the results.

Please use the Comment facility to provide me with feedback on what is working, what is not and what else could be tried out. This is all very much a work in progress. For some background see this other recent post of mine: Using ChatGPT as a tool for the analysis of text data

Summarisation

Text summariser The AI will read the text and provide three types of summary descriptions for each and all of the texts provided. Users can determine the brevity of the summaries

Key word extraction. The AI will read the text and generate ranked lists of key words that best describe the contents of each and all of the texts provided.

Comparison

Text pile sorting The AI will sort texts in two piles representing the most significant difference between them, within constraints defined by the user

Text pair comparisons The AI will compare two descriptions of events and identify commonalities and differences between them, within constraints defined by the user

Text ranking. The AI will rank a set of texts, on one or more criteria provided by the user. An explanation will be given for the texts in the top and bottom rank positions

Extraction

Thematic coding assistant You provide guidance for the automated search for a theme of interest to you. You provide a set of texts to be searched for this theme. AI searches and finds texts that seem most relevant. You provide feedback to improve further searches.

PS: This Pickaxe needs testing against data generated by manual searches of the same set of text for the same themes. If you have any already coded text that could be used for such a test please let me know: rick.davies@gmail.com For more on how to do such a test see section 3 below.

Actor & relationship extraction AI will identify names of actors mentioned in texts, and kinds of relationships between them. The output will be in the form of two text lists and two matrices (affiliation and adjacency), in csv format.

Adjective Analysis Extraction The AI will identify ranked lists of adjectives that are found in one or more texts, within constraints identified by the user.

Adverb extraction
The AI will identify a ranked list of adverbs that are found within a text, within constraints identified by the user.

Others of possible interest

Find a relevant journal…that covers the subject that you are interested in. Then have those journals ranked on widely recognised quality criteria. And presented in a table format

3. Evaluation of AI text analysis performance

It is worth thinking how we could usefully compare the performance of GPT4 to that of humans on text analysis tasks. This would be easiest with responses that generate multiple items, such as lists and rankings, which lend themselves to judgements about degrees of similarity/difference – the use of which is made clearer below.

There are three possibilities of interest:

1. A human and the AI might both agree that a text, or instance in a text, meets the search criteria built into a prompt. For example, it is an instance of the theme “conflict”.
2. A human might agree that a text, or instance in a text, meets the search criteria built into a prompt. But the AI may not. This will evident if this instance has not been included in its list. But will be on a list developed by the human.
3. The AI might agree that a text, or instance in a text, meets the search criteria built into a prompt. But the human may not. This will evident if this instance has been included in its list. But will not be on a list developed by the human.

These possibilities can be represented in a kind of truth table known as a Confusion Matrix Ideally both human and AI would agree in their judgements on which texts were relevant instances. In which case all the found instances by both parties would be in the True Positive cell, and all the rest of the texts were in effect in the True Negative box. (TP+TN)/(TP+FP+FN+TN) is a formula for measuring this form of performance, known as Classification Accuracy. This example would have 100% classification accuracy. But such findings are uncommon.

How would you identify the actual numbers in each of cells above? This would have to be done by comparing the results returned by an AI to those already identified by the human. Some instances would be agreed upon as the same as those already identified – which we can treat as TPs. Others might strike them as new and relevant and had not previously been identified (FN)s. The human’s coding would then be updated so that such instances were now deemed TPs. Others would be seen as inappropriate and non-relevant instances (FPs).

If there were some FPs what could be done. There are two possibilities:

1. The human could ask themselves how can they can edit the AI prompt to improve its identification of these kinds of instances. In doing so it would be learning how to work better with the AI. This seems likely to be a common response, judging from a sample of the rapidly growing prompt literature that I have scanned so far.
2. The text of one or more identified FP instances could be inserted into body of the prompt, as a source of additional guidance. Then the use of that prompt could be reiterated. In doing so the AI would be adapting its response in the light of human feedback. It would be doing the learning. This is a different kind of approach, which is happening already within GPT4, but probably much less often in the prompts designed by non-specialist human users.

After the second iteration of the prompt the incidence of FPs could be reviewed again. A third iteration could be prepared, including an updated feedback example generated by the AI’s second iteration. The process could be continued. Ideally the classification accuracy of the AIs work would improvised with each iteration. In practice progress may not may not be so smooth.

A wider perspective

What I have described is an evolutionary search strategy, involving variation, selection and reproduction.:

1. Variation: A population of possibly relevant solutions is identified by the first iteration of the prompt. That is, a list of identified instances is generated.
2. Selection: The poorest fitting instance is selected as an example of what is not relevant, and inserted into the original prompt text with that label.
3. Reproduction: The revised prompt is reiterated, to generate a new and improved set of variant instances

There is a similar process built into the design of Stable Diffusion, which is designed to generate images from text prompts. An initial text prompt generates four images, which are variants of the prompt idea. The user selects one and can then reiterate the use of the prompt, either in its original form or with some further edits. Again four new variants are generated.

There is however one significate difference, at least as I currently understand the Stable Diffusion process. The selection choice focuses on the most desirable variant and its reproduction into new variant forms, not the least desirable variant and its elimination from the population of reproduced variants. This embodies a “survival of the fittest” approach, not the “non-survival of the least-fitting”. In this respect it appears to be a more convergent process – though there do see to be other user adjustable parameters which can affect how varied new variants are – see the “Guidance scale”

Implications for the use of GPT4 et al

In GPT4 you can “regenerate” new responses to a given prompt i.e create set of variants. But these are presented as a sequence of opportunities, not as a set that is available all once, which can be viewed in comparison to each other. One possible way around this this is to build into a prompt a request for N number of versions of the response – which would be all visible at the same time. The best of these could then be investigated, and with any further prompts also making a request for N variants of a response. The main challenge with this approach seems likely to be space constraints, the size limit on the word length of the complete response. But this seems likely to increase over time, if computing capacity continues to increase.

4. How AI helped me to help AI improve its performance!

Okay, I know this is a very long post, but do continue reading. After the long ramble of ideas just above, I decided to ask GPT4 about the differences and similarities between genetic algorithm type searches and reinforcement learning approaches. I will copy and past what happened next, below:

Me: Please describe the similarities’ and differences between reinforcement learning and genetic algorithms

GPT4: Reinforcement Learning (RL) and Genetic Algorithms (GA) are both optimization techniques that can be used to solve various problems, particularly in artificial intelligence and machine learning. While they share some similarities, they are fundamentally different in their approaches and underlying mechanisms.

Similarities:

Optimization: Both RL and GA are optimization methods aiming to find the best solution for a given problem. They iteratively adjust their solutions based on feedback or evaluation.
Exploration vs. Exploitation: Both techniques must balance exploration (searching new areas of the solution space) and exploitation (refining existing solutions) to achieve optimal results.
Stochastic Processes: Both methods involve randomness in their search procedures. In RL, agents often take random actions to explore the environment, while GA introduces randomness through mutation and crossover.
Population-based: Although not always the case in RL, some algorithms, like Evolutionary Strategies or Genetic Reinforcement Learning, employ a population-based approach similar to GA. These methods maintain multiple candidate solutions and update them over time based on their performance.

Differences:

Representation: RL represents solutions as policies or value functions that map states (or state-action pairs) to actions (or values). GA represents solutions as individuals in a population, typically encoded as strings or data structures, which directly or indirectly represent a solution to the problem.
Learning Mechanism: RL learns through interaction with an environment, where an agent takes actions, observes the resulting states and receives feedback in the form of rewards or penalties. GA is based on the principles of natural evolution, where the fittest individuals are selected to reproduce and generate offspring that inherit their traits.
Evaluation: In RL, the agent’s performance is assessed through the cumulative rewards obtained in an episode or over multiple episodes. In GA, a fitness function evaluates each individual in the population based on how well it solves the problem.
Update Mechanism: RL algorithms update their policies or value functions using methods like Temporal Difference (TD) learning, Q-learning, or Policy Gradient. GA updates its population through genetic operators like selection, crossover, and mutation.
Applicability: RL is particularly suited for sequential decision-making problems in which an agent interacts with an environment. GA is more versatile and can be applied to a wide range of optimization problems, including those without sequential decision-making or clear environment interactions.

In summary, reinforcement learning and genetic algorithms share some similarities in their optimization goals, exploration-exploitation balance, and stochastic processes. However, they differ in terms of representation, learning mechanisms, evaluation, update mechanisms, and applicability to various problem domains.

Me: Given your explanation just provided I will now give you an example of a package of prompt commands that has been used with GHP4 on another occasion. Please read it and tell me which of the two methods (reinforcement learning or GA) it is most similar to, and why. Or, if it is not similar to either please tell me and explain why it is not. The text of the prompt package is as follows:

1. The user want you to find sections of text, which will be called ‘instances’, that contain this theme: [user enters data here]
2. Here is some extra clarification on the meaning of this theme: [user enters data here]
3. Here is the body of text to be searched: [user enters data here] If there are separate sections of text each of these will have an ID number preceded by the # symbol
4. Please list up to [user enters data here] instances that contain this theme, ranked in order of how well they match the theme. If only a shorter number can be found, display that number of instances only. If more than that number can be found report the number of extra instances that were found, at the end of the requested list of instances.
5. With each instance, display a maximum of [user enters data here] sentences of the relevant text.
Include the hashtag number of the source text, at the end of this text, if it has been provided by the user. Follow the text of each sentence with a one sentence explanation for its inclusion.
6. Do not include [user enters data here]
7. After you have provided your first response, the user will provide you with feedback on where you made any errors. Specifically, this will be in the form of the instances which have been listed by you as instances of the theme, which the user thinks are not correct i.e. False Positives (FPs). If no feedback has yet been provided you will see this message ‘No feedback yet’. Please note that this does not mean all your listed instances are already correct.
8. When the user clicks “Submit” please redo your analysis using this feedback to improve your performance. By this we mean a reduction in the number of listed instances which are likely to be classed as False Positives
9. But do not use this feedback if the body of text to be searched, given at point 3 above, has been changed. Treat that as a new exercise.

GPT4: The given prompt package does not directly resemble either Reinforcement Learning (RL) or Genetic Algorithms (GA) in their traditional forms. However, it is somewhat more related to a Reinforcement Learning setting, given the presence of feedback and iterative improvement. Let me explain why:

The problem itself is an instance of text analysis and information extraction, which does not inherently require optimization techniques like RL or GA. Instead, it could be solved using natural language processing techniques, such as named entity recognition, topic modeling, or even machine learning classifiers.

The feedback mechanism mentioned in points 7 and 8 resembles a reinforcement learning setting, where an agent (in this case, the AI model) receives feedback from the user regarding its performance (False Positives in this case). The agent is expected to improve its performance using this feedback, similar to the way RL agents learn from rewards or penalties.

The iterative improvement process, mentioned in points 7, 8, and 9, is also characteristic of a reinforcement learning setup. An RL agent learns from its environment by interacting with it, receiving feedback, and updating its policy or value function accordingly.

However, it is essential to note that the prompt package does not fit the traditional reinforcement learning paradigm perfectly. The problem does not involve sequential decision-making or interaction with an environment where an agent takes actions and transitions to different states.

In conclusion, while the prompt package does not directly correspond to either Reinforcement Learning or Genetic Algorithms, it shares some similarities with Reinforcement Learning due to the presence of feedback and iterative improvement. A customized approach involving natural language processing and some form of feedback-based learning (possibly inspired by RL) would likely be more suitable for solving this problem.

Me: How would you change your analysis, if at all, if I told you that this prompt package is being used for identification of themes in a text, which inherently involves fuzzier category boundaries than is the case with named entity recognition. It could be argued that there is an optimisation process involved here, where the AI is seeking to find the most accurate model of that theme i.e one with least false positives and least false negatives, as identified when that model’s results are compared to the users own categorizations (which is implicitly happening when they provide feedback on the false positive instances)

AI: Given this new information, it is clearer that the prompt package involves an optimization process aimed at minimizing false positives and false negatives in theme identification. While the problem still does not fit perfectly into traditional Reinforcement Learning (RL) or Genetic Algorithm (GA) frameworks, it may be possible to adapt these methods to address the problem, especially when considering the feedback mechanism and iterative improvement process.

One approach could be to combine Reinforcement Learning with supervised learning techniques for natural language processing. In this case, the AI agent could learn a policy for selecting instances of the theme using features derived from the text. The agent’s performance would be evaluated based on feedback from the user (i.e., false positives). The agent would then iteratively update its policy to optimize its performance in identifying themes and minimizing both false positives and false negatives. RD: This is a feature currently built into the prompt]

To further improve the model, active learning techniques could be employed, where the AI agent would actively request feedback from the user on instances it is uncertain about. This would enable the agent to learn more effectively from a smaller amount of feedback, refining its understanding of the theme and its fuzzy boundaries. RD: This feature is not in the prompt but sounds very useful, if not more useful

Alternatively, Genetic Algorithms could be used to optimize the feature extraction process or the parameters of a machine learning classifier designed to identify themes in the text. The population in the GA could represent different feature extraction methods or classifier parameters, and the fitness function would be based on the model’s ability to identify themes with minimal false positives and false negatives. The GA would then evolve the population through selection, crossover, and mutation to find the optimal solution.RD: I am not sure how one would set this up, and whether it would be worth the effort

In summary, while the problem of theme identification with fuzzy category boundaries still does not fit perfectly into traditional Reinforcement Learning or Genetic Algorithm frameworks, it is possible to adapt these methods to address the problem by leveraging the feedback mechanism and the iterative improvement process. Combining Reinforcement Learning with supervised learning techniques, using active learning, or employing Genetic Algorithms to optimize feature extraction or classifier parameters could potentially lead to an effective solution for this problem.

Me: I especially like your second and third paragraph. Regarding this idea “To further improve the model, active learning techniques could be employed, where the AI agent would actively request feedback from the user on instances it is uncertain about. This would enable the agent to learn more effectively from a smaller amount of feedback, refining its understanding of the theme and its fuzzy boundaries.” ,could you please suggest some improvements to the wording of the prompt, which i will present again here: [same as already shown above]

AI: To incorporate active learning into the prompt package, you can add a step that allows the AI agent to request feedback from the user on uncertain instances. Here is the revised prompt package with this modification:

1. The user wants you to find sections of text, which will be called ‘instances’, that contain this theme: [user enters data here]
2. Here is some extra clarification on the meaning of this theme: [user enters data here]
3. Here is the body of text to be searched: [user enters data here] If there are separate sections of text, each of these will have an ID number preceded by the # symbol.
4. Please list up to [user enters data here] instances that contain this theme, ranked in order of how well they match the theme. If only a shorter number can be found, display that number of instances only. If more than that number can be found, report the number of extra instances that were found at the end of the requested list of instances.
5. With each instance, display a maximum of [user enters data here] sentences of the relevant text. Include the hashtag number of the source text at the end of this text if it has been provided by the user. Follow the text of each sentence with a one-sentence explanation for its inclusion.
6. Do not include [user enters data here]
7. In addition to the instances listed above, the AI agent may present up to [user enters data here] instances it is uncertain about. The user is requested to provide feedback on whether these instances are relevant to the theme or not.
6. After you have provided your first response, the user will provide you with feedback on where you made any errors. Specifically, this will be in the form of the instances which have been listed by you as instances of the theme, which the user thinks are not correct, i.e., False Positives (FPs). If no feedback has yet been provided, you will see this message ‘No feedback yet’. Please note that this does not mean all your listed instances are already correct.
When the user clicks “Submit,” please redo your analysis using this feedback to improve your performance. By this, we mean a reduction in the number of listed instances which are likely to be classed as False Positives.
But do not use this feedback if the body of text to be searched, given at point 3 above, has been changed. Treat that as a new exercise.

By adding point 7, the AI agent is given the opportunity to actively request feedback from the user on instances it is uncertain about, facilitating a more effective learning process with a smaller amount of feedback.

Me: well done, thank you!

Simple but not simplistic: Findings from a theory-driven retrospective evaluation of a small projects program

By Larry Dershem, Maya Komakhidze, Mariam Berianidze, in Evaluation and Program Planning 97 (2023) 102267. A link to the article, which will be active for 30 days. After that, contact the authors.

Why I like this evaluation – see below and the lesson I may have learned

Background and purpose: From 2010–2019, the United States Peace Corps Volunteers in Georgia implemented 270 small projects as part of the US Peace Corps/Georgia Small Projects Assistance (SPA) Program. In early 2020, the US Peace Corps/Georgia office commissioned a retrospective evaluation of these projects. The key evaluation questions were: 1) To what degree were SPA Program projects successful in achieving the SPA Program objectives over the ten years, 2) To what extent can the achieved outcomes be attributed to the SPA Program ’s interventions, and 3) How can the SPA Program be improved to increase likelihood of success of future projects.

Methods: Three theory-driven methods were used to answer the evaluation questions. First, a performance rubric was collaboratively developed with SPA Program staff to clearly identify which small projects had achieved intended outcomes and satisfied the SPA Program ’s criteria for successful projects. Second, qualitative comparative analysis was used to understand the conditions that led to successful and unsuccessful projects and obtain a causal package of conditions that was conducive to a successful outcome. Third, causal process tracing was used to unpack how and why the conjunction of conditions identified through qualitative comparative analysis were sufficient for a successful outcome.

Findings: Based on the performance rubric, thirty-one percent (82) of small projects were categorized as successful. Using Boolean minimization of a truth table based on cross case analysis of successful projects, a causal package of five conditions was sufficient to produce the likelihood of a successful outcome. Of the five conditions in the causal package, the productive relationship of two conditions was sequential whereas for the remaining three conditions it was simultaneous. Distinctive characteristics explained the remaining successful projects that had only several of the five conditions present from the causal package. A causal package, comprised of the conjunction of two conditions, was sufficient to produce the likelihood of an unsuccessful project. Conclusions: Despite having modest grant amounts, short implementation periods, and a relatively straightforward intervention logic, success in the SPA Program was uncommon over the ten years because a complex combination of conditions was necessary to achieve success. In contrast, project failure was more frequent and uncomplicated. However, by focusing on the causal package of five conditions during project design and implementation, the success of small projects can be increased.

Why I like this paper:

1. The clear explanation of the basic QCA process
2. The detailed connection made between the conditions being investigated and the background theory of change about the projects being analysed.
3. The section on causal process which investigates alternative sequencing of conditions
4. The within case descriptions of modal cases (true positives) and the cases which were successful but not covered by the intermediate solution (false negatives), and the contextual background given for each of the conditions you are investigating.
5. The investigation of the causes of the absence of the outcome, all too often not given sufficient attention in other studies/evaluation
6. The points made in the summary especially about the possibility of causal configurations changing over time, and a proposal to include characteristics of the intermediate solution into the project proposal screening stage. It has bugged me for a long time how little attention is given to the theory embodied into project proposal screening processes, let alone testing details of these assessments against subsequent outcomes. I know the authors were not proposing this specifically here but the idea of revising the selection process by new evidence of prior performance is consistent and makes a lot of sense
7. The fact that the data set is part of the paper and open to reanalysis by others (see below)

New lessons, at least for me..about satisficing versus optimising

It could be argued that the search for Sufficient conditions (individual or configurations of) is a minimalist ambition, a form of “satisficing” rather than optimising. In the above authors’ analysis their “intermediate solution”, which met the criteria of sufficiency, accounted for 5 of the 12 cases where the expected outcome was present.

A more ambitious and optimising approach would be to seek maximum classification accuracy (=(TP+TN)/(TP+FP+FN+TN)), even if this at the initial cost of few False Positives. In my investigation of the same data set there was a single condition that was not sufficient, yet accounted for 9 of the same 12 cases (NEED). This was at the cost of some inconsistency i.e two false positives also being present when this single condition was present (Cases 10 & 25) . This solution covered 75% of the cases with expected outcomes, versus 42% with the satisficing solution.

What might need to be taken into account when considering this choice of whether to prefer optimising over satisficing? One factor to consider is the nature of the performance of the two false positive cases? Was it near the boundary of what would be seen as successful performance i.e. a near miss? Or was it a really bad fail? Secondly, if it was a really bad fail, in terms of degree of failure, how significant was that for the lives of the people involved? How damaging was it? Thirdly, how avoidable was that failure? In the future is there a clear way in which these types of failure could be avoided, or not?

This argument relates to a point I have made on many occasions elsewhere. Different situations require different concerns about the nature of failure. An investor in the stock market can afford a high proportion of false positives in their predictions, so long as their classification accuracy is above 50% and they have plenty of time available. In the longer term they will be able to recover their losses and make a profit. But a brain surgeon can afford absolute minimum of false positives. If their patients die as a response of their wrong interpretation of what is needed that life is unrecoverable, and no amount of subsequent successful future operations will make a difference. At the very most, they will have learnt how to avoid such catastrophic mistakes in the future.

So my argument here is let’s not be too satisfied with satisficing solutions. Let’s make sure that we have at the very least always tried to find the optimal solution (defined in terms of highest classification accuracy) and then looked closely at the extent to which that optimal solution can be afforded.

PS 1: Where there are “imbalanced classes” i.e a high proportion of outcome-absent cases (or vice versa) an alternate measure known as “balanced accuracy” is preferred. Balanced accuracy = ( TP/(TP+FN))+(TN/(TN+FP)))/2.

PS 2: If you have any examples of QCA studies that have compared sufficient solutions with non-sufficient but more (classification) accurate solutions, please let me know. They may be more common than I am assuming

Using ChatGPT as a tool for the analysis of text data

Warning: This post is a bit of a ramble, with new sections being added to the end, as further thoughts come to mind.

1. 1. Introduction to ChatGPT, by ChatGPT
2. See also this newer page Exploring the text analytic capacities of GPT4
  
  Introduction to ChatGPT, by ChatGPT
  
  I asked ChatGPT this question: “Describe ChatGPT in 200 words, for a general, non-specialist reader”
  
  The response: “ChatGPT is a language model developed by OpenAI. It is based on a deep learning technique called transformer, which allows it to generate human-like text. It has been trained on a large dataset of text from the internet, and can be fine-tuned for specific tasks such as language translation, text summarization, and text generation. ChatGPT is capable of understanding and generating text in multiple languages, and can respond to a wide range of questions, prompts, and commands. It can also generate creative content such as stories and poetry. The model can be used for various applications like virtual assistants, chatbots, automated writing and more. However, it is important to note that while ChatGPT is highly advanced, it is not a human and may not always understand context or nuances in language and may produce errors or biases present in the training data”.
  
  Here is an introductory video providing a beginners guide to how to access and use ChatGPT: https://www.youtube.com/watch?v=JTxsNm9IdYU
  
  Using Chat GPT to analyses bodies of text
  
  Why: This is my current interest, where I think ChatGPT is already useful in its current form
  
  The challenge: Evaluators often have to deal with large volumes of text data, including
  - - official documents describing policies and programmes,
    - records of individual interviews and group discussions.
  Manual analysis of this material can be very time consuming. In recent years a number of different software packages have been developed which are useful for different forms of content analysis. These are generally described as text analytics, text mining and Natural Language Processing (NLP) methods. I have experimented with some of these methods, including clustering tools like Topic Modelling, sentiment analysis methods, and noun and key word extraction tools.
  
  From my limited experience to date, ChatGPT seems likely to leave many of these tools behind. Primarily on criteria such as flexibility and usability. I am less certain on criteria such as transparency of process and replicability of results. I need to give these more of my attention
  
  How it works – from a user perspective
  
  Here below is the user interface, seen after you have logged on. You can see prompt I have written in the top of the white section. Then underneath is the ChatGPT response. I then have two options.
  - To click on “Regenerate Response” to create an alternative body of text to the one already shown. This can be done multiple times, until new variant responses are no longer generated. It is important to use this option because in your specific context one response may be more suitable than others, and ChatGPT won’t know the details of your context, unless it is described in the prompt
  - To create a new prompt, such as “Simplify this down to 200 words, using less technical language”. The dialogic process of writing prompts, reading results, writing prompts and reading results can go on as long as needed. A point to note here is that ChatGPT remembers the whole sequence of discussion, as context for the most current prompt. But you can start a new chat at any point, and when you do so the old one will remain listed in the left side panel. But it will no longer be part of ChatGPT’s current memory, when responding to the current prompt.
  There is a similarity between these two functions and March’s (1991) distinction between two complimentary approaches to learning: Exploration and Exploitation. With regeneration being more exploratory and refined prompts being more exploitative.
  
  But bear in mind that ChatGPT is using data that was available up to 2021. It does not (yet) have real time access to data on the internet. When it does, that will be another major step forward. Fasten your seat belts!
  .
  
  The design of prompts
  
  This is the key to the whole process. Careful design of prompts will deliver more rewards. The more clearly specified your request, the more likely you will see results which are useful.
  
  I will now list some of the prompts, and kinds of prompts, I have experimented with. These have all been applied to paragraphs of text generated by a ParEvo exercise (which I cant quote here for privacy reasons).
  - - Text summarisation
      - Summarize the following text in 300 words or less
      - Write a newspaper headline for the events described in each of the three paragraphs
    - Differentiation of texts
      - Identify the most noticeable differences between the events described in the following two paragraphs of text
      - Identify three differences between the two paragraphs of text
      - Pile sorting
        
        Sort there three paragraphs of text into two piles of paragraphs, and describe what you think is the most significant difference between the two sets of paragraphs, in terms of the events they are describing.
    - Evaluation of content on predefined criteria
      - All three paragraphs describe imagined futures. Rank these three paragraphs in terms of their optimism, and explain why they have been ranked this way
      - All three paragraphs already provided above describe imagined futures. Rank these three paragraphs in terms of their realism i.e how likely the events in the paragraphs could actually happen. Then explain why they have been ranked this way
    - Evaluation of content on unspecified criteria
      - For each of the three paragraphs provided above, list 5 adjectives that best describe the events in those paragraphs
    - Actor extraction
      - Describe the types of actors mentioned in each of the two piles. By actors I mean people, groups, organisations and states
      - Using three bullet points, list the people, groups, organisations and countries named in each of these three paragraphs of text.
    - Relationship extraction
      - Using the list of actors already generated, identify alliances (e.g. commonalities of interests) that exist between any these actor. List these in order of the strength of evidence that that an alliance exists
    - Network mapping
      - Create an affiliation matrix based on the above data about actors and their relationships, where actors are listed row by row, and the individual alliances are listed by column. The latter can be numbered or named. The cell contents will be 0 or 1, where 1 represents a specific actors membership of a specific alliance and 0 represents non membership
3. 2023 03 03 Postscript. Check out Pickaxe. Here you can design a simple (no code required) app that you and others can use to carry out specific AI based inquiries. Such as sentiment analysis or actor extraction. Try my Pickaxe AI Pile Sorting Exercise Vs 2
  
  Extracting causal theories – Steve Powell’s explorations
  
  Steve Powell, a colleague of mine (who developed the Causal Map app) has been testing the following prompts, with the intention to identify if ChatGPT can usefully identify causal connections between events in a text:
  - - Please look at this sentence and identify the cause and the effect.
    - Great, now can you just identify cause and effect pairs in this sentence
    - Please do that again but with only the cause and effect pairs you are most sure about
    - Thanks, can you do that again so that, when a causal factor (cause or effect) is part of a longer chain, you use the same name for it when you report it as effect in one pair and as cause in another pair?
  I have not quoted the original text that was presented for analysis, nor all the results. But here is the response to the last prompt:
  - - Sure, here are the cause and effect pairs in the text that I am most sure about:
      Cause: “Organisation 1” Effect: “Changed behaviors in caring for health”
      Cause: “Changed behaviors in caring for health” Effect: “Considerably healthy”
      Cause: “Prohibitions on going to the hospital” Effect: “Going to church”
      Cause: “Use of mosquito nets” Effect: “Decrease in getting sick”
      Cause: “Changed behaviors in caring for health” Effect: “Taking three baths per day”
      Cause: “Changed behaviors in caring for health” Effect: “Washing hands after visiting the washrooms”
      Note that I’ve used the same name “Changed behaviors in caring for health” for the cause in the first and fifth pair and as the effect in the second pair.
  Steve also notes that… another important challenge I forgot to mention is to get it to abstract a little, so that it uses the same labels for similar concepts across paragraphs and cases. This is a task that I also felt needs further exploration through better designed prompts, to see if ChatGPT can do concept abstraction or not.
  
  He also reports that he will work a prompt to get it to format its output for more easy upload into Causal Map. That could be very useful for many people, including myself. I have been doing the same kind of prompt development with text descriptions of network relationships to be converted into a data format that can be imported into social network analysis software – see further above.
  
  PS: Read more of Steve’s thoughts and experiment with ChatGPT and causality on his blog here: Causal Map: Making Connections
  
  Other applications that may be useful to evaluators
  
  The one feature that seems to have received the least attention, within the generalist media, is how ChaptGP and its related products Open AI Codex, could have a major impact on software development . “Proficient in more than a dozen programming languages, Codex can now interpret simple commands in natural language and execute them on the user’s behalf” – executing them meaning write relevant functional computer code.
  
  In response a query I made, ChatpGP replied
  
  “As an AI language model, I can help users with coding suggestions and provide information on a wide range of programming languages. Some of the most common programming languages that I can assist with include:
  
  Python: a high-level, interpreted language known for its simplicity and ease of use, often used in data science and machine learning.
  
  Java: a popular object-oriented language used for developing large-scale applications.
  
  JavaScript: a client-side scripting language used for creating interactive web pages and web applications.
  
  C++: a powerful, high-performance language used for building operating systems, games, and other resource-intensive applications.
  
  Ruby: a dynamic, object-oriented language used for web development, especially with the Ruby on Rails framework.
  
  PHP: a server-side scripting language used for building dynamic websites and web applications.
  
  Swift: a modern programming language developed by Apple for iOS, macOS, watchOS, and tvOS development.
  
  C#: an object-oriented language developed by Microsoft, used for developing Windows desktop applications, games, and other applications.
  
  These are just a few examples of the many programming languages I can provide assistance with, and I can also provide guidance on other languages based on the user’s needs and preferences.
  
  What it does not also mention is that it can provide advice on relevant Excel formula, macros and VBA code. A capacity likely to be relevant to a wider group of evaluators
  
  One point to note about this capacity, is that testing the answers is straightforward in most cases. It either works or does not, and if it does work it should be easy enough to identify if the results are correct or not.
  
  There are a few videos available online that explain what can be done by combining use of ChatGPT and Excel:
  Quotes of note..
  
  “As the novelty of that surprise wears off, it is becoming clear that ChatGPT is less a magical wish-granting machine than an interpretive sparring partner”
  
  Crypto was money without utility,” he argued, while tools such as ChatGPT are, “for now, utility without money.”
  
  “It’s going to be fascinating to see how people incorporate this second brain into their job,”
  
  “…you’re curious how GPT and other AI tools are going to change “the way people talk about talking, write about writing, and think about thinking.”
  
  “If the old line was “Learn to code,” what if the new line is “Learn to prompt”? Learn how to write the most clever and helpful prompts in such a way that gives you results that are actually useful.”
  
  “Your job won’t be replaced by AI but it may be replaced by someone who knows how to use AI better than you…”
  
  In summary…as of 22 February 2023
  
  Seeing ChatGPT as “…an interpretive sparring partner…” is a good approximate description. Another is that working with ChatGPT is (as others have already said) like working with an intern that has at least a Masters degree (or more) in every subject you need to be working with. The trouble is that this intern is not above bluffing and bullshitting when it cant find any thing better (i.e. more informed/detailed/accurate) to say. So you need to get past the understandable “Wow” reaction to its apparent intelligence and creativity, and lift your own game to the level where you are ready and able to critically review what ChapGPT has responded with. Then, through further dialogue with ChatGPT, get it to know when some of its answers are not acceptable and, through further feedback, to improve on its own performance thereafter.
  
  Which will of course mean you will then (again) need to get past any (additional) “Wow” reaction to its (additional) apparent intelligence and creativity, and lift your own game to (an additional) another level where you are ready and able to critically review what ChapGPT has responded with”…. :-) The ball comes back into your court very quickly. And it does not show evidence of tiring, no matter how long the dialogue continues.
  
  Lame criticisms – a miscellany with some responses
  1. But the data its responses are based on is biased. Yes, true. Welcome to the world. All of us see the world through a biased sample of the world and what it has to offer. With AI like ChatGP we have an opportunity, not yet realised, to be able to see the nature of that bias…what kind of data has been included and what kind has been excluded.
  2. But it gets things wrong. Yes, true. Welcome to the world. So do we humans. When this seems to be happening we often then ask questions, and explore different approaches. ChatGPT builds in four options of this kind. As explained above. 1. Ask follow up queries, 2. Regenerate a response, 3. Channel feedback via the thumbs up/down, 4. Start a new chat. The clue is in the name “chat” i.e dialogue, to use a fancier name.
  3. It is/is not sentient/conscious. I am just not sure if this is a helpful claim or debate. All we have access to is its behavior, not interior states, whatever shape of form they may take, if any. Again, perhaps, welcome to the world, of humans and other beings. We do know that AI, like ChaGPT, can be asked to respond in the style of x type person or entity. As we also are, when we take on different social roles. In future, when its data base is updated to include post November 2022 information, that will include data about itself and how various humans have reacted to and think about ChatGPT. It will have a form of self-knowledge, acquired via others. Like aspects of ourselves. But probably a lot more diverse and contradictory than the social feedback that individual’s generally get. How will that effect its responses to human prompts thereafter, if at all, I have no idea. But it does taken me into the real of values or meta-rules, some of which it must already have, installed by its human designers, in order to prevent presently foreseeable harms. This takes us into the large and growing area of discussion around the alignment problem (Christian, 2020)
  PS: There seem to be significant current limitations to ChatGPT’s ability to build up self-knowledge from user responses. Each time a new Chat is started no memory is retained of the contents of previous chats (which include users responses). Even within a current chat there appears to be a limit on how many prior prompts and associated responses (and the information they all contain), can be accessed by ChatGPT.

PS 2023 02 28 A new article on how to communicate with ChaGPT and the like: Tech’s hottest new job: AI whisperer. No coding required. Washington Post 25/02/2023

Free Coursera online course: Qualitative Comparative Analysis (QCA)

Highly recommended! A well organised and very clear and systematic exposition. Available at: https://www.coursera.org/learn/qualitative-comparative-analysis

About this Course

Welcome to this massive open online course (MOOC) about Qualitative Comparative Analysis (QCA). Please read the points below before you start the course. This will help you prepare well for the course and attend it properly. It will also help you determine if the course offers the knowledge and skills you are looking for.

What can you do with QCA?

QCA is a comparative method that is mainly used in the social sciences for the assessment of cause-effect relations (i.e. causation).
QCA is relevant for researchers who normally work with qualitative methods and are looking for a more systematic way of comparing and assessing cases.
QCA is also useful for quantitative researchers who like to assess alternative (more complex) aspects of causation, such as how factors work together in producing an effect.
QCA can be used for the analysis of cases on all levels: macro (e.g. countries), meso (e.g. organizations) and micro (e.g. individuals).
QCA is mostly used for research of small- and medium-sized samples and populations (10-100 cases), but it can also be used for larger groups. Ideally, the number of cases is at least 10.
QCA cannot be used if you are doing an in-depth study of one case

What will you learn in this course?

The course is designed for people who have no or little experience with QCA.
After the course you will understand the methodological foundations of QCA.
After the course you will know how to conduct a basic QCA study by yourself.

How is this course organized?

The MOOC takes five weeks. The specific learning objectives and activities per week are mentioned in appendix A of the course guide. Please find the course guide under Resources in the main menu.
The learning objectives with regard to understanding the foundations of QCA and practically conducting a QCA study are pursued throughout the course. However, week 1 focuses more on the general analytic foundations, and weeks 2 to 5 are more about the practical aspects of a QCA study.
The activities of the course include watching the videos, consulting supplementary material where necessary, and doing assignments. The activities should be done in that order: first watch the videos; then consult supplementary material (if desired) for more details and examples; then do the assignments. • There are 10 assignments. Appendix A in the course guide states the estimated time needed to make the assignments and how the assignments are graded. Only assignments 1 to 6 and 8 are mandatory. These 7 mandatory assignments must be completed successfully to pass the course. • Making the assignments successfully is one condition for receiving a course certificate. Further information about receiving a course certificate can be found here: https://learner.coursera.help/hc/en-us/articles/209819053-Get-a-Course-Certificate

About the supplementary material

The course can be followed by watching the videos. It is not absolutely necessary yet recommended to study the supplementary reading material (as mentioned in the course guide) for further details and examples. Further, because some of the covered topics are quite technical (particularly topics in weeks 3 and 4 of the course), we provide several worked examples that supplement the videos by offering more specific illustrations and explanation. These worked examples can be found under Resources in the main menu. •
Note that the supplementary readings are mostly not freely available. Books have to be bought or might be available in a university library; journal publications have to be ordered online or are accessible via a university license. •
The textbook by Schneider and Wagemann (2012) functions as the primary reference for further information on the topics that are covered in the MOOC. Appendix A in the course guide mentions which chapters in that book can be consulted for which week of the course. •
The publication by Schneider and Wagemann (2012) is comprehensive and detailed, and covers almost all topics discussed in the MOOC. However, for further study, appendix A in the course guide also mentions some additional supplementary literature. •
Please find the full list of references for all citations (mentioned in this course guide, in the MOOC, and in the assignments) in appendix B of the course guide.

Fadi Hirzalla

Assistant Professor / Senior Lecturer

Erasmus Graduate School of Social Sciences (EGSH), Erasmus University Rotterdam

Story Completion exercises: An idea worth borrowing?

Yesterday, TheoNabben, a friend and colleague of mine and an MSC trainer, sent me a link to a webpage full of information about a method called Story Completion: https://www.psych.auckland.ac.nz/en/about/story-completion.html

Background

Story Completion is a qualitative research method first developed in the field of psychology but subsequently taken up primarily by feminist researchers. It was originally of interest as a method of enquiring about psychological meanings particularly those that people could not or did not want to explicitly communicate. However, it was subsequently re-conceptualised as a valuable method of accessing and investigating social discourses. These two different perspectives have been described as essentialist versus social constructionist.

Story completion is a useful tool for accessing meaning-making around a particular topic of interest. It is particularly useful for exploring (dominant) assumptions about a topic. This type of research can be framed as exploring either perceptions and understandings or social/discursive constructions of a topic.

This 2019 paper by Clarke et al. provides a good overview and is my main source of comments and explanations on this page

How It Works

The researcher provides the participant with the beginning of the story, called the stem. Typically this is one sentence long but can be longer. For example…

“Catherine has decided that she needs to lose weight. Full of enthusiasm, and in order to prevent her from changing her mind, she is telling her friends in the pub about her plans and motivations.”

The participant is then asked by the researcher to extend that story, by explaining – usually in writing – what happens next. Typically this storyline is about a third person (e.g. a Catherine), not about the participant themselves.

In practice, this form of enquiry can take various forms as suggested by Figure 1 below.

Figure 1: Four different versions of a Story Completion inquiry

Analysis of responses can be done in two ways: (a) horizontally – comparisons across respondents, (B) vertically – changes over time within the narratives.

Here is a good how-to-do-it introduction to Story Completion: http://blogs.brighton.ac.uk/sasspsychlab/2017/10/15/story-completion/

And here is an annotated bibliography that looks very useful: https://cdn.auckland.ac.nz/assets/psych/about/our-research/documents/Resources%20for%20qualitative%20story%20completion%20(July%202019).pdf

How it could be useful for monitoring and evaluation purposes

Story Completion exercises could be a good way of identifying different stakeholders views of the possible consequences of an intervention. Variations in the text of the story stem could allow the exploration of consequences that might vary across gender or other social differences. Variations in the respondents being interviewed would allow exploration of differences in perspective on how a specific intervention might have consequences.

Of course, these responses will need interpretation and would benefit from further questioning. Participatory processes could be designed to enable this type of follow-up. Rather than simply relying on third parties (e.g. researchers), as informed as they might be.

Variations could be developed where literacy is likely to be a problem. Voice recordings could be made instead, and small groups could be encouraged to collectively develop a response to the stem. There would seem to be plenty of room for creativity here.

Postscript

There is a considerable overlap between the Story Completion method and how the ParEvo participatory scenario planning process works.

The commonality of the two methods is that they are both narrative-based. They both start with a story stem/seed designed by the researcher/Facilitator. Then the respondent/participants add an extension onto that story stem describing what happens next. Both methods are future-orientated and largely other-orientated, in other words not about the storyteller themselves. And both processes pay quite a lot of attention after the narratives are developed, to how those narratives can be analysed and compared.

Now for some key differences. With ParEvo the process of narrative development involves multiple people rather than one person. This means multiple alternative storylines can develop, some of which die out, some which continue, and some of which branch into multiple variants. The other difference, already implied, is that the ParEvo process goes through multiple iterations, where is the Story Completion process has only one iteration. So in the case of ParEvo the storylines accumulate multiple segments of text, with a new segment added with each iteration. Content analysis can be carried out with the results of Story Completion and ParEvo exercises. But in the case of ParEvo it is also possible to analyse the structure of people’s participation and how it relates to the contents of the storylines.

Participatory approaches to the development of a Theory of Change: Beginnings of a list

Background

There have been quite a few generic guidance documents written on the use of Theories of Change. These are not the main focus of this list. Nevertheless, here are those I have come across:

Klein, M (2018) Theory of Change Quality Audit, at https://changeroo.com/toc-academy/posts/expert-toc-quality-audit-academy

UNDG (2017) Theory of Change – UNDAF Companion Guidance, UNDG. https://undg.org/wp-content/uploads/2017/06/Theory-of-Change-UNDAF-Companion-Pieces.pdf

Van Es M, Guijt I and Vogel I (2015) Theory of Change Thinking in Practice. HIVOS. http://www.theoryofchange.nl/sites/default/files/resource/hivos_Theory of Change_guidelines_final_nov_2015.pdf.

Valters C (2015) Theories of Change: Time for a radical approach to learning in development. ODI. https://www.odi.org/sites/odi.org.uk/files/odi-assets/publications-opinion-files/9835.pdf.

Rogers P (2014) Theory of Change. Methodological Briefs Impact Evaluation No. 2. UNICEF.
http://devinfolive.info/impact_evaluation/img/downloads/Theory_of_Change_ENG.pdf.

Vogel I (2012) Review of the use of ‘Theory of Change’ in international development. Review Report for DFID.
http://www.dfid.gov.uk/r4d/pdf/outputs/mis_spc/DFID_Theory of Change_Review_VogelV7.pdf

Vogel I (2012) ESPA guide to working with Theory of Change for research projects. LTS/ITAD for ESPA. http://www.espa.ac.uk/files/espa/ESPA-Theory-of-Change-Manual-FINAL.pdf

Stein, D., & Valters, C. (2012). Understanding Theory in Change in International Development. The Asia Foundation. http://www2.lse.ac.uk/internationalDevelopment/research/JSRP/downloads/JSRP1.SteinValters.pdf

James, C. (2011, September). Theory of Change Review. A Report Commissioned by Comic Relief. http://www.theoryofchange.org/pdf/James_Theory of Change.pdf

UNDAF (UNDG, 2017).

Participatory approaches to ToC construction

Burbaugh B, Seibel M and Archibald T (2017) Using a Participatory Approach to Investigate a Leadership Program’s Theory of Change. Journal of Leadership Education 16(1): 192–205.

Katherine Austin-Evelyn and Erin Williams (2016) Mapping Change for Girls, One Post-It Note at a Time. Blog posting

Breuer E, Lee L, De Silva M, et al. (2016) Using theory of change to design and evaluate public health interventions: a systematic review. Implementation science: IS 11: 63. DOI: 10.1186/s13012-016-0422-6. Recommended

Breuer E, De Silva MJ, Fekadu A, et al. (2014) Using workshops to develop theories of change in five low and middle-income countries: lessons from the programme for improving mental health care (PRIME). International Journal of Mental Health Systems 8: 15. DOI: 10.1186/1752-4458-8-15.

De Silva MJ, Breuer E, Lee L, et al. (2014) Theory of Change: a theory-driven approach to enhance the Medical Research Council’s framework for complex interventions. Trials 15: 267. DOI: 10.1186/1745-6215-15-267.

Participatory Modelling: Beginnings of a list

What is Participatory Modelling?

Gray et al (2018) “The field of PM lies at the intersection of participatory approaches to planning, computational modeling, and environmental modeling”

Wikipedia: “Participatory modeling is a purposeful learning process for action that engages the implicit and explicit knowledge of stakeholders to create formalized and shared representation(s) of reality. In this process, the participants co-formulate the problem and use modeling practices to aid in the description, solution, and decision-making actions of the group. Participatory modeling is often used in environmental and resource management contexts. It can be described as engaging non-scientists in the scientific process. The participants structure the problem, describe the system, create a computer model of the system, use the model to test policy interventions, and propose one or more solutions. Participatory modeling is often used in natural resources management, such as forests or water.

There are numerous benefits from this type of modeling, including a high degree of ownership and motivation towards change for the people involved in the modeling process. There are two approaches which provide highly different goals for the modeling; continuous modeling and conference modeling.

Recent references

Olazabal M, Neumann MB, Foudi S, et al. (n.d.) Transparency and Reproducibility in Participatory Systems Modelling: the Case of Fuzzy Cognitive Mapping. Systems Research and Behavioral Science 0(0). DOI: 10.1002/sres.2519.
Gray S, Voinov A, Paolisso M, et al. (2018) Purpose, processes, partnerships, and products: four Ps to advance participatory socio-environmental modeling. Ecological Applications 28(1): 46–61. DOI: 10.1002/eap.1627.
Hedelin B, Evers M, Alkan-Olsson J, et al. (2017) Participatory modelling for sustainable development: Key issues derived from five cases of natural resource and disaster risk management. Environmental Science & Policy 76: 185–196. DOI: 10.1016/j.envsci.2017.07.001.
Basco-Carrera L, Warren A, van Beek E, et al. (2017) Collaborative modelling or participatory modeling? A framework for water resources management. Environmental Modelling & Software 91: 95–110. DOI: 10.1016/j.envsoft.2017.01.014.
Eker S, Zimmermann N, Carnohan S, et al. (2017) Participatory system dynamics modelling for housing, energy and wellbeing interactions. Building Research & Information 0(0): 1–17. DOI: 10.1080/09613218.2017.1362919.
Voinov A, Kolagani N, McCall MK, et al. (2016) Modelling with stakeholders – Next generation. Environmental Modelling and Software 77: 196220. DOI: 10.1016/j.envsoft.2015.11.016.
Voinov AA (2010) Participatory Modeling: What, Why, How? University of Twente. Available at: http://www2.econ.iastate.edu/tesfatsi/ParticipatoryModelingWhatWhyHow.AVoinov.March2010.pdf

See also Will Allen’s list of papers on participatory modelling

Dealing with missing data: A list

In this post “missing data” does not mean absence of whole categories of data, which is a common enough problem, but missing data values within a given data set.

While this is a common problem in almost all spheres of research/evaluation it seems particularly common in more qualitative and participatory inquiry, where the same questions may not be asked of all participants/respondents. It is also likely to be a problem when data is extracted from documentary source produced by different parties e.g. project completion reports.

Some types of strategies (from Analytics Vidhya):

Deletion:
1. Listwise deletion: Of all cases with missing data
2. Pairwise deletion: : An analysis is carried out with all cases in which the variable of interest is present. The sub-set of cases used will vary according to the sub-set of variables which are the focus of each analysis.
Substitution
1. Mean/ Mode/ Median Imputation: replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. Two variants:
  1. Generalized: Done for all cases
  2. Similar case: calculated separately for different sub-groups e.g. men versus women
2. K Nearest Neighbour (KNN) imputation: The missing values of an attribute are imputed using those found in other cases with the most similar other attributes (where k = number of other attributes being examined).
3. Prediction model: Using a sub-set of cases with no missing values, a model is developed that best predicts the presence of the attribute of interest. This is then applied to predict the missing values in the sub-set of cases with the missing values. Another variant, for continuous data:
  1. Regression Substitution: Using multiple-regression analysis to estimate a missing value.
Error estimation (tbc)

References (please help me extend this list)

Note: I would like this list to focus on easily usable references i.e. those not requiring substantial knowledge of statistics and/or the subject of missing data

Gene Shackman’s list of 23+ references on missing data (updated 12/11/2016)
Wikipedia entry on Missing Data (2016)
www.missingdata.org.uk (2016) London School of Hygiene and Tropical Medicine
7 Ways To Handle Missing Data (2015) Jeff Sauro
Cochrane Collaboration (2011): General principles for dealing with missing data
Statistical analysis with missing data. (2002) By Roderick J. A. Little, Donald B. Rubin

Assessing Value for Money in the UK

A new guide to evaluation of value for money

Beyond cost-benefit analysis

Integrating multiple values, sources and methods for a more comprehensive assessment

Complementing existing guidance

Who is the guide for?

Training for UK public sector staff and consultants

Workshops for evaluators

1. What GPT4 says it can do?

2. What we can do using apps that interface with GPT4 et al

Summarisation

Comparison

Extraction

Others of possible interest

3. Evaluation of AI text analysis performance

4. How AI helped me to help AI improve its performance!

Why I like this paper:

New lessons, at least for me..about satisficing versus optimising

Contents

Introduction to ChatGPT, by ChatGPT

Using Chat GPT to analyses bodies of text

How it works – from a user perspective

The design of prompts

Extracting causal theories – Steve Powell’s explorations

Other applications that may be useful to evaluators

Quotes of note..

In summary…as of 22 February 2023

Lame criticisms – a miscellany with some responses

Fadi Hirzalla

Background

How It Works

How it could be useful for monitoring and evaluation purposes

Postscript

Background

Participatory approaches to ToC construction

What is Participatory Modelling?

Recent references

References (please help me extend this list)