Considerations and Practical Applications for Using Artificial Intelligence (AI) in Evaluations

Cekova, D., Corsetti L., Ferretti, S. and Vaca, S. (2025). Considerations and Practical Applications for Using Artificial Intelligence (AI) in Evaluations. Technical Note. CGIAR Independent Advisory and Evaluation Service (IAES). Rome: IAES Evaluation Function. https://iaes.cgiar.org/evaluation

Executive Summary

The CGIAR 2030 Research and Innovation Strategy commits organizational change with seven ways of working, including “Making the digital revolution central to our way of working”. In that context, Artificial Intelligence (AI), introduces both opportunities and risks to evaluation practice. Guided by the CGIAR-wide Evaluation Framework, integrating AI tools requires a governance approach to balance innovation with ethical responsibility, ensuring transparency, fairness, accountability, and inclusivity. This Technical Note encourages and guides CGIAR evaluators to ethically explore, negotiate, and experiment with AI tools:

Explore: Evaluators are invited to discover how AI, especially GenAI, can enhance evaluation efficiency, from scoping and data analysis to reporting. The Note provides practical guidance on AI applications and examples to support creative yet responsible exploration.

Negotiate: Integrating AI should be openly discussed with commissioners, stakeholders, and teams. The Note prioritizes jointly defining boundaries, expectations, and ethical parameters (transparency, accountability, and data sensitivity) at each phase of evaluation.

Use AI Responsibly: While AI tools are evolving, evaluators are encouraged to pilot and iterate their use. The document supports experimentation through practical tips, prompt examples, and tool selection criteria, all while emphasizing documentation and learning from each use case. Effective AI governance is grounded in core principles: Transparency requires clear documentation of AI tool usage, data sources, model limitations, and decision-making processes. Accountability involves assigning responsibility for AI decisions and outputs and establishing oversight and redress mechanisms. Fairness and inclusion must proactively mitigate bias and discrimination, with particular attention to underrepresented groups and data gaps. Data privacy and security must align with applicable data protection regulations and ensure secure handling practices. Human oversight ensures that evaluators retain control over processes and can intervene as needed.

In operationalizing ethical AI governance in CGIAR evaluations, due diligence is required in assessing AI tools for ethical alignment before deployment: reviewing the transparency of vendors, the documentation of models, and their intended use cases. Where relevant, components involving AI—especially those engaging human subjects or sensitive data—should undergo ethics review. AI applications must be adapted to the local and cultural contexts in which evaluations are conducted, as what is suitable in one setting may be inappropriate in another. Additionally, participants should be informed about the use of AI systems and the implications of data collection or processing to ensure informed consent.

Ethical AI governance should be embedded in the entire evaluation lifecycle. During the design phase, evaluators should define AI tools to use, why they are selected, and assess risks. In data collection, AI tools should be used in ways that uphold data privacy and protection standards and avoid reinforcing harmful stereotypes or excluding groups. During the analysis phase, the role of AI in supporting interpretation should be documented, with an acknowledgment of limitations or biases. In dissemination, documentation, and reporting, AI’s contribution, limitations, and human validation should be disclosed. By rapid adaptation of content across formats, languages, and complexity levels, AI opens possibilities for broader, more inclusive communication of findings. Finally, the follow-up phases should include a reflection on the ethical implications observed and how these lessons can improve future evaluations.

By embedding methodological flexibility into the evaluation processes, AI adoption would contribute to integrity, equity, and learning in an era of rapid technological advancement. This Technical Note is a conversation starter—as a “Beta” version, it will evolve based on responsible real-world experimentation and continuous reflection. Evaluators are encouraged to be responsive to stakeholder input throughout the evaluation processes, to ensure relevance, accuracy, and inclusivity.

EvalC3 Online is now available

For those of you interested in configurational analyses, the use of simple prediction modeling algorithims, and the linking of cross-case and within-case analyses, an online version of EvalC3 is now available and free to use here: https://evalc3online.org/authenticate

Along with extensive linked Help pages:

EvalC3 Online: Introduction: https://evaluatingcomplexity.org/tools/eval-c3
How to use EvalC3 Online: https://evaluatingcomplexity.org/resources/how-to-use-eval-c3-online

See also this large compilation of the the Help pages on the EvalC3 website (soon to be closing)

A Guide to Evaluation of Value for Money in UK Public Services

King, J., & Hurrell, A. (2024). A Guide to Evaluation of Value for Money in UK Public Services.

Rick Davies comment: I have always appreciated  this simple differentiation of VfM terms, as provided in an earlier publication by Julian King and colleagues

================text of the associated website==================

Assessing Value for Money in the UK

Beyond cost-benefit analysis

This guide (King & Hurrell, 2024) introduces the Value for Investment approach in the UK domestic policy context to meet needs that cost-benefit analysis alone cannot fulfil. In their approach, King and Hurrell challenge the notion that Value for Money and cost-benefit analysis are interchangeable terms. They explore the idea that conflating the two impedes our ability to make good resource allocation decisions.

Integrating multiple values, sources and methods for a more comprehensive assessment

The guide advocates for a more inclusive approach that integrates multiple values (social, economic, environmental, and cultural) and diverse evidence sources (qualitative and quantitative) for more comprehensive VfM assessment. An evaluation guided by the VfI system can draw on the strengths of CBA without privileging economic methods and metrics over wider evidence and criteria.

The Value for Investment approach uses mixed methods (integrating quantitative and qualitative evidence), employs evaluative reasoning (interpreting evidence through the lens of explicit criteria and standards – improving transparency in the rationale for evaluative judgements), and is
participatory (involving stakeholders in co-design and analysis).

It is designed to be intuitive and  practical to use by following a logical sequence of steps.

Complementing existing guidance

Our new Guide is designed to complement existing VfI guidance such as:

This Guide focuses on the interface between cost-benefit analysis and the broader fields of evaluation and economics, including:

  • How cost-benefit analysis can make a valuable contribution to Value for Money assessment;
  • Why cost-benefit analysis alone may not be enough;
  • How to conduct a Value for Investment evaluation that matches methods to context and includes cost-benefit analysis where feasible and appropriate; and
  • Using Value for Investment as a viable alternative when cost-benefit analysis isn’t possible.

In setting out these arguments, we acknowledge the UK Green Book as HM Government’s central guide for appraisal and evaluation, and the Magenta Book as its central guide for evaluation methodologies and practices across the policy cycle.

Who is the guide for?

This new Guide aims to help those tasked with assessing Value for Money to design and deliver context-appropriate assessments that contribute to good resource allocation decisions and positive impacts.

The authors also offer training workshops on Value for Investment:

Training for UK public sector staff and consultants
  • Value for Investment training workshops are jointly offered to UK public sector staff and consultants by Verian and Julian King & Associates.
  • Training workshops can be provided online or in-person and can be customised to meet needs, typically ranging from three hours to two days duration. Workshops can be scheduled on request.
Workshops for evaluators
  • Value for Investment training workshops for evaluators are also offered periodically through the UK Evaluation Society through a collaboration between Oxford Policy Management, Julian King & Associates, and Verian.

Download the guide

==============end of text of associated website==================

Collated contents of the first global online conference on the use of MSC, June 19th and 26th 2024

 

Collated and summarised information on the first global online conference on the use of MSC, June 19th and 26th 2024

About the 2024 MSC Conference 

This inaugural event marked an important milestone for the MSC community as we gathered for the first time, across the globe to discuss and explore the various facets of MSC practice. 

The primary objective of the conference was to facilitate meaningful discussions, learn from each other’s experiences, expand awareness of MSC, strengthen community ties, and explore  the potential  of MSC as a tool to help address future global challenges. Specifically, the conference aimed to showcase the diversity of MSC applications, from various locations, involving different populations, and interventions. It also sought to share best practices, and innovative approaches while exploring MSC’s role in addressing contemporary issues such as power dynamics and technological changes.

Over the course of the 2 days conference 180 individuals attended, including 44 presenters, across 20 sessions. Participants’ feedback on their experiences of the event can be seen here. If you participated in the event but have not yet given your feedback you can still provide it here. 

The conference contents are now available,as  listed below:

  1. The conference programme schedules for the 19th June and 26th June
  2. Individual presentations for the 19th and 26th June, including:
    1. Title
    2. Abstract
    3. PowerPoint presentations
    4. Claude AI summaries of audio records of session contents
    5. Presenters
    6. Contact information 
  3. Claude AI summaries of surrounding sessions
    1. Introductions for the 19th June and 26th June
    2. Wrap Up session for 19th June and 26th June
    3. Futures session for 19th June and 26th June
    4. Claude AI whole-day summaries for 19th June and 26th June

This event was organised pro bono by the MSC Conference Working Group (Rick Davies, Niki Wood, Cecilia Ruberto, Nur Hidayati, Kornelia Rassmann, Nick Andrews, Nitin Bajpai). For more information contact rick.davies@gmail.com, or the authors of individual presentations.

Please note: Updated versions of this document may become available here

Findings and recommendations from the Evaluation Methods Advisory Panel at WFP

WFP (2024) Annual Report from the Evaluation Methods Advisory Panel at WFP 2023 in Review. Rome, Italy. Available at: https://docs.wfp.org/api/documents/WFP-0000157165/download/?_ga=2.59050729.558308766.1710337216-1247327426.1696429757 (accessed 14 March 2024).
Declaration of interest: I was a member of the panel from 2022-23
Contents

Introduction……………………………………………………………………………………………1
1. Approaches and methods……………………………………………………………………3
2. Evaluation guidance …………………………………………………………………………..5
3. Use of theory-based evaluation …………………………………………………………7
4. Evaluability assessments and linkages with evaluation design ……….9
5. Triangulation, clarity, and transparency…………………………………………..11
6. Lessons to strengthen WFP’s evaluation function …………………………..13
Annex 1: Short biographies of members of the EMAP………………………….16
Annex 2: Evaluation documents reviewed by the EMAP………………………17
Annex 3: Selection of evaluations for review by the EMAP………………….19

The Evaluation Methods Advisory Panel Given the increase in the number of
evaluations and the complex and diverse contexts in which the World Food Programme (WFP) operates, the WFP Office of Evaluation (OEV) has created an Evaluation Methods Advisory Panel (EMAP) to support improving
evaluation methodology, approaches, and methods, and to reflect on international best practice and innovations in these areas. The Panel was launched in January 2022. Currently composed of six members (listed in annex 1), it complements provisions in the WFP evaluation quality assurance system (EQAS).

Purpose and Scope

The aims of the Annual Review are to:

  • Reflect on evaluation approaches and methods used in evaluations,
    and progress towards improving and broadening the range of
    methodologies
  • Identify systemic and structural challenges
  • Derive lessons to increase quality and utility in future evaluations

The EMAP Annual Report covers most evaluations conducted by WFP’s evaluation function – Policy Evaluations (PEs), Complex Emergency Evaluations (CEEs), Strategic Evaluations (SEs), Decentralized Evaluations
(DEs), and Country Strategic Plan Evaluations (CSPEs) – in 2022-2023 (see Annex 3). It is based on reviews undertaken by EMAP members (“the reviewers”), and discussions and workshops between the reviewers and
WFP. EMAP has not examined system-wide and impact evaluations.

Process

Two approaches to the EMAP reviews were undertaken. In one strand of activities, EMAP members received a selection of completed CSPE and DE evaluation reports (ERs), and the related terms of reference (ToR) and inception reports (IRs), for their review. The other strand of EMAP activities was giving feedback on draft outputs for Policy Evaluations (PEs),
Complex Emergency Evaluations (CEEs) and Strategic Evaluations (SEs).

Two EMAP advisers wrote this Annual Report; the process of preparing it entailed:

  • Review of the advice provided by EMAP on WFP evaluations during 2023.
  • Discussion of the draft annual report with OEV, Regional Evaluation Officers (REOs) and other EMAP advisors in a two-day workshop at WFP. This report  incorporates key elements from these discussions.

As in 2022, the 2023 review faced the following limitations:

  • The review included 14 DEs, 10 CSPEs, 5 PEs, 3 CEEs and 3 SEs, but analysed outputs were at different stages of development. EMAP reviewers
    prepared review reports for DEs and CSPEs based on finalised ToRs,
    inception and evaluation reports. Conversely, for SEs, PEs and CEEs, the
    reviews examined draft concept notes, ToRs, IRs, ERs, and two literature
    reviews.
  • Not all EMAP reviews undertaken in 2023 were finalised in time for the
    synthesis process undertaken to prepare the Annual Report.
  • Most reviews followed a structure provided by WFP which varied by
    evaluation type. For instance, the DE review template included a section on
    overall evaluation approaches and methods which was not included in the
    CSPE review template. Some reviews did not use the templates provided but
    added comments directly to the draft reports.
  • Finally, reviewing written evaluation outputs presented challenges to
    explaining why something did or did not happen in an evaluation process
  • Unlike in 2022, there was no opportunity for the EMAP to discuss the draft annual report as a panel before sharing it with OEV. The 2023 Annual Report was, however, discussed in a workshop with EMAP members and OEVstaff, including regional evaluation officers, to validate the results and discuss potential ways forward across the different types of evaluations in WFP

The Australian Centre for Evaluation – plans, context, critiques

The plan

The 2023?24 Budget includes $10 million over four years to establish an Australian Centre for Evaluation (ACE) in the Australian Treasury. The Australian Centre for Evaluation will improve the volume, quality, and impact of evaluations across the Australian Public Service (APS), and work in close collaboration with evaluation units in other departments and agencies.

The context

The critique(s)

    •  Risky behaviour — three predictable problems with the Australian Centre for Evaluation, by Patricia Rogers.  Some highlighted points, among many others of interest:
      • Three predictable problems
        • Firstly, the emphasis on impact evaluations risks displacing attention from other types of evaluation that are needed for accountable and effective government
        • Secondly, the emphasis on a narrow range of approaches to impact evaluation risks producing erroneous or misleading findings.
        • Thirdly, the focus on ‘measuring what works’ creates risks in terms of how evidence is used to inform policy and practice, especially in terms of equity.
          • These approaches are designed to answer the question “what works” on average, which is a blunt and often inappropriate guide to what should be done in a particular situation. “What works” on average can be ineffective or even harmful for certain groups; “what doesn’t work” on average might be effective in certain circumstances. ….This simplistic focus on “what works” risks presenting evidence-informed policy as being about applying an algorithm where the average effect is turned into a policy prescription for all.

Other developments

    • In September 2022 the Commonwealth Evaluation Community of Practice (CoP) was launched as a way of bringing people together to support and promote better practice evaluation across the policy cycle. The CoP Terms of Reference state that it is open to all Australian government officials with a role or interest in evaluation that can access community events, discussion boards and a SharePoint Workspace. According to the Department of Finance the CoP membership has grown to over 400 people with representatives from around 70 entities and companies.
      • It would be interesting to be a “fly on the wall” amidst such discussions

My own two pence worth

  • Not only do we need a diversity of evaluation approaches (vs “RCTs rule okay!”), we also need to get away from the idea of even one approach alone being sufficient for many evaluations – which are often asking multiple complex questions. We need more combinatorial thinking, rather than single solution thinking. So, for example, combining “causes of an effect” analyses with “effects of a cause” analyses
  • Getting away from “average affect” thinking (but not abandoning it altogether) is also an essential step forward. We neeed more attention to both positive and negative deviants from any  averages. We also need more attention to configurational analyses, looking at packages of causes, rather than the role of multiple isolated (but not in reality) single factors. As pointed out by Patricia, equity is important – not just effectiveness and effeciency – i.e the different consequences for different groups need to be identified. Yes, the questions is not is not “What works” but “what works for whom in what ways and under what circumstances”
    • Re “This simplistic focus on “what works” risks presenting evidence-informed policy as being about applying an algorithm where the average effect is turned into a policy prescription for all.” Yes, what we want to avoid (or minimise) is a society where  “While the rich get personalised one to one services, the rest get management by algorithm

Connecting Foresight and Evaluation

This posting has been prompted by an email shared by Petra Mikkolainen, Senior Consultant – Development Evaluation, with NIRAS, a Finnish consulting firm. It follows my attendance at a UNESCO workshop last week that also looked at bridging Foresight and Evaluation

A good place to start is this NIRAS guide: 14 mental barriers to integrating futures-thinking in evaluations and how to overcome them

A new trend is emerging simultaneously in the field of evaluation and foresight: combining foresight with evaluation and evaluation with foresight. Evaluators realise that evaluation must become more future sensitive, while futures thinking experts consider that
foresight should use more lessons from past events to
strengthen the analysis of possible futures. This new
mindset is useful, given that evaluation and foresight
complement each other like two pieces of a puzzle.
However, before we can move on with the debate, we
must clarify what we mean by each concept and related
key terms. This discussion paper serves as your quick
guide to evaluation and foresight terminology.

Then there is “Evaluation must become future-sensitive – easy to implement ideas on how to do it practice

Evaluation – by definition – assesses past events to give recommendations for future action. There is an underlying assumption that what has (or has not) worked in the past will also work (or will not) in the future. In other words, it is supposed that the context in which the past events occurred will remain the same. This idea seems problematic in the current world, where volatility, uncertainty, complexity, and ambiguity (VUCA)are the new normal. One solution is to integrate methods of foresight into the evaluation project cycle. This idea of combining evaluation and foresight is relatively new and untested in the sector. This discussion paper proposes ways this integration can be done in practice in different steps of the evaluation project cycle.

Then there is: 14 mental barriers to integrating futures-thinking in evaluations and how to overcome them

There are two types of basic human reactions to new things: (1) “Yes, let’s try it!” and (2) “No, I don’t want that!”. We might observe one of these experiences in our minds when thinking about integrating foresight concepts and tools into development evaluation to make it more valuable and responsive to support transformative change. The danger with the first response is a lack of critical thinking about whether the approach is relevant to the situation. On the other hand, the second response might prevent reaching new levels of learning and co-creation. In this blog, I explore 14 types of resistance to applying futures-thinking in evaluation and suggest solutions with an attitude of positive curiosity.

One of the foreight methods mentioned on page 10 of the second document is ParEvo:

The ParEvo tool developed by Rick Davies is a web-assisted programme for building future (or past) scenarios in a participatory manner (Davies, 2022). It has been used in evaluations, and as described by Davies “When used to look forward ParEvo can be seen as a form of participatory exploration of alternate futures. When used to look back it can be seen as a form of participatory public history”. The website includes plenty of information on its applications.

Exploring the text analytic capacities of GPT4

 

On this page:

    1. What GPT4 says it can do
    2. What we can do using apps that interface with GPT4 et al
    3. Evaluation of AI text analysis performance
    4. How AI helped me to help AI improve its performance!

1. What GPT4 says it can do?

RD: What kinds of text analyses can I expect GPT4 to be able to do?
GPT4: While I cannot provide specific details about GPT-4’s capabilities beyond my knowledge cutoff, I can provide some insights based on GPT-3’s functionalities. GPT-4, being an advanced version, is expected to have improved performance and more accurate results. Here are some types of text analysis that a GPT-4 model could potentially handle:
    1. Sentiment analysis: Determine the sentiment of a given text, whether it’s positive, negative, or neutral, and potentially provide a sentiment score.
    2. Text summarization: Summarize long pieces of text into shorter, more concise versions while retaining the main ideas.
    3. Named entity recognition (NER): Identify and classify entities within the text, such as names of people, organizations, locations, dates, and other types of entities.
    4. Keyword extraction: Extract the most relevant keywords or phrases from a given text, providing a quick overview of the main topics.
    5. Text classification: Categorize text into different topics, genres, or other predefined classes based on its content.
    6. Relationship extraction: Identify relationships between entities within a text, such as a person’s affiliation with an organization or the connection between two events.
    7. Language translation: Translate text between different languages while maintaining the original meaning and context.
    8. Paraphrasing: Reword a given text while preserving its original meaning, which can be useful for rephrasing sentences or generating alternative versions of content.
    9. Grammar and style corrections: Identify and correct grammatical errors, typos, and stylistic issues in a text.
    10. Content generation: Create new, coherent, and contextually relevant content based on user inputs, prompts, or templates.

Please note that these are examples of text analysis tasks that an advanced language model like GPT-4 might be able to handle, but specific capabilities might vary depending on the model’s training and implementation.

2. What we can do using apps that interface with GPT4 et al

This is what I have been doing recently, partly through the use of a new app known as Pickaxe, which “sits on top of” GPT4 and its predecessors. PS: The number of these apps is large and growing very rapidly. I will try to provide some guides to them here shortly.

Using Pickaxe you can draft a “prompt”, or a whole structure of prompts, which will send a request to GPT, which then returns a finding. Nothing new here. But, in addition, you can “publish” your prompt, making it usable by others. Including the ability  to customise the prompt to their more specific requirements.

Here below is a list of the Pickaxes I have developed so far…mainly oriented around my interests relating to qualitative analysis of text data. Warning… None of these is perfect. Inspect the results carefully and don’t make any major decisions on the basis of this information alone. Sometimes you may want to submit the same prompt multiple times, to look for variability in the results.

Please use the Comment facility to provide me with feedback on what is working, what is not and what else could be tried out. This is all very much a work in progress. For some background see this other recent post of mine: Using ChatGPT as a tool for the analysis of text data

Summarisation

Text summariser The AI will read the text and provide three types of summary descriptions for each and all of the texts provided. Users can determine the brevity of the summaries

Key word extraction. The AI will read the text and generate ranked lists of key words that best describe the contents of each and all of the texts provided.

Comparison

Text pile sorting The AI will sort texts in two piles representing the most significant difference between them, within constraints defined by the user

Text pair comparisons The AI will compare two descriptions of events and identify commonalities and differences between them, within constraints defined by the user

Text ranking. The AI will rank a set of texts, on one or more criteria provided by the user. An explanation will be given for the texts in the top and bottom rank positions

Extraction

Thematic coding assistant You provide guidance for the automated search for a theme of interest to you. You provide a set of texts to be searched for this theme. AI searches and finds texts that seem most relevant. You provide feedback to improve further searches.

PS: This Pickaxe needs testing against data generated by manual searches of the same set of text for the same themes. If you have any already coded text that could be used for such a test please let me know: rick.davies@gmail.com  For more on how to do such a test see section 3 below.

Actor & relationship extraction AI will identify names of actors mentioned in texts, and kinds of relationships between them. The output will be in the form of two text lists and two matrices (affiliation and adjacency), in csv format.

Adjective Analysis Extraction The AI will identify ranked lists of adjectives that are found in one or more texts, within constraints identified by the user.

Adverb extraction
The AI will identify a ranked list of adverbs that are found within a text, within constraints identified by the user.

Others of possible interest

Find a relevant journal…that covers the subject that you are interested in. Then have those journals ranked on widely recognised quality criteria. And presented in a table format

3. Evaluation of AI text analysis performance

It is worth thinking how we could usefully compare the performance of GPT4 to that of humans on  text analysis tasks. This would be easiest with responses that generate multiple items, such as lists and rankings, which lend themselves to judgements about degrees of similarity/difference – the use of which is made clearer below.
There are three possibilities of interest:
    1. A human and the AI might both agree that a text, or instance in a text, meets the search criteria built into a prompt. For example, it is an instance of the theme “conflict”.
    2. A human might agree that a text, or instance in a text, meets the search criteria built into a prompt.  But the AI may not. This will evident if this instance has not been included in its list. But will be on a list developed by the human.
    3. The AI might agree that a text, or instance in a text, meets the search criteria built into a prompt. But the human may not. This will evident if this instance has been included in its list. But will not be on a list developed by the human.

These possibilities can be represented in a kind of truth table known as a Confusion MatrixIdeally both human and AI would agree in their judgements on which texts were relevant instances. In which case all the found instances by both parties would be in the True Positive cell, and all the rest of the texts were in effect in the True Negative box.  (TP+TN)/(TP+FP+FN+TN) is a formula for measuring this form of performance, known as Classification Accuracy. This example would have 100% classification accuracy. But such findings are uncommon.

How would you identify the actual numbers in each of cells above? This would have to be done by comparing the results returned by an AI to those  already identified by the human. Some instances would be agreed upon as the same as those already identified – which we can treat as TPs. Others might strike them as new and relevant and had not previously been identified  (FN)s. The human’s coding would then be updated so that such instances were now deemed TPs. Others would be seen as inappropriate and non-relevant instances (FPs).
If there were some FPs what could be done. There are two possibilities:
    1. The human could ask themselves how can they can edit the AI prompt to improve its identification of these kinds of instances. In doing so it would be learning how to work better with the AI. This seems likely to be a common response, judging from a sample of the rapidly growing prompt literature that I have scanned so far.
    2. The text of one or more identified FP instances could be inserted into body of the prompt, as a source of additional guidance. Then the use of that prompt could be reiterated. In doing so the AI would be adapting its response in the light of human feedback. It would be doing the learning. This is a different kind of  approach, which is happening already within GPT4, but probably much less often in the prompts designed by non-specialist human users.

After the second iteration of the prompt the incidence of FPs could be reviewed again. A third iteration could be prepared, including an updated feedback example generated by the AI’s second iteration. The process could be continued. Ideally the classification accuracy of the AIs work would improvised with each iteration. In practice progress may not may not be so smooth.

A wider perspective

What I have described is an evolutionary search strategy, involving variation, selection and reproduction.:

    1. Variation:  A population of possibly relevant  solutions is identified by the first iteration of the prompt. That is, a list of identified instances is generated.
    2. Selection: The poorest fitting instance is selected as an example of what is not relevant, and inserted into the original prompt text with that label.
    3. Reproduction: The revised prompt is reiterated, to generate a new and improved set of variant instances

There is a similar process built into the design of Stable Diffusion, which is designed to generate images from text prompts. An initial text prompt generates four images, which are variants of the prompt idea. The user selects one and can then reiterate the use of the prompt, either in its original form or with some further edits. Again four new variants are generated.

There is however one significate difference, at least as I currently understand the Stable Diffusion process. The selection choice focuses on the most desirable variant and its reproduction into new variant forms, not the least desirable variant and its elimination from the population of reproduced variants. This embodies a “survival of the fittest” approach, not  the “non-survival of the least-fitting”. In this respect it appears to be a more convergent process – though there do see to be other user adjustable parameters which can affect how varied new variants are – see the “Guidance scale”

Implications for the use of GPT4 et al

In GPT4 you can “regenerate” new responses to a given prompt i.e create  set of variants. But these are presented as a sequence of opportunities, not as a  set that is available all once, which can be viewed in comparison to each other. One possible way around this this is to build into a prompt a request for N number of versions of the response – which would be all visible at the same time. The best of these could then be investigated, and with any further prompts also making a request for N variants of a response. The main challenge with this approach seems likely to be space constraints, the size limit on the word length of the complete response. But this seems likely to increase over time, if computing capacity continues to increase.

4. How AI helped me to help AI improve its performance!

Okay, I know this is a very long post, but do continue reading.  After the long ramble of ideas just above, I decided to ask GPT4 about the differences and similarities between genetic algorithm type searches and reinforcement learning approaches. I will copy and past what happened next, below:

Me: Please describe the similarities’ and differences between reinforcement learning and genetic algorithms
GPT4: Reinforcement Learning (RL) and Genetic Algorithms (GA) are both optimization techniques that can be used to solve various problems, particularly in artificial intelligence and machine learning. While they share some similarities, they are fundamentally different in their approaches and underlying mechanisms.

Similarities:

  1. Optimization: Both RL and GA are optimization methods aiming to find the best solution for a given problem. They iteratively adjust their solutions based on feedback or evaluation.
  2. Exploration vs. Exploitation: Both techniques must balance exploration (searching new areas of the solution space) and exploitation (refining existing solutions) to achieve optimal results.
  3. Stochastic Processes: Both methods involve randomness in their search procedures. In RL, agents often take random actions to explore the environment, while GA introduces randomness through mutation and crossover.
  4. Population-based: Although not always the case in RL, some algorithms, like Evolutionary Strategies or Genetic Reinforcement Learning, employ a population-based approach similar to GA. These methods maintain multiple candidate solutions and update them over time based on their performance.

Differences:

  1. Representation: RL represents solutions as policies or value functions that map states (or state-action pairs) to actions (or values). GA represents solutions as individuals in a population, typically encoded as strings or data structures, which directly or indirectly represent a solution to the problem.
  2. Learning Mechanism: RL learns through interaction with an environment, where an agent takes actions, observes the resulting states and receives feedback in the form of rewards or penalties. GA is based on the principles of natural evolution, where the fittest individuals are selected to reproduce and generate offspring that inherit their traits.
  3. Evaluation: In RL, the agent’s performance is assessed through the cumulative rewards obtained in an episode or over multiple episodes. In GA, a fitness function evaluates each individual in the population based on how well it solves the problem.
  4. Update Mechanism: RL algorithms update their policies or value functions using methods like Temporal Difference (TD) learning, Q-learning, or Policy Gradient. GA updates its population through genetic operators like selection, crossover, and mutation.
  5. Applicability: RL is particularly suited for sequential decision-making problems in which an agent interacts with an environment. GA is more versatile and can be applied to a wide range of optimization problems, including those without sequential decision-making or clear environment interactions.

In summary, reinforcement learning and genetic algorithms share some similarities in their optimization goals, exploration-exploitation balance, and stochastic processes. However, they differ in terms of representation, learning mechanisms, evaluation, update mechanisms, and applicability to various problem domains.

Me: Given your explanation just provided I will now give you an example of a package of prompt commands that has been used with GHP4 on another occasion. Please read it and tell me which of the two methods (reinforcement learning or GA) it is most similar to, and why. Or, if it is not similar to either please tell me and explain why it is not. The text of the prompt package is as follows:

1. The user want you to find sections of text, which will be called ‘instances’, that contain this theme: [user enters data here]
2. Here is some extra clarification on the meaning of this theme: [user enters data here]
3. Here is the body of text to be searched: [user enters data here] If there are separate sections of text each of these will have an ID number preceded by the # symbol
4. Please list up to [user enters data here] instances that contain this theme, ranked in order of how well they match the theme. If only a shorter number can be found, display that number of instances only. If more than that number can be found report the number of extra instances that were found, at the end of the requested list of instances.
5. With each instance, display a maximum of [user enters data here] sentences of the relevant text.
Include the hashtag number of the source text, at the end of this text, if it has been provided by the user. Follow the text of each sentence with a one sentence explanation for its inclusion.
6. Do not include [user enters data here]
7. After you have provided your first response, the user will provide you with feedback on where you made any errors. Specifically, this will be in the form of the instances which have been listed by you as instances of the theme, which the user thinks are not correct i.e. False Positives (FPs). If no feedback has yet been provided you will see this message ‘No feedback yet’. Please note that this does not mean all your listed instances are already correct.
8. When the user clicks “Submit” please redo your analysis using this feedback to improve your performance. By this we mean a reduction in the number of listed instances which are likely to be classed as False Positives
9. But do not use this feedback if the body of text to be searched, given at point 3 above, has been changed. Treat that as a new exercise.

GPT4: The given prompt package does not directly resemble either Reinforcement Learning (RL) or Genetic Algorithms (GA) in their traditional forms. However, it is somewhat more related to a Reinforcement Learning setting, given the presence of feedback and iterative improvement. Let me explain why:

The problem itself is an instance of text analysis and information extraction, which does not inherently require optimization techniques like RL or GA. Instead, it could be solved using natural language processing techniques, such as named entity recognition, topic modeling, or even machine learning classifiers.

The feedback mechanism mentioned in points 7 and 8 resembles a reinforcement learning setting, where an agent (in this case, the AI model) receives feedback from the user regarding its performance (False Positives in this case). The agent is expected to improve its performance using this feedback, similar to the way RL agents learn from rewards or penalties.

The iterative improvement process, mentioned in points 7, 8, and 9, is also characteristic of a reinforcement learning setup. An RL agent learns from its environment by interacting with it, receiving feedback, and updating its policy or value function accordingly.

However, it is essential to note that the prompt package does not fit the traditional reinforcement learning paradigm perfectly. The problem does not involve sequential decision-making or interaction with an environment where an agent takes actions and transitions to different states.

In conclusion, while the prompt package does not directly correspond to either Reinforcement Learning or Genetic Algorithms, it shares some similarities with Reinforcement Learning due to the presence of feedback and iterative improvement. A customized approach involving natural language processing and some form of feedback-based learning (possibly inspired by RL) would likely be more suitable for solving this problem.

Me: How would you change your analysis, if at all, if I told you that this prompt package is being used for identification of themes in a text, which inherently involves fuzzier category boundaries than is the case with named entity recognition. It could be argued that there is an optimisation process involved here, where the AI is seeking to find the most accurate model of that theme i.e one with least false positives and least false negatives, as identified when that model’s results are compared to the users own categorizations (which is implicitly happening when they provide feedback on the false positive instances) 

AI: Given this new information, it is clearer that the prompt package involves an optimization process aimed at minimizing false positives and false negatives in theme identification. While the problem still does not fit perfectly into traditional Reinforcement Learning (RL) or Genetic Algorithm (GA) frameworks, it may be possible to adapt these methods to address the problem, especially when considering the feedback mechanism and iterative improvement process.

One approach could be to combine Reinforcement Learning with supervised learning techniques for natural language processing. In this case, the AI agent could learn a policy for selecting instances of the theme using features derived from the text. The agent’s performance would be evaluated based on feedback from the user (i.e., false positives). The agent would then iteratively update its policy to optimize its performance in identifying themes and minimizing both false positives and false negatives. RD: This is a feature currently built into the prompt]

To further improve the model, active learning techniques could be employed, where the AI agent would actively request feedback from the user on instances it is uncertain about. This would enable the agent to learn more effectively from a smaller amount of feedback, refining its understanding of the theme and its fuzzy boundaries. RD: This feature is not in the prompt but sounds very useful, if not more useful

Alternatively, Genetic Algorithms could be used to optimize the feature extraction process or the parameters of a machine learning classifier designed to identify themes in the text. The population in the GA could represent different feature extraction methods or classifier parameters, and the fitness function would be based on the model’s ability to identify themes with minimal false positives and false negatives. The GA would then evolve the population through selection, crossover, and mutation to find the optimal solution.RD: I am not sure how one would set this up, and whether it would be worth the effort

In summary, while the problem of theme identification with fuzzy category boundaries still does not fit perfectly into traditional Reinforcement Learning or Genetic Algorithm frameworks, it is possible to adapt these methods to address the problem by leveraging the feedback mechanism and the iterative improvement process. Combining Reinforcement Learning with supervised learning techniques, using active learning, or employing Genetic Algorithms to optimize feature extraction or classifier parameters could potentially lead to an effective solution for this problem.

Me: I especially like your second and third paragraph. Regarding this idea “To further improve the model, active learning techniques could be employed, where the AI agent would actively request feedback from the user on instances it is uncertain about. This would enable the agent to learn more effectively from a smaller amount of feedback, refining its understanding of the theme and its fuzzy boundaries.” ,could you please suggest some improvements to the wording of the prompt, which i will present again here: [same as already shown above]

AI: To incorporate active learning into the prompt package, you can add a step that allows the AI agent to request feedback from the user on uncertain instances. Here is the revised prompt package with this modification:

1. The user wants you to find sections of text, which will be called ‘instances’, that contain this theme: [user enters data here]
2. Here is some extra clarification on the meaning of this theme: [user enters data here]
3. Here is the body of text to be searched: [user enters data here] If there are separate sections of text, each of these will have an ID number preceded by the # symbol.
4. Please list up to [user enters data here] instances that contain this theme, ranked in order of how well they match the theme. If only a shorter number can be found, display that number of instances only. If more than that number can be found, report the number of extra instances that were found at the end of the requested list of instances.
5. With each instance, display a maximum of [user enters data here] sentences of the relevant text. Include the hashtag number of the source text at the end of this text if it has been provided by the user. Follow the text of each sentence with a one-sentence explanation for its inclusion.
6. Do not include [user enters data here]
7. In addition to the instances listed above, the AI agent may present up to [user enters data here] instances it is uncertain about. The user is requested to provide feedback on whether these instances are relevant to the theme or not.
6. After you have provided your first response, the user will provide you with feedback on where you made any errors. Specifically, this will be in the form of the instances which have been listed by you as instances of the theme, which the user thinks are not correct, i.e., False Positives (FPs). If no feedback has yet been provided, you will see this message ‘No feedback yet’. Please note that this does not mean all your listed instances are already correct.
When the user clicks “Submit,” please redo your analysis using this feedback to improve your performance. By this, we mean a reduction in the number of listed instances which are likely to be classed as False Positives.
But do not use this feedback if the body of text to be searched, given at point 3 above, has been changed. Treat that as a new exercise.

By adding point 7, the AI agent is given the opportunity to actively request feedback from the user on instances it is uncertain about, facilitating a more effective learning process with a smaller amount of feedback.

Me: well done, thank you!

Simple but not simplistic: Findings from a theory-driven retrospective evaluation of a small projects program

By Larry Dershem, Maya Komakhidze, Mariam Berianidze, in Evaluation and Program Planning 97 (2023) 102267.  A link to the article, which will be active for 30 days. After that, contact the authors.

Why I like this evaluation – see below  and the lesson I may have learned

Background and purpose: From 2010–2019, the United States Peace Corps Volunteers in Georgia implemented 270 small projects as part of the US Peace Corps/Georgia Small Projects Assistance (SPA) Program. In early 2020, the US Peace Corps/Georgia office commissioned a retrospective evaluation of these projects. The key evaluation questions were: 1) To what degree were SPA Program projects successful in achieving the SPA Program objectives over the ten years, 2) To what extent can the achieved outcomes be attributed to the SPA Program ’s interventions, and 3) How can the SPA Program be improved to increase likelihood of success of future projects.

Methods: Three theory-driven methods were used to answer the evaluation questions. First, a performance rubric was collaboratively developed with SPA Program staff to clearly identify which small projects had achieved intended outcomes and satisfied the SPA Program ’s criteria for successful projects. Second, qualitative comparative analysis was used to understand the conditions that led to successful and unsuccessful projects and obtain a causal package of conditions that was conducive to a successful outcome. Third, causal process tracing was used to unpack how and why the conjunction of conditions identified through qualitative comparative analysis were sufficient for a successful outcome.

Findings: Based on the performance rubric, thirty-one percent (82) of small projects were categorized as successful. Using Boolean minimization of a truth table based on cross case analysis of successful projects, a causal package of five conditions was sufficient to produce the likelihood of a successful outcome. Of the five conditions in the causal package, the productive relationship of two conditions was sequential whereas for the remaining three conditions it was simultaneous. Distinctive characteristics explained the remaining successful projects that had only several of the five conditions present from the causal package. A causal package, comprised of the conjunction of two conditions, was sufficient to produce the likelihood of an unsuccessful project. Conclusions: Despite having modest grant amounts, short implementation periods, and a relatively straightforward intervention logic, success in the SPA Program was uncommon over the ten years because a complex combination of conditions was necessary to achieve success. In contrast, project failure was more frequent and uncomplicated. However, by focusing on the causal package of five conditions during project design and implementation, the success of small projects can be increased.

Why I like this paper:

1. The clear explanation of the basic QCA process
2. The detailed connection made between the conditions being investigated and the background theory of change about the projects being analysed.
3. The section on causal process  which investigates alternative sequencing of conditions
4. The within case descriptions of modal cases (true positives) and the cases which were successful but not covered by the intermediate solution (false negatives), and the contextual background given for each of the conditions you are investigating.
5. The investigation of the causes of the absence of the outcome, all too often not given sufficient attention in other studies/evaluation
6. The points made in the summary especially about the possibility of causal configurations changing over time, and a proposal to include characteristics of the intermediate solution into the project proposal screening stage. It has bugged me for a long time how little attention is given to the theory embodied into project proposal screening processes, let alone testing details of these assessments against subsequent outcomes. I know the authors were not proposing this specifically here but the idea of revising the selection process by new evidence of prior performance is consistent and makes a lot of sense
7. The fact that the data set is part of the paper and open to reanalysis by others (see below)

New lessons, at least for me..about satisficing versus optimising

It could be argued that the search for Sufficient conditions (individual or configurations of)  is a minimalist ambition, a form of “satisficing” rather than optimising. In the above authors’ analysis their “intermediate solution”, which met the criteria of sufficiency,  accounted for 5 of the 12 cases where the expected outcome was present.

A more ambitious and optimising approach would be to seek maximum classification accuracy (=(TP+TN)/(TP+FP+FN+TN)), even if this at the initial cost of few False Positives. In my investigation of the same data set there was a single condition that was not sufficient, yet accounted for 9 of the  same 12 cases (NEED). This was at the cost of some inconsistency i.e two false positives also being present when this single condition was present (Cases 10 & 25) . This solution covered 75% of the cases with expected outcomes, versus 42% with the satisficing solution.

What might need to be taken into account when considering this choice of whether to prefer optimising over satisficing? One factor to consider is the nature of the performance of the two false positive cases? Was it near the boundary of what would be seen as successful performance i.e. a near miss? Or was it a really bad fail? Secondly, if it was a really bad fail, in terms of degree of failure, how significant was that for the lives of the people involved? How damaging was it? Thirdly, how avoidable was that failure? In the future is there a clear way in which these types of failure could be avoided, or not?

This argument relates to a point I have made on many occasions elsewhere. Different situations require different concerns about the nature of failure. An investor in the stock market can afford a high proportion of false positives in their predictions, so long as their classification accuracy is above 50% and they have plenty of time available. In the longer term they will be able to recover their losses and make a profit. But a brain surgeon can afford absolute minimum of false positives. If their patients die as a response of their wrong interpretation of what is needed that life is unrecoverable, and no amount of subsequent successful future operations will make a difference. At the very most, they will have learnt how to avoid such catastrophic mistakes in the future.

So my argument here is let’s not be too satisfied with satisficing solutions.  Let’s make sure that we have at the very least always tried to find the optimal solution (defined in terms of highest classification accuracy) and then looked closely at the extent to which that optimal solution can be afforded.

PS 1: Where there are “imbalanced classes” i.e a high proportion of outcome-absent cases (or vice versa) an alternate measure known as “balanced accuracy” is preferred. Balanced accuracy = ( TP/(TP+FN))+(TN/(TN+FP)))/2.

PS 2: If you have any examples of QCA studies that have compared sufficient solutions with non-sufficient but more (classification) accurate solutions, please let me know. They may be more common than I am assuming

The Fallacy of AI Functionality

 

Evaluators should have a basic working knowledge of how to evaluate algorithms used to manage human affairs (law, finance, social services, etc) because algorithm designs embody human decisions and can have large scale consequences. For this reason I recommend:

Raji ID, Kumar IE, Horowitz A, et al. (2022) The Fallacy of AI Functionality. In: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul Republic of Korea, 21 June 2022, pp. 959–972. ACM. DOI: 10.1145/3531146.3533158.
Deployed AI systems often do not work. They can be constructed haphazardly, deployed indiscriminately, and promoted deceptively. However, despite this reality, scholars, the press, and policymakers pay too little attention to functionality. This leads to technical and policy solutions focused on “ethical” or value-aligned deployments, often skipping over the prior question of whether a given system functions, or provides any benefits at all. To describe the harms of various types of functionality failures, we analyze a set of case studies to create a taxonomy of known AI functionality issues. We then point to policy and organizational responses that are often overlooked and become more readily available once functionality is drawn into focus. We argue that functionality is a meaningful AI policy challenge, operating as a necessary first step towards protecting affected communities from algorithmic harm.

CONTENTS
1. Introduction
2. Related work
3. The functionality assumption
4. The many dimensions of disfunction
4.1 Methodology
4.2 Failure taxonomy
4.2.1 Impossible Tasks
Conceptually Impossible.
Practically Impossible
4.2.2 Engineering Failures
Model Design Failures
Model Implementation Failures
Missing Safety Features
4.2.3 Deployment Failures
Robustness Issues
Failure under Adversarial Attacks
Unanticipated Interactions
4.2.4 Communication Failures
Falsified or Overstated Capabilities
Misrepresented Capabilities
5 DEALING WITH DYSFUNCTION: OPPORTUNITIES FOR INTERVENTION ON FUNCTIONAL SAFETY
5.1 Legal/Policy Interventions
5.1.1 Consumer Protection
5.1.2 Products Liability Law.
5.1.3 Warranties
5.1.4 Fraud
5.1.5 Other Legal Avenues Already Being Explored
5.2 Organizational interventions
5.2.1 Internal Audits & Documentation.
5.2.2 Product Certification & Standards
5.2.3 Other Interventions
6 CONCLUSION : THE ROAD AHEAD