A Participatory Means Of Measuring Complex Change
A weighted checklist has:
- A list of items, each of which describes an attribute of an organisation or an event. The attribute may or may not be present (indicated by a 1 or 0), or it may be present in a degree measured in a simple scale (e.g. 0 to 3).
- A set of weights, which describes the relative importance of each item
- A summary score, based on the number of items identified as present, but adjusted by their individual weights.
Here is an example of a very simple customer satisfaction survey that is in the form of a weighted checklist (that was sent to me by a firm I used). In this case the survey respondents provided two sets of information (a) their views on the importance of each item, to them (the weights), in the second column, (b) their views on how well the firm was doing on each criteria according to their experience, in the third column.
Once the responses have been collected, weighted scores for individual respondents then can be calculated, along with an average score for all respondents. The process is as follows:
1. Multiply the importance rating x actual performance rating for each item
2. The sum of these is the actual raw score
3. Multiply the importance rating x highest possible performance rating for each item
4. The sum of these is the highest possible raw score
5. Divide the actual raw score (2) by the highest possible raw score (4), to get a percentage score for the respondent. A high percentage = high degree of satisfaction, and vice versa
6. Calculate the average percentage score for all the respondents
This is a participatory form of weighted checklist, because respondents themselves determine the weights given to different items on the checklist. Other types of checklists use weightings solicited from experts. They are not the focus of the remainder of this paper. Judging from a Google search these expert weighted checklists are mainly used for staff performance appraisal purposes. For more information on these, see:
- Performance Appraisal Tips Help Page by Dexter Hansen. “Weighted Checklist. – The term used to describe a performance appraisal method where supervisors or personnel specialists familiar with the jobs being evaluated prepared a large list of descriptive statements about effective and ineffective behaviour on jobs.”
- Managing Employee Performance and Reward: Concepts, Practices, Strategies, by John Shields. 2007, page 170
- Managing Human Resources in Small & Mid-Sized Companies By Diane Arthur, 1995, page 178
- Handbook of Public Personnel Administration By Jack Rabin, 1995, page 337
1. When the event is complex and difficult to measure with any single indicator
Often people try to measure a change by finding a single measurable indicator that will capture the change. For complex changes, such as improvements in people’s participation or changes in organisational capacity, finding such an indicator can be a major challenge. Often the chosen indicator seems far too simplistic. Such as using the number of people participating in x type of meetings, as an indicator of participation.
2. Where there may be multiple measures in place, but a single aggregate measure is needed of overall performance
Sometimes Logical Framework descriptions of project designs will include more than one indicator to track a given change that is recognised to be complex. However this response presents a further challenge, of how to aggregate the evidence of change described by multiple indicators.
3. Where peoples’ views of the significance of what has happened differ
Users of a health centre may have different views on how well the health service is performing compared to the health centre staff, or to the views of the senior managers of health services
The matrix below can be used to describe where three different methods are most suitable (ordinary indicators, Most Significant Change stories, and weighted checklists)
What is different about (participatory) weighted checklists?
Weighted checklists separate out value data from observational data . In the example above, the second column asks about importance to you, the respondent. This is value data. The third column asks you about the company’s performance. This is observational data.
With the use of conventional indicators judgments about importance happen only once, when the choice is made to use a specific indicator or not. This happens at the planning stage, and is set thereafter. It is not possible to change the choice of indicator later on, without losing continuity of the data that has been collected so far. With weighted checklists the same set of observational data can be re-analysed with different sets of value data, reflecting the views of different stakeholders.
Value data is meta-information: information about information. This can be of different kinds. In the simple example shown in the table above, respondents are asked about their preferences. Another survey could ask people which items they thought were basic rights, which all people should have access to. This is the basis of the design of the Basic Necessities Survey (BNS). Or, a survey could ask which items would be the most important cause of an overall outcome e.g. improved community health. This was the subject of my posting on “Checklists as mini theories-of-change” Because of these choices available participatory processes used to elicit checklist weightings should always be clear on what type of judgments are being sought.
Value data can be worthwhile analyzing in itself. Different groups of stakeholders will usually vary in the extent that their views agree with each other. We could measure and monitor this degree of alignment by looking at how participants’ ratings in the second column of the example above correlate with each other, using Excel. Social Network diagrams could also be produced using the same data (in a participants x item ratings matrix) to show in more detail how various stakeholder groups are aligned with each other in their views. Of special importance in development project settings will be how alignment of views between stakeholders changes over time. Is there a stronger consensus developing or not?
Changes over time are likely to be important in other ways as well. If a survey asks for information about the perceived importance of different health services (as well as their actual availability) the increased expectations over time might be as important an indicator of development as any increased availability of services. Knowledge that the public were expecting more could affect the responsiveness of local politicians to their constituency’s concerns. Differences between what people report as available and what is reported to be available through other sources of information could also be informative. It could highlight a lack of public knowledge of what is available, or raise questions about the validity of officials’ claims about what is available.
Constructing the checklist
A common challenge to the use of methods like the BNS is “But who constructed the checklist? Surely the contents of this list, and what it omits, will affect the overall findings?” There are two ways of addressing this potential problem. The first is to ensure that the checklist contents are developed through a consultative process involving a range of stakeholders, especially those whose performance is being assessed. The other is to ensure that the checklist is long enough. The BNS checklist had around thirty items. The larger the checklist the less vulnerable the aggregate score will be to the accidental omission of individual items that could be important. But there will also need to be some limits to the size of the list, because respondents’ interests are likely to wane towards the end of a long list.
Long lists of items in a checklist can also present another challenge, of how to assign weightings to all of them. One way around this problem is to group the items into categories, as in the table above, and then proceed with weightings in two stages: (a) for the four main categories first, then for the individual items within each category.
Interpreting the checklist
In a survey form it will not be easy to elicit reasons why people have rated one item on a checklist as more important than another. But in workshop settings this can be easier.. One way of eliciting these explanations from participating stakeholders is to do pair comparisons, asking “Why is this category of activities more important than this one?” Answers to this question help provide insight into people as consumers of services or citizens with rights or managers with theories-of-change about how their intervention should work.
The same problem exists with understanding the observational data, especially where there are rating scales rather than yes/no answer options. With yes/no options the main requirements it that the items on the list are clearly defined entities or events. With ratings there is the possibility of significant differences in response styles, how people use the ratings available. One common strategy is to provide the respondent with guidance on what would constitute a 0, 1, 2, of 3 on a rating scale.
There is some debate however about whether the weightings of items should be made visible to respondents before a survey, or only made visible later on, when results have been aggregated. This would not be an issue where the weightings themselves are obtained from the respondents, as is the case with the BNS survey. But it could have an influence on the survey results where weights are decided before the survey is implemented. It could lead to respondents, say health centre staff, deciding to improve one aspect of their service more than another, because they know it receives a higher weighting in the checklist. But that response may not necessarily be a bad, thing, if those aspects of service are really more important than others. Being open about the weightings could give health centres some choices about how to improve their service, in contrast to performance measurement relying on one key indicator.
Where weightings are obtained from stakeholders (including respondents) via a workshop event after the survey the effects might be less easy to predict. Participants might be inclined to argue for higher weightings for items they know they have done well on, and vice versa. Making their raw checklist scores visible during the workshop discussion could help make this tendency evident, but it is not likely to eradicate it. Structuring a debate around the proponents of different weightings might help force any apparently self-interest proposals to be justified.
I recommend THE LOGIC AND METHODOLOGY OF CHECKLISTS by Michael Scriven Claremont Graduate University, updated in 2007. In the opening para he says
“The humble checklist, while no one would deny its utility in evaluation and elsewhere, is usually thought to fall somewhat below the entry level of what we call a methodology, let alone a theory. But many checklists used in evaluation incorporate a quite complex theory, or at least a set of assumptions, which we are well advised to uncover; and the process of validating an evaluative checklist is a task calling for considerable sophistication. Indeed, while the theory underlying a checklist is less ambitious than the kind that we normally call a program theory, it is often all the theory we need for an evaluation.”
This is a great paper, informative and a pleasure to read. Amongst other things, it gives a wider background to the use of checklists than I have provided above.
In the DFID “Guidance on using the revised Logical Framework“ there is now a section on IMPACT WEIGHTING.
“Once you have defined your Outputs, you should assign a percentage for the contribution each is likely to make towards the achievement of the overall Purpose. The impact weights of all the Outputs must total 100% and each should be rounded to the nearest 5%. Impact Weightings for Outputs are intended to:
• Promote a more considered approach to the choice of Outputs at project
design stage; and
• Provide a clearer link to how Output performance relates to project Purpose performance.
It appears from the DFID Annual Report formats that Output achievement ratings are multiplied by these weightings to produce a weighted measure of achievement
- The synthesis problem:Issues and methods in the combination of evaluation results into overall evaluative conclusions. By Michael Scriven, Claremont Graduate University andE. Jane Davidson, CGU & Alliant University.A demonstration presented at the annual meeting of the American Evaluation Association, Honolulu, HI, November 2000
- Multi-criteria analysis: A manual. Department of Communities and Local Government. London. 2009 “This manual was commissioned by the Department for the Environment, Transport and the Regions in 2000 and remains, in 2009, the principal current central government guidance on the application of multi-criteria analysis (MCA) techniques. Since 2000 it has become more widely recognised in government that, where quantities can be valued in monetary terms, MCA is not a substitute for cost-benefit analysis, but it may be a complement; and that MCA techniques are diverse in both the kinds of problem that they address (for example prioritisation of programmes as
well as single option selection) and in the techniques that they employ, ranging from decision conferencing to less resource intensive processes. “
Postscript 4 (14 March 2012)
Could you combine the use of weighted checklists and genetic algorithms (GAs), to discover candidate theories that have a good record of explaining variations in performance across a range of settings?
How it might work:
- An aggregate score on a weighted checklist is a function of individual item scores x their respective weights
- The aggregate score can be seen as a prediction, which can be compared against observed measures for its fitness i.e accuracy.
- The particular set of weights in a weighted checklist can be seen as a “theory” of what is needed for good performance
- Some theories (i.e sets of weights) will be better than others in generating an accurate aggregate score
- GA software could find the best combination of weights that would generate the most accurate aggregate score, across a range of participants, each of whom’s performance is measured by a weighted checklist
- The theory embodied in that set of weights would then need to be tested by complimentary means, including common sense judgement (e.g. is this particular combination of performance attributes likely to occur in real life?
Excel has an add-in called Solver that enables GA searches for the most optimal combination of variables to generate a target result. This Excel file shows a mock-up example describing 8 projects scored on 4 attributes. Solver found the combination of attribute weights that generated the most accurate prediction of actual performance (predicted values – observed values = closest to zero). My guess is that the larger the number of cases (projects) and the larger number of attributes, the less sensitive the results would be to small variations in attribute scores (e.g. due to measurement error).
Why use a GA? The number of possible combinations of weightings that might generate successful predictions is enormous even when there are only five values on four attributes, and this number grows very rapidily with each extra attribute. Evolutionary processes are well known for their capacity to explore very large combinatorial spaces.