The Availability of Research Data Declines Rapidly with Article Age

Summarised on SciDevNet, as “Most research data lost as scientists switch storage tech” from this source:

Current Biology, 19 December 2013
Copyright © 2014 Elsevier Ltd All rights reserved.
10.1016/j.cub.2013.11.014

Authors

, Arianne Y.K. Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, Diana J. Rennison

Highlights

We examined the availability of data from 516 studies between 2 and 22 years old
The odds of a data set being reported as extant fell by 17% per year
Broken e-mails and obsolete storage devices were the main obstacles to data sharing
Policies mandating data archiving at publication are clearly needed

Summary

“Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2,3,4], and journal [5,6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8,9,10,11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.”

Rick Davies comment: I suspect the situation with data generated by development aid projects (and their evaluations) is much, much worse. I have been unable to get access to data generated within the last 12 months by one DFID co-funded project in Africa . I am now trying to see if data used in a recent analysis of the (DFID funded) Chars Livelihoods Programme is available.

I am also making my own episodic attempts to make data sets publicly available that have been generated by my own work in the past. One is a large set of hosuehold survey data from Mogadishu in 1986, and another is household survey data from Vietnam generated in 1996 (baseline) and 2006 (follow up). One of the challenges is finding a place on the internet that specialises in making such data available (especially development project data). Any ideas?

PS 2014 01 07: Missing raw data is not the only problem. Lack of contact information about the evaluators/researchers who were associated with the data collection is another one. In their exemplary blog about their use of QCA Raab and Stuppert comment about their search for evaluation reports:

“Most of the 74 evaluation reports in our first coding round do not display the evaluator’s or the commissioner’s contact details. In some cases, the evaluators remain anonymous; in other cases, the only e-mail address available in the report is a generic info@xyz.org. This has surprised us – in our own evaluation practice, we always include our e-mail addresses so that our counterparts can get in touch with us in case, say, they wish to work with us again”

PS 2014 02 01 Here is another interesting article about missing data and missing policies about making data available: Troves of Personal Data, Forbidden to Researchers (NYT, May 21, 2012)

“At leading social science journals, there are few clear guidelines on data sharing. “The American Journal of Sociology does not at present have a formal position on proprietary data,” its editor, Andrew Abbott, a sociologist at the University of Chicago, wrote in an e-mail. “Nor does it at present have formal policies enforcing the sharing of data.”

The problem is not limited to the social sciences. A recent review found that 44 of 50 leading scientific journals instructed their authors on sharing data but that fewer than 30 percent of the papers they published fully adhered to the instructions. A 2008 review of sharing requirements for genetics data found that 40 of 70 journals surveyed had policies, and that 17 of those were “weak.””

3 thoughts on “The Availability of Research Data Declines Rapidly with Article Age”

Quality of statistical data in aid programmes are often weak for many different reasons including weakness in data gathering and data entry process and change of staff. There is often a pathetic lack of openness found among the project holders in acknowledging mistakes while internal studies try to gloat over padded glory (based on erroneous data)! The research work of CLP that Rick has referred to seems to have avoided crucial external assessment reports (e.g., Impact Evaluation 2009), which was critical about datasets maintained by the programme.

While this topic relates to ‘public archives’, this has broader implications for development and is very timely for me. I have recently started in a new role with an NGO and found out that datasets from evaluations and baseline surveys are totally missing. All we have are the final reports (which I have some ‘questions’ over, hence the need to look back at the raw data).

It seems the datasets and the survey forms were only kept by the consultants. Why? I have no idea. I have immediately made a change to consultant contracts to ensure the NGO retains the data, in trust, for the community. (This is important, as it is the community who ‘owns’ the data – not the NGOs or the consultants). Next task is to try to trace the consultants to get the raw datasets.

I agree with Muhammad that “Quality of statistical data in aid programmes are often weak”. Actually, that’s probably a huge understatement. Some of the statistical practice I have observed globally makes me wonder about the education and/or training given to staff (and consultants!). At the worst, it places into question the ‘robustness’ of assessing and validating real impact. Many of the current approaches would not be acceptable in terms of appropriate scientific or academic rigour. I think we are too soft in development and don’t demand the need for better accuracy (yes, I’m a old-fashioned about that). After all, rubbish in equals rubbish out. There is a real integrity issue around this.

Just my own analysis in my new role, in the only evaluation dataset I have, I found something like 15% errors (and that is conservative). It doesn’t take a rocket sciencist to know such errors must have an impact on the final report. The errors seem to be related to basic statisitical issues around poor quality control, especially data entry and data cleaning. However, the survey form is probably the worst I have EVER seen – how it was ever ‘approved’…? Unfortunately, it’s been used for the past 3 years here for baseline and evaluations in multiple projects in the same sector. I would strongly suspect that if I had the datasets from the baselines and evaluations, I would be find a similar error patterns.

How to solve this? Maybe all baselines and evaluations reports should either embed the file in the Appendix or have a website where they are available? If people say the dataset is too large – well, that simply means too much unnecessary data was (mostly likely) collected. Sure there maybe times were ‘confidentiality’ and ‘security’ issues may override this intent – however they would be the exception to the rule. Most datasets can be easily cleaned to ensure anonymity of participants. I think the confidentially argument is used too much and acts as a ‘convenient’ excuse not to be transparent.

Hi Murray
Thanks very much for your comment
Your proposal to change the consultants contract conditions sounds very worthwhile. It would be interesting to see how many other NGOs follow the same good practice, or not…Perhaps BOND could provide or obtain such information.
regards, rick davies

Muhammad Taher says:

23 December, 2013 at 7:26 AM

Quality of statistical data in aid programmes are often weak for many different reasons including weakness in data gathering and data entry process and change of staff. There is often a pathetic lack of openness found among the project holders in acknowledging mistakes while internal studies try to gloat over padded glory (based on erroneous data)! The research work of CLP that Rick has referred to seems to have avoided crucial external assessment reports (e.g., Impact Evaluation 2009), which was critical about datasets maintained by the programme.
Murray says:

30 December, 2013 at 3:33 PM

While this topic relates to ‘public archives’, this has broader implications for development and is very timely for me. I have recently started in a new role with an NGO and found out that datasets from evaluations and baseline surveys are totally missing. All we have are the final reports (which I have some ‘questions’ over, hence the need to look back at the raw data).

It seems the datasets and the survey forms were only kept by the consultants. Why? I have no idea. I have immediately made a change to consultant contracts to ensure the NGO retains the data, in trust, for the community. (This is important, as it is the community who ‘owns’ the data – not the NGOs or the consultants). Next task is to try to trace the consultants to get the raw datasets.

I agree with Muhammad that “Quality of statistical data in aid programmes are often weak”. Actually, that’s probably a huge understatement. Some of the statistical practice I have observed globally makes me wonder about the education and/or training given to staff (and consultants!). At the worst, it places into question the ‘robustness’ of assessing and validating real impact. Many of the current approaches would not be acceptable in terms of appropriate scientific or academic rigour. I think we are too soft in development and don’t demand the need for better accuracy (yes, I’m a old-fashioned about that). After all, rubbish in equals rubbish out. There is a real integrity issue around this.

Just my own analysis in my new role, in the only evaluation dataset I have, I found something like 15% errors (and that is conservative). It doesn’t take a rocket sciencist to know such errors must have an impact on the final report. The errors seem to be related to basic statisitical issues around poor quality control, especially data entry and data cleaning. However, the survey form is probably the worst I have EVER seen – how it was ever ‘approved’…? Unfortunately, it’s been used for the past 3 years here for baseline and evaluations in multiple projects in the same sector. I would strongly suspect that if I had the datasets from the baselines and evaluations, I would be find a similar error patterns.

How to solve this? Maybe all baselines and evaluations reports should either embed the file in the Appendix or have a website where they are available? If people say the dataset is too large – well, that simply means too much unnecessary data was (mostly likely) collected. Sure there maybe times were ‘confidentiality’ and ‘security’ issues may override this intent – however they would be the exception to the rule. Most datasets can be easily cleaned to ensure anonymity of participants. I think the confidentially argument is used too much and acts as a ‘convenient’ excuse not to be transparent.
rick davies says:

7 January, 2014 at 10:40 AM

Hi Murray
Thanks very much for your comment
Your proposal to change the consultants contract conditions sounds very worthwhile. It would be interesting to see how many other NGOs follow the same good practice, or not…Perhaps BOND could provide or obtain such information.
regards, rick davies

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Authors

Summary

3 thoughts on “The Availability of Research Data Declines Rapidly with Article Age”

Leave a Reply