Working with messy data sets? Two useful and free tools

I have just come across two useful apps (aka software packages (aka tools)) for when you are working with someone else’s data sets and/or data sets from multiple sources and times. Or,  just your own data that was in a less than perfect state when you last left it :-)

  • OpenRefine: Initially developed by Google and now open source with its own support and development community. You can explore the characteristics of a data set, clean it in quick and comprehensive moves, transform its layout and formats, as well as reconcile and match multiple data sets. There is documentation and videos to show you how to do all this. There is also a book, which you can purchase.The wikipedia entry provides a good overview.
  • Tabula: This package allows you to extract tables of data from pdfs, a task which otherwise can be very tiresome, messy and error prone

And some other packages I have yet to explore


This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: