Background
OpenRefine is a Java-based program that runs on your computer (not online).
It runs inside your Web browser, but no internet connection is needed to use it, unless you want to bring in Web-based data for cleaning. Once web-based data has been read into OpenRefine, no further internet connection is needed.
Features of OpenRefine
- Open source collaboratively developed software (OpenRefine source code is housed on GitHub)
- A growing community of users worldwide, from novice to expert, ready to help
- Works with large datasets, i.e. those greater than 100,000 rows
- Can adjust memory allocation to accommodate larger datasets
Tasks you can use OpenRefine for
Clean & standardise data
- Identify where data is missing
- Fix inconsistencies such as date formats, name case format and order
- Find and correct errors inlcuding misspellings, typos, whitespace
- Find and remove duplicate observations
- Identify and fix illegal values (data that does not fall within the accepted range for the variable)
- Map the meaning of the dataset to its structure, see ‘Tidy data’ by Hadley Wickham
Extend & transform data
- Split columns or rows of data up into more granular parts
- Combine multiple datasets into one
- Combine values from two or more variables (concatenation)
- Add new variables (columns) or observations (rows) to a dataset
- Reshape data from rows and columns to visualise data in a different arrangement
- Organise data
Explore data prior to analysis
- Sort by variables and values
- Agregrate: reorganise to get a summary of the data
- Filter: extract a subset by value
- Facet: summarise values to provide a big picture of your data or to identify outliers
Document & repeat steps
- Document all steps taken to process the data
- Create scripts to automate and repeat the processes on other datasets
Explore the two figures below to see examples of messy and clean tabular data.
Messy data example
Clean data example