Background


OpenRefine software
OpenRefine

OpenRefine is a Java-based program that runs on your computer (not online).

It runs inside your Web browser, but no internet connection is needed to use it, unless you want to bring in Web-based data for cleaning. Once web-based data has been read into OpenRefine, no further internet connection is needed.

Features of OpenRefine

Tasks you can use OpenRefine for

Clean & standardise data
  • Identify where data is missing
  • Fix inconsistencies such as date formats, name case format and order
  • Find and correct errors inlcuding misspellings, typos, whitespace
  • Find and remove duplicate observations
  • Identify and fix illegal values (data that does not fall within the accepted range for the variable)
  • Map the meaning of the dataset to its structure, see ‘Tidy data’ by Hadley Wickham
Extend & transform data
  • Split columns or rows of data up into more granular parts
  • Combine multiple datasets into one
  • Combine values from two or more variables (concatenation)
  • Add new variables (columns) or observations (rows) to a dataset
  • Reshape data from rows and columns to visualise data in a different arrangement
  • Organise data
Explore data prior to analysis
  • Sort by variables and values
  • Agregrate: reorganise to get a summary of the data
  • Filter: extract a subset by value
  • Facet: summarise values to provide a big picture of your data or to identify outliers
Document & repeat steps
  • Document all steps taken to process the data
  • Create scripts to automate and repeat the processes on other datasets

Explore the two figures below to see examples of messy and clean tabular data.

Messy data example

Messy Data
Messy Data

Clean data example

Clean Data
Clean Data

<-- BACK | NEXT -->