Facets & Clustering
Facets
Faceting is a useful feature of OpenRefine and can help you:
- Get an overview of the data in a project
- Get counts of data in specific columns
- Help you identify missing, misspelled or inconsistent data.
A common use case for your data might be where you want to know how many times a particular value appears in a column in your data.
A ‘Facet’ groups all the like values that appear in a column, and then allows you to filter the data by these values and edit the values for a number of records in one go.
Activity – Looking at data through Facets
The first facet to explore is Text facet
. It groups all the identical text values in a column and lists each value and the number of records in which that value appears. Facet information always appears in the left-hand panel in the OpenRefine interface.
Here we will use faceting to look at the values represented in the Area
column.
- Scroll over to the
Area
column. - Click the down arrow and choose
Facet > Text facet
.
In the left panel, you will now see a box containing every unique value in the Area
column, along with a number representing how many times that value occurs in the column. There are also some blank values to identify later. At the top of the box you can sort the results by name and count. A great feature for exploring a large dataset.
- Try sorting this facet by name and by count. How are they sorted? (name is alphabetical and count is largest first).
- Which areas have the highest and lowest shark catches? (Townsville is the highest, Bribie Island the lowest).
- Close facet by clicking the
x
in top corner of the Facet panel.
Note: Always close facets when you are finished with them, so as not to affect future facets or results.
You can also amend data with Facets.
In this next activity you want to limit to a sub-set of this data, with records from the south of the Cairns Area
only.
Activity - amending data through Facets
- Scroll to
Area
Column. - Click the down arrow and choose
Facet > Text facet
. Unique values will be displayed in left hand panel. - Click on the choices in the facet, or hover over them,
edit
andinclude
functions appear. - Hover over Cairns and use
include
function to narrow results just to those records. - Check how many records display? (39).
- Now use
Exclude
to return to all records. - Use
include
function again for valuesCairns
. - Remove all these records from the dataset.
- Select
All column > Edit rows > Remove all matching rows
. - This value is now displayed in red and missing counts. Click
reset
. - How many records are now available? (429)
- Close the facet.
You can also edit values using the facets feature.
Activity – fixing errors with Facets
- Use faceting to look at the
Species Name
Column. - What are the results? 34 value choices. Do you see different representations of the same value?
- Hover over BULL WHALERS and choose
Edit
correct the spelling to BULL WHALER then selectApply
. This changes all five incorrect records. - Do this again for COMMON BLACKTIP WHALERS to correct the spelling for all records at once.
- Do you now have 32 value choies for the variable
Species Name
? - We will explore the other representations of values in another exercise.
- Close facet
The Species Name values are represented in all capital letters. Let’s change this using a Common transforms
to titlecase.
Activity - fix font case with `Common transforms`
- Select
Species Name > Edit cells > Common transforms > To titlecase
. Now the case format is consistent throught the dataset.
Watch these steps in the video below.
OpenRefine’s many transformations functions will be explored further soon.
More on Facets
As well as Text facets
, OpenRefine supports other types of facets. These include:
Numeric facets
: displays a graph and includes ‘drag and drop’ controls to set a start and end range to filter the data displayed. Explore these later in the lesson on Examining Numbers in OpenRefine.Timeline facets
: displays a histogram with values sorted chronologically. This only works on cells formatted as ‘date’ data type.- ` Customized facets` provide a range of different facets including:
Word facet
- to break down text into words and count the number of records each word appears inDuplicates facet
- this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row.Text length facet
- creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g., if the values are expected to be years, any row with a text length of more than 4 for that column is likely to be incorrect.)Facet by blank (null or empty)
- a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for missing data.
What data is missing in `Water Temp (C)` column?
Use a Customized facet
to find out how many records are missing Water Temp (C) values. This facet shows true or false results.
- Select
Facet > Customized facets > Facet by blank (null or empty)
. - Hover over the false results and select
include
.
This result is a totals line which we don’t need, let’s remove it.
- Select
All column > Edit rows > Remove all matching rows
. - This value is now displayed in red and missing counts. Click
reset
. - How many records are now available? (428)
Clustering
Another very useful feature of OpenRefine is Clustering. In OpenRefine, clustering means ‘finding groups of different values that might be alternative representations of the same thing’. For example, the text strings ’New York’, ‘new york’ or ‘New Yrok’ very likely refer to the same concept.
Clustering is a very powerful tool for identifying and fixing datasets which contain misspelled or mistyped entries.
OpenRefine has several clustering algorithms built in. Let’s experiment with them.
Activity – fixing errors via Clustering
- Create a Text Facet for
Location
column, scroll through the list. You will notice a number of values that are likely mis-typed entries. - Click the
Cluster
button, on the top right of the facet. In the resulting pop-up window, different cluster methods and algorithms are available via drop down boxes. - the
key collision
method andfingerprint
keying function is the first displayed. - Try selecting different Methods and Keying Functions combinations, to see if merges are suggested.
- When a merge is suggested the value with the most rows is suggest as the new cell value, you can change that by clicking on one of the other options or typing a new value name.
- Select the new value names by ticking the boxes.
- Click the
Merge?
box beside the cluster, then clickMerge Selected and Re-cluster
to apply the corrections to the dataset. - There may be a few more clusters, to fix misspellings, typos, capitalisation, hyphens, etc.
- How many choices are now shown in the facet? Depending on your merges, there may be 63 choices remaining.
- Close the facet.
Note: Some merges are not necessary. The function cologne-phonetic
will find Miami Beach and Main Beach as potential variants. These are valid values, there is no need to merge these.
Watch this video to learn the Clustering function.
Different clustering algorithms
Detailed information on the different clustering algorithms and combinations is available here: https://docs.openrefine.org/manual/cellediting#clustering-methods.