Tidy data - a revisit
It would be useful to add the Distribution
variable from NESP_SharkSpeciesList
to QldShark_2017_Clean_v2
, but the data in that column is very messy with multiple values inside the cells. Do a Text facet
on the Distribution column
and explore.
To create a tidy dataset, where:
- Each variable forms a column
- Each observation forms a row
- Each cell has one value
- Each type of observational unit forms a table,
multi-value cells need to be split by the value.
This task is helpful where there are multiple values in a cell that are not organised consistently, such as when survey respondents can select multiple, controlled values to answer a question, or a notes field has free text. The Distribution
column is an example of this. Let’s add the column to QldShark_2017_Clean_v2
then tidy it.
Add a new column
- Go to
QldShark_2017_Clean_v2
- Select
Common name
columnEdit Column > Add column based on this column
- name the new Column
Distribution
-
enter this GREL expression:
cell.cross("NESP_SharkSpeciesList","Common name")[0].cells["Distribution"].value
382 rows of the shark captures now have a new variable & value of Distribution
added.
Now we need to tidy the new variable in QldShark_2017_Clean_v2
, then split the column into two columns with a value in each.
Split a multi-value column
- Go to
QldShark_2017_Clean_v2
- Select column
Distribution > Facet > Text Facet
- There are 6 choices. Two of the choices contain two values within the cells e.g. Australasia & Indonesia. The multiple values do not contain a common separator. This will need fixing to be able to split on.
Let’s change the blank choice to “NULL” as these records do not have any data.
- In
Distribution
Facet box hover over(blank)
- Select
edit
and typeNULL
>Apply
- At
Distribution column
- Select
Edit cells > Replace > Find: & Replace with: ;
- Select
Edit column > Split in several columns > by separator *;* > Split into *2* columns
ok
.
An additional Distribution column will be created with values only appearing where there had previously been two values in the original cell.
Watch these steps in this video.
Now let’s explore the column Notes on skies and wind
. It is messy and contains multiple values within the cells. Ultimately we will create multiple columns using GREL and a language known as Regular Expression or Regex, which searches for patterns in strings.
Add new columns using GREL & Regex
- Go to column
Notes on skies and wind
- Select
Facet > Text facet
and explore the 36 choices
Notice the variations in common separators, order of values and case format. We cannot split the variable into multiple columns using the method.
- Hover over the
(blank)
choice selectedit
and change(blank)
toNULL
- Select
Edit cells > Common transforms > to lower case
to make the case consistent throughout the cell (and reduce the choices to 31) - Select
Edit column> add column based on this column
- Type new column name
Overcast
-
Click inside expression box, enter GREL expression:
if(value.contains("overcast"),"overcast",value).replace(/.*[^overcast].*/,"")
This means… if (the value in the cell contains “overcast”, replace it with “overcast” value), then replace (anything that is not “overcast” that is found one or more times in the cell, with “” ie. a blank).
- Preview and ok
- Repeat steps above for the other Sky related value
fine
(can reuse expression fromhistory
tab)
Now let’s join up the two columns fine
and overcast
which are really values, into a new variable Skies
- Go to
fine
columnEdit column > Join columns...
- Select
fine
andovercast
from tick box list - Select
Write result in new column named...
- Type
Skies
andok
.
We now have a new variable Skies
. Perform and text facet to see the value choices.
- Hover over
(blank)
choice selectedit
and typeNULL
andApply
. Now each cell has data.
Next we can extract all the “wind” data into a new variable using the steps above.
- Select
Edit column> add column based on this column
- Type new column name
strong breeze
-
Click inside expression box, enter GREL expression:
if(value.contains("strong breeze"),"strong breeze",value).replace(/.*[^strong breeze].*/,"")
- Preview and ok
- Repeat steps above for the other “wind” related values that can be found in the
Text facet window
near gale
andlight winds
(can reuse expression fromhistory
tab).
Now let’s join up the 3 columns strong breeze
, near gale
, light winds
which are really values, into a new variable Wind
- Go to
near gale
columnEdit column > Join columns...
- Select
strong gale
,calm winds
,strong breeze
,near gale
,light winds
,gale force
,light breeze
from tick box list - Select
Write result in new column named...
- Type
Wind
andok
.
We now have a new variable Wind
. Perform a text facet to see the value choices.
- Hover over
(blank)
choice selectedit
and typeNULL
andApply
. Now each cell has data. - Go to
All
columnEdit columns > Re-order / remove columns...
- Drag and drop the following columns to remove:
fine
,overcast
,Notes on skies and wind
,near gale
,strong breeze
,light winds
ok
.
This great GREL cheat sheet from code4lib Toronto has more details on building expressions using Regex.
See how this works in the video below.
The final step is to export specific variables from this tidy dataset to a .csv file which can be parsed (understood by) the geo.json tool used in the next lesson.
Export selected columns to .csv file
- Go to
All
column>Edit Columns> Reorder Remove columns
- Drag and drop the columns not needed including
Species Code
,Date
,Area
,Location
,Genus
,Species
,Distribution
,Distribution1
, to theremove
box. Common name
,Latitude
,Longitude
,Length (m)
,Rainfall_mm
,Formatted Date
,Water Temp (C)
,Skies
,Wind
will remainOk
- Click
Export
button, top right hand corner - Select Comma-separated value dataset
- save file