Find a suitable variable


The aim of this lesson is to enhance the QldShark_2017_Clean_v2 dataset by adding new variables, to explore if rainfall occurred on the days that sharks were captured and to add scientific names to the common shark names, protection status and distribution values to the dataset where possible.

The Bureau of Meterology rainfall dataset for the Gold Coast Seaway BOM_GCS_Rain_2017.csv and the NESP_SharkSpeciesList.csv datasets could be a good matches. We will only match Gold Coast shark captures with the rainfall data in this lesson. It is possible to find and download the other relevant QLD coastal rainfall datasets from BOM.

Let’s add the rainfall variable first. We need to identify a common variable in each of the datasets that we can match on. A key variable. To do this we need to explore both datasets to see if a key or very similar variable is available that we can work with.

Let’s open another instance of OpenRefine to explore the BOM_GCS_Rain_2017.csv dataset first.

Add a new column with values from three columns
  • Select the Open button in the top right hand corner of OpenRefine to open a new instance of the software in your browser.
  • Create a new project using BOM_GCS_Rain_2017.csv dataset
  • Name it BOM_GCS_Rain_2017
  • Look at both datasets, what variables do they have in common?
    • Date column in QldShark_2017_Clean_v2
    • Year , Month and Day columns in BOM_GCS_Rain_2017.

These could be used as a key to match on. The unique key will be in the BOM_GCS_Rain_2017 dataset as each date only occurs once and can be matched and added to the same date as many times as appropriate in the QLDShark_2017_Clean_v2 data.

At present the date is represented in three columns, Year, Month and Day. We need to combine (concatinate) the values of the columns into one column, and add a specific string to the new column’s values, such as a common separator, that matches the format in QLDShark_2017_Clean_v2.

  • Go to BOM_GCS_Rain_2017 OpenRefine project
  • At Year column select Edit column > join columns
  • Tick Year , Month and Day boxes
  • add the separator - (dash) between the content of each column so it will match the date format in QldShark_2017_Clean_v2
  • at Write result in new column named add Date
  • OK.

The image below illustrates this step.

Add a column with values from three columns
Joining up columns
Change variable from text to date format

The final step required is to change the underlining text format of the new date variable in BOM_GCS_Rain_2017 to a standard ISO date format to match the date in QldShark_2017_Clean_v2.

  • In BOM_GCS_Rain_2017 OpenRefine project
  • Go to Date column Edit cells > Common transforms > To date

The cells values are now green and contain values that look like e.g. 2017-01-03T00:00:00Z in this format yyyy-mm-ddTnn:nn:nnZ.

We want to extract the Rainfall data which is in the column named Rainfall amount (millimetres). It is a long variable name, let’s change it.

  • Go to Rainfall amount (millimetres) column
  • select Edit column > Rename this column
  • add name Rainfall_mm
  • Ok.
Watch this video to work through the activity

<-- BACK | NEXT -->