Building a dataset

A dataset can be built either by finding and extracting existing text data, collating multiple resources or generating your own data from surveys, interviews or other investigations. Building a dataset can be as simple as collecting PDFs from your literature search, or as complex as finding then transcribing handwritten texts and converting them into a computational format. In humanities research your dataset may also be known as a corpus and through out this guide the terms are used interchangeably.

If you are not creating your own data the first step is to find the dataset or corpus of interest.

Finding data

Find data in GLAM collections
Find data in GLAM collections

To find data we suggest you start with our Finding data guide. This page links to datasets across the different discipline areas.

As the processes and tools for Digitisation, Optical Character Recognition (OCR) processing, indexing, encoding, and online hosting of primary source materials has improved we now have greater access, and new ways to find, explore, process and analyse text as data. This in practice means that we can now find and download analysable resources from a variety of sources. Our Text mining and analysis guide provides links to Griffith Libraries subscription and open source databases of resources you can mine and collate to create a dataset.

Explore the amazing data that can be extracted from Australian GLAM catalogues with historian and hacker Ass. Prof. Tim Sherratt, via the video below.

If the data you need is not available via the sources in the guides above, youi can make a direct enquiry with the platform or data owner of interest. Companies may make their data available to researchers by agreement such as Twitter’s Academic Research access service. Archives and other repositories will have links to datasets that can be downloaded for TDM, as will some Government websites.

Extracting data

Searching and downloading

The simplest method to access text data is to search GLAM databases and download individual resources or sets of results as required. This is useful if searching for small quantities of text. GLAM websites and Library database vendors provide useful search and download help for users.

  • When downloading resources, select text .txt file format which has all formatting removed and is readible by computers.

  • If this isn’t possible choose .pdf format which is accepted by some analysis programs or can be read by Optical Character Recognition (OCR) to make it readible by analysis tools.

Watch the video below to learn how to search then download newspaper articles from TROVE.

TROVE text download options include .txt and .pdf formats.

Using an API to extract data

An Application Programming Interface (API) is a software tool that allows two applications to communicate with each other. In the case of collecting web data, an API allows you to access a webpages database and then use that data in your own analysis. This process can require you to use a programming language such as Python or R, specialised software on your computer, or a cloud based application.

Using a supplied API is the easiest way to collect metadata sets (data about the text, e.g. a literature database records) or a large dataset of text from a database or website. Examples of API’s useful to text analysis researchers include:

Case study - the Prosecution Project - finding sources

The Prosecution Project is a collaborative research project on the history of the criminal trial in Australia from 1788 to 1960. Sources of text data included Supreme Court criminal trial registers, Police Gazettes notices, Prison and Convict Registers, along with press releases and other newspaper articles https://prosecutionproject.griffith.edu.au/.

The images below include example source materials like those used by the Prosecution Project:

Sources of text for the Prosecution Project
Sources of text for the Prosecution Project

Now you have found and accessed a dataset of text, let’s explore the next step in the workflow, preparing and formatting the data.


<-- BACK | NEXT -->