Activities and workflows

A text mining and analysis workflow is a messy, iterative and complex process. It’s often like the diagram below. “Depending on the project, a researcher may repeat certain steps in small cycles, or return to previous steps, or do some exploratory steps to determine next steps.” (Green, et al. 2018)

Text mining workflow
Text mining workflow

Source: Green, et al. (2018)


Considerations

Consider the questions below when embarking on a text mining and analysis project.

Do you have a dataset, corpus or text source? Not yet? GLAM catalogues, subscription databases, websites and social media are good sources if you aren’t collecting your own text data from interviews or surveys. Find sources via the Texting mining guide.

Is the text you want to use already digitised?

If the text you are mining is a recent .pdf file format it should not need any further amendments. Older .pdf formats are often interpreted by computers as an image. With older .pdf files you will need to transcribe or convert the materials to a machine readible format. Handwritten texts will require Optical Character Recognition (OCR) tools or manual transcription. Methods for transcription and reformatting will be explored later in the tutorial.

What are the copyright or licencing requirements?

  • Some collections, particularly those published before a certain date, provide licences for reuse. An example is the historical newspapers digitised in TROVE from The National Library of Australia
  • If the collection is more recent you can request permission from copyright owners of subscription databases via the Library.
  • Griffith’s Information Policy Officer can provide advice on specific copyright and licensing use cases.

How much technical experience do you need? This depends on what you wish to achieve.

  • Some tools and methods are plug and play and others require coding and web development experience. You can explore these technical requirements further in the upcoming lessons.
  • eResearch Services run regular coding workshops and provide great support via Hacky Hour.

Can you pay for associated costs?

  • You may need to pay subscription vendors or publishers for access to OCR text and some text and audio transcription services have fees. Investigate the potential costs prior to commencing a project.

Text mining and analysis workflow steps

The most common steps in the workflow include:

  • Finding and building the dataset, either by collating your own resources or accessing and using pre-existing data
  • Preparing, cleaning and formatting the text, audio or video for processing
  • Computationally processing text to extract data with either plug and play tools or by using code
  • Analysing text using computational methods
  • Visualising text using computational methods

Before exploring each of these activities, let’s identify your ethical obligations in the next lesson.

<-- BACK | NEXT -->