Data analysis

Data analysis takes on many names and may forms. In the context of this resource data analysis is the use of code or tools to process the data we have collected, cleaned and formated for use. This is where we use our data to inform our decision making, identify themes and gaps in the research and find evidence to support findings.

Methods for analysis

Once the text is cleaned and formatted it can be computationally processed using a number of different methods, depending the analysis you want to undertake. Explore some of popular the methods below.

Natural language processing (NLP) techniques

Natural language processing (NLP) is the branch of artificial intelligence (AI) technology to train a computer to understand, process, and generate language. Search engines, machine translation services, and voice assistants are all powered by the technology. Source: Bell & Olavsrud, 2021

The following NLP tasks break down text into analysable parts:

NLP method Description Example
Tokenization Splits the text into sentences and sentences into words; changes to lowercase and removes punctuation. This is creates a ‘bag of words’ for analysis
Stop word removal Uses standard language stop word dictionaries which can be amended. removes words such as “the”, “and”, “it”, “so”, “this”, “page”, “of”….
Lemmatization Third person words are changed to first person and verbs in past and future tenses are change into present. Alters change, changing, changes, changed… to change
Word stemming Words are reduced to their root form. Changes victorious, victories, victory… to victor
Special characters removed Characters that cannot be understood are removed. * @ # ! »
Part-of-speech tagging Categorises words in a text in correspondence with a particular part of speech. Her (pronoun) hat (noun) is (verb) grey (adjective).
Shallow parsing Chunks phrases from unstructured text. Identifies sentences, verb phrases, noun phrases.
Syntactic parsing Finds structural relationships between words in a sentence. Can for example identify a noun phrase as being formed by a determiner, followed by an adjective, followed by a noun.
Bag of words
Natural Language Processing - tokenization

Machine learning

Machine learning is a branch of AI and a process of teaching a computer system to recognise patterns in text without explicit human programming. Machine learning can be either unsupervised (with minimal human intervention) or supervised (with more human intervention). Explore machine learning at Zdnet.com. Analysis using Machine learning includes topic modelling, and Naive Bayes Classification, which are detailed below.

Common computational analysis tasks

Explore some common computational text analysis methods.

Text analysis tasks
Text analysis tasks

Text pattern analysis

Linguistic patterns such as word frequency analysis is useful for historical exploration of language as well as topic identification
Collocation identifies words commonly appearing near each other
Concordance shows the context of (the words around) a given word or set of words
N-grams finds common two-, three-, etc. word phrases see Google books Ngram viewer
Dictionary tagging locates a specific set of words in texts

Analysing one or a number of texts of interest

Method Description Example
Topic modeling Unsupervised machine learning to identify groups of terms that may be representative of a given topic, uncovering hidden themes. Topic mapping for a literature review
Document classification Such as Naive Bayes Classfication uses machine learning to classify documents based on information in the text Used in Sentiment analysis and literature reviews.
Sentiment analysis Used to determine whether text is positive, negative, or neutral. Used in research to see public sentiment, opinions, or emotions about products, ideas or policy, and can undertaken via NLP or machine learning. Tweet Sentiment Visualization App from NC State University.
Network analysis Analysis of social or other structures comprising variables or actors (represented by nodes), and the relationships (edges) between the nodes Network Analysis 101
Named entity recognition Generates a list of people, places, dates, times etc. Booking.com user experience analysis
Stylometry Statistical method of studying a linguistic style. Used in forensic, attribution and genre analysis.

Learn more about these methods from:


Analysis tools

Griffith University subscription software
  • NVivo : performs cluster analysis, phrase nets, tag clouds, and sentiment analysis.
  • Leximancer : performs network analysis, topic modeling, sentiment analysis, and named entity recognition.

Login and installation are required for both. Training is available for Griffith researchers via Researcher Education & Development.

Platforms with prepared text and tools

The virtual research environments below have been developed to support digital text scholarship.

  • JSTOR text mining support : for metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets available via Griffith University’s subscription to JSTOR. Login required.
  • Gale Digital Scholar Lab : for document clustering, named entity recognition, Ngrams, Parts of Speech, Sentiment Analysis, Topic Modelling all available via Griffith University’s subscription. The lab is designed to use the Gale Primary Source archives, but you can use the analysis tools with your own data. Learn about it and Gale Primary Sources here. Includes online tutorials. Login required.
  • Hathi Trust Research Center Analytics : supports large-scale computational analysis of the digital library works to facilitate non-profit and educational research. Individual researchers can sign up for free with their Griffith email and use out of copyright materials and analysis tools.
  • Proquest TDM Studio : create an account with your university email address. Undertake geographic analysis, topic modelling or sentiment analysis of Proquest’s collection of newspapers, dissertations and theses.
Open source (free) tools

The tools listed below enable users to undertake text analysis without the need to learn to code. The majority of the tools are based on Python or R codes. You can use these tools for simple or exploratory data analysis and some visulations. Some tools are downloadable to your computer, others are web interfaces, each with their own benefits and limitations.

Advantages Disadvantages
Easier to learn than coding, good for high level analysis Can be inflexible, may not be good for deeper analysis, web based tools may not approprate for using with identifiable data.
  • Voyant Tools web based online tool for frequency, distribution and collocation of terms, keywords in context, term clusters and more.
  • Sentiment Analyzer web based tool for analysing one source at a time.
  • Topic Modelling toola GUI for MALLET modelling code.
  • Cytoscape software platform for visualizing complex networks.
  • Stanford Named Entity Recognizer (NER) for person, organisation and location recognition.
  • WordHoard web application for the close reading and scholarly analysis of deeply tagged texts, from Northwestern University.
  • WORDij Semantic Network Tools is downloadable software for natural language processing. It can process unstructured text from sources such as social media, news, speeches, focus groups, interviews, email, and web sites.
  • CLAWS part of speech tagger for corpus annotation for English text, developed by UCREL at Lancaster University.
Voyant tools activity

Try this activity using Voyant tools. Look at the results, think how you might use it for analysis and the limitations of the tool.

Coding for text mining and analysis
  • R & R Studio : network analysis, topic modeling, classification/clustering, named entity recognition, sentiment analysis
  • Python : network analysis, topic modeling, classification/clustering, named entity recognition, sentiment analysis
Coding tutorials for text mining and analysis

<-- BACK | NEXT -->