Data analysis

Data analysis takes on many names and may forms. In the context of this resource data analysis is the use of code or tools to process the data we have collected, cleaned and formated for use. This is where we use our data to inform our decision making, identify themes and gaps in the research and find evidence to support findings.

Methods for analysis

Once the text is cleaned and formatted it can be computationally processed using a number of different methods, depending the analysis you want to undertake. Explore some of popular the methods below.

Natural language processing (NLP) techniques

Natural language processing (NLP) is the branch of artificial intelligence (AI) technology to train a computer to understand, process, and generate language. Search engines, machine translation services, and voice assistants are all powered by the technology. Source: Bell & Olavsrud, 2021

The following NLP tasks break down text into analysable parts:

NLP method	Description	Example
Tokenization	Splits the text into sentences and sentences into words; changes to lowercase and removes punctuation.	This is creates a ‘bag of words’ for analysis
Stop word removal	Uses standard language stop word dictionaries which can be amended.	removes words such as “the”, “and”, “it”, “so”, “this”, “page”, “of”….
Lemmatization	Third person words are changed to first person and verbs in past and future tenses are change into present.	Alters change, changing, changes, changed… to change
Word stemming	Words are reduced to their root form.	Changes victorious, victories, victory… to victor
Special characters removed	Characters that cannot be understood are removed.	* @ # ! »
Part-of-speech tagging	Categorises words in a text in correspondence with a particular part of speech.	Her `(pronoun)` hat `(noun)` is `(verb)` grey `(adjective)`.
Shallow parsing	Chunks phrases from unstructured text.	Identifies sentences, verb phrases, noun phrases.
Syntactic parsing	Finds structural relationships between words in a sentence.	Can for example identify a noun phrase as being formed by a determiner, followed by an adjective, followed by a noun.

Bag of words — Natural Language Processing - tokenization

Machine learning

Machine learning is a branch of AI and a process of teaching a computer system to recognise patterns in text without explicit human programming. Machine learning can be either unsupervised (with minimal human intervention) or supervised (with more human intervention). Explore machine learning at Zdnet.com. Analysis using Machine learning includes topic modelling, and Naive Bayes Classification, which are detailed below.

Common computational analysis tasks

Explore some common computational text analysis methods.

Text pattern analysis

Linguistic patterns	such as word frequency analysis is useful for historical exploration of language as well as topic identification
Collocation	identifies words commonly appearing near each other
Concordance	shows the context of (the words around) a given word or set of words
N-grams	finds common two-, three-, etc. word phrases see Google books Ngram viewer
Dictionary tagging	locates a specific set of words in texts

Analysing one or a number of texts of interest

Method	Description	Example
Topic modeling	Unsupervised machine learning to identify groups of terms that may be representative of a given topic, uncovering hidden themes.	Topic mapping for a literature review
Document classification	Such as Naive Bayes Classfication uses machine learning to classify documents based on information in the text	Used in Sentiment analysis and literature reviews.
Sentiment analysis	Used to determine whether text is positive, negative, or neutral. Used in research to see public sentiment, opinions, or emotions about products, ideas or policy, and can undertaken via NLP or machine learning.	Tweet Sentiment Visualization App from NC State University.
Network analysis	Analysis of social or other structures comprising variables or actors (represented by nodes), and the relationships (edges) between the nodes	Network Analysis 101
Named entity recognition	Generates a list of people, places, dates, times etc.	Booking.com user experience analysis
Stylometry	Statistical method of studying a linguistic style.	Used in forensic, attribution and genre analysis.

Learn more about these methods from:

Australian Text Analytics Platform Methods Guide
An Introduction to Text Mining: Research Design, Data Collection, and Analysis ebook
Prof. Miriam Possner’s Topic Modelling online tutorials
Demystifying Networks an introduction for HASS scholars.
Introduction to Sentiment analysis fun and informative video.
Article on comparison of machine learning methods for text-based sentiment analysis
Article by Berger et al. on text analysis methods used in Marketing

After applying these compuational processing and analysis models and methods, the data will be ready for the most important and interesting stage, your analysis and interpretation of the results.

Analysis tools

Griffith University subscription software

NVivo : performs cluster analysis, phrase nets, tag clouds, and sentiment analysis.
Leximancer : performs network analysis, topic modeling, sentiment analysis, and named entity recognition.

Login and installation are required for both. Training is available for Griffith researchers via Researcher Education & Development.

Platforms with prepared text and tools

The virtual research environments below have been developed to support digital text scholarship.

JSTOR text mining support : for metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets available via Griffith University’s subscription to JSTOR. Login required.
Gale Digital Scholar Lab : for document clustering, named entity recognition, Ngrams, Parts of Speech, Sentiment Analysis, Topic Modelling all available via Griffith University’s subscription. The lab is designed to use the Gale Primary Source archives, but you can use the analysis tools with your own data. Learn about it and Gale Primary Sources here. Includes online tutorials. Login required.
Hathi Trust Research Center Analytics : supports large-scale computational analysis of the digital library works to facilitate non-profit and educational research. Individual researchers can sign up for free with their Griffith email and use out of copyright materials and analysis tools.
Proquest TDM Studio : create an account with your university email address. Undertake geographic analysis, topic modelling or sentiment analysis of Proquest’s collection of newspapers, dissertations and theses.

Open source (free) tools

The tools listed below enable users to undertake text analysis without the need to learn to code. The majority of the tools are based on Python or R codes. You can use these tools for simple or exploratory data analysis and some visulations. Some tools are downloadable to your computer, others are web interfaces, each with their own benefits and limitations.

Advantages	Disadvantages
Easier to learn than coding, good for high level analysis	Can be inflexible, may not be good for deeper analysis, web based tools may not approprate for using with identifiable data.

Voyant Tools web based online tool for frequency, distribution and collocation of terms, keywords in context, term clusters and more.
Sentiment Analyzer web based tool for analysing one source at a time.
Topic Modelling toola GUI for MALLET modelling code.
Cytoscape software platform for visualizing complex networks.
Stanford Named Entity Recognizer (NER) for person, organisation and location recognition.
WordHoard web application for the close reading and scholarly analysis of deeply tagged texts, from Northwestern University.
WORDij Semantic Network Tools is downloadable software for natural language processing. It can process unstructured text from sources such as social media, news, speeches, focus groups, interviews, email, and web sites.
CLAWS part of speech tagger for corpus annotation for English text, developed by UCREL at Lancaster University.

Voyant tools activity

Try this activity using Voyant tools. Look at the results, think how you might use it for analysis and the limitations of the tool.

Coding for text mining and analysis

R & R Studio : network analysis, topic modeling, classification/clustering, named entity recognition, sentiment analysis
Python : network analysis, topic modeling, classification/clustering, named entity recognition, sentiment analysis

Coding tutorials for text mining and analysis

Beginner R & Python workshops are available from Griffith’s eResearch services throughout the year.
Constellate tutorials a series of lessons to help you learn about programming in Python, text analysis, and the Constellate platform for JSTOR.
Programming Historian novice-friendly, peer-reviewed tutorials to help humanists learn a wide range of digital tools, techniques, and workflows. These include lesons in R and Python.
GLAM Workbench tutorials learn how to use the GLAM Workbench, Jupyter Notebook and Python to extract and analyse data from Australia’s galleries libraries, archives and museums.
Top 5 Unknown Sentiment Analysis Projects On Github To Help You Through Your NLP Projects
Language Technology and Data Analysis Laboratory (LADAL) Tutorials provides online text analysis tutorials in R.

Note: Refer to software used for your research in methods notes. Attribute software developers by citation e.g.

Sinclair, Stéfan and Geoffrey Rockwell, 2016. Voyant Tools. Web. http://voyant-tools.org/.

<-- BACK | NEXT -->