LibGuides: Love Data Week: Text Analysis

Voyant Tools

Links:

Google N-Grams and HathiTrust Bookworm

Demo video: Google N-Grams and HathiTrust Bookworm

Links to the tools:

Text Analysis

Computational Text Analysis, Computer-aided Text Analysis, Text Mining, and the abbreviation TDM are broad terms for searching, organizing, and analyzing large amounts of text data.

TDM can help reveal new patterns or information from a large body of work - leading to the development of new knowledge, of a larger evidence-based practice. TDM enables researchers to analyze thousands of documents and terabytes of data, allowing for a comprehensive look into research questions.

The methods used to process corpora vary widely between disciplines, and are based on insights from machine learning, statistics, computational linguistics, sociology, and many other fields.

Examples where researchers used text analysis to answer their research question:

Much of the content of this page comes from the University of Pennsylvania's Text Analysis Guide by Jajwalya Karajgikar.

Methods

Common methods of text analysis include:

Sentiment Analysis: Sentiment analysis employs natural language processing techniques to identify and extract subjective information from text, such as opinions and emotions expressed in the textual data, and is commonly used to analyze social media posts, customer reviews, and other text data to determine the overall sentiment.
Text Classification: Text classification involves categorizing text data into predefined classes or categories based on the content of the text and is frequently used for tasks such as spam filtering, topic identification, and sentiment classification.
Topic Modeling: Topic modeling is a statistical method used to identify topics or themes that occur in a collection of documents, allowing hidden patterns and relationships within text data to be discovered. It is widely applied in fields such as social sciences and humanities.
Named Entity Recognition: Named Entity Recognition (NER) is the process of identifying and extracting named entities from text, such as names of people, places, and organizations. It is commonly used for information extraction, retrieval, and data analysis.
Text Clustering: Text clustering is the process of grouping similar documents together based on their content, which is frequently used to identify patterns and similarities in large text datasets, particularly in fields such as marketing and customer service.
Text Summarization: Text summarization involves creating a concise summary of a longer text document and can be used to quickly understand the main points and themes of a large document or set of documents.
Text Mining: Text mining involves extracting useful information from unstructured text data using techniques such as natural language processing, machine learning, and information retrieval to discover patterns, relationships, and trends in large text datasets.
Named Entity Disambiguation: Named Entity Disambiguation is the process of disambiguating named entities by distinguishing between entities with similar names or referring to the same real-world entity, thereby reducing ambiguity in text data.
Word Frequencies: Word frequency analysis involves counting the number of times each word appears in a text document or corpus to identify common words or phrases, which can provide insights into the content of the text data.
Visualization: Text visualization involves creating visual representations of text data, such as word clouds, topic models, and graphs, to identify patterns, trends, and relationships in the data and communicate insights to stakeholders in a clear and concise manner.

Some Online Text Analysis Tools

Annotation Studio
A suite of collaborative web-based annotation tools.
AntConc
A free tool for analyzing large amounts of text.
CATMA
A tool for text markup and analysis.
FromThePage
Software for transcribing handwritten documents online.
Google N-gram Viewer
An online search engine that charts the frequencies of any set of comma-delimited search terms using a yearly count of the terms' occurrences found in sources printed between 1500 and 2008 that are available in Google Books.
HathiTrust Bookworm
An N-Gram viewer for the HathiTrust library with faceted search capabilities.
Mallet
A toolkit for for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications.
Overview
A tool that lets users search, analyze, and cull large volumes of texts or documents.
Serendip
A system for visually exploring topic models generated on large corpora of documents.
spaCy
A python package for natural language processing.
Voyant
A web-based reading and analysis environment for digital texts.
WordSeer
A text analysis environment that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful.