This week’s readings introduced the idea of topic modeling as a digital humanities tool. The concept of Latent Dirichlet Allocation (LDA), the primary example of topic modeling in the readings, is credited to David Blei, Andrew Ng, and Michael I. Jordan.
I felt that no one text provided a good definition of topic modeling. In “Words Alone: Dismantling Topic Models in the Humanities,” Benjamin Schmidt refers to topic models as “clustering algorithms that create groupings based on the distributional properties of words across documents.”
In the same edition of Journal of Digital Humanities, Andrew Goldstone and Ted Underwood call topic modeling a “technique that automatically identifies groups of words that tend to occur together in a large collection of documents.”
In Maryland Institute for Technology in the Humanities’ overview of topic modeling, they provide attributes of topic modeling projects as opposed to a concrete definition (their 5 elements of topic modeling projects are corpus, technique, unit of analysis, post processing, and visualization).
According to Schmidt, LDA was originally designed for data retrieval, not for exploring literary or historical corpora. And he expresses concern with the uncontextualized use of topic modeling in the digital humanities field.
He acknowledges that topics are easier to study than individual words when trying to understand a massive text corpora. However, he also expresses that “simplifying topic models for humanists who will not (and should not) study the underlying algorithms creates an enormous potential for groundless–or even misleading–insights.”
His concerns primarily stem from two assumptions that are made when using a topic modeling approach: 1) topics are coherent, and 2) topics are stable. Schmidt then proposes contextualizing the topics in the word usage/frequency of the documents.
Although Schmidt stays positive and realistic (he supports topic modeling; he just wants digital humanists to understand its limitations), the underlying point that I got from the reading is that perhaps that digital humanists are meddling in things they shouldn’t be (at least, not yet).
Schmidt hints that the people who can use topic modeling the most successfully are those who understand the algorithms, at least on a basic level. And this makes sense. That’s the reality for any tool.
This brought me back to the debates about whether or not digital humanists need to know how to code (I feel like I keep coming back to this topic). If we can’t agree that digital humanists need to know how to code, how can we agree or disagree that digital humanists need to be able to understand the algorithms of topic modeling?
The concept of topic modeling is mildly confusing, but still attainable. The algorithms, however, are straight up intimidating. The Wikipedia page for LDA shows a ton of variables and equations that would take more time and effort to understand than I am capable of giving.
Maybe if we discussed this in class, we would come to same conclusion as we did with the need to code for digital humanists: they shouldn’t have to be experts, but they should know enough to talk about it with an expert. But who are the experts in topic modeling? Statisticians, perhaps?
I think that digital humanists who wish to conduct research across a large number of texts could benefit from studying statistics. I’m starting to realize just how many hats digital humanists must (or at least should) wear!