Search

alexries606

Portfolio for English 606: Topics in Humanities Computing

Tag

statistics

Math for Humanists? (Week 10)

This week’s readings introduced the idea of topic modeling as a digital humanities tool. The concept of Latent Dirichlet Allocation (LDA), the primary example of topic modeling in the readings, is credited to David Blei, Andrew Ng, and Michael I. Jordan.

I felt that no one text provided a good definition of topic modeling. In “Words Alone: Dismantling Topic Models in the Humanities,” Benjamin Schmidt refers to topic models as “clustering algorithms that create groupings based on the distributional properties of words across documents.”

In the same edition of Journal of Digital Humanities, Andrew Goldstone and Ted Underwood call topic modeling a “technique that automatically identifies groups of words that tend to occur together in a large collection of documents.”

In Maryland Institute for Technology in the Humanities’ overview of topic modeling, they provide attributes of topic modeling projects as opposed to a concrete definition (their 5 elements of topic modeling projects are corpus, technique, unit of analysis, post processing, and visualization).

According to Schmidt, LDA was originally designed for data retrieval, not for exploring literary or historical corpora. And he expresses concern with the uncontextualized use of topic modeling in the digital humanities field.

He acknowledges that topics are easier to study than individual words when trying to understand a massive text corpora. However, he also expresses that “simplifying topic models for humanists who will not (and should not) study the underlying algorithms creates an enormous potential for groundless–or even misleading–insights.”

His concerns primarily stem from two assumptions that are made when using a topic modeling approach: 1) topics are coherent, and 2) topics are stable. Schmidt then proposes contextualizing the topics in the word usage/frequency of the documents.

Although Schmidt stays positive and realistic (he supports topic modeling; he just wants digital humanists to understand its limitations), the underlying point that I got from the reading is that perhaps that digital humanists are meddling in things they shouldn’t be (at least, not yet).

Schmidt hints that the people who can use topic modeling the most successfully are those who understand the algorithms, at least on a basic level. And this makes sense. That’s the reality for any tool.

This brought me back to the debates about whether or not digital humanists need to know how to code (I feel like I keep coming back to this topic). If we can’t agree that digital humanists need to know how to code, how can we agree or disagree that digital humanists need to be able to understand the algorithms of topic modeling?

The concept of topic modeling is mildly confusing, but still attainable. The algorithms, however, are straight up intimidating. The Wikipedia page for LDA shows a ton of variables and equations that would take more time and effort to understand than I am capable of giving.

Maybe if we discussed this in class, we would come to same conclusion as we did with the need to code for digital humanists: they shouldn’t have to be experts, but they should know enough to talk about it with an expert. But who are the experts in topic modeling? Statisticians, perhaps?

I think that digital humanists who wish to conduct research across a large number of texts could benefit from studying statistics. I’m starting to realize just how many hats digital humanists must (or at least should) wear!

Describing Images with Images (Week 8)

In “How to Compare One Million Images” [UDH], Lev Manovich discusses the challenge for the DH field of accounting for the crazy amount of data that exists and continues to grow. He introduces the software studies initiative’s key method for analysis and visualization of large sets of images, video, and interactive visual media (251).

There are two parts of this approach: 1) “automatic digital image analysis that generates numerical descriptions of various visual characteristics of the images,” and 2) “visualizations that show the complete image set organized by these characteristics” (251).

His outlined approach addresses problems that DH researchers struggle with when they use traditional approaches. These include scalability, registering subtle differences, and adequately describing visual characteristics. The approach also accounts more for entropy, the degree of uncertainty in the data.

For me, this idea of entropy echoes with Johanna Drucker’s concern in “Humanities Approaches to Graphical Display” [DITDH] with the binary representations required for traditional scientific approaches to graphical displays.

I think the connection lies in the separation that Drucker describes between science’s realist approach and humanities’ constructivist approach and the need for the DH field to forge their own path in statistical displays of capta.

Note: although I agree with Drucker’s characterization of data as capta (something that is taken and constructed rather than recorded and observed), I will use the term data throughout the rest of this post for simplicity.

I think Manovich’s approach for handling large sets of data makes sense and is a viable option for the DH field, as long as they can afford the necessary computer programs and have the necessary technical expertise. As Manovich explains, a project like comparing a million manga pages (or even 10,000) would be exceptionally difficult without computer software that can measure differences between images.

For example, tagging can be problematic because even with a closed vocabulary, tags can vary. As mentioned earlier, the human eye cannot account for the subtle differences among a large number of images.

Most DH projects utilize sampling (comparing 1,000 out of 100,000 images), but sampling data can be very problematic. When sampling from a large data set, there is always the possibility that the sample will not accurately represent the entire data set. This is something that every field, both in the sciences and humanities, has to deal with.

Manovich’s scatter plots, line graphs, and image plots are beautiful and interesting and I thought they were surprisingly simple to read and understand for being so nontraditional. Describing images with images just makes sense.

Create a free website or blog at WordPress.com.

Up ↑