Portfolio for English 606: Topics in Humanities Computing


March 2016

Math for Humanists? (Week 10)

This week’s readings introduced the idea of topic modeling as a digital humanities tool. The concept of Latent Dirichlet Allocation (LDA), the primary example of topic modeling in the readings, is credited to David Blei, Andrew Ng, and Michael I. Jordan.

I felt that no one text provided a good definition of topic modeling. In “Words Alone: Dismantling Topic Models in the Humanities,” Benjamin Schmidt refers to topic models as “clustering algorithms that create groupings based on the distributional properties of words across documents.”

In the same edition of Journal of Digital Humanities, Andrew Goldstone and Ted Underwood call topic modeling a “technique that automatically identifies groups of words that tend to occur together in a large collection of documents.”

In Maryland Institute for Technology in the Humanities’ overview of topic modeling, they provide attributes of topic modeling projects as opposed to a concrete definition (their 5 elements of topic modeling projects are corpus, technique, unit of analysis, post processing, and visualization).

According to Schmidt, LDA was originally designed for data retrieval, not for exploring literary or historical corpora. And he expresses concern with the uncontextualized use of topic modeling in the digital humanities field.

He acknowledges that topics are easier to study than individual words when trying to understand a massive text corpora. However, he also expresses that “simplifying topic models for humanists who will not (and should not) study the underlying algorithms creates an enormous potential for groundless–or even misleading–insights.”

His concerns primarily stem from two assumptions that are made when using a topic modeling approach: 1) topics are coherent, and 2) topics are stable. Schmidt then proposes contextualizing the topics in the word usage/frequency of the documents.

Although Schmidt stays positive and realistic (he supports topic modeling; he just wants digital humanists to understand its limitations), the underlying point that I got from the reading is that perhaps that digital humanists are meddling in things they shouldn’t be (at least, not yet).

Schmidt hints that the people who can use topic modeling the most successfully are those who understand the algorithms, at least on a basic level. And this makes sense. That’s the reality for any tool.

This brought me back to the debates about whether or not digital humanists need to know how to code (I feel like I keep coming back to this topic). If we can’t agree that digital humanists need to know how to code, how can we agree or disagree that digital humanists need to be able to understand the algorithms of topic modeling?

The concept of topic modeling is mildly confusing, but still attainable. The algorithms, however, are straight up intimidating. The Wikipedia page for LDA shows a ton of variables and equations that would take more time and effort to understand than I am capable of giving.

Maybe if we discussed this in class, we would come to same conclusion as we did with the need to code for digital humanists: they shouldn’t have to be experts, but they should know enough to talk about it with an expert. But who are the experts in topic modeling? Statisticians, perhaps?

I think that digital humanists who wish to conduct research across a large number of texts could benefit from studying statistics. I’m starting to realize just how many hats digital humanists must (or at least should) wear!


The Last 5% (Week 9)

Fyfe’s “Electronic Errata” chapter in DITDH examines the decreasing value of correction and copy editors in scholarly publication. This has been the trend for quite some time, as seen with the elimination of the reading boy, but it has become more pronounced with the increase of online and digital publishing.

According to Fyfe, despite the importance and increasing relevancy of this topic, copy editing and fact checking are often omitted in research on digital publishing.

Fyfe’s claims about the decreasing value of correction did not surprise me. Although I do not have a great sample that I can compare current published scholarly materials against (I entered the University in 2007, and most of my academic reading consisted of textbooks and literature until I began my graduate program in 2014), I have seen many scholarly books and articles, both online and in print, with spelling and grammar errors.

Once, in a book about editing, I found a really bad mistake that seems to have resulted from copying and pasting, which is always a dangerous affair.

We seem to live in a society that is so concerned with producing, editing is often pushed to the side, and is sometimes forgotten altogether, which is somehow both concerning and relieving to me.

It’s concerning, because copy editing is a viable career option for me. Also, I believe that errors, even tiny spelling errors, are distracting and unprofessional.

When it comes to writing articles for the IT knowledge base, I tend to focus on the small details, such as making sure all uses of “drop-down” have a hyphen or that all hyperlinks open the linked website in a new tab.

‘Enforcing’ these rules is challenging when there are multiple people writing and editing articles. With thousands of articles, it’s even challenging for me to remember how I wrote a certain word or formatted a table in an earlier article. To create consistency, I created the style guide, but it is underused.

It’s important to be detail oriented, but focusing on the small things gives me less time to focus on the large things, which in the end, probably matters more. Most readers looking for information will care more about accuracy and usefulness. More importantly, readers care that information is simply available.

Sometimes with time sensitive articles, I focus less on perfect wordsmithing and grammar simply to get that information to readers as quickly as possible (I do always read through my work at least once, just not as carefully, and if I get hung up on a certain word or phrase, I leave it alone). For this reason, I am relieved that the last 5% is not as necessary as it used to be.

However, I think it also depends on the formality of the publication. I do think that scholarly publications should be held to a higher standard when it comes to copy editing and especially fact checking. I think Fyfe’s conversation about crowd sourcing copy editing is interesting, but unrealistic. I think it could work well in some academic circles if one or more of the scholars in the circle are, what one might call, grammar nazis, but I think the task of copy editing should fall mostly to the author.

As I say this, I realize that it would create a lot of extra work for authors to have to learn and apply style guides, which may change from journal to journal. But it is the author’s reputation on the line and they should be responsible for their work.

However, I also think that one or two mistakes should be forgiven, so long as the content is rich.

Describing Images with Images (Week 8)

In “How to Compare One Million Images” [UDH], Lev Manovich discusses the challenge for the DH field of accounting for the crazy amount of data that exists and continues to grow. He introduces the software studies initiative’s key method for analysis and visualization of large sets of images, video, and interactive visual media (251).

There are two parts of this approach: 1) “automatic digital image analysis that generates numerical descriptions of various visual characteristics of the images,” and 2) “visualizations that show the complete image set organized by these characteristics” (251).

His outlined approach addresses problems that DH researchers struggle with when they use traditional approaches. These include scalability, registering subtle differences, and adequately describing visual characteristics. The approach also accounts more for entropy, the degree of uncertainty in the data.

For me, this idea of entropy echoes with Johanna Drucker’s concern in “Humanities Approaches to Graphical Display” [DITDH] with the binary representations required for traditional scientific approaches to graphical displays.

I think the connection lies in the separation that Drucker describes between science’s realist approach and humanities’ constructivist approach and the need for the DH field to forge their own path in statistical displays of capta.

Note: although I agree with Drucker’s characterization of data as capta (something that is taken and constructed rather than recorded and observed), I will use the term data throughout the rest of this post for simplicity.

I think Manovich’s approach for handling large sets of data makes sense and is a viable option for the DH field, as long as they can afford the necessary computer programs and have the necessary technical expertise. As Manovich explains, a project like comparing a million manga pages (or even 10,000) would be exceptionally difficult without computer software that can measure differences between images.

For example, tagging can be problematic because even with a closed vocabulary, tags can vary. As mentioned earlier, the human eye cannot account for the subtle differences among a large number of images.

Most DH projects utilize sampling (comparing 1,000 out of 100,000 images), but sampling data can be very problematic. When sampling from a large data set, there is always the possibility that the sample will not accurately represent the entire data set. This is something that every field, both in the sciences and humanities, has to deal with.

Manovich’s scatter plots, line graphs, and image plots are beautiful and interesting and I thought they were surprisingly simple to read and understand for being so nontraditional. Describing images with images just makes sense.

Create a free website or blog at

Up ↑