I know absolutely nothing about R, so in order to get started, I decided to follow along with a tutorial video by Julia Silge I had watched on topic modeling in R using something called “tidytext” principles. You can check out the Silge’s blog post and video here.
However, I wasn’t able to load my TEI copies of Capital into R (and actually crashed R to the point of having to reinstall and edit the folders to run R using command line). So I decided to follow the tutorial more closely by using the Gutenberg online copy of Jude the Obscure.
I found Silge’s video and the accompanying blog post incredibly useful. When I got stuck or a strange error appeared, I could easily Google the issue or rewind the video to make sure I was using the correct symbols for each step and in the process I learned a lot about the coding nuances of R. For example, when working with dplyr, each line in a sequence must end in %>% up until the last line as %>% acts almost like a chain linking each subsequent function together.
The first chart I created was a tf-idf chart for each part of the book, and I will admit that I was not entirely sure what tf-idf meant. Julia Silge briefly explained the tf-idf concept, but I found the this page useful because it provided a formula in plain English.
- TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
- IDF(t) = log_e(Total number of documents) / (Number of documents with term t in it)
- Value (i.e. length of the bar in the chart) = TF * IDF
In other words, tf-idf represents the importance of a term inside a document. tf-idf shows which words appear most frequently in each part that are relatively less frequent in other parts. For example, the word “pig” is very important in the first part of the book, but doesn’t come up often in the remainder of the book.
As soon as I ran the code in R, I saw a bunch of results that came out as 0, which didn’t seem correct since the video didn’t show a bunch of terms with 0s. So I used the formula above to test a few of the results by hand and found that they were correct. From here, I used ggplot2 to create this chart:
The next step was to create a topic model, which was fairly easy to do after making the tf-idf chart. As Silge explained each step, I felt I understood what I was doing and why the chart turned out the way that it did.
Finally, I created a gamma graph, which Silge explained shows that there is roughly one document (part of the story) per topic. I was unsure at the time why this should matter, but I managed to create the graph and it looked like the one in the video so I considered this a success.
Once I finished the tutorial and began writing this blog post, I realized running all of these lines of code by following a tutorial really isn’t that different than using a GUI, at least in terms of what I learned. I generated all of these charts and really didn’t need to look at what any of the packages and formulas I was using meant or how they worked (though I did look into tf-idf because I initially thought I had done something incorrect). This isn’t to say I didn’t learn anything from this exercise. Certainly, it was helpful for me to better understand the syntax and minutia of coding with R, and the resulting charts I generated were technically clear and accurate. But this isn’t good enough from a distant reading standpoint. Going back to one of the two quotes I provided in my last post, “In distant reading and cultural analytics the fundamental issues of digital humanities are present: the basic decisions about what can be measured (parameterized), counted, sorted, and displayed are interpretative acts that shape the outcomes of the research projects. The research results should be read in relation to those decisions, not as statements of self-evident fact about the corpus under investigation.” I don’t really understand any of the decisions I made aside from the fact that I wanted 6 topics and that I divided the book into 6 parts.
Certainly this isn’t a terribly impactful experiment. If I screwed up the statistics and misrepresented the data, thereby misrepresenting the book as a whole, there won’t be any negative consequences. It just means readers of this blog might think the word “pig” or “gillingham” is more important to the story than they actually are. But let’s say I wasn’t working with a single nineteenth century novel (and admitting my gaps in knowledge as I proceeded) and was instead working with data about race or gender or a corpus of nineteenth century novels. Let’s say I wanted to derive meaning from this data and failed to do so accurately. Let’s say I didn’t understand the malleable nature of my data points, that race or gender or even vocabulary are not set in stone but are instead social constructs that can be interpreted in many ways. Let’s say I created some charts following Silge’s tutorial and (incorrectly) determined white male writers have a “better” or “larger” vocabulary than women writers of color, and that I used these charts to determine which novels I should read in the future to better understand the nineteenth century cannon. That would definitely be a problem. It would be worse if anyone who read this blog post was convinced by my incorrect results and acted in accordance with them.
So let’s go back to the beginning to figure out what each of these visualizations mean, not just to do my due diligence to Thomas Hardy, but also distant reading and text analysis as a whole. I’m already fairly clear on what the tf-idf chart means and how it was created, so I’ve decided not to delve into that any further. Let’s start with the topic modeling chart.
How does stm topic modeling work?
Over the week, I read a bunch of articles about how STM works, and I still don’t really understand what it does. I understand that it is different than LDA, and that the Topic Modeling GUI I used in the past uses LDA. The most helpful article, Structural Topic Models for Open‐Ended Survey Responses, has this to say about the differences:
“there are three critical differences in the STM as compared to the LDA model described above: (1) topics can be correlated; (2) each document has its own prior distribution over topics, defined by covariate X rather than sharing a global mean; and (3) word use within a topic can vary by covariate U. These additional covariates provide a way of “structuring” the prior distributions in the topic model, injecting valuable information into the inference procedure.”
Alright, so structural topic modeling means using metadata to impact the topics that are modeled. In the article the authors use the examples of gender and control versus variable groups in a sociology experiment as their metadata/covariates. But did I do that? I created a document-feature matrix (dfm) to input into the stm function. Is that structured? What is a dfm? Well, Google all you want, it’s hard to find an answer. It isn’t the same thing as a document-term matrix. quanteda describes dfm this way, “quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use.” Tidytextmining.com explains dfm this way, “The tidy method works on these document-feature matrices as well, turning them into a one-token-per-document-per-row table.”
When you print jude_dfm in R, you get:
Document-feature matrix of: 6 documents, 10,221 features (66.9% sparse)
Okay, so I intentionally created 6 documents (i.e. the 6 parts of the book). But what are the 10,221 features and what does 66.9% sparse mean?
When you look up “sparse matrix R”, the top 5 results give you “User friendly construction of a compressed, column-oriented, sparse matrix, inheriting from class CsparseMatrix (or TsparseMatrix if giveCsparse is false), from locations (and values) of its non-zero entries.” I don’t know what this means and using the term to be defined in the definition 3 times isn’t exactly helpful. A post by Erwin Kalvelagen provided me with this answer, “A sparse matrix is a matrix where most of the elements are zero.”. Excellent. That makes a lot of sense. So, 66.9% of my features were sparse. Now I need to understand why the majority of my data is sparse. What does that mean? Likely, it has something to do with the 10K features.
I returned to Silge’s tutorial to find out that in this example, features simply means words. Okay, so that means cast_dfm() creates a matrix of documents and words and the sparse element is determined by which words do not appear in each part. So going back to our tf-idf chart, the term “pig” would probably have a high count in part 1, but have a count of 0 in each other part, making the matrix sparse, with ⅚ of the parts not containing the word “pig.”
At this point, though, I’m even more confused. Why would I use a dfm when dtm means document-term matrix?
R-Statistics explains features this way, “Finding the most important predictor variables (of features) that explains major part of variance of the response variable is key to identify and build high performing models.”
The Simple Features CRAN-R page says,
“A feature is thought of as a thing, or an object in the real world, such as a building or a tree. As is the case with objects, they often consist of other objects. This is the case with features too: a set of features can form a single feature. A forest stand can be a feature, a forest can be a feature, a city can be a feature. A satellite image pixel can be a feature, a complete image can be a feature too.
“Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties. The geometry of a tree can be the delineation of its crown, of its stem, or the point indicating its centre. Other properties may include its height, color, diameter at breast height at a particular date, and so on.”
This description isn’t super clear, but it seems as if features are akin to a row of data in a table, while terms are a single cell.
Text Mining with R by Silge (the tutorial’s author) explains, “cast() turns a tidy one-term-per-row data frame into a matrix. tidytext provides three variations of this verb, each converting to a different type of matrix: cast_sparse() (converting to a sparse matrix from the Matrix package), cast_dtm() (converting to a DocumentTermMatrix object from tm), and cast_dfm() (converting to a dfm object from quanteda).”
Okay, so it seems Silge used dfm simply because it fit with quanteda. Had we installed tm (which is a text mining framework created for R), we probably would have used a dtm.
I am unsure what the advantage of using quanteda is over using tm. A paper by K. Welbers et al. says, “The performance and flexibility of quanteda’s dfm format lends us to recommend it over the tm equivalent.” I could test out these claims of performance and flexibility, but I consider this an experiment for another day.
So now I understand what the cast_dfm() function does–what it creates–but not exactly how it interacts with the stm package to create a topic model. At this point, I’m stuck. I cannot find an explanation of this relationship that I understand online, so I’m going to move on and see if I can glean any answers from the next steps.
The next step was to create the beta chart which, according to Brandon Stewart, shows a “list containing the log of the word probabilities for each topic.” Basically, the beta chart is showing a bar for the top 10 terms in each topic. The “top” is determined by the beta value of the term. Benjamin Soltoff explains that beta values are determined by “the probability of that term being generated from that topic.” In the tutorial, Silge says, the beta value shows which words contribute to which topic.
When you print the td_beta, you get:
print(td_beta) # A tibble: 61,326 x 3 topic term beta <int> <chr> <dbl> 1 1 _agamemnon_ 1.58e-21 2 2 _agamemnon_ 3.14e-21 3 3 _agamemnon_ 1.72e-21 4 4 _agamemnon_ 7.83e-22 5 5 _agamemnon_ 1.09e- 4 6 6 _agamemnon_ 2.16e-20 7 1 _all 8.80e-22 8 2 _all 1.74e-21 9 3 _all 1.63e- 4 10 4 _all 6.20e-22 # ... with 61,316 more rows
Great, so now I understand that the topic model that I created uses a document-feature matrix to determine which words belong in the 6 generated topics (I don’t know how this works, but I know it happens), and then I created a chart showing the top 10 words in each topic.
The next chart I created was a gamma chart, which shows which topics contribute to which documents. Based on my graph above, there is 1 part that belongs in each topic. Silge explains the chart this way, “How likely is it that this document belongs to this topic?”
When I print the results for the tidy_gamma matrix, I get a chart that looks like this:
print(td_gamma) # A tibble: 36 x 3 document topic gamma <chr> <int> <dbl> 1 Part First 1 0.0000185 2 Part Second 1 0.0000204 3 Part Third 1 0.0000330 4 Part Fourth 1 0.0000261 5 Part Fifth 1 1.000 6 Part Sixth 1 0.0000179 7 Part First 2 0.0000185 8 Part Second 2 0.0000214 9 Part Third 2 0.0000352 10 Part Fourth 2 1.000 # ... with 26 more rows
This chart, along with the graph above shows that each part of the story belongs to a single topic (mostly). Silge explained that this result isn’t always very common, but due to the small number of documents and topics, it isn’t an unlikely outcome. However, I am wondering whether this is where the structured part of the stm comes in. Have the topics been generated based off of the document divisions? Again, this is an experiment I will run another day after I’ve learned more about STM and topic modeling in general.
All in all, I would say that I haven’t exactly fulfilled the distant reading criteria for sound analysis, but I’ve certainly made improvements on my understanding of the models I created. One reassuring thing I discovered while I was updating my reading list was that a lot of the sources I used to understand this tutorial were very recently created. This means that there is potential for someone to further unpack the underlying assumptions made during this tutorial and about structured topic modeling in general. My next steps are to learn more about structured topic modeling (and hope that more information comes out about it in the coming months) and test out some of the experiments I’ve detailed in this blog post.