For the last couple of weeks, I have been working on IRB documents for my research project involving student interviews. The good thing about the IRB process is that it really made me plan out my project so that I now know every detail of how I will do each step.
Part of my project will almost definitely involve topic modeling and sentiment analysis. However, when I wrote my first draft of the IRB Approval Form Responses, I realized I didn’t actually know very much about topic modeling or sentiment analysis, and what little knowledge I did have wasn’t going to cut it for this review process. So I sat down to try to read about the process of topic modeling and how it can be used.
I don’t know if you’ve read my About page or any of my other blog posts, but let me reiterate that I do not have a computer science or a statistics background. I consider myself fairly capable when it comes to math. I’ve always had a good sense for math and logic in the same way that although I’m not a cartographer, I am usually pretty confident in my ability to point out the cardinal directions. However, the articles I was reading and most of the blog posts sounded like this to me:
No matter how many articles I read or reread, I just wasn’t getting how the whole thing worked. Where do you do topic modeling or sentiment analysis on a computer? Is it code? Is it a program? Do I need to download something? How should I format the files I want to analyze? What kind of code is used in topic modeling?
Learning how to do this stuff on your own is like trying to bake a cake for the first time in your life, and your only directions are in Russian, and you’ve also never eaten cake before. I have a general idea of what I hope to end up with, but I don’t even know where to flip in this cookbook to find a cake recipe.
I found myself reading the same sentence over and over again, so I did what I always do when I don’t understand how to do something and reading isn’t helping. I got on Youtube and watched some videos. (You can find everything I’ve watched and read (that I comprehended) on my reading list page.) They weren’t all helpful, but just watching someone type out the code and show a result was helpful. So I watched a bunch of these videos and after finding one that was particularly clear and useful (it was actually on sentiment analysis, but that’s beside the point), I downloaded all of the things the Youtuber had in the description section of his video. I quickly realized that many of the programs have been updated since the video, and don’t look the same or don’t have the same features anymore, so I couldn’t follow along with his tutorial as I planned. I was kind of back to square one–or square two, since I at least had a better idea of what kind of information I could get out of doing data analysis like this.
Next step? Complain about how hard this is! Feeling like I’d hit only dead ends, I explained my predicament to Ben, who sent me an article I’d previously given up on. The thing about a lot of these tutorial articles is that they start by telling you to go back a step if you don’t already know how to use the command line or BASH or R or whatever thing they’re going to use throughout the tutorial. And that makes a lot of sense to me. I wouldn’t suggest to you that the best way to learn how to bake is by using a recipe in a Russian cookbook and if I did, it would be cruel of me not to tell you to brush up on your Russian first. But if you keep going backward further and further away from the thing you want to do, you end up watching videos about how computers work (like this) instead of writing an IRB document. And I’m certainly not arguing that learning how computers work is a bad thing or that I shouldn’t spend my time learning the basics of computer science, but I’m also on a schedule. There are only so many skills I can learn in a week or a month or a year. But, I had a lead on one. Ben suggested this article, which was much clearer to me once I watched all of the Youtube videos and out of all the tutorial blogs, if Ben said this was good way to start, I could trust it would get me somewhere. That article linked to another that I could understand and another (see reading list). Eventually, I found the GUI Topic Modeling Tool, which is a GUI for MALLET, and then I really got on a roll.
Let me say briefly, I understand the trepidation around GUIs. If you don’t know how something works (e.g. what the program does and how it processes data), you might take all of the results at face value, as if they were concrete, official, and not as the social construct that they actually are. It’s like how most of us know vaccines prevent diseases and we get them, but few of us know how to make them. Generally, this works out. We don’t all need to know how to make a vaccine, but very few of us are mixing up our own vaccines at home and claiming they cure anything. Digital scholars who don’t know how the statistical model behind their data works aren’t going to accidentally give people mercury poisoning with a homemade polio vaccine, but they could confuse their audience with claims that might not be substantiated in the data. So if you don’t love the GUI Topic Modeling Tool, I get it.
The thing about GUIs to me is that they can be like the Google Translate for your Russian cookbook. Sometimes (maybe often) you’re going to end up with some total nonsense, but if it’s all you’ve got to get started, that’s what you’re going to use. The MALLET GUI available here, is really, really useful. It allowed me to work backward, so that I could tweak variables in a format I understand and compare the results. I know what number of topics should do, but what does number of iterations mean? How does it change the results from a data set I know super well? How does changing the number of topic words printed impact my understanding of the results? I don’t need to know how to tell the computer to change that variable yet. I need to understand what the variable is in the first place.
I ended up with some data that I could play around with and visualize on Power BI. Miriam Posner’s blog post about interpreting this data was super helpful at this step. Now I feel more prepared to read all of those blogs about the statistical model behind topic modeling. I have a better grasp of the variables and results. I’m also going to try running MALLET from the command line. To return to that damn cake metaphor for the 100th time this post, I know what all of the ingredients are. I’ve tasted some cake. Next, I have to learn how to set the oven and turn on the mixer. (Do you hate me yet?)
Best of all, the other benefit of using the MALLET GUI was being able to grasp enough of the conceptual ideas behind topic modeling and the data it generates so I could complete my IRB documents. Now I just need to submit them!