Latent Topics

This page lets you explore topics "discovered" by Latent Dirichlet Allocation (LDA) in texts filtered by the some of our draft models. Basically, we used these models to classify texts in the Reddit corpus. Then we performed LDA topic modeling on each category of texts, asking it to find n groups of topics with LDA, where n was chosen by trying different group numbers and choosing the one with the lowest perplexcity. We then continued this process for the texts in each topic, discovering subtopics, until we bottomed out.

Keep in mind that the process of generating topics is not deterministic. So different runs of the data will produce different topic groupings. Consequently, this is a tool for exploring the topic space, not creating a definitive list of topics. The hope is that it helps one get a feel for what the corpus is saying without having to read all 75,000 entries. If you see a lot of topics in a category, it probably means folks are talking about a lot of different things OR they talk about the same things in a lot of different ways. Hopefully, you'll find topics that fit your existing mental models for a category or that make sense upon reflection.

The topics that follow rely on filtering texts based on models which are still learning. So keep in mind that the better the underlying model the more likely the the topics are to be "sensible." So topics for models with lower precision and recall will likely be lower-quality. You can learn more about a given model version's performance under Draft Models, with the exception of 2019-04-09 for which no data is displayed.


The closer together two groups are, the more "similar" they are. You can get a feel for what words are important to a group by clicking on the group and playing with the relevance slider. 1 shows you terms based on how likely they are to be present in a group, and 0 shows terms based on their "lift." 0.6 is probably a good place to start as there's some research to show that is close to the sweet spot for recognizing groups.It can also be useful to mouse over individual words to the left of the bar graph. This will reveal how prevalent they are in individual groups.

Version:    Label: