Topic modeling is a technique to extract the hidden topics from large volumes of text. As an unsupervised process, topic modeling is quite valuable as it is able to discover hidden semantic structures in text. We implemented the Latent Dirichlet Allocation (LDA) algorithm for topic modeling through gensim to determine the dominant topics found in misinformative YouTube videos. LDA builds a topic per document model and words per topic model, using Dirichlet distributions. The algorithm takes in a number of topics, then rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

The LDA model for topic modeling was built using cleaned captions from the YouTube videos collected from Twitter. It is important to find a good number of topics for the LDA model because it will provide meaningful and interpretable topics. Picking higher numbers can sometimes provide more granular sub-topics but if the number of topics is too high, the same keywords will be repeated in multiple topics. To find the best number of topics for the LDA model, we evaluate model performance using a coherence score. Coherence measures the degree of semantic similarity between high scoring words in the topic. These measurements help truly distinguish topics that are interpretable and understandable. We measured the coherence of LDA models built using 2, 8, 14, 20, 26, 32, 38 topics and plotted them below. As seen in the chart, the LDA model with 8 topics has the highest coherence with a score of 0.369. Thus, this is our optimal model for topic modeling.

The dashboard below visualizes the topics produced by the optimal LDA model and associated keywords. Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic. The highlighted bubble updates the words and bars on the right-hand side which are the salient keywords that form the selected topic.

After discovering the dominant topics across the documents in the entire corpus, we also found the dominant topic in each sentence of the caption texts and most representative caption texts for each topic. Lastly, we looked at the topic distribution among video captions and found that topics 2 and 4 were the most dominant topics, appearing in 53% and 42% of video captions texts respectively. Topic 2 was focused on the origins of COVID-19 with keywords like coronavirus, China, Wuhan, and lab while Topic 4 was regarding COVID-19 vaccines with keywords such as vaccine, booster, and dose. Topic modeling provides a deeper insight into the topics covered in YouTube videos linked in Twitter. Clearly, contentious subjects that are often sources of misinformation are frequently spread on social media platforms as evidenced by the dominant topics we discovered.

Topic Modeling