Data

Creating Datasets

This work analyzes the video captions of YouTube videos linked in any tweets related to public health. This data was compiled by first gathering individual tweets. While Twitter does not provide access to all tweets directly, this data was acquired by compiling daily tweet ids from Panacea Lab’s Covid-19 Twitter Chatter dataset spanning from March 22, 2020 to September 30, 2021. Using the Twitter API and the Twarc2 python library, tweet ids were sampled and hydrated into complete tweet objects which provided more complete data about the tweets including the text, author id, urls, date, and other useful information. These tweets were then filtered to check if the text content of the tweet contained any key terms relating to public health. Additionally, tweets were filtered to check if they contained any links that led to a YouTube video. Any such tweets were compiled into a dataset with the hydrated tweet objects and the YouTube video ids were stored separately.

The resulting dataset of video ids was then used to create YouTube data using the YouTube Data API and the YouTube Web Client. Special considerations were taken when fetching the video captions or subtitles. Firstly, not every video had captioning provided by the video creator. In that case, YouTube will often auto-generate captions which will not be completely accurate. Other times, the captions are in another language but the YouTube API provides the ability to translate captions into a specified language. In both situations, this dataset uses the available English captions whether they are autogenerated or translated. Additionally, the YouTube API can be used to find other information about a YouTube video using the video id. Our dataset contains relevant data including video title, date posted, like count, comment count, view count, description, video tags, and category that can be useful in determining if a video is contributing to the spread of misinformation.

Missingness

One key consideration for the dataset was missingness. Between the date they were posted and then accessed, tweets can be removed from the platform for several reasons. Missingness can be caused by several factors on both Twitter and YouTube. Oftentimes, users themselves can remove content, but the platforms themselves can also remove any content they believe violates their policies. With misinformation on the rise, social media platforms have developed policies to combat this and will sometimes remove content that contains false or misleading information. As a result, missing tweets and videos are not trivial and cannot be ignored in the conclusions we draw.

After fetching the tweet objects from the Twitter API, we conducted some exploratory data analysis to gain an overall understanding of the available tweets. First, we calculated the number of tweets that were hydrated into tweet objects and found the number of tweet ids that could no longer fetch tweets because they had been removed from the platform. To do so, we randomly sampled 10,000 tweet ids from the full tweet id dataset to build a 95% confidence interval of the proportion of missing tweets. This method resulted in an interval of (0.19, 0.37). As mentioned earlier, we are not able to verify the reason the tweet is unavailable but there is a higher chance that it was removed for violating Twitter’s policies including promoting misinformation. Similarly, any YouTube videos linked in the missing tweets have a stronger likelihood of being related to misinformation.