Introduction | Data | Misinformation Model | Topic Modeling | Sentiment Analysis | References |
We performed sentiment analysis on our YouTube videos to gauge if there was a difference in the engagement of positive and negative sentiment videos. We then wanted to see if there was a relation in the sentiment of the videos and whether the videos were spreading misinformation or not.
We trained a Logistic Regression model to predict the sentiment of videos. In order to label our videos for training, we researched what percentage of likes on a video deems it successful. On average, receiving likes that total 4-10% of the total views or above is seen as the baseline for a good video. We decided to take the upper limit of 10% and label all videos with a ratio of likes to views higher than 10% as positive. Any video with a ratio that fell below the threshold was labeled negative. We then fed this labeled data into the model with a 80/20 train test split.
We created two models following this labeling logic. The first model dealt with predicting sentiment based on video titles. The second model predicted sentiment using the video descriptions. Once we finished our models, we plotted our positive and negative sentiment videos to see how the distribution of misinformation spread compared. The model based on video descriptions, with an accuracy of 0.83, performed better than the one run on video titles, which had 0.75 accuracy. Looking into the video descriptions further, we were able to isolate the terms that appeared most frequently in both the positive and negative sentiment videos. Videos about general news channels seemed to have a more positive reaction from viewers than videos about specific platforms like Instagram, Twitter, Facebook, tv9kannada, or CTV news. China and omicron as terms have a higher appearance in videos that were taken negatively. African Diaspora News and non-English words appear in high amounts in positively received videos.
We found that the videos with positive sentiment had a slightly higher rate of misinformation (25%) vs negative videos containing misinformation (~17%). These results raise an interesting question of how people react to videos. Are they reacting more positively when given specific types of information or are people being exposed to misinformation content that is more likely to yield a positive reaction?
To summarize, this is our pipeline for sentiment analysis:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
import nltk
from nltk.corpus import stopwords
# df = temp
df = pd.read_csv('../../data/youtube_metadata.csv')
df = df.drop(0)
df["like_count"] = pd.to_numeric(df["like_count"])
df["comment_count"] = pd.to_numeric(df["comment_count"])
df["view_count"] = pd.to_numeric(df["view_count"])
df['misinformation'] = ['False', 'True', 'False', 'False', 'False', 'False', 'False', 'False', 'False', 'False',
'True', 'False', 'False', 'False', 'True', 'False', 'False', 'False', 'False', 'False',
'False', 'False', 'False', 'False', 'False', 'True', 'False', 'True']
# df.iloc[27]
# len(df)
df.head()
plt.figure(figsize=(12,6))
df.misinformation.hist()
plt.figure(figsize=(12,6))
df.category.hist()
# Video Likes
fig = px.histogram(df, x="like_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Video Likes')
fig.show()
plt.figure(figsize=(5,5))
sns.boxplot(y='like_count',data=df)
fig = px.histogram(df[df['like_count'] <200000], x="like_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Video Likes')
fig.show()
# Video Comments
fig = px.histogram(df, x="comment_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Video Comments')
fig.show()
plt.figure(figsize=(5,5))
sns.boxplot(y='comment_count',data=df)
fig = px.histogram(df[df['comment_count'] <10000], x="comment_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Video Comments')
fig.show()
# Video Views
fig = px.histogram(df, x="view_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Video Views')
fig.show()
plt.figure(figsize=(5,5))
sns.boxplot(y='view_count',data=df)
fig = px.histogram(df[df['view_count'] <1000000], x="view_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Video Views')
fig.show()
# Create stopword list:
from nltk.corpus import stopwords
import seaborn as sns
from wordcloud import WordCloud
# from wordcloud import WordCloud
# stopwords = set(STOPWORDS)
stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href", "https", 'http', 'www', 'com', 'de', 'da',
'twitter', 'youtube', 'follow', 'subscribe', 'u', 'que'])
textt = " ".join(review for review in df.description)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()
# df = df[df['like_count'] != 3]
# df['sentiment'] = df['Score'].apply(lambda likes : +1 if likes/> 3 else -1)
df['perc_liked'] = df['like_count']/df['view_count']
df = df[df['perc_liked'] != .5]
df['sentiment'] = df['perc_liked'].apply(lambda likes : +1 if likes > .1 else -1)
df['perc_liked'].hist()
positive = df[df['sentiment'] == 1]
negative = df[df['sentiment'] == -1]
# WORD CLOUD - POSITIVE SENTIMENT
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great", "e", "US", "us", "Us"])
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.description)
wordcloud2 = WordCloud(stopwords=stopwords).generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()
# WORD CLOUD - NEGATIVE SENTIMENT
neg = " ".join(review for review in negative.description)
wordcloud3 = WordCloud(stopwords=stopwords).generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud33.png')
plt.show()
df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})
fig = px.histogram(df, x="sentimentt")
fig.update_traces(marker_color="indianred",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Video Sentiment')
fig.show()
def remove_punctuation(text):
final = "".join(u for u in text if u not in ("?", ".", ";", ":", "!",'"'))
return final
df['Text'] = df['category'].apply(remove_punctuation)
df = df.dropna(subset=['description'])
df['description'] = df['description'].apply(remove_punctuation)
dfNew = df[['description','sentiment']]
dfNew.head()
# random split train and test data
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]
# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['description'])
test_matrix = vectorizer.transform(test['description'])
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
lr.fit(X_train,y_train)
predictions = lr.predict(X_test)
# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)
print(classification_report(predictions,y_test))
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href", "https", 'http', 'www', 'com', 'de', 'da',
'twitter', 'youtube', 'follow', 'subscribe', 'u', 'que'])
textt = " ".join(review for review in df.video_title)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()
For the titles, we see that Publiic Restriction, COVID, Omicron, detained, dose, variant are all heavily found
# WORD CLOUD - POSITIVE SENTIMENT VIDEO TITLES
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great"])
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.video_title)
wordcloud2 = WordCloud(stopwords=stopwords).generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()
# WORD CLOUD - NEGATIVE SENTIMENT VIDEO TITLES
neg = " ".join(review for review in negative.video_title)
wordcloud3 = WordCloud(stopwords=stopwords).generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud33.png')
plt.show()
def remove_punctuation(text):
final = "".join(u for u in text if u not in ("?", ".", ";", ":", "!",'"'))
return final
df['Text'] = df['description'].apply(remove_punctuation)
df = df.dropna(subset=['video_title'])
df['video_title'] = df['video_title'].apply(remove_punctuation)
dfNew = df[['description','sentiment']]
dfNew.head()
# random split train and test data
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]
# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['description'])
test_matrix = vectorizer.transform(test['description'])
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
lr.fit(X_train,y_train)
predictions = lr.predict(X_test)
# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)
print(classification_report(predictions,y_test))
Using the dideo descriptions gives us a better model for determining sentiment than video titles. We see that descriptions yields a model with 0.83 overall accuracy whereas titles yields 0.75 overall accuracy.
# plt.pie(df[df['sentiment'] == 1])
plt.pie(df[df['sentiment'] == 1]['misinformation'].value_counts(), labels = ['False', 'True'], autopct='%1.1f%%')
plt.title('Misinformation In Positive Sentiment Videos')
plt.pie(df[df['sentiment'] == -1]['misinformation'].value_counts(), labels = ['False', 'True'], autopct='%1.1f%%')
plt.title('Misinformation In Negative Sentiment Videos')