Introduction Data Misinformation Model Topic Modeling Sentiment Analysis References

Sentiment Analysis of Youtube Videos by Their Video Descriptions & Titles

We performed sentiment analysis on our YouTube videos to gauge if there was a difference in the engagement of positive and negative sentiment videos. We then wanted to see if there was a relation in the sentiment of the videos and whether the videos were spreading misinformation or not.

We trained a Logistic Regression model to predict the sentiment of videos. In order to label our videos for training, we researched what percentage of likes on a video deems it successful. On average, receiving likes that total 4-10% of the total views or above is seen as the baseline for a good video. We decided to take the upper limit of 10% and label all videos with a ratio of likes to views higher than 10% as positive. Any video with a ratio that fell below the threshold was labeled negative. We then fed this labeled data into the model with a 80/20 train test split.

We created two models following this labeling logic. The first model dealt with predicting sentiment based on video titles. The second model predicted sentiment using the video descriptions. Once we finished our models, we plotted our positive and negative sentiment videos to see how the distribution of misinformation spread compared. The model based on video descriptions, with an accuracy of 0.83, performed better than the one run on video titles, which had 0.75 accuracy. Looking into the video descriptions further, we were able to isolate the terms that appeared most frequently in both the positive and negative sentiment videos. Videos about general news channels seemed to have a more positive reaction from viewers than videos about specific platforms like Instagram, Twitter, Facebook, tv9kannada, or CTV news. China and omicron as terms have a higher appearance in videos that were taken negatively. African Diaspora News and non-English words appear in high amounts in positively received videos.

We found that the videos with positive sentiment had a slightly higher rate of misinformation (25%) vs negative videos containing misinformation (~17%). These results raise an interesting question of how people react to videos. Are they reacting more positively when given specific types of information or are people being exposed to misinformation content that is more likely to yield a positive reaction?

To summarize, this is our pipeline for sentiment analysis:

  1. Look into the distribution of likes, comments, and views (removing outliers)
  2. Create a WordCloud for the descriptions (removing stopwords)
  3. Separate the positive and negative sentiment videos and create WordClouds for each
  • Determine a ratio of likes to views to act as a threshold for positive vs negative sentiment
  1. Run a Logistic Regression Model to predict sentiment training on current video data
  2. Repeat process on Video Titles
In [372]:
import pandas as pd
import numpy as np
In [373]:
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
In [374]:
import nltk
from nltk.corpus import stopwords
In [375]:
# df = temp
df = pd.read_csv('../../data/youtube_metadata.csv')
df = df.drop(0)
df["like_count"] = pd.to_numeric(df["like_count"])
df["comment_count"] = pd.to_numeric(df["comment_count"])
df["view_count"] = pd.to_numeric(df["view_count"])
df['misinformation'] = ['False', 'True', 'False', 'False', 'False', 'False', 'False', 'False', 'False', 'False', 
                        'True', 'False', 'False', 'False', 'True', 'False', 'False', 'False', 'False', 'False', 
                        'False', 'False', 'False', 'False', 'False', 'True', 'False', 'True']
# df.iloc[27]
# len(df)
In [376]:
df.head()
Out[376]:
video_title user date_posted like_count comment_count view_count description topic_details video_tags category misinformation
1 “You’re being detained”: One Woman's Horrific ... UCln1G-zTVJQlnnnwGrvU8PA 2021-12-11T02:59:47Z 1 1.0 138 Cathy Blais found out her flight home on Egypt... ['https://en.wikipedia.org/wiki/Society'] ['omicron', 'covid', 'quarantine', 'africa', '... News & Politics False
2 I Found The Source of the Coronavirus UCSMrMr8KL6ADMstvYpbYCYA 2021-12-10T10:41:02Z 38 8.0 1155 I Found The Source of the Coronavirus, I have ... ['https://en.wikipedia.org/wiki/Society'] ['china', 'coronavirus', 'laowhy86', 'advchina... News & Politics True
3 In person conferences and my opinion UC2dd23_rKliwoPcCYANq7AA 2021-07-23T13:31:33Z 2 2.0 80 Fall conference season is coming up, be safe. ['https://en.wikipedia.org/wiki/Lifestyle_(soc... NaN Entertainment False
4 ¿Por que solo hay un 35% de vacunados en Parag... UCMxkaLLyBC8NoPYbPzNvRLg 2021-12-01T18:31:15Z 1 1.0 20 Carlos Kiese en su programa Deportivo de Radio... ['https://en.wikipedia.org/wiki/Society'] NaN Music False
5 ÓMICRON: SÍNTOMAS y CLAVES de la nueva variant... UC1FZF46zaC26_CynNc786_g 2021-12-02T23:43:17Z 11017 1450.0 1177469 La #variante #ómicron del #coronavirus tras se... ['https://en.wikipedia.org/wiki/Society'] ['variante omicron', 'que es la variante omicr... News & Politics False
In [377]:
plt.figure(figsize=(12,6))
df.misinformation.hist()
Out[377]:
<AxesSubplot:>
In [378]:
plt.figure(figsize=(12,6))
df.category.hist()
Out[378]:
<AxesSubplot:>
In [379]:
# Video Likes
fig = px.histogram(df, x="like_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Likes')
fig.show()
0100k200k300k400k500k600k0510152025
Video Likeslike_countcount

This histogram doesn't show us too much about the data. We can see that a lot of the data is grouped within 100k likes

  • We will remove outliers to represent that distribution better
In [380]:
plt.figure(figsize=(5,5))
sns.boxplot(y='like_count',data=df)
Out[380]:
<AxesSubplot:ylabel='like_count'>

There are outlier values at around 500000

  • We will remove these to get a better distribution of the like counts below
In [381]:
fig = px.histogram(df[df['like_count'] <200000], x="like_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Likes')
fig.show()
In [382]:
# Video Comments
fig = px.histogram(df, x="comment_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Comments')
fig.show()
02k4k6k8k10k12k14k16k18k05101520
Video Commentscomment_countcount
In [383]:
plt.figure(figsize=(5,5))
sns.boxplot(y='comment_count',data=df)
Out[383]:
<AxesSubplot:ylabel='comment_count'>
In [384]:
fig = px.histogram(df[df['comment_count'] <10000], x="comment_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Comments')
fig.show()
0100020003000400050006000700005101520
Video Commentscomment_countcount
In [385]:
# Video Views
fig = px.histogram(df, x="view_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Views')
fig.show()
In [386]:
plt.figure(figsize=(5,5))
sns.boxplot(y='view_count',data=df)
Out[386]:
<AxesSubplot:ylabel='view_count'>
In [387]:
fig = px.histogram(df[df['view_count'] <1000000], x="view_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Views')
fig.show()
0100k200k300k400k500k600k700k800k900k05101520
Video Viewsview_countcount

Create a Wordcloud for All Youtube Videos

In [388]:
# Create stopword list:
from nltk.corpus import stopwords
import seaborn as sns
from wordcloud import WordCloud 
# from wordcloud import WordCloud 

# stopwords = set(STOPWORDS)
stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href", "https", 'http', 'www', 'com', 'de', 'da', 
                  'twitter', 'youtube', 'follow', 'subscribe', 'u', 'que'])
textt = " ".join(review for review in df.description)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)
In [389]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()

We see that Facebook and Instagram are referenced a lot

  • Other notable terms are CTV News, tv9kannada, china, wolff, covid, omicron, medical, health
  • this may indicate that in the time period, omicron outweighs any other variant in discussion of covid on twitter -> youtube

Next, we want to see if the top words change between videos with positive and negative sentiment

  • We will determine the sentiment labels based on the pratio of likes to views and video recieves
  • a video with a ratio greater than 0.1 will be seen as positive
In [390]:
# df = df[df['like_count'] != 3]
# df['sentiment'] = df['Score'].apply(lambda likes : +1 if likes/> 3 else -1)
df['perc_liked'] = df['like_count']/df['view_count']
In [391]:
df = df[df['perc_liked'] != .5]
df['sentiment'] = df['perc_liked'].apply(lambda likes : +1 if likes > .1 else -1)
In [392]:
df['perc_liked'].hist()
Out[392]:
<AxesSubplot:>
In [393]:
positive = df[df['sentiment'] == 1]
negative = df[df['sentiment'] == -1]
In [394]:
# WORD CLOUD - POSITIVE SENTIMENT
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great", "e", "US", "us", "Us"]) 
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.description)
wordcloud2 = WordCloud(stopwords=stopwords).generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()
In [395]:
# WORD CLOUD - NEGATIVE SENTIMENT
neg = " ".join(review for review in negative.description)
wordcloud3 = WordCloud(stopwords=stopwords).generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud33.png')
plt.show()
In [396]:
df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})
fig = px.histogram(df, x="sentimentt")
fig.update_traces(marker_color="indianred",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Sentiment')
fig.show()
negativepositive0510152025
Video Sentimentsentimenttcount
In [397]:
def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"'))
    return final
df['Text'] = df['category'].apply(remove_punctuation)
df = df.dropna(subset=['description'])
df['description'] = df['description'].apply(remove_punctuation)
In [398]:
dfNew = df[['description','sentiment']]
dfNew.head()
Out[398]:
description sentiment
1 Cathy Blais found out her flight home on Egypt... -1
2 I Found The Source of the Coronavirus, I have ... -1
3 Fall conference season is coming up, be safe -1
4 Carlos Kiese en su programa Deportivo de Radio... -1
5 La #variante #ómicron del #coronavirus tras se... -1
In [399]:
# random split train and test data
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]
In [400]:
# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['description'])
test_matrix = vectorizer.transform(test['description'])
In [401]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
In [402]:
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
In [403]:
lr.fit(X_train,y_train)
Out[403]:
LogisticRegression()
In [404]:
predictions = lr.predict(X_test)
In [405]:
# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)
Out[405]:
array([[5, 1],
       [0, 0]])
In [406]:
print(classification_report(predictions,y_test))
              precision    recall  f1-score   support

          -1       1.00      0.83      0.91         6
           1       0.00      0.00      0.00         0

    accuracy                           0.83         6
   macro avg       0.50      0.42      0.45         6
weighted avg       1.00      0.83      0.91         6

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Let's Repeat the Above Analysis on Video Titles Alone

In [407]:
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href", "https", 'http', 'www', 'com', 'de', 'da', 
                  'twitter', 'youtube', 'follow', 'subscribe', 'u', 'que'])
textt = " ".join(review for review in df.video_title)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)
In [408]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()

For the titles, we see that Publiic Restriction, COVID, Omicron, detained, dose, variant are all heavily found

In [409]:
# WORD CLOUD - POSITIVE SENTIMENT VIDEO TITLES
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great"]) 
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.video_title)
wordcloud2 = WordCloud(stopwords=stopwords).generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()
In [438]:
# WORD CLOUD - NEGATIVE SENTIMENT VIDEO TITLES
neg = " ".join(review for review in negative.video_title)
wordcloud3 = WordCloud(stopwords=stopwords).generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud33.png')
plt.show()
In [439]:
def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"'))
    return final
df['Text'] = df['description'].apply(remove_punctuation)
df = df.dropna(subset=['video_title'])
df['video_title'] = df['video_title'].apply(remove_punctuation)
In [440]:
dfNew = df[['description','sentiment']]
dfNew.head()
Out[440]:
description sentiment
1 Cathy Blais found out her flight home on Egypt... -1
2 I Found The Source of the Coronavirus, I have ... -1
3 Fall conference season is coming up, be safe -1
4 Carlos Kiese en su programa Deportivo de Radio... -1
5 La #variante #ómicron del #coronavirus tras se... -1
In [441]:
# random split train and test data
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]
In [442]:
# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['description'])
test_matrix = vectorizer.transform(test['description'])
In [443]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
In [444]:
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
In [445]:
lr.fit(X_train,y_train)
Out[445]:
LogisticRegression()
In [446]:
predictions = lr.predict(X_test)
In [447]:
# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)
Out[447]:
array([[6, 2],
       [0, 0]])
In [448]:
print(classification_report(predictions,y_test))
              precision    recall  f1-score   support

          -1       1.00      0.75      0.86         8
           1       0.00      0.00      0.00         0

    accuracy                           0.75         8
   macro avg       0.50      0.38      0.43         8
weighted avg       1.00      0.75      0.86         8

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Using the dideo descriptions gives us a better model for determining sentiment than video titles. We see that descriptions yields a model with 0.83 overall accuracy whereas titles yields 0.75 overall accuracy.

In [449]:
# plt.pie(df[df['sentiment'] == 1])
plt.pie(df[df['sentiment'] == 1]['misinformation'].value_counts(), labels = ['False', 'True'], autopct='%1.1f%%')
plt.title('Misinformation In Positive Sentiment Videos')
Out[449]:
Text(0.5, 1.0, 'Misinformation In Positive Sentiment Videos')
In [450]:
plt.pie(df[df['sentiment'] == -1]['misinformation'].value_counts(), labels = ['False', 'True'], autopct='%1.1f%%')
plt.title('Misinformation In Negative Sentiment Videos')
Out[450]:
Text(0.5, 1.0, 'Misinformation In Negative Sentiment Videos')
In [ ]:
 
In [ ]:
 
In [ ]: