Sentiment Analysis of Youtube Videos by Their Video Descriptions & Titles¶

We performed sentiment analysis on our YouTube videos to gauge if there was a difference in the engagement of positive and negative sentiment videos. We then wanted to see if there was a relation in the sentiment of the videos and whether the videos were spreading misinformation or not.

We trained a Logistic Regression model to predict the sentiment of videos. In order to label our videos for training, we researched what percentage of likes on a video deems it successful. On average, receiving likes that total 4-10% of the total views or above is seen as the baseline for a good video. We decided to take the upper limit of 10% and label all videos with a ratio of likes to views higher than 10% as positive. Any video with a ratio that fell below the threshold was labeled negative. We then fed this labeled data into the model with a 80/20 train test split.

We created two models following this labeling logic. The first model dealt with predicting sentiment based on video titles. The second model predicted sentiment using the video descriptions. Once we finished our models, we plotted our positive and negative sentiment videos to see how the distribution of misinformation spread compared. The model based on video descriptions, with an accuracy of 0.83, performed better than the one run on video titles, which had 0.75 accuracy. Looking into the video descriptions further, we were able to isolate the terms that appeared most frequently in both the positive and negative sentiment videos. Videos about general news channels seemed to have a more positive reaction from viewers than videos about specific platforms like Instagram, Twitter, Facebook, tv9kannada, or CTV news. China and omicron as terms have a higher appearance in videos that were taken negatively. African Diaspora News and non-English words appear in high amounts in positively received videos.

We found that the videos with positive sentiment had a slightly higher rate of misinformation (25%) vs negative videos containing misinformation (~17%). These results raise an interesting question of how people react to videos. Are they reacting more positively when given specific types of information or are people being exposed to misinformation content that is more likely to yield a positive reaction?

To summarize, this is our pipeline for sentiment analysis:

Look into the distribution of likes, comments, and views (removing outliers)
Create a WordCloud for the descriptions (removing stopwords)
Separate the positive and negative sentiment videos and create WordClouds for each

Determine a ratio of likes to views to act as a threshold for positive vs negative sentiment

Run a Logistic Regression Model to predict sentiment training on current video data
Repeat process on Video Titles

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

import nltk
from nltk.corpus import stopwords

# df = temp
df = pd.read_csv('../../data/youtube_metadata.csv')
df = df.drop(0)
df["like_count"] = pd.to_numeric(df["like_count"])
df["comment_count"] = pd.to_numeric(df["comment_count"])
df["view_count"] = pd.to_numeric(df["view_count"])
df['misinformation'] = ['False', 'True', 'False', 'False', 'False', 'False', 'False', 'False', 'False', 'False', 
                        'True', 'False', 'False', 'False', 'True', 'False', 'False', 'False', 'False', 'False', 
                        'False', 'False', 'False', 'False', 'False', 'True', 'False', 'True']
# df.iloc[27]
# len(df)

df.head()

plt.figure(figsize=(12,6))
df.misinformation.hist()

<AxesSubplot:>

plt.figure(figsize=(12,6))
df.category.hist()

<AxesSubplot:>

# Video Likes
fig = px.histogram(df, x="like_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Likes')
fig.show()

This histogram doesn't show us too much about the data. We can see that a lot of the data is grouped within 100k likes¶

We will remove outliers to represent that distribution better

plt.figure(figsize=(5,5))
sns.boxplot(y='like_count',data=df)

<AxesSubplot:ylabel='like_count'>

There are outlier values at around 500000¶

We will remove these to get a better distribution of the like counts below

fig = px.histogram(df[df['like_count'] <200000], x="like_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Likes')
fig.show()

# Video Comments
fig = px.histogram(df, x="comment_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Comments')
fig.show()

plt.figure(figsize=(5,5))
sns.boxplot(y='comment_count',data=df)

<AxesSubplot:ylabel='comment_count'>

fig = px.histogram(df[df['comment_count'] <10000], x="comment_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Comments')
fig.show()

# Video Views
fig = px.histogram(df, x="view_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Views')
fig.show()

plt.figure(figsize=(5,5))
sns.boxplot(y='view_count',data=df)

<AxesSubplot:ylabel='view_count'>

fig = px.histogram(df[df['view_count'] <1000000], x="view_count", nbins=10)
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Views')
fig.show()

Create a Wordcloud for All Youtube Videos¶

# Create stopword list:
from nltk.corpus import stopwords
import seaborn as sns
from wordcloud import WordCloud 
# from wordcloud import WordCloud 

# stopwords = set(STOPWORDS)
stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href", "https", 'http', 'www', 'com', 'de', 'da', 
                  'twitter', 'youtube', 'follow', 'subscribe', 'u', 'que'])
textt = " ".join(review for review in df.description)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()

We see that Facebook and Instagram are referenced a lot¶

Other notable terms are CTV News, tv9kannada, china, wolff, covid, omicron, medical, health
this may indicate that in the time period, omicron outweighs any other variant in discussion of covid on twitter -> youtube

Next, we want to see if the top words change between videos with positive and negative sentiment¶

We will determine the sentiment labels based on the pratio of likes to views and video recieves
a video with a ratio greater than 0.1 will be seen as positive

# df = df[df['like_count'] != 3]
# df['sentiment'] = df['Score'].apply(lambda likes : +1 if likes/> 3 else -1)
df['perc_liked'] = df['like_count']/df['view_count']

df = df[df['perc_liked'] != .5]
df['sentiment'] = df['perc_liked'].apply(lambda likes : +1 if likes > .1 else -1)

df['perc_liked'].hist()

<AxesSubplot:>

positive = df[df['sentiment'] == 1]
negative = df[df['sentiment'] == -1]

# WORD CLOUD - POSITIVE SENTIMENT
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great", "e", "US", "us", "Us"]) 
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.description)
wordcloud2 = WordCloud(stopwords=stopwords).generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()

# WORD CLOUD - NEGATIVE SENTIMENT
neg = " ".join(review for review in negative.description)
wordcloud3 = WordCloud(stopwords=stopwords).generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud33.png')
plt.show()

df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})
fig = px.histogram(df, x="sentimentt")
fig.update_traces(marker_color="indianred",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Video Sentiment')
fig.show()

def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"'))
    return final
df['Text'] = df['category'].apply(remove_punctuation)
df = df.dropna(subset=['description'])
df['description'] = df['description'].apply(remove_punctuation)

dfNew = df[['description','sentiment']]
dfNew.head()

# random split train and test data
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]

# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['description'])
test_matrix = vectorizer.transform(test['description'])

# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']

lr.fit(X_train,y_train)

LogisticRegression()

predictions = lr.predict(X_test)

# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)

array([[5, 1],
       [0, 0]])

print(classification_report(predictions,y_test))

              precision    recall  f1-score   support

          -1       1.00      0.83      0.91         6
           1       0.00      0.00      0.00         0

    accuracy                           0.83         6
   macro avg       0.50      0.42      0.45         6
weighted avg       1.00      0.83      0.91         6

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Let's Repeat the Above Analysis on Video Titles Alone¶

# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href", "https", 'http', 'www', 'com', 'de', 'da', 
                  'twitter', 'youtube', 'follow', 'subscribe', 'u', 'que'])
textt = " ".join(review for review in df.video_title)
wordcloud = WordCloud(stopwords=stopwords).generate(textt)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()

For the titles, we see that Publiic Restriction, COVID, Omicron, detained, dose, variant are all heavily found

# WORD CLOUD - POSITIVE SENTIMENT VIDEO TITLES
# stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great"]) 
## good and great removed because they were included in negative sentiment
pos = " ".join(review for review in positive.video_title)
wordcloud2 = WordCloud(stopwords=stopwords).generate(pos)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis("off")
plt.show()

# WORD CLOUD - NEGATIVE SENTIMENT VIDEO TITLES
neg = " ".join(review for review in negative.video_title)
wordcloud3 = WordCloud(stopwords=stopwords).generate(neg)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud33.png')
plt.show()

def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"'))
    return final
df['Text'] = df['description'].apply(remove_punctuation)
df = df.dropna(subset=['video_title'])
df['video_title'] = df['video_title'].apply(remove_punctuation)

dfNew = df[['description','sentiment']]
dfNew.head()

# random split train and test data
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]

# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['description'])
test_matrix = vectorizer.transform(test['description'])

# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']

lr.fit(X_train,y_train)

LogisticRegression()

predictions = lr.predict(X_test)

# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)

array([[6, 2],
       [0, 0]])

print(classification_report(predictions,y_test))

              precision    recall  f1-score   support

          -1       1.00      0.75      0.86         8
           1       0.00      0.00      0.00         0

    accuracy                           0.75         8
   macro avg       0.50      0.38      0.43         8
weighted avg       1.00      0.75      0.86         8

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Using the dideo descriptions gives us a better model for determining sentiment than video titles. We see that descriptions yields a model with 0.83 overall accuracy whereas titles yields 0.75 overall accuracy.

We also wanted to compare the amount of positive and negative sentiment videos that were related to misinformation¶

# plt.pie(df[df['sentiment'] == 1])
plt.pie(df[df['sentiment'] == 1]['misinformation'].value_counts(), labels = ['False', 'True'], autopct='%1.1f%%')
plt.title('Misinformation In Positive Sentiment Videos')

Text(0.5, 1.0, 'Misinformation In Positive Sentiment Videos')

plt.pie(df[df['sentiment'] == -1]['misinformation'].value_counts(), labels = ['False', 'True'], autopct='%1.1f%%')
plt.title('Misinformation In Negative Sentiment Videos')

Text(0.5, 1.0, 'Misinformation In Negative Sentiment Videos')

	video_title	user	date_posted	like_count	comment_count	view_count	description	topic_details	video_tags	category	misinformation
1	“You’re being detained”: One Woman's Horrific ...	UCln1G-zTVJQlnnnwGrvU8PA	2021-12-11T02:59:47Z	1	1.0	138	Cathy Blais found out her flight home on Egypt...	['https://en.wikipedia.org/wiki/Society']	['omicron', 'covid', 'quarantine', 'africa', '...	News & Politics	False
2	I Found The Source of the Coronavirus	UCSMrMr8KL6ADMstvYpbYCYA	2021-12-10T10:41:02Z	38	8.0	1155	I Found The Source of the Coronavirus, I have ...	['https://en.wikipedia.org/wiki/Society']	['china', 'coronavirus', 'laowhy86', 'advchina...	News & Politics	True
3	In person conferences and my opinion	UC2dd23_rKliwoPcCYANq7AA	2021-07-23T13:31:33Z	2	2.0	80	Fall conference season is coming up, be safe.	['https://en.wikipedia.org/wiki/Lifestyle_(soc...	NaN	Entertainment	False
4	¿Por que solo hay un 35% de vacunados en Parag...	UCMxkaLLyBC8NoPYbPzNvRLg	2021-12-01T18:31:15Z	1	1.0	20	Carlos Kiese en su programa Deportivo de Radio...	['https://en.wikipedia.org/wiki/Society']	NaN	Music	False
5	ÓMICRON: SÍNTOMAS y CLAVES de la nueva variant...	UC1FZF46zaC26_CynNc786_g	2021-12-02T23:43:17Z	11017	1450.0	1177469	La #variante #ómicron del #coronavirus tras se...	['https://en.wikipedia.org/wiki/Society']	['variante omicron', 'que es la variante omicr...	News & Politics	False