Will Silver Wagman
Final Data Science Project Tutorial
CMPS-6790 | Professor Mattei
For this project, I will be working independently to analyze the sentiment of characters' lines throughout the entire series of The Office. I aim to explore how the tone and sentiment evolved over its nine seasons. By examining these changes, I hope to uncover whether shifts in sentiment were connected to the show's popularity or influenced viewer demographics. This analysis will offer insights into how the mood of the series transformed over time and how those shifts may have impacted its audience.
Datasets
I have selected the following datasets to use for this project:
1. The Office Transcripts Dataset (Primary Dataset)
This dataset contains the complete dialogue from every episode of The Office, broken down by season, episode, scene, and character. It includes a total of 59,909 lines, with metadata such as the character speaking the line, and whether the scene is deleted or not.
Question to Answer: How does the sentiment of characters’ lines change from season 1 to the final season? I aim to analyze the progression of sentiment across seasons to determine if the show became more humorous, emotional, or serious over time.
Link to source: https://www.reddit.com/r/datasets/comments/b30288/every_line_from_every_episode_of_the_office/
Direct link to google sheet dataset: https://docs.google.com/spreadsheets/d/18wS5AAwOh8QO95RwHLS95POmSNKA2jjzdt0phrxeAE0/edit?gid=747974534#gid=747974534
2. IMDB Ratings Dataset
I plan on curating my own dataset by scraping the episode ratings for every episode of The Office from IMDB. The dataset I create will include the episode number, season, and the average rating given by viewers.
Question to Answer: Do episodes with higher positive sentiment scores correspond to higher IMDB ratings? I plan on investigating whether episodes with a more positive or negative tone tend to have higher or lower viewer ratings.
3. Viewer Demographic Data
I hope to find viewer demographic data, which would ideally contain information about the demographics of The Office viewers, including age, gender, location, and other metadata.
Question to Answer: Did changes in sentiment throughout the series affect the viewer demographics? For example, if the sentiment became more positive or emotional, did the show attract a younger audience?
I'm going start with the first step of my initial ETL process: Extract
I'm going to load the dataset and verify the strucutre and the integrity of the data. This will include checking for missing/null values and ensuring relevant columns are present and formatted correctly.
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
df = pd.read_csv('/content/drive/MyDrive/the-office-lines-csv.csv')
df.head()
id | season | episode | scene | line_text | speaker | deleted | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | All right Jim. Your quarterlies look very good... | Michael | False |
1 | 2 | 1 | 1 | 1 | Oh, I told you. I couldn't close it. So... | Jim | False |
2 | 3 | 1 | 1 | 1 | So you've come to the master for guidance? Is ... | Michael | False |
3 | 4 | 1 | 1 | 1 | Actually, you called me in here, but yeah. | Jim | False |
4 | 5 | 1 | 1 | 1 | All right. Well, let me show you how it's done. | Michael | False |
The transcript dataset contains the following columns:
id: Unique identifier for each line.
season: The season of The Office in which the line occurs.
episode: The episode number within that season.
scene: The scene number within the episode.
line_text: The actual line spoken in the episode.
speaker: The character who spoke the line.
deleted: Whether the scene is a deleted scene or not (True/False).
# Check for missing values
missing_values = df.isnull().sum()
missing_values
0 | |
---|---|
id | 0 |
season | 0 |
episode | 0 |
scene | 0 |
line_text | 0 |
speaker | 0 |
deleted | 0 |
# Check the data types of each column
data_types = df.dtypes
data_types
0 | |
---|---|
id | int64 |
season | int64 |
episode | int64 |
scene | int64 |
line_text | object |
speaker | object |
deleted | bool |
The dataset appears to be clean and shows no missing values in any of the columns. The relevant columns are also present and correctly formatted:
season, episode, and scene are all integers. line_text (the dialogue) and speaker (the character speaking) are stored as strings (object type). deleted is a boolean indicating whether a scene is deleted or not.
# The dataset includes an 'id' column which I'm going to drop because Pandas, by default, creates an index for me.
df.drop(columns=['id'], inplace=True)
df.head()
season | episode | scene | line_text | speaker | deleted | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | All right Jim. Your quarterlies look very good... | Michael | False |
1 | 1 | 1 | 1 | Oh, I told you. I couldn't close it. So... | Jim | False |
2 | 1 | 1 | 1 | So you've come to the master for guidance? Is ... | Michael | False |
3 | 1 | 1 | 1 | Actually, you called me in here, but yeah. | Jim | False |
4 | 1 | 1 | 1 | All right. Well, let me show you how it's done. | Michael | False |
I'm move onto the next two steps of my initial ETL process: Transform & Load
Since I'm working with text for the transcripts dataset, I'll need to extract some numerical value from the lines. This will involve applying sentiment analysis to the dialogue lines in The Office dataset. I'm going to use the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool, which is specifically designed to analyze the sentiment of text data.
# Installing the vaderSentiment library
!pip install vaderSentiment
Requirement already satisfied: vaderSentiment in /usr/local/lib/python3.10/dist-packages (3.3.2) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from vaderSentiment) (2.32.3) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->vaderSentiment) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->vaderSentiment) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->vaderSentiment) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->vaderSentiment) (2024.8.30)
# Now that the VADER library is installed, I'm going to import
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Now, I will apply sentiment analysis to the line_text column in the dataset. For each line, VADER will assign a sentiment score (ranging from -1 for very negative to +1 for very positive). I will add a new column to the dataset called sentiment to store these scores.
# Initializing the VADER sentiment analyzer.
analyzer = SentimentIntensityAnalyzer()
# Now to apply the sentiment analysis to each line of dialogue
df['sentiment'] = df['line_text'].apply(lambda line: analyzer.polarity_scores(line)["compound"])
df[["line_text", "sentiment"]].head(20)
line_text | sentiment | |
---|---|---|
0 | All right Jim. Your quarterlies look very good... | 0.4927 |
1 | Oh, I told you. I couldn't close it. So... | 0.0000 |
2 | So you've come to the master for guidance? Is ... | 0.0000 |
3 | Actually, you called me in here, but yeah. | 0.4215 |
4 | All right. Well, let me show you how it's done. | 0.2732 |
5 | [on the phone] Yes, I'd like to speak to your ... | 0.6712 |
6 | I've, uh, I've been at Dunder Mifflin for 12 y... | 0.2225 |
7 | Well. I don't know. | 0.2732 |
8 | If you think she's cute now, you should have s... | 0.4588 |
9 | What? | 0.0000 |
10 | Any messages? | 0.0000 |
11 | Uh, yeah. Just a fax. | 0.2960 |
12 | Oh! Pam, this is from Corporate. How many time... | 0.4574 |
13 | You haven't told me. | 0.0000 |
14 | It's called the wastepaper basket! Look at tha... | 0.0000 |
15 | People say I am the best boss. They go, 'God w... | 0.9745 |
16 | [singing] Shall I play for you? Pa rum pump um... | 0.0516 |
17 | My job is to speak to clients on the phone abo... | -0.4019 |
18 | Whassup! | 0.0000 |
19 | Whassup! I still love that after seven years. | 0.6696 |
Now that each line has a sentiment score, I will group the data by the season column and calculate the average sentiment score for each season. This will help me understand how the sentiment of The Office evolves over time.
# Grouping by season and calculate the average sentiment score for each season.
sentiment_by_season = df.groupby('season')['sentiment'].mean()
print(sentiment_by_season)
season 1 0.164965 2 0.142028 3 0.136380 4 0.140098 5 0.127326 6 0.129451 7 0.137139 8 0.138774 9 0.141170 Name: sentiment, dtype: float64
To better visualize the changes in sentiment across the seasons, I'm going to plot the average sentiment score for each season using a line chart.
import matplotlib.pyplot as plt
plt.plot(sentiment_by_season.index, sentiment_by_season.values, marker='o')
plt.xlabel('Season')
plt.ylabel('Average Sentiment Score')
plt.title('Average Sentiment Across Seasons of The Office')
plt.grid(True)
plt.show()
Interesting Stat: The season with the highest average sentiment is season 1, indicating that this season had the most consistently positive tone in dialogue. Season 5, on the other hand, had the lowest sentiment score, possibly corresponding with a more serious or negative tone.
However, as someone who has watched the show multiple times over, my initial reaction is that these outcomes may not necessarily be reflective of the actual sentiment of the show. For example, The Office generally maintains a balance between humor and more serious moments across the seasons, and the sentiment scores from VADER don't fully capture the nuances of these tones. I have a strong feeling that the VADER sentiment analysis tool may not be the most effective method for my specific purposes in analyzing the sentiment of the dialogue throughout the show. I plan on exploring alternative sentiment analysis tools to achieve more accurate results.
I'm thinking that I will create a python script that will send the transcribed lines from the 'line_text' column to openai's API along with a prompt asking it to assign a sentiment score to each line. Since there are 59,909 lines in the dataset, I plan on batch processing these API calls.
For the purpose of completing this first milestone, I will proceed with some light EDA using the values from the 'sentiment' column that I created even though I plan on changing the method in which I apply a sentiment score to each line.
# 1. average number of lines per season
lines_per_season = df.groupby('season')['line_text'].count()
# 2. Total number of unique characters
unique_characters = df['speaker'].nunique()
# 3. Top 5 characters with the most lines
top_characters = df['speaker'].value_counts().head(5)
# 4. Average sentiment score by character (top 5)
top_characters_sentiment = df[df['speaker'].isin(top_characters.index)].groupby('speaker')['sentiment'].mean()
# 5. Distribution of sentiment scores (histogram).
import matplotlib.pyplot as plt
plt.hist(df['sentiment'], bins=20, edgecolor='black')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Distribution of Sentiment Scores for Lines in The Office')
plt.grid(True)
plt.show()
lines_per_season, unique_characters, top_characters, top_characters_sentiment
(season 1 1996 2 7492 3 7483 4 5642 5 8170 6 7630 7 7302 8 7083 9 7111 Name: line_text, dtype: int64, 797, speaker Michael 12137 Dwight 7529 Jim 6814 Pam 5375 Andy 3968 Name: count, dtype: int64, speaker Andy 0.154255 Dwight 0.098032 Jim 0.151209 Michael 0.174460 Pam 0.149948 Name: sentiment, dtype: float64)
Light EDA Findings: Here are 4 summary statistics that relate to the questions I am asking of the dataset:
Average Number of Lines per Season:
Relevance: This indicates the overall length and density of each season, which may correlate with character development, sentiment analysis, and audience engagement.
Total Number of Unique Characters:
Relevance: The diversity in characters is important because it can show how different perspectives andd character arcs influence the tone and sentiment throughout the show.
Top 5 Characters by Number of Lines:
Relevance: These are the key characters that contribute most to the dialogue. As such, it would be logical to say that their sentiment scores are particularly relevant to understanding the overall tone of the show.
Average Sentiment Score by Character (in the Top 5):
Relevance: This is interesting as it may reflect character traits.
Michael's generally positive sentiment could align with his role as the comedic center of the show, while Dwight's more neutral sentiment aligns with his intense and often serious personality.
Visual Analysis: Distribution of Sentiment Scores
The histogram (above) shows the distribution of sentiment scores (according to the VADER index) across all lines in The Office.
Relevance: The distribution is centered around neutral sentiment, but we can see that positive sentiment (greater than 0) occurs more frequently than negative sentiment (less than 0). This would align with the comedic nature of the show but also indicates that a wide range of sentiments are present, which could be key to understanding its emotional dynamics. However, I plan on changing the way the sentiment score is applied to the values in the 'line_text' column as I have determined that the VADER index may not be accurate for my use case. Before making my decision, I will first take a look under the hood of VADER to see how it works and what it's doing when it applies a sentiment analysis score to text.
Before starting Milestone 2, I wanted to address my concerns about using the VADER model for sentiment scoring. During Milestone 1, I used this model to get a feel for hands-on application of sentiment scoring using a pre-built scoring mechanism (VADER), however I did not conduct a thorough examination of what it does 'under the hood'.
Thus, since the sentiment scoring outcomes I found during my EDA in Milestone 1 didn't seem to be accurate, I decided to conduct an investigation into VADER and its accuracy in this project's use case.
The first thing I did was to examine the line_text column and my newly created "sentiment" column so that I could read through the lines of text and examine whether or not VADER's sentiment score for that line seemed accurate.
Note: An instances value in "line_text" may not be a single sentence. It may be a few sentences of text.
I discovered that some of the sentiment scores VADER assigned to text in the line_text column seemed wildly innaccurate. In many cases, VADER assigned positive scored to text that was obviously more neutral or even negative (and vice versa). For example, for the text
"Oh! Pam, this is from Corporate. How many times have I told you? There's a special filing cabinet for things from corporate."
VADER assigned a score of 0.46 which is clearly way too positive. Here is a reference of the typical threshold values used with the VADER model (https://github.com/cjhutto/vaderSentiment):
• positive sentiment: compound score >= 0.05
• neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
• negative sentiment: compound score <= -0.05
Because I found many instances of sentiment-scoring innacuracy, I decided against using VADER for sentiment scoring. Instead I am going to be sending each instance's line_text value to openai's API for sentiment scoring. I will detail this in the coming cells, but first, a quick thought:
The VADER model was created to score short snippets of text, specifically from social media. Under the hood, VADER analyzes sentiment using a lexicon-based approach, where each word in the text is matched to a dictionary of sentiment-labeled words with predefined scores. It then combines these scores, adjusting for factors like punctuation, capitalization, and modifiers to capture the overall sentiment intensity of the text. This is probably a decent way to capture sentiment of short text, but since I'm feeding it longer snippets of text (some line_text values are 3+ sentences in length), accuracy seems to decrease. This makes sense because more nuance gets introduced as word count increases and so VADER's dictionary-based approach simply won't do a good job of capturing sentiment in longer sections of text.
Thus, before I start Milestone 2, I'm going to re-score sentiment for each instance in the transcripts dataset.
For this I will be sending all line_text values to OpenAI's API for sentiment scoring.
Here are the details I'm using:
Model: gpt-4o-2024-08-06
Meta: Structured Outputs functionality — This is because all 59,909 instances of speech need processing and I need consistently formatted sentiment output scores. With structured outputs, I can receive JSON the response so I don't have to parse through the API's response text and can quickly find sentiment scores. Here is the documentation: https://platform.openai.com/docs/guides/structured-outputs/introduction
Meta: I am using batch processing because of api limits and also because time is a factor. After some playing around, I found a batch size of 500 line_text values would be decent for my purposes.
Meta: I am also using parallel processing since batch processing by itself was too slow. I implemented a progress bar with metadata including a time-to-completion estimate and brought down the estimated time to finish the entire 59,909 lines from 12-ish hours to ~35 minutes.
Meta Meta: I chose to use 4 parallel workers (this number was sort of arbitrary).
I originally wrote the OpenAI API line processing script as a .py file in VSCode. Since I've already saved the output CSV locally and uploaded it to Google Drive, I'll share the code here in the following code cell instead of re-running it in this notebook (as this would take a while...).
Note: I've commented out the code so that I can run the nb without running the code (which would fail anyway because I used my exported API key locally and it doesn't exist within the code).
#import pandas as pd
#import time
#from pathlib import Path
#from pydantic import BaseModel
#from openai import OpenAI
#from typing import Optional, List
#import json
#from tqdm import tqdm
#import concurrent.futures
#import numpy as np
#class LineScore(BaseModel):
# line_index: int
# sentiment: float
#class BatchSentimentScore(BaseModel):
# scores: List[LineScore]
#class DialogueProcessor:
# def __init__(self, input_file: str, output_file: str, batch_size: int = 500, n_workers: int = 4):
# self.client = OpenAI()
# self.input_file = input_file
# self.output_file = output_file
# self.batch_size = batch_size
# self.n_workers = n_workers
# def get_batch_sentiment(self, batch_data: tuple) -> dict:
# texts, indices = batch_data
# try:
# # Create a numbered list of texts for the model
# numbered_texts = [f"Text {i}: {text}" for i, text in zip(indices, texts)]
# all_texts = "\n\n".join(numbered_texts)
# completion = self.client.beta.chat.completions.parse(
# model="gpt-4o-2024-08-06",
# messages=[
# {"role": "system", "content": """You are a sentiment analysis expert. Analyze each text and return a sentiment score between -1.0 (most negative) and 1.0 (most positive).
# Use 0.0 for neutral sentiment. Consider context, tone, and subtle emotional nuances in the dialogue.
# You must return a score for every text provided."""},
# {"role": "user", "content": all_texts},
# ],
# response_format=BatchSentimentScore,
# )
# return {score.line_index: score.sentiment for score in completion.choices[0].message.parsed.scores}
# except Exception as e:
# print(f"Error processing batch: {str(e)}")
# return {}
# def process_file(self):
# try:
# print(f"Reading input file: {self.input_file}")
# df = pd.read_csv(self.input_file)
# print(f"Found {len(df)} lines to process")
# # split data into batches.
# indices = list(range(len(df)))
# texts = df["line_text"].tolist()
# # Creating batches
# batch_indices = [indices[i:i + self.batch_size] for i in range(0, len(indices), self.batch_size)]
# batch_texts = [texts[i:i + self.batch_size] for i in range(0, len(texts), self.batch_size)]
# batches = list(zip(batch_texts, batch_indices))
# results = {}
# processed_count = 0
# print(f"Processing {len(batches)} batches with {self.n_workers} workers")
# with concurrent.futures.ThreadPoolExecutor(max_workers=self.n_workers) as executor:
# # Submit all batches to the thread pool
# future_to_batch = {executor.submit(self.get_batch_sentiment, batch): i
# for i, batch in enumerate(batches)}
# # Process completed batches as they finish
# for future in tqdm(concurrent.futures.as_completed(future_to_batch),
# total=len(batches),
# desc="Processing batches"):
# batch_results = future.result()
# results.update(batch_results)
# processed_count += len(batch_results)
# # Save intermediate results every 2000 lines
# if processed_count % 2000 == 0:
# self.save_results(df, results)
# # Now to save the final results
# self.save_results(df, results)
# print("Processing completed successfully")
# except Exception as e:
# print(f"Error during file processing: {str(e)}")
# if 'results' in locals() and 'df' in locals():
# self.save_results(df, results)
# raise
# def save_results(self, df: pd.DataFrame, results: dict):
# try:
# output_df = df.copy()
# output_df["sentiment_score"] = pd.Series(results)
# output_df.to_csv(self.output_file, index=False)
# print(f"Results saved to {self.output_file}")
# except Exception as e:
# print(f"Error saving results: {str(e)}")
#def main():
# # Initializing processor with optimized parameters
# processor = DialogueProcessor(
# input_file="/Users/will5206/Desktop/JUST_TEXT_LINES_THE_OFFICE - the-office-lines-csv.csv",
# output_file="processed-sentiment-lines-59909.csv",
# batch_size=500, # Process 500 lines at once
# n_workers=4 # Use 4 parallel workers
# )
# # Process the file
# processor.process_file()
#if __name__ == "__main__":
# main()
To reiterate, I have run this code locally on my machine and then uploaded the output csv file to my google drive which I will now load into a pandas data frame.
The output CSV file contains two columns: line_text (the dialogue) and sentiment_score, the newly created column which now houses the sentiment scores for every line_text value.
# Loading the output file from the openai-api-line-processing script
openai_processed_lines = pd.read_csv("/content/drive/MyDrive/processed-sentiment-lines-59909.csv")
openai_processed_lines.drop("line_text", axis=1, inplace=True)
openai_processed_lines.head()
sentiment_score | |
---|---|
0 | 0.1 |
1 | -0.8 |
2 | -0.6 |
3 | -0.4 |
4 | -0.2 |
Since I still have the original transcripts dataframe ('df'), I will delete the VADER sentiment score column and replace it with the new sentiment_score column from the newly created openai_processed_lines data frame.
# Dropping the old 'sentiment' column
df.drop("sentiment", axis=1, inplace=True)
# Add 'sentiment_score' from 'openai_processed_lines' to 'df' based on row index
df['sentiment_score'] = openai_processed_lines['sentiment_score']
df.head()
# One thing I forgot to do in Milestone 1 was to remove all rows of scenes that were deleted from the show as indicated
# by a value of 'TRUE' in the 'deleted' column.
df = df[df['deleted'] != True]
Now that we have new sentiment scores, I'll re-run some initial EDA.
# Grouping by season and calculate the average sentiment score for each season.
new_sentiment_by_season = df.groupby('season')['sentiment_score'].mean()
print(new_sentiment_by_season)
season 1 0.044336 2 -0.000909 3 0.012983 4 -0.008380 5 -0.027901 6 -0.021494 7 0.024541 8 0.009646 9 0.024871 Name: sentiment_score, dtype: float64
The mean sentiment values for each season seem to be even more neutral (scores are closer to zero) when using Openai's API to score sentiment compared to using VADER to score sentiment.
After looking through many lines of dialogue and their respective GPT-4o scores provided by Openai's API, I found some similar issues to those which I found in VADER's scoring.
Ultimately, I think that scoring individual text dialogue values in line_text for sentiment is likely quite difficult to do with high accuracy. I wonder if this is in part due to the fact that I am scoring dialogue without regard for the context in which it is in. I will save this problem for another day and will continue to use the new sentiment scores from Openai's API.
import matplotlib.pyplot as plt
plt.plot(new_sentiment_by_season.index, new_sentiment_by_season.values, marker='o')
plt.xlabel('Season')
plt.ylabel('Average Sentiment Score')
plt.title('Average Sentiment Across Seasons of The Office')
plt.grid(True)
plt.show()
There seems to be more variation here in comparison to this same graph as produced by VADER's scores.
I'm now going to re-run the rest of my initial EDA, but with the new sentiment_score column values.
# 1. average number of lines per season
lines_per_season = df.groupby('season')['line_text'].count()
# 2. Total number of unique characters
unique_characters = df['speaker'].nunique()
# 3. Top 5 characters with the most lines
top_characters = df['speaker'].value_counts().head(5)
# 4. Average sentiment score by character (top 5)
top_characters_sentiment = df[df['speaker'].isin(top_characters.index)].groupby('speaker')['sentiment_score'].mean()
# 5. Distribution of sentiment scores (histogram).
import matplotlib.pyplot as plt
plt.hist(df['sentiment_score'], bins=20, edgecolor='black')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Distribution of Sentiment Scores for Lines in The Office')
plt.grid(True)
plt.show()
lines_per_season, unique_characters, top_characters, top_characters_sentiment
(season 1 1536 2 6051 3 7448 4 5642 5 8170 6 7630 7 7302 8 7083 9 7111 Name: line_text, dtype: int64, 790, speaker Michael 11574 Dwight 7167 Jim 6609 Pam 5205 Andy 3968 Name: count, dtype: int64, speaker Andy 0.027235 Dwight -0.048481 Jim 0.045383 Michael -0.006531 Pam 0.043709 Name: sentiment_score, dtype: float64)
For EDA stats 1-4 from the previous cell, I'm going to create a graph of each for easy visualization.
# 1. Average number of lines per season
plt.figure(figsize=(8, 6))
lines_per_season.plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel('Season')
plt.ylabel('Number of Lines')
plt.title('Average Number of Lines per Season')
plt.grid(axis='y')
plt.show()
# 2. Total number of unique characters (displayed as a simple text output)
print(f"Total number of unique characters: {unique_characters}")
# 3. Top 5 characters with the most lines
plt.figure(figsize=(8, 6))
top_characters.plot(kind='bar', color='salmon', edgecolor='black')
plt.xlabel('Character')
plt.ylabel('Number of Lines')
plt.title('Top 5 Characters with the Most Lines')
plt.grid(axis='y')
plt.show()
# 4. Average sentiment score by character (top 5)
plt.figure(figsize=(8, 6))
top_characters_sentiment.plot(kind='bar', color='lightgreen', edgecolor='black')
plt.xlabel('Character')
plt.ylabel('Average Sentiment Score')
plt.title('Average Sentiment Score by Character (Top 5)')
plt.grid(axis='y')
plt.show()
Total number of unique characters: 790
My comments on these EDA graphs:
Average number of lines per season: This looks correct. Season 1 only had a fraction of the number of episodes that the other seasons had.
Total number of unique characters (displayed as a simple text output): This wasn't a graph, rather just an interesting stat.
Top 5 characters with the most lines: As someone who has seen the show, this makes sense. Michael, Dwight, and Jim have the most screen time.
Average sentiment score by character (top 5): Now this is an interesting graph. I say this because Dwight's average sentiment score is lower (and negative) than the rest of the characters (the top 5 with the most spoken lines) and yet he is one of the most, if not the most liked character in the whole show.
Distribution of Sentiment Scores for Lines in The Office (located in the previous code cell output): It looks like there is a bit more variation in the sentiment score versus frequency graph in comparison to that produced with VADER scores, but the overall trend is very similar.
Since the mean sentiment score by season seems to be pretty neutral throughout all the seasons, I want to take a look at a plot of sentiment score averages by episode.
# Grouping by season and episode, then calculate the mean sentiment score for each episode
sentiment_by_episode = df.groupby(['season', 'episode'])['sentiment_score'].mean().reset_index()
# Plotting the average sentiment score by episode
plt.figure(figsize=(12, 6))
plt.plot(sentiment_by_episode.index, sentiment_by_episode['sentiment_score'], marker='o', linestyle='-', color='blue')
plt.xlabel('Episode Index (by Season)')
plt.ylabel('Average Sentiment Score')
plt.title('Average Sentiment Score by Episode in The Office')
plt.grid(True)
plt.show()
Now that I've finished running EDA on the transcripts data frame with the new sentiment values assigned to each dialogue instance, we can begin Milestone 2 now.
# Since we're introducing a new dataset into the mix, I'm going to rename the 'df', the transcripts data frame
# to be more specific for data frame-distinguishing purposes
transcripts_df = df
transcripts_df.head()
season | episode | scene | line_text | speaker | deleted | sentiment_score | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | All right Jim. Your quarterlies look very good... | Michael | False | 0.1 |
1 | 1 | 1 | 1 | Oh, I told you. I couldn't close it. So... | Jim | False | -0.8 |
2 | 1 | 1 | 1 | So you've come to the master for guidance? Is ... | Michael | False | -0.6 |
3 | 1 | 1 | 1 | Actually, you called me in here, but yeah. | Jim | False | -0.4 |
4 | 1 | 1 | 1 | All right. Well, let me show you how it's done. | Michael | False | -0.2 |
For Milestone 2, I'm going to scrape individual episode ratings from IMDB's website.
The IMDb ratings data offers insight into viewer reception and popularity for each episode of The Office across its nine seasons. By analyzing IMDb ratings alongside sentiment data from episode dialogues, we can explore potential relationships between viewer ratings and the emotional tone of the show (note: these relationships would be more accurate if the sentiment scores were more accurate, but alas). For example, I can investigate whether episodes with more positive sentiment scores (based on dialogue) tend to receive higher ratings, suggesting a correlation between upbeat or humorous tones and viewer preference. Additionally, variations in ratings could also reflect audience responses to shifts in character development, storyline focus, or humor style.
In addressing the broader question of how sentiment evolved throughout The Office, IMDb ratings provide a quantitative measure of audience engagement and satisfaction. Understanding if higher-rated episodes coincide with specific emotional tones can provide insights into what resonated most with viewers, potentially explaining why certain seasons or episodes were more impactful or memorable.
After exploring their site, I found a "Ratings" page for the show The Office on the IMDB website at the following link: https://www.imdb.com/title/tt0386676/ratings/
There is a nice table that displays all episodes and their ratings. The rows are the seasons (in order) and the columns are episode numbers (in order).
I will now scrape the site and put the episode ratings into a new pandas dataframe.
import requests
from bs4 import BeautifulSoup
import pandas as pd
# imdb site url containing episode ratings
url = "https://www.imdb.com/title/tt0386676/ratings/"
# Sending request to the website
response = requests.get(url)
response.status_code
403
I'm getting a status code of 403 so I will try specifying an agent header.
# imdb site url containing episode ratings
url = "https://www.imdb.com/title/tt0386676/ratings/"
# headers to mimic a browser request
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com"
}
# Sending request to the website
response = requests.get(url, headers=headers)
response.status_code
200
This worked! Now to scrape the site.
soup = BeautifulSoup(response.content, 'html.parser')
# Lists to store extracted data
seasons = []
episodes = []
ratings = []
# Find all elements with the `a` tag containing the season, episode, and rating information in `aria-label`
for rating_div in soup.select('td.ratings-heatmap__table-data a[aria-label]'):
# Extract the aria-label attribute
aria_label = rating_div['aria-label']
# Parsing the information: "Season X Episode Y, Rating Z.Z"
parts = aria_label.split(", ")
if len(parts) == 2:
season_episode, rating = parts
season = int(season_episode.split(" ")[1])
episode = int(season_episode.split(" ")[3])
rating_value = float(rating.split(" ")[1])
# Append extracted data to lists
seasons.append(season)
episodes.append(episode)
ratings.append(rating_value)
# Create a DataFrame from the lists
imdb_episode_ratings_df = pd.DataFrame({
'Season': seasons,
'Episode': episodes,
'Rating': ratings
})
# verifying the table - there IS an issue! See next text cell.
imdb_episode_ratings_df.head()
Season | Episode | Rating | |
---|---|---|---|
0 | 1 | 1 | 8.1 |
1 | 1 | 2 | 7.3 |
2 | 1 | 3 | 7.6 |
3 | 1 | 4 | 7.8 |
4 | 1 | 5 | 8.2 |
After closely examining the scraped table, the structure seems correct, but it did NOT collect all of the data in the table on the IMDB page. I discovered that this is because there is a button (right arrow button) that you have to press in order to see the rest of the ratings since they are cut off. Go to the link and see for yourself (the table is halfway down the page): https://www.imdb.com/title/tt0386676/ratings/
Because of this, I'm going to try to use Selenium to "click" the button for me so I can retrieve the rest of the data.
Update: I tried using selenium, but ran into some issues so I'm going to manually insert the rest of the data. Yikes.
missing_episodes = [
{'Season': 2, 'Episode': 16, 'Rating': 8.1},
{'Season': 2, 'Episode': 17, 'Rating': 8.3},
{'Season': 2, 'Episode': 18, 'Rating': 8.2},
{'Season': 2, 'Episode': 19, 'Rating': 8.0},
{'Season': 2, 'Episode': 20, 'Rating': 8.2},
{'Season': 2, 'Episode': 21, 'Rating': 8.6},
{'Season': 2, 'Episode': 22, 'Rating': 9.3},
{'Season': 3, 'Episode': 16, 'Rating': 8.8},
{'Season': 3, 'Episode': 17, 'Rating': 8.3},
{'Season': 3, 'Episode': 18, 'Rating': 8.9},
{'Season': 3, 'Episode': 19, 'Rating': 8.6},
{'Season': 3, 'Episode': 20, 'Rating': 8.5},
{'Season': 3, 'Episode': 21, 'Rating': 8.6},
{'Season': 3, 'Episode': 22, 'Rating': 9.0},
{'Season': 3, 'Episode': 23, 'Rating': 9.2},
{'Season': 5, 'Episode': 16, 'Rating': 7.8},
{'Season': 5, 'Episode': 17, 'Rating': 8.5},
{'Season': 5, 'Episode': 18, 'Rating': 8.1},
{'Season': 5, 'Episode': 19, 'Rating': 8.2},
{'Season': 5, 'Episode': 20, 'Rating': 8.1},
{'Season': 5, 'Episode': 21, 'Rating': 8.5},
{'Season': 5, 'Episode': 22, 'Rating': 8.5},
{'Season': 5, 'Episode': 23, 'Rating': 9.2},
{'Season': 5, 'Episode': 24, 'Rating': 8.1},
{'Season': 5, 'Episode': 25, 'Rating': 8.7},
{'Season': 5, 'Episode': 26, 'Rating': 8.9},
{'Season': 6, 'Episode': 16, 'Rating': 7.8},
{'Season': 6, 'Episode': 17, 'Rating': 8.3},
{'Season': 6, 'Episode': 18, 'Rating': 8.3},
{'Season': 6, 'Episode': 19, 'Rating': 7.5},
{'Season': 6, 'Episode': 20, 'Rating': 7.6},
{'Season': 6, 'Episode': 21, 'Rating': 8.4},
{'Season': 6, 'Episode': 22, 'Rating': 7.6},
{'Season': 6, 'Episode': 23, 'Rating': 7.7},
{'Season': 6, 'Episode': 24, 'Rating': 8.0},
{'Season': 6, 'Episode': 25, 'Rating': 7.6},
{'Season': 6, 'Episode': 26, 'Rating': 7.8},
{'Season': 7, 'Episode': 16, 'Rating': 9.3},
{'Season': 7, 'Episode': 17, 'Rating': 7.3},
{'Season': 7, 'Episode': 18, 'Rating': 9.3},
{'Season': 7, 'Episode': 19, 'Rating': 7.5},
{'Season': 7, 'Episode': 20, 'Rating': 8.9},
{'Season': 7, 'Episode': 21, 'Rating': 9.8},
{'Season': 7, 'Episode': 22, 'Rating': 7.4},
{'Season': 7, 'Episode': 23, 'Rating': 8.5},
{'Season': 7, 'Episode': 24, 'Rating': 8.5},
{'Season': 8, 'Episode': 16, 'Rating': 7.9},
{'Season': 8, 'Episode': 17, 'Rating': 7.6},
{'Season': 8, 'Episode': 18, 'Rating': 7.6},
{'Season': 8, 'Episode': 19, 'Rating': 6.3},
{'Season': 8, 'Episode': 20, 'Rating': 6.8},
{'Season': 8, 'Episode': 21, 'Rating': 6.7},
{'Season': 8, 'Episode': 22, 'Rating': 6.8},
{'Season': 8, 'Episode': 23, 'Rating': 7.4},
{'Season': 8, 'Episode': 24, 'Rating': 7.5},
{'Season': 9, 'Episode': 16, 'Rating': 7.7},
{'Season': 9, 'Episode': 17, 'Rating': 7.3},
{'Season': 9, 'Episode': 18, 'Rating': 7.7},
{'Season': 9, 'Episode': 19, 'Rating': 7.8},
{'Season': 9, 'Episode': 20, 'Rating': 7.8},
{'Season': 9, 'Episode': 21, 'Rating': 9.0},
{'Season': 9, 'Episode': 22, 'Rating': 9.4},
{'Season': 9, 'Episode': 23, 'Rating': 9.8},
]
# Converting the list of missing episodes to a DataFrame
missing_df = pd.DataFrame(missing_episodes)
# Concatenate the existing and missing data
imdb_episode_ratings_df = pd.concat([imdb_episode_ratings_df, missing_df], ignore_index=True)
# Sort by Season and Episode for a cleaner view.
imdb_episode_ratings_df = imdb_episode_ratings_df.sort_values(by=['Season', 'Episode']).reset_index(drop=True)
imdb_episode_ratings_df.head()
Season | Episode | Rating | |
---|---|---|---|
0 | 1 | 1 | 8.1 |
1 | 1 | 2 | 7.3 |
2 | 1 | 3 | 7.6 |
3 | 1 | 4 | 7.8 |
4 | 1 | 5 | 8.2 |
# Checking dtypes
imdb_episode_ratings_df.dtypes
0 | |
---|---|
Season | int64 |
Episode | int64 |
Rating | float64 |
Now that I have this new episode ratings data table, I'm going to merge it into the main transcripts_df.
# First to make column names match between dataframes for clean merging
transcripts_df.rename(columns={'season': 'Season', 'episode': 'Episode'}, inplace=True)
# Now to merge the transcripts_df with imdb_episode_ratings_df on 'Season' and 'Episode' columns
merged_df = pd.merge(transcripts_df, imdb_episode_ratings_df, on=['Season', 'Episode'], how='left')
# verify the table
merged_df.head()
#merged_df.iloc[644]
#merged_df.iloc[644:680]
Season | Episode | scene | line_text | speaker | deleted | sentiment_score | Rating | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | All right Jim. Your quarterlies look very good... | Michael | False | 0.1 | 8.1 |
1 | 1 | 1 | 1 | Oh, I told you. I couldn't close it. So... | Jim | False | -0.8 | 8.1 |
2 | 1 | 1 | 1 | So you've come to the master for guidance? Is ... | Michael | False | -0.6 | 8.1 |
3 | 1 | 1 | 1 | Actually, you called me in here, but yeah. | Jim | False | -0.4 | 8.1 |
4 | 1 | 1 | 1 | All right. Well, let me show you how it's done. | Michael | False | -0.2 | 8.1 |
Now that the main transcripts data frame has both sentiment data as well IMDB ratings for each episode, I want to start the exploratory data analysis.
I want to start by investigating the relationship between average sentiment score per episode and episode ratings to see if episodes with higher sentiment scores tend to have higher IMDb ratings.
To do this I'm going to calculate the average sentiment score per episode and compare it to the IMDb rating for that episode. This will help to see if there is a correlation between positive sentiment in an episode's lines and its popularity/viewer reception.
# Calculate average sentiment score per episode
average_sentiment_per_episode = merged_df.groupby(['Season', 'Episode']).sentiment_score.mean().reset_index()
average_sentiment_per_episode = average_sentiment_per_episode.rename(columns={'sentiment_score': 'Average_Sentiment_Score'})
# Merge with IMDb ratings
sentiment_rating_df = pd.merge(average_sentiment_per_episode, imdb_episode_ratings_df, on=['Season', 'Episode'])
# Plot Average Sentiment Score vs. IMDb Rating
plt.figure(figsize=(10, 6))
plt.scatter(sentiment_rating_df['Average_Sentiment_Score'], sentiment_rating_df['Rating'], alpha=0.6, color='b')
plt.title('Average Sentiment Score vs. IMDb Episode Rating')
plt.xlabel('Average Sentiment Score')
plt.ylabel('IMDb Rating')
plt.grid(True)
plt.show()
Immediately, I'm not seeing any strong correlation. Let's calculate the correlation.
# Calculate the Pearson correlation coefficient between Average Sentiment Score and IMDb Rating
correlation = sentiment_rating_df['Average_Sentiment_Score'].corr(sentiment_rating_df['Rating'])
correlation
0.0702305298487174
This Pearson correlation of 0.17 suggests a weak positive relationship between sentiment scores and IMDb ratings, indicating that, generally, episodes with higher average sentiment scores tend to have slightly higher ratings. However, the correlation is weak, so this relationship is not strong or particularly predictive.
However, I do want to reiterate that the original sentiment scores are likely not highly accurate — but I'm continuing with the investigation for the purpose of doing this hands-on project.
Given the weak correlation found between sentiment and episode ratings, I want to examine if there might be other patterns or subgroup behaviors within the data that provide context or additional insights into viewer ratings. Rather than abandoning the overall question, I'm going to pivot to consider additional influences on ratings, such as how sentiment and rating patterns may vary across seasons or character-specific contributions to sentiment in high- versus low-rated episodes.
Exploring these might allow us to identify whether particular seasons or key character arcs contribute to notable sentiment shifts or episode ratings, potentially offering deeper context to the relationship between sentiment and viewer ratings.
In this second analysis I'm going to examine how sentiment and rating patterns vary across seasons. First i'm going to normalize both sentiment_score and Rating to a common scale (typically between 0 and 1) before plotting them together. This way, we can see relative trends rather than directly comparing raw values.
from sklearn.preprocessing import MinMaxScaler
# Initialize the scaler
scaler = MinMaxScaler()
# Normalize avg_sentiment_score and avg_rating
seasonal_sentiment_ratings[['avg_sentiment_score', 'avg_rating']] = scaler.fit_transform(
seasonal_sentiment_ratings[['avg_sentiment_score', 'avg_rating']]
)
# Plot the normalized values
plt.figure(figsize=(10, 6))
plt.plot(seasonal_sentiment_ratings.index, seasonal_sentiment_ratings['avg_sentiment_score'], label="Normalized Average Sentiment Score", marker='o')
plt.plot(seasonal_sentiment_ratings.index, seasonal_sentiment_ratings['avg_rating'], label="Normalized Average Rating", marker='o', color="orange")
plt.xlabel("Season")
plt.ylabel("Normalized Score")
plt.title("Normalized Average Sentiment Score and Rating per Season")
plt.legend()
plt.show()
Given that the normalized sentiment and rating scores appear to move similarly across seasons, I'm going to quantify their relationship using correlation on the normalized values. This will provide a measure of the strength and direction of the association between average sentiment and rating trends, even though they were originally on different scales.
# correlation between the normalized avg_sentiment_score and avg_rating
correlation = seasonal_sentiment_ratings['avg_sentiment_score'].corr(seasonal_sentiment_ratings['avg_rating'])
print(f"The correlation between normalized average sentiment scores and ratings is: {correlation}")
The correlation between normalized average sentiment scores and ratings is: -0.23883641950296838
This negative correlation of -0.24 suggests a slight inverse relationship between the sentiment scores and episode ratings across seasons. This implies that as the sentiment score increases, the ratings tend to decrease slightly, though the relationship isn’t strong.
Since these first two explorations revealed a weak correlations between overall sentiment and episode ratings, it's worth examining sentiment at a more granular level. Specifically, I'm going to be shifting focus to investigate whether the sentiment scores of key characters are more closely related to episode ratings. The rationale here is that individual characters often evoke distinct emotional responses from viewers, and these responses may impact how episodes are rated. By analyzing the average sentiment scores for major characters in each episode and comparing them to ratings, we can explore whether characters' emotional tones might be linked to viewer reception.
For this third analysis, I'll calculate the average sentiment score for a selection of main characters (Michael, Jim, and Dwight) per episode. I'll then plot these averages against episode ratings and compute correlations to assess any potential relationships.
# top 3 characters by line count
top_characters = transcripts_df['speaker'].value_counts().nlargest(3).index
# Filter merged_df to only include these top characters
filtered_df = merged_df[merged_df['speaker'].isin(top_characters)]
# normalize sentiment and rating
filtered_df['normalized_sentiment'] = (filtered_df['sentiment_score'] - filtered_df['sentiment_score'].mean()) / filtered_df['sentiment_score'].std()
filtered_df['normalized_rating'] = (filtered_df['Rating'] - filtered_df['Rating'].mean()) / filtered_df['Rating'].std()
# Group by character and season to get average normalized sentiment and rating
character_season_avg = filtered_df.groupby(['speaker', 'Season']).agg({
'normalized_sentiment': 'mean',
'normalized_rating': 'mean'
}).reset_index()
import matplotlib.pyplot as plt
# separate plots for each character
for character in top_characters:
character_data = character_season_avg[character_season_avg['speaker'] == character]
plt.figure(figsize=(10, 6))
plt.plot(character_data['Season'], character_data['normalized_sentiment'], marker='o', label='Sentiment')
plt.plot(character_data['Season'], character_data['normalized_rating'], marker='s', label='Rating', linestyle='--')
plt.xlabel('Season')
plt.ylabel('Normalized Scores')
plt.title(f'Normalized Sentiment and Rating for {character} Over Seasons')
plt.legend()
plt.show()
<ipython-input-224-3869b2fd66b8>:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy filtered_df['normalized_sentiment'] = (filtered_df['sentiment_score'] - filtered_df['sentiment_score'].mean()) / filtered_df['sentiment_score'].std() <ipython-input-224-3869b2fd66b8>:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy filtered_df['normalized_rating'] = (filtered_df['Rating'] - filtered_df['Rating'].mean()) / filtered_df['Rating'].std()
I'm now going to calculate the correlations between normalized sentiment and ratings for each of the top characters.
character_correlations = {}
# Calculate correlation for each character
for character in top_characters:
character_data = character_season_avg[character_season_avg['speaker'] == character]
# Calculate Pearson correlation between normalized sentiment and normalized rating
correlation = character_data['normalized_sentiment'].corr(character_data['normalized_rating'])
character_correlations[character] = correlation
print(f"Correlation for {character}: {correlation:.2f}")
# results
character_correlations
Correlation for Michael: 0.87 Correlation for Dwight: -0.19 Correlation for Jim: -0.19
{'Michael': 0.8678056678330158, 'Dwight': -0.18761254430475763, 'Jim': -0.19424055657114186}
Michael (0.86): The high positive correlation suggests a strong association between Michael's sentiment and episode ratings. When Michael’s sentiment scores are high, episode ratings tend to be higher as well.
Dwight (0.02) and Jim (0.06): The near-zero correlations for Dwight and Jim imply little to no relationship between their sentiment scores and episode ratings.
Again, I wish I had highly accurate sentiment scores, but alas.
For this fourth analysis, I'm going to focus on understanding the overall distribution and trends within the ratings data to provide insights into what drives episode popularity independently of sentiment.
This includes the following:
Distribution of Ratings Across All Episodes: I'll plot a histogram of episode ratings to see the distribution shape, revealing if ratings are skewed towards higher or lower values.
Seasonal Trends in Ratings: I'll calculate the average rating for each season and plot it over time to observe any overarching trends. This will help to see if ratings generally increased or decreased as the show progressed, independent of character sentiment or dialogue content.
Standard Deviation and Range of Ratings: Summary statistics like the mean, standard deviation, and range of ratings will help quantify the overall consistency or variability in episode popularity.
This will allow us to determine if there are specific patterns in viewer ratings across seasons that could suggest other underlying factors influencing episode reception.
import seaborn as sns
# 1. Distribution of ratings across all episodes
plt.figure(figsize=(10, 6))
sns.histplot(imdb_episode_ratings_df['Rating'], bins=15, kde=True)
plt.title("Distribution of Episode Ratings")
plt.xlabel("Episode Rating")
plt.ylabel("Frequency")
plt.show()
# 2. Seasonal average ratings trend
season_avg_ratings = imdb_episode_ratings_df.groupby('Season')['Rating'].mean()
plt.figure(figsize=(10, 6))
season_avg_ratings.plot(kind='line', marker='o')
plt.title("Average Episode Rating by Season")
plt.xlabel("Season")
plt.ylabel("Average Rating")
plt.show()
# 3. Summary stats for ratings
rating_mean = imdb_episode_ratings_df['Rating'].mean()
rating_std = imdb_episode_ratings_df['Rating'].std()
rating_range = imdb_episode_ratings_df['Rating'].max() - imdb_episode_ratings_df['Rating'].min()
print("Summary Statistics for Episode Ratings:")
print(f"Mean Rating: {rating_mean}")
print(f"Standard Deviation: {rating_std}")
print(f"Range: {rating_range}")
Summary Statistics for Episode Ratings: Mean Rating: 8.062234042553191 Standard Deviation: 0.6354158253729384 Range: 3.500000000000001
My comments on these summary stats for the ratings data:
Mean Rating: With an average rating of 8.09, we can see that the show consistently held high appeal — This makes sense as everyone I know loves the show.
Standard Deviation (0.66): This relatively low standard deviation suggests that episode ratings are fairly consistent and there is not too much variability.
Range (3.5): The range shows that the difference between the lowest and highest-rated episodes is 3.5.
Given the strong correlation between Michael's sentiment and episode ratings observed at the seasonal level, a more granular analysis at the episode level could uncover more detailed patterns in how his emotional tone relates to individual episode ratings.
I want to examine the normalized sentiment of Michael's lines for each episode alongside the normalized episode ratings to assess whether fluctuations in his sentiment closely follow or impact the ratings at a finer level. This could reveal whether specific emotional tones or shifts in his character’s sentiment are associated with the most popular episodes, providing insights into his influence on viewer reception episode by episode.
# filtering the data to only include rows where Michael is the speaker
michael_df = merged_df[merged_df['speaker'] == 'Michael']
# grouping by Season and Episode to get the average sentiment score for Michael's lines per episode
michael_episode_sentiment = michael_df.groupby(['Season', 'Episode'])['sentiment_score'].mean().reset_index()
# merging Michael's episode-level sentiment with the overall episode ratings
michael_episode_sentiment = michael_episode_sentiment.merge(imdb_episode_ratings_df, on=['Season', 'Episode'])
# normalize the sentiment and rating scores for comparison.
michael_episode_sentiment['normalized_sentiment'] = (michael_episode_sentiment['sentiment_score'] - michael_episode_sentiment['sentiment_score'].mean()) / michael_episode_sentiment['sentiment_score'].std()
michael_episode_sentiment['normalized_rating'] = (michael_episode_sentiment['Rating'] - michael_episode_sentiment['Rating'].mean()) / michael_episode_sentiment['Rating'].std()
# Michael's normalized sentiment score per episode vs normalized rating plot
plt.figure(figsize=(14, 7))
plt.plot(michael_episode_sentiment['normalized_sentiment'], label='Normalized Sentiment Score (Michael)', color='blue', marker='o')
plt.plot(michael_episode_sentiment['normalized_rating'], label='Normalized Episode Rating', color='green', marker='o')
plt.xlabel('Episode (Chronological)')
plt.ylabel('Normalized Score')
plt.title("Comparison of Michael's Sentiment per Episode and Episode Ratings (Normalized)")
plt.legend()
plt.show()
It's a bit hard to see a correlation from the graph so I'm going to calculate a Pearson correlation between Michael’s normalized sentiment per episode and the normalized episode ratings.
# Pearson correlation between Michael's normalized sentiment and normalized episode ratings
correlation = michael_episode_sentiment['normalized_sentiment'].corr(michael_episode_sentiment['normalized_rating'])
print(f"The Pearson correlation between Michael's normalized sentiment per episode and the normalized episode ratings is: {correlation}")
The Pearson correlation between Michael's normalized sentiment per episode and the normalized episode ratings is: 0.1486110479217692
This positive correlation of 0.25 suggests a slight (albeit not very strong) relationship between Michael’s sentiment per episode and the corresponding episode ratings.
This episode-level correlation shows some consistency with our earlier findings. Ultimately, the relatively low correlation also points to the likelihood that episode ratings are influenced by a broader range of factors beyond just sentiment from Michael’s lines, warranting further multi-character or narrative-focused analyses — Again, I wish I had a more accurate sentiment scoring mechanism because I don't fully trust any of these analyses.
I'm curious to explore if there is a correlation between the average sentiment score per scene in relation to episode ratings. This could help identify whether certain scenes, perhaps intense or emotionally positive/negative scenes, correlate with overall episode popularity. We might also be able to see if episodes with higher-rated scenes tend to be rated higher overall, which could reveal the impact of specific scene dynamics on viewer reception.
# grouping by Season, Episode, and Scene to calculate the average sentiment score per scene.
scene_sentiment_df = transcripts_df.groupby(['Season', 'Episode', 'scene']).agg({
'sentiment_score': 'mean'
}).reset_index()
# calculating the average scene sentiment per episode.
episode_scene_sentiment_df = scene_sentiment_df.groupby(['Season', 'Episode']).agg({
'sentiment_score': 'mean'
}).rename(columns={'sentiment_score': 'avg_scene_sentiment'}).reset_index()
# merging with the main dataframe containing episode ratings
merged_scene_ratings_df = pd.merge(
imdb_episode_ratings_df,
episode_scene_sentiment_df,
how='inner',
left_on=['Season', 'Episode'],
right_on=['Season', 'Episode']
)
# Calculate the correlation between average scene sentiment and episode rating
scene_rating_correlation = merged_scene_ratings_df['avg_scene_sentiment'].corr(merged_scene_ratings_df['Rating'])
print(f"Correlation between average scene sentiment and episode rating: {scene_rating_correlation}")
Correlation between average scene sentiment and episode rating: 0.1673474050295768
This correlation (0.25) is not strong enough for me to be convinced of any influential correlation between average scene sentiment and episode rating.
Another potential correlation I want to explore is average sentiment score by character in each episode in relation to the respective episode ratings. This analysis would help investigate if certain characters, depending on their sentiment in an episode (e.g., Michael, Jim, Dwight, or others), have an impact on the episode’s popularity.
# Filter for main characters (e.g., Michael, Jim, Dwight) to analyze their sentiment individually
key_characters = ['Michael', 'Jim', 'Dwight']
# filter dataframe for only the key characters
character_sentiment_df = transcripts_df[transcripts_df['speaker'].isin(key_characters)]
# finding the average sentiment score for each character per episode
character_avg_sentiment_df = character_sentiment_df.groupby(['Season', 'Episode', 'speaker']).agg({
'sentiment_score': 'mean'
}).reset_index()
# Pivot to create separate columns for each character's sentiment
character_avg_sentiment_pivot = character_avg_sentiment_df.pivot(
index=['Season', 'Episode'],
columns='speaker',
values='sentiment_score'
).reset_index()
# rename columns for clarity
character_avg_sentiment_pivot.columns.name = None # Remove pivoted level name
character_avg_sentiment_pivot = character_avg_sentiment_pivot.rename(
columns={'Michael': 'Michael_sentiment', 'Jim': 'Jim_sentiment', 'Dwight': 'Dwight_sentiment'}
)
# merge with episode ratings dataframe
merged_character_ratings_df = pd.merge(
imdb_episode_ratings_df,
character_avg_sentiment_pivot,
how='inner',
on=['Season', 'Episode']
)
# Calculate correlation for each character's sentiment with episode rating
michael_rating_corr = merged_character_ratings_df['Michael_sentiment'].corr(merged_character_ratings_df['Rating'])
jim_rating_corr = merged_character_ratings_df['Jim_sentiment'].corr(merged_character_ratings_df['Rating'])
dwight_rating_corr = merged_character_ratings_df['Dwight_sentiment'].corr(merged_character_ratings_df['Rating'])
print(f"Correlation between Michael's sentiment and episode rating: {michael_rating_corr}")
print(f"Correlation between Jim's sentiment and episode rating: {jim_rating_corr}")
print(f"Correlation between Dwight's sentiment and episode rating: {dwight_rating_corr}")
Correlation between Michael's sentiment and episode rating: 0.1486110479217692 Correlation between Jim's sentiment and episode rating: 0.029158344566363167 Correlation between Dwight's sentiment and episode rating: 0.07765156230632707
These correlations provide a nuanced view of how each character's sentiment might contribute to episode popularity, with Michael's sentiment showing a moderately stronger correlation with episode ratings compared to Jim and Dwight.
Model Idea 1: Ensemble Voting Regressor to Predict Episode Ratings Based on Michael’s Sentiment
This model would aim to assess Michael Scott's influence on episode ratings by using an ensemble approach to predict ratings based on the sentiment of his lines in each episode. Since we’ve observed a moderate correlation between Michael’s sentiment and episode ratings, the ensemble model could help capture potential nonlinear and complex interactions between Michael’s sentiment and ratings.
An Ensemble Voting Regressor could be used to predict episode ratings by combining models like linear regression, decision trees, and support vector regression. By aggregating predictions from these different models, the ensemble method can capture a range of patterns in the data that each model might individually overlook.
Features of the model:
• Michael’s normalized sentiment per episode
• Episode season and episode number (for temporal context)
• Additional features, such as average scene sentiment or interactions with other main characters like Jim and Dwight, to capture more of the episode’s dynamics.
Reason for this model choice: This would allow me to test if Michael's sentiment, especially when combined with other episode features, provides a consistent impact across models and improves predictive accuracy. This model could help confirm if Michael’s influence on sentiment and ratings aligns with audience reception and if his character has a unique effect on episode ratings compared to others.
Model Idea 2: Character Sentiment Influence Model on Episode Popularity
This model would attempt to predict episode popularity (high vs. low rating, binary or multi class) based on sentiment scores for key characters like Michael, Jim, and Dwight. The model would investigate whether sentiment dynamics (especially Michael’s) are a reliable predictor of episode success by isolating high and low rated episodes.
This type of model could be appropriate. It would use normalized sentiment scores for each character as features and the episode rating thresholded to define “high” or “low” popularity. This could provide insights into which characters’ sentiments most influence episode ratings and, by extension, episode popularity.
%%shell
jupyter nbconvert --to html '/content/drive/MyDrive/Colab Notebooks/FinalProject2024.ipynb'
[NbConvertApp] WARNING | pattern '/content/drive/MyDrive/Colab Notebooks/FinalProjectM22024.ipynb' matched no files This application is used to convert notebook files (*.ipynb) to various other formats. WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES. Options ======= The options below are convenience aliases to configurable class-options, as listed in the "Equivalent to" description-line of the aliases. To see all configurable class-options for some <cmd>, use: <cmd> --help-all --debug set log level to logging.DEBUG (maximize logging output) Equivalent to: [--Application.log_level=10] --show-config Show the application's configuration (human-readable format) Equivalent to: [--Application.show_config=True] --show-config-json Show the application's configuration (json format) Equivalent to: [--Application.show_config_json=True] --generate-config generate default config file Equivalent to: [--JupyterApp.generate_config=True] -y Answer yes to any questions instead of prompting. Equivalent to: [--JupyterApp.answer_yes=True] --execute Execute the notebook prior to export. Equivalent to: [--ExecutePreprocessor.enabled=True] --allow-errors Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too. Equivalent to: [--ExecutePreprocessor.allow_errors=True] --stdin read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*' Equivalent to: [--NbConvertApp.from_stdin=True] --stdout Write notebook output to stdout instead of files. Equivalent to: [--NbConvertApp.writer_class=StdoutWriter] --inplace Run nbconvert in place, overwriting the existing notebook (only relevant when converting to notebook format) Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=] --clear-output Clear output of current file and save in place, overwriting the existing notebook. Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True] --no-prompt Exclude input and output prompts from converted document. Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True] --no-input Exclude input cells and output prompts from converted document. This mode is ideal for generating code-free reports. Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True] --allow-chromium-download Whether to allow downloading chromium if no suitable version is found on the system. Equivalent to: [--WebPDFExporter.allow_chromium_download=True] --disable-chromium-sandbox Disable chromium security sandbox when converting to PDF.. Equivalent to: [--WebPDFExporter.disable_sandbox=True] --show-input Shows code input. This flag is only useful for dejavu users. Equivalent to: [--TemplateExporter.exclude_input=False] --embed-images Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports. Equivalent to: [--HTMLExporter.embed_images=True] --sanitize-html Whether the HTML in Markdown cells and cell outputs should be sanitized.. Equivalent to: [--HTMLExporter.sanitize_html=True] --log-level=<Enum> Set the log level by value or name. Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL'] Default: 30 Equivalent to: [--Application.log_level] --config=<Unicode> Full path of a config file. Default: '' Equivalent to: [--JupyterApp.config_file] --to=<Unicode> The export format to be used, either one of the built-in formats ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'] or a dotted object name that represents the import path for an ``Exporter`` class Default: '' Equivalent to: [--NbConvertApp.export_format] --template=<Unicode> Name of the template to use Default: '' Equivalent to: [--TemplateExporter.template_name] --template-file=<Unicode> Name of the template file to use Default: None Equivalent to: [--TemplateExporter.template_file] --theme=<Unicode> Template specific theme(e.g. the name of a JupyterLab CSS theme distributed as prebuilt extension for the lab template) Default: 'light' Equivalent to: [--HTMLExporter.theme] --sanitize_html=<Bool> Whether the HTML in Markdown cells and cell outputs should be sanitized.This should be set to True by nbviewer or similar tools. Default: False Equivalent to: [--HTMLExporter.sanitize_html] --writer=<DottedObjectName> Writer class used to write the results of the conversion Default: 'FilesWriter' Equivalent to: [--NbConvertApp.writer_class] --post=<DottedOrNone> PostProcessor class used to write the results of the conversion Default: '' Equivalent to: [--NbConvertApp.postprocessor_class] --output=<Unicode> overwrite base name use for output files. can only be used when converting one notebook at a time. Default: '' Equivalent to: [--NbConvertApp.output_base] --output-dir=<Unicode> Directory to write output(s) to. Defaults to output to the directory of each notebook. To recover previous default behaviour (outputting to the current working directory) use . as the flag value. Default: '' Equivalent to: [--FilesWriter.build_directory] --reveal-prefix=<Unicode> The URL prefix for reveal.js (version 3.x). This defaults to the reveal CDN, but can be any url pointing to a copy of reveal.js. For speaker notes to work, this must be a relative path to a local copy of reveal.js: e.g., "reveal.js". If a relative path is given, it must be a subdirectory of the current directory (from which the server is run). See the usage documentation (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow) for more details. Default: '' Equivalent to: [--SlidesExporter.reveal_url_prefix] --nbformat=<Enum> The nbformat version to write. Use this to downgrade notebooks. Choices: any of [1, 2, 3, 4] Default: 4 Equivalent to: [--NotebookExporter.nbformat_version] Examples -------- The simplest way to use nbconvert is > jupyter nbconvert mynotebook.ipynb --to html Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']. > jupyter nbconvert --to latex mynotebook.ipynb Both HTML and LaTeX support multiple output templates. LaTeX includes 'base', 'article' and 'report'. HTML includes 'basic', 'lab' and 'classic'. You can specify the flavor of the format used. > jupyter nbconvert --to html --template lab mynotebook.ipynb You can also pipe the output to stdout, rather than a file > jupyter nbconvert mynotebook.ipynb --stdout PDF is generated via latex > jupyter nbconvert mynotebook.ipynb --to pdf You can get (and serve) a Reveal.js-powered slideshow > jupyter nbconvert myslides.ipynb --to slides --post serve Multiple notebooks can be given at the command line in a couple of different ways: > jupyter nbconvert notebook*.ipynb > jupyter nbconvert notebook1.ipynb notebook2.ipynb or you can specify the notebooks list in a config file, containing:: c.NbConvertApp.notebooks = ["my_notebook.ipynb"] > jupyter nbconvert --config mycfg.py To see all available configurables, use `--help-all`.
--------------------------------------------------------------------------- CalledProcessError Traceback (most recent call last) <ipython-input-232-e412db7e8b7e> in <cell line: 1>() ----> 1 get_ipython().run_cell_magic('shell', '', "jupyter nbconvert --to html '/content/drive/MyDrive/Colab Notebooks/FinalProjectM22024.ipynb'\n") /usr/local/lib/python3.10/dist-packages/google/colab/_shell.py in run_cell_magic(self, magic_name, line, cell) 332 if line and not cell: 333 cell = ' ' --> 334 return super().run_cell_magic(magic_name, line, cell) 335 336 /usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell) 2471 with self.builtin_trap: 2472 args = (magic_arg_s, cell) -> 2473 result = fn(*args, **kwargs) 2474 return result 2475 /usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in _shell_cell_magic(args, cmd) 110 result = _run_command(cmd, clear_streamed_output=False) 111 if not parsed_args.ignore_errors: --> 112 result.check_returncode() 113 return result 114 /usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py in check_returncode(self) 135 def check_returncode(self): 136 if self.returncode: --> 137 raise subprocess.CalledProcessError( 138 returncode=self.returncode, cmd=self.args, output=self.output 139 ) CalledProcessError: Command 'jupyter nbconvert --to html '/content/drive/MyDrive/Colab Notebooks/FinalProjectM22024.ipynb' ' returned non-zero exit status 255.