Predicting Bob’s Burgers Episode Ratings

Setup Code (Click to Expand)

# import packages
import optuna
import textwrap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from great_tables import GT
from plotnine import (
    aes,
    element_blank,
    element_line,
    element_text,
    facet_wrap,
    geom_bar,
    geom_histogram,
    geom_hline,
    geom_point,
    geom_smooth,
    ggplot,
    labs,
    scale_x_continuous,
    scale_fill_gradient,
    theme,
    theme_minimal,
    theme_set,
    theme_update,
)
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import (
    ExtraTreesRegressor,
    GradientBoostingRegressor,
    RandomForestRegressor,
)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, root_mean_squared_error
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

# import data
bobs_burgers = (
    pd.read_csv(
        "https://raw.githubusercontent.com/rfordatascience/tidytuesday/"
        "master/data/2024/2024-11-19/episode_metrics.csv"
    )
    .merge(
        pd.read_csv(
            "https://raw.githubusercontent.com/poncest/bobsburgersR/refs/"
            "heads/main/data-raw/IMDb_Wikipedia_Bobs_Burgers_Data_Clean.csv"
        ),
        on=["season", "episode"],
    )
    .assign(
        wikipedia_directed_by=lambda df: df["wikipedia_directed_by"]
        .str.replace(r"\bBoohwan\b(?! Lim)", "Boohwan Lim", regex=True)
        .str.replace("Brian LoSchiavo", "Brian Loschiavo")
        .str.replace("Mathew Long", "Matthew Long")
    )
    .assign(
        wikipedia_written_by=lambda df: df["wikipedia_written_by"]
        .str.replace("Teleplay by : ", "")
        .str.replace("Story by : ", " & ")
        .str.replace("Lizzie Molyneux-Logelin", "Lizzie Molyneux")
        .str.replace(r"\bLizzie\b(?! Molyneux)", "Lizzie Molyneux", regex=True)
    )
    .pipe(lambda df: df.astype({col: "category" for col in df.select_dtypes("object")}))
)

# set plot theme
theme_set(theme_minimal(base_size=11, base_family="Poppins"))
theme_update(
    panel_grid_major=element_line(color="#e5e5e5", size=1),
    panel_grid_minor=element_blank(),
    panel_spacing_x=0.01,
    panel_spacing_y=0.01,
    axis_title_x=element_text(color="#4d4d4d", margin={"t": 5}, size=11),
    axis_title_y=element_text(color="#4d4d4d", margin={"r": 5}, size=11),
    axis_text=element_text(color="#7f7f7f", size=10),
    axis_text_x=element_text(angle=30, vjust=1, hjust=0.75),
    axis_ticks=element_line(color="#e5e5e5", size=0.4),
    axis_ticks_length=1,
    legend_position="top",
    legend_title=element_blank(),
    legend_text=element_text(
        size=11, va="center", margin={"t": 5, "r": 5, "b": 0, "l": 5}
    ),
    legend_box_margin=5,
    legend_key_width=20,
    legend_key_spacing=10,
    plot_title=element_text(
        ha="left", color="black", family="Lora", size=15, margin={"t": 5, "b": 5}
    ),
    plot_subtitle=element_text(
        ha="left",
        color="#4d4d4d",
        family="Lora",
        lineheight=1.2,
        size=11.5,
        margin={"t": 5, "r": 0, "b": 10, "l": 0},
    ),
    plot_caption=element_text(
        color="#7f7f7f", size=9, ha="right", margin={"t": 10, "r": 0, "b": 0, "l": 0}
    ),
    strip_text=element_text(size=11, margin={"t": 0, "r": 0, "b": 10, "l": 0}),
)

# suppress all optuna logging outputs
optuna.logging.set_verbosity(optuna.logging.CRITICAL)
# to switch back - optuna.logging.set_verbosity(optuna.logging.INFO)

Biologists have been wrestling with the challenge of predicting the structure of proteins for over 50 years. Doing so would enable scientists to better understand diseases and accelerate the development of medications (Heaven 2022). Experimental methods for predicting protein structures can be arduous, proving inadequate for predicting the structure of the over 200 million proteins discovered in nature (Service 2020). On the other hand, the performance of computational methods fell short of the experimental. That was until DeepMind’s AlphaFold2 was released at CASP14 (Jumper et al. 2021). DeepMind combined a vast repository of training data from the Protein Data Bank with innovations in deep learning and immense computational resources, achieving landmark performance and solving one of science’s grand challenges.

AlphaFold is a testament to the potential value created by combining massive volumes of data and computational power (when incentives align and intentions are good). The right data can help us solve all kinds of problems, from predicting protein structures to less grand but no less important issues, such as predicting the IMDb user ratings of Bob’s Burgers episodes.

While the big brains at the world’s best universities are busy trying to change the world, I’m creating models no one needs or asks for. I will make a machine learning model that predicts IMDb user ratings for Bob’s Burgers episodes, using features that summarise the episode (episode and season number, date aired, writer, director, etc.) and metrics that describe the dialogue in the episode itself, seeing how far I can stretch the limited sample size.

I don’t know what the TV equivalent is to “football isn’t played on a spreadsheet,” but by the end of this post, I hope everyone is so mad they are telling me to “watch the ~~games~~ TV.”

Data Exploration

The Bob’s Burgers data comes from Week 47 of 2024’s TidyTuesday. Most of the data is originally from Steven Ponce’s bobsburgersR R package, including many self-explanatory features that describe episode details. But the TidyTuesday data also includes dialogue metrics that are not as clear. Table 1 below defines the six dialogue metrics in the dataset.

Table Code (Click to Expand)

(
    pd.DataFrame(
        {
            "Features": [
                "dialogue_density",
                "avg_length",
                "sentiment_variance",
                "unique_words",
                "question_ratio",
                "exclamation_ratio",
            ],
            "Description": [
                "The number of non-blank lines in this episode.",
                "The average number of characters (technically codepoints) per line "
                "of dialogue.",
                "The variance in the numeric AFINN sentiment of words in this episode.",
                "The number of unique lowercase words in this episode.",
                "The proportion of lines of dialogue that contain at least one "
                "question mark ('?').",
                "The proportion of lines of dialogue that contain at least one "
                "exclamation point ('!').",
            ],
        }
    ).pipe(
        lambda df: (
            GT(df).tab_source_note(source_note="Data: {bobsburgersR} (via TidyTuesday)")
        )
    )
)

Table 1: Bob’s Burgers Episode Dialogue Metrics

Features	Description
dialogue_density	The number of non-blank lines in this episode.
avg_length	The average number of characters (technically codepoints) per line of dialogue.
sentiment_variance	The variance in the numeric AFINN sentiment of words in this episode.
unique_words	The number of unique lowercase words in this episode.
question_ratio	The proportion of lines of dialogue that contain at least one question mark ('?').
exclamation_ratio	The proportion of lines of dialogue that contain at least one exclamation point ('!').
Data: {bobsburgersR} (via TidyTuesday)

There are further details about the data in the TidyTuesday repository, including the following quote:

We would like to emphasise that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.

They are not wrong! If I was doing this properly, I’d spend more time trying to understand the data, including extracting more information from the dialogue. I’d also spend some time thinking about what moderating variables they are referring to and including those where possible. While the results might not be groundbreaking, I’m sure the model would be more precise. Instead, I’m just having fun. I’m a silly goose.

I should still do some exploratory analysis first. I’m not a complete monster. Below, we plot the outcome distribution using a colour gradient scale to represent seasons. IMDb ratings are approximately normally distributed, with most values between 7 and 8.

Plot Code (Click to Expand)

(
    ggplot(bobs_burgers, aes("rating", fill="factor(season)"))
    + geom_bar()
    + geom_hline(yintercept=0, colour="#343a40")
    + labs(
        title="Distribution of Bob's Burgers Episode Ratings by Season",
        subtitle=(
            textwrap.fill(
                "The distribution of IMDb user ratings for Bob's Burgers episodes, "
                "with a gradient scale representing season (blue = earlier seasons; "
                "grey = later seasons). The earlier seasons of the show are generally "
                "favoured.",
                width=92,
            )
        ),
        x="IMDb Rating",
        y="",
        caption="Visualisation: Paul Johnson | Data: {bobsburgersR} (via TidyTuesday)",
    )
    + scale_fill_gradient(low="#026E99", high="#D2D2D2", guide=None)
)

The earlier seasons (blue) are generally higher-rated, though some of the highest-rated episodes come from the later seasons (grey). Five episodes have ratings of 9+: two in season 6, two in season 13, and one in season 14. The highest-rated episode, The Plight Before Christmas (S13E10), gets a 9.6.

We can also look at the correlation between all the numeric features in the dataset and the target, IMDb ratings, in Table 2.

Table Code (Click to Expand)

(
    bobs_burgers.select_dtypes(include="number")
    .drop(columns="rating")
    .corrwith(bobs_burgers["rating"])
    .sort_values(ascending=False)
    .reset_index()
    .rename(columns={0: "Correlation", "index": "Feature"})
    .pipe(
        lambda df: GT(df, rowname_col="Feature")
        .fmt_number(columns="Correlation", decimals=2)
        .cols_align(columns="Correlation", align="center")
        .tab_source_note(source_note="Data: {bobsburgersR} (via TidyTuesday)")
        .tab_options(table_width="100%")
    )
)

Table 2: Correlations Between Numeric Features & IMDb User Ratings

	Correlation
wikipedia_viewers	0.31
exclamation_ratio	0.28
dialogue_density	0.17
question_ratio	0.11
sentiment_variance	0.09
avg_length	0.07
episode	−0.06
unique_words	−0.15
season	−0.39
episode_overall	−0.39
year	−0.40
Data: {bobsburgersR} (via TidyTuesday)

Several features, including a couple dialogue metrics, appear to be making minimal contributions to the variance in episode ratings. However, ratings are moderately correlated with viewers, exclamation ratio, season, the overall episode number, and the year the episode was aired.

The remarkably similar correlations between ratings and the season, overall episode, and year suggest that those three features capture the same thing—time. The plot below visualises how ratings have changed over time.

Plot Code (Click to Expand)

(
    ggplot(bobs_burgers, aes(x="episode_overall", y="rating"))
    + geom_point(size=2, stroke=0.5, fill="white", color="#343a40", shape="o")
    + geom_smooth(method="lm", size=1, se=False, color="#026E99")
    + scale_x_continuous(breaks=[0, 50, 100, 150, 200, 250])
    + labs(
        title="Bob's Burgers Episode Ratings Over Time",
        subtitle=(
            textwrap.fill(
                "Comparing how Bob's Burgers episode ratings have changed over time. "
                "There has been a steady decline in ratings, with no episode receiving "
                "an average rating below seven until season 10.",
                width=80,
            )
        ),
        x="Episode",
        y="IMDb Rating",
        caption="Visualisation: Paul Johnson | Data: {bobsburgersR} (via TidyTuesday)",
    ) 
)

While episode ratings are noisy, there is a visible decline over time. The average ratings for the first eight seasons are between 7.7 and 8, dropping in season nine (seen around 150 episodes in the above plot) and hovering around 7.5 for the remaining seasons.

At the other end of Table 2, a moderate positive correlation exists between ratings and viewership, which is visualised below.

Plot Code (Click to Expand)

(
    ggplot(bobs_burgers, aes(x="wikipedia_viewers", y="rating"))
    + geom_point(size=2, stroke=0.5, fill="white", color="#343a40", shape="o")
    + geom_smooth(method="lm", size=1, se=False, color="#026E99")
    + labs(
        title="Bob's Burgers Episode Ratings by Viewers",
        subtitle=(
            textwrap.fill(
                "Comparing the association between IMDb user ratings of Bob's Burgers "
                "episodes and US viewers when first aired. While it is noisy, there "
                "does appear to be a positive correlation between viewing figures and "
                "the episode rating.",
                width=91,
            )
        ),
        x="Viewers (Millions)",
        y="IMDb Rating",
        caption="Visualisation: Paul Johnson | Data: {bobsburgersR} (via TidyTuesday)",
    )
)

One almighty outlier pushes the x-axis out a lot further: S1E1, watched by over nine million people, almost three million more viewers than any other episode. It turns out this was the first episode of Bob’s Burgers, and I think this hints at the likely explanation for the strong correlation between viewers and ratings¹. The viewership has been steadily declining since the first season. The causal mechanism may go in the other direction (the quality of episodes declining and leading to a decline in the audience), but it’s just as likely the correlation is just capturing time effects.

Is everything a function of time? Maybe the overall episode number is the only feature we need? Table 2 might not have given us much hope in the predictive power of the dialogue metrics, but they might be our only hope if we want to include any features that are not just a variety of ways of measuring the effect of time.

The dialogue metrics are intended to describe the episode itself. The assumption is that these metrics will capture characteristics that define the episode and offer some signal about the episode quality (and, therefore, ratings). Below, each dialogue metric is plotted against IMDb ratings.

Plot Code (Click to Expand)

(
    bobs_burgers.melt(
        id_vars=["rating"],
        value_vars=[
            "dialogue_density",
            "avg_length",
            "sentiment_variance",
            "unique_words",
            "question_ratio",
            "exclamation_ratio",
        ],
        var_name="label",
        value_name="value",
    )
    .assign(
        label=lambda df: df["label"]
        .str.replace("_", " ")
        .str.title()
        .str.replace("Avg", "Average", case=False)
    )
    .pipe(
        lambda df: ggplot(df, aes(x="value", y="rating", group=1))
        + geom_point(
            size=2, stroke=0.5, alpha=0.8, fill="white", color="#343a40", shape="o"
        )
        + geom_smooth(method="lm", size=1, se=False, color="#026E99")
        + facet_wrap("~label", scales="free_x", nrow=3)
        + labs(
            title="Bob's Burgers Episode Ratings by Episode Dialogue Metrics",
            subtitle=(
                textwrap.fill(
                    "Comparing the association between Bob's Burgers IMDb user ratings "
                    "by the dialogue metrics for each episode. While all six metrics "
                    "are noisy, the exclamation ratio and the number of unique words do "
                    "appear to have correlations with the episode's rating.",
                    width=88,
                )
            ),
            x="",
            y="IMDb Rating",
            caption=(
                "Visualisation: Paul Johnson | "
                "Data: {bobsburgersR} (via TidyTuesday)"
            ),
        )
        + theme(
            figure_size=(7, 9),
            panel_spacing_x=0.02,
            panel_spacing_y=0.04,
            plot_subtitle=element_text(
                margin={"t": 5, "r": 0, "b": 25, "l": 0},
            ),
        )
    )
)

There appears to be some signal here, but it’s mostly noise. Metrics like dialogue density appear to offer nothing, but exclamation ratio and unique words have stronger associations with ratings. I can strain to convince myself there are some interesting patterns in the question ratio and sentiment variance plots, but that may be explained by motivated reasoning more than any meaningful signal in these features.

Finally, in my desperation, I turn to the categorical features. Specifically, the episode directors and writers could contain predictive value². Table 3 & Table 4 show the total episodes and average rating for each director and writer credited on at least five Bob’s Burgers episodes.

Director Ratings
Writer Ratings

Table Code (Click to Expand)

(
    bobs_burgers.assign(
        wikipedia_directed_by=lambda df: (
            df["wikipedia_directed_by"].str.split(" & ")
        )
    )
    .explode("wikipedia_directed_by")
    .groupby("wikipedia_directed_by")
    .agg(
        episodes=("wikipedia_directed_by", "size"), 
        average_rating=("rating", "mean")
        )
    .query("episodes >= 5")
    .sort_values(by="average_rating", ascending=False)
    .reset_index()
    .pipe(
        lambda df: GT(df, rowname_col="wikipedia_directed_by")
        .fmt_number(columns="average_rating", decimals=2)
        .cols_label(
            episodes="Total Episodes Directed", 
            average_rating="Average Episode Rating"
        )
        .cols_align(columns=["episodes", "average_rating"], align="center")
        .tab_source_note(source_note="Data: {bobsburgersR} (via TidyTuesday)")
        .tab_options(table_width="100%")
    )
)

Table 3: IMDb User Ratings by Director

	Total Episodes Directed	Average Episode Rating
Jennifer Coyle	19	8.07
Kyounghee Lim	15	8.00
Boohwan Lim	15	7.95
Bernard Derriman	9	7.89
Mauricio Pardo	6	7.87
Anthony Chun	8	7.86
Tyree Dillihay	32	7.86
Don MacKinnon	12	7.85
Wes Archer	7	7.77
Brian Loschiavo	23	7.76
Ian Hamilton	11	7.65
Chris Song	47	7.60
Matthew Long	18	7.53
Ryan Mattos	26	7.49
Tom Riggin	19	7.36
Simon Chong	9	7.31
Data: {bobsburgersR} (via TidyTuesday)

Table Code (Click to Expand)

(
    bobs_burgers.assign(
        wikipedia_written_by=lambda df: (
            df["wikipedia_written_by"].str.split(" & ")
        )
    )
    .explode("wikipedia_written_by")
    .groupby("wikipedia_written_by")
    .agg(
        episodes=("wikipedia_written_by", "size"), 
        average_rating=("rating", "mean")
        )
    .query("episodes >= 5")
    .sort_values(by="average_rating", ascending=False)
    .reset_index()
    .pipe(
        lambda df: GT(df, rowname_col="wikipedia_written_by")
        .fmt_number(columns="average_rating", decimals=2)
        .cols_label(
            episodes="Total Episodes Written", 
            average_rating="Average Episode Rating"
        )
        .cols_align(columns=["episodes", "average_rating"], align="center")
        .tab_source_note(source_note="Data: {bobsburgersR} (via TidyTuesday)")
        .tab_options(table_width="100%")
    )
)

Table 4: IMDb User Ratings by Writer

	Total Episodes Written	Average Episode Rating
Loren Bouchard	7	8.30
Lizzie Molyneux	29	7.87
Wendy Molyneux	29	7.87
Steven Davis	28	7.86
Kelvin Yu	25	7.86
Greg Thompson	21	7.80
Dan Fybel	27	7.73
Nora Smith	18	7.73
Jon Schroeder	25	7.66
Rich Rinaldi	28	7.63
Scott Jacobson	26	7.60
Holly Schlesinger	28	7.58
Katie Crown	11	7.28
Data: {bobsburgersR} (via TidyTuesday)

There is plenty of variance in the outcomes across directors and writers. It is reasonable to assume that these should impact the ratings and improve model performance. However, including directors and writers in the model will require inducing considerable sparsity in the data (assuming that one-hot encoding is used).

Data Preparation

A brief exploration of the data has not given me confidence that it is sufficient for developing a cromulent predictive model (maybe the TidyTuesday folks were on to something). But I’m committed to the bit. I’ve never heard of sunken costs.

I will carry out some basic preprocessing but won’t do any feature engineering or add additional data from elsewhere. I will keep all the dialogue metrics in the model because meaningful interactions between them could add some value³. However, I will drop the season-episode number and year features because they offer minimal additional value over the season and overall episode number features. Finally, I’m following my instincts and dropping the viewership feature. I think it is just serving as another proxy for time and isn’t needed.

There is a lot more that could be done here. Never underestimate how many different feature engineering methods exist⁴, nor the potential performance gains they can achieve. I think a lot could be gained from the transcripts of each episode’s dialogue, but this is outside the scope of this post (because it’s hard and I’m lazy).

Train/Test Split

I should have split the data into training and test sets before doing the exploratory work, but with only 271 observations, I don’t think we have enough data to justify splitting it up beforehand.

Data Code (Click to Expand)

X_train, X_test, y_train, y_test = train_test_split(
    bobs_burgers.drop(["rating"], axis=1),
    bobs_burgers["rating"],
    test_size=0.3,
    random_state=42,
)

The data has been split 70/30 into training and testing. I also decided there wasn’t enough to include a validation set.

Data Preprocessing

The model pipeline includes several preprocessing steps. I have split the features into categorical and numeric, and missing values are imputed using slightly different methods depending on the data type.

Class Code (Click to Expand)

class ExplodeAndEncode(TransformerMixin, BaseEstimator):
    def __init__(self, min_frequency=3):
        # initialise transformer with min frequency for encoding
        self.min_frequency = min_frequency
        self.ohe_directors = OneHotEncoder(
            handle_unknown="ignore",
            sparse_output=False,
            min_frequency=self.min_frequency,
        )
        self.ohe_writers = OneHotEncoder(
            handle_unknown="ignore",
            sparse_output=False,
            min_frequency=self.min_frequency,
        )

    def fit(self, X, y=None):
        # check input type
        if not isinstance(X, pd.DataFrame):
            raise ValueError("Input must be a DataFrame.")

        # rename columns for consistency
        X = X.rename(
            columns={
                "wikipedia_directed_by": "director",
                "wikipedia_written_by": "writer",
            }
        )

        # fit one-hot encoders on expanded and encoded data
        self.ohe_directors.fit(self._expand_and_encode(X["director"]))
        self.ohe_writers.fit(self._expand_and_encode(X["writer"]))

        return self

    def transform(self, X, y=None):
        # check input type
        if not isinstance(X, pd.DataFrame):
            raise ValueError("Input must be a DataFrame.")

        # rename columns for consistency
        X = X.rename(
            columns={
                "wikipedia_directed_by": "director",
                "wikipedia_written_by": "writer",
            }
        )

        # expand and encode director and writer columns
        directors_expanded = self._expand_and_encode(X["director"])
        writers_expanded = self._expand_and_encode(X["writer"])

        # apply one-hot encoding
        transformed_directors = self.ohe_directors.transform(directors_expanded)
        transformed_writers = self.ohe_writers.transform(writers_expanded)

        # create dataframes with meaningful column names
        directors_df = (
            pd.DataFrame(
                transformed_directors,
                index=directors_expanded.index,
                columns=[
                    f"director_{col}"
                    for col in self.ohe_directors.get_feature_names_out()
                ],
            )
            .groupby(level=0)
            .sum()  # aggregate back to original index level
        )

        writers_df = (
            pd.DataFrame(
                transformed_writers,
                index=writers_expanded.index,
                columns=[
                    f"writer_{col}" for col in self.ohe_writers.get_feature_names_out()
                ],
            )
            .groupby(level=0)
            .sum()  # aggregate back to original index level
        )

        # merge transformed features and fill missing values with zero
        transformed_df = directors_df.join(writers_df, how="outer").fillna(0)

        return transformed_df.values  # return as numpy array

    def _expand_and_encode(self, series):
        # ensure series is categorical
        if not isinstance(series.dtype, pd.CategoricalDtype):
            series = series.astype("category")

        # add 'unknown' category if missing
        if "Unknown" not in series.cat.categories:
            series = series.cat.add_categories(["Unknown"])

        # fill missing values with 'unknown'
        series = series.fillna("Unknown")

        # split multi-value strings into separate rows
        exploded = series.str.split(" & ").explode()

        # standardise formatting (lowercase, replace spaces with underscores)
        exploded = exploded.str.lower().str.replace(" ", "_")

        return exploded.to_frame()  # return as a dataframe

With the help of ChatGPT⁵, I created a custom transformer class that explodes and formats the categorical columns—episode writers and directors—before one-hot encoding each of the individual writers/directors per episode.

Data Code (Click to Expand)

categorical_features = ["wikipedia_written_by", "wikipedia_directed_by"]
numeric_features = [
    "season",
    "episode_overall",
    "dialogue_density",
    "unique_words",
    "question_ratio",
    "exclamation_ratio",
    "avg_length",
    "sentiment_variance",
]

categorical_transformer = Pipeline(
    steps=[("explode_encode", ExplodeAndEncode(min_frequency=5))]
)

numeric_transformer = Pipeline(steps=[("imputer", IterativeImputer())])

col_transformer = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

preprocessor = Pipeline(steps=[("col_transformer", col_transformer)])

To prevent the newly created binary features from markedly expanding the data’s dimensionality and sparsity, I have restricted the one-hot encoding to those credited with a minimum of five episodes.

Model Training

I’ll start with a baseline model and then a relatively simple model to compare against. This is always a good starting point, especially when there’s a reasonable chance your model will be garbage. You need to make sure you are at least able to build a model that performs better than taking the most basic baseline predictions.

The baseline model predicts that all test set values will equal the mean IMDb ratings in the training set.

Model Code (Click to Expand)

baseline_model = DummyRegressor(strategy="mean")

baseline_model.fit(X_train, y_train)

baseline_preds = baseline_model.predict(X_test)

baseline_rmse = root_mean_squared_error(y_test, baseline_preds)

The RMSE for the baseline model is 0.430. This gives us a target to try and beat. If we can’t do better than this, we should pack it in and not spend a moment more on this nonsense. Remember to blame the tools, though. It’s certainly not my fault.

I have also built a linear regression to compare against the baseline and help evaluate whether a more complex model is worth the effort. Linear regressions are straightforward. Despite this, they are capable of remarkable performance and typically generalise very well.

Model Code (Click to Expand)

pipeline = Pipeline(
    steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)

pipeline.fit(X_train, y_train)

lm_preds = pipeline.predict(X_test)

lm_rmse = root_mean_squared_error(y_test, lm_preds)

The linear regression returns a score of 0.415, which beats the baseline. We can use both these models as a good starting point, but the linear regression performance is the new target to beat.

Model Selection

I decided to stick with algorithms provided by scikit-learn, choosing random forests (Breiman 2001), extremely randomised trees (Extra Trees in scikit-learn) (Geurts, Ernst, and Wehenkel 2006), and gradient boosting (Friedman 2001).

Random forests⁶ and extra trees are bagging ensembles combining multiple decision trees by averaging their predictions. The key difference between the two is that random forests use bootstrapped samples of the data for each tree instead of the entire sample (though you can choose to use this feature in scikit-learn’s extra trees implementation, too), and extra trees split tree nodes randomly, to induce more variance between trees. Gradient boosting⁷ combines an ensemble of weak models (typically decision trees) in a sequential, additive fashion. The sequential process, called boosting, allows each model to learn from all that precede it.

I am testing the performance of these algorithms by tuning a couple key hyperparameters for each, split over five folds using k-fold cross-validation. The goal is that this should identify the algorithm that gives us the best opportunity for maximising performance. I am using Optuna for hyperparameter optimisation (here and throughout the rest of this post)⁸. Optuna’s default optimisation strategy is Bayesian optimisation using tree-structured parzen estimators, and I’ve always had good luck with it.

Model Code (Click to Expand)

rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better=False)


def objective(trial):
    regressor_name = trial.suggest_categorical(
        "regressor", ["RandomForest", "ExtraTrees", "GradientBoosting"]
    )

    if regressor_name == "RandomForest":
        n_estimators = trial.suggest_int("n_estimators", 100, 1000, step=100)
        max_features = trial.suggest_float("max_features", 0.1, 1.0, step=0.1)
        model = RandomForestRegressor(
            n_estimators=n_estimators, max_features=max_features
        )

    elif regressor_name == "ExtraTrees":
        n_estimators = trial.suggest_int("n_estimators", 100, 1000, step=100)
        max_features = trial.suggest_float("max_features", 0.1, 1.0, step=0.1)
        model = ExtraTreesRegressor(
            n_estimators=n_estimators, max_features=max_features
        )

    else:
        n_estimators = trial.suggest_int("n_estimators", 100, 1000, step=100)
        learning_rate = trial.suggest_float("learning_rate", 0.01, 0.3, step=0.01)
        model = GradientBoostingRegressor(
            n_estimators=n_estimators, learning_rate=learning_rate
        )

    pipeline = Pipeline([("preprocesser", preprocessor), ("regressor", model)])

    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(
        pipeline, X_train, np.ravel(y_train), cv=kf, scoring=rmse_scorer
    )

    return -scores.mean()


select_model = optuna.create_study(direction="minimize")
select_model.optimize(objective, n_trials=100)

I’m not especially concerned with model performance during this step; it’s just a comparison between algorithms. The study suggests RandomForest is the best of the bunch.

Hyperparameter Tuning

Now, we can get knee-deep in filth, searching for the best hyperparameter values for optimising our random forest. The setup is more or less the same as in the previous section; here, we are digging deeper into one algorithm and beefing up the total number of trials in the study.

The study below includes several more hyperparameters. The search space for the hyperparameters is relatively modest. I have tried to keep the model simple to reflect the data, and I have iteratively refined the study to narrow the hyperparameter values and speed up the tuning. Finally, I’ve also tried to set some constraints to help the model avoid overfitting.

Model Code (Click to Expand)

def objective(trial):

    rf_params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 250),
        "max_features": trial.suggest_float("max_features", 0.2, 0.5),
        "max_samples": trial.suggest_float("max_samples", 0.7, 1.0),
        "max_depth": trial.suggest_int("max_depth", 10, 15),
        "min_samples_split": trial.suggest_int("min_samples_split", 20, 25),
        "criterion": trial.suggest_categorical(
            "criterion", ["poisson", "friedman_mse", "squared_error"]
        ),
    }

    model = RandomForestRegressor(**rf_params, random_state=42)

    pipeline = Pipeline([("preprocessor", preprocessor), ("regressor", model)])

    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(
        pipeline, X_train, np.ravel(y_train), cv=kf, scoring=rmse_scorer
    )

    return -scores.mean()


tuned_rf = optuna.create_study(direction="minimize")
tuned_rf.optimize(objective, n_trials=500)

The tuned random forest produces an RMSE of 0.395. Comparisons with the baselines don’t work because they are fit to the test data, while this is the performance of the best trial on the training data⁹. The hope is that how the tuning is set up will make the model’s performance on the training data generalisable. Given the small sample we’re working with, there could be plenty of variance, so while our model shows promise, we can’t draw conclusions yet.

Table 5 below shows the tuned hyperparameter values.

Table Code (Click to Expand)

(
    GT(
        pd.DataFrame(
            tuned_rf.best_trial.params.items(), columns=["Hyperparameter", "Value"]
        ).assign(
            Value=lambda df: df["Value"].apply(
                lambda x: round(x, 2) if isinstance(x, (int, float)) else x
            )
        )
    )
    .cols_align(columns="Value", align="center")
    .tab_source_note(source_note="Data: {bobsburgersR} (via TidyTuesday)")
    .tab_options(table_width="100%")
)

Table 5: Optimised Random Forest Hyperparameters

Hyperparameter	Value
n_estimators	118
max_features	0.34
max_samples	0.72
max_depth	14
min_samples_split	25
criterion	friedman_mse
Data: {bobsburgersR} (via TidyTuesday)

The hyperparameters are mostly tailored to optimise the training data, with some thought towards what is appropriate given the sample size. I have constrained the maximum features and samples to make the model more generalisable.

Model Evaluation

Having tuned the hyperparameters, we should hopefully have a model that gives us the best performance on our test set.

Model Code (Click to Expand)

best_params = tuned_rf.best_params

best_rf = RandomForestRegressor(**best_params, random_state=42)

pipeline = Pipeline([("preprocessor", preprocessor), ("regressor", best_rf)])

pipeline.fit(X_train, np.ravel(y_train))

rf_preds = pipeline.predict(X_test)

rf_rmse = root_mean_squared_error(np.ravel(y_test), rf_preds)

The model improves on the baseline (0.430) and the linear regression (0.415) to achieve an RMSE of 0.367. That is a ~15% improvement on the baseline and ~11% on the linear model¹⁰. This is not nothing!

Visualising Model Error

Data Code (Click to Expand)

test_errors = (
    pd.DataFrame(rf_preds.T, columns=["prediction"])
    .assign(actual=y_test.values)
    .assign(error=lambda df: df["prediction"] - df["actual"])
    .assign(abs_error=lambda df: abs(df["error"]))
    .join(X_test.reset_index(drop=True))
)

Our final model beats both baselines, but I’m sure there remains room for improvement. If we wanted to iteratively fine-tune the model to squeeze out some extra drops of joy, the best place to start would be to understand what the model is getting wrong. There are a few ways to go about this. I will gather the predicted ratings, actual ratings, and prediction error (raw and absolute values) and visualise the model’s performance to see what jumps out.

The distribution of prediction error, plotted below, serves as a quick check that there is no obvious indication that the model is misbehaving. If there is some wild skew in errors or several substantial outliers, this might be cause for concern, but the errors appear to be distributed approximately normally.

Plot Code (Click to Expand)

(
    ggplot(test_errors, aes("error"))
    + geom_histogram(binwidth=0.1, colour="#343a40")
    + geom_hline(yintercept=0, colour="#343a40")
    + labs(
        title="Distribution of Bob's Burgers Episode Ratings Prediction Error",
        subtitle=(
            textwrap.fill(
                "The distribution of prediction errors for predicted IMDb user ratings "
                "for each episode of Bob's Burgers in the test dataset. Errors peak at "
                "just above zero but the negative errors have a longer tail.",
                width=90,
            )
        ),
        x="Prediction Error",
        y="",
        caption="Visualisation: Paul Johnson | Data: {bobsburgersR} (via TidyTuesday)",
    )
)

A histogram visualising the distribution of prediction errors for the random forest predicting Bob's Burgers IMDb user ratings. While the majority of errors appear to be positive and the distribution peaks just above zero, the negative errors have a longer tail.

With less than 100 observations in the test dataset, it requires a bit of squinting to make some assumptions about the distribution we’d see with a larger sample. Still, nothing concerning jumps out at a glance. The distribution peaks slightly above zero, but the negative errors have a longer tail. There appears to be one outlier that the model struggled with, underestimating the rating by more than one (which is quite a lot when considering the ratings are out of 10).

We can also plot the predicted and actual ratings below to compare the model’s misses more directly.

Plot Code (Click to Expand)

(
    ggplot(test_errors, aes(x="prediction", y="actual"))
    + geom_point(size=2, stroke=0.5, fill="white", color="#343a40", shape="o")
    + geom_smooth(method="lm", size=1, se=False, color="#026E99")
    + labs(
        title="Predicted vs Actual Bob's Burgers Episode Ratings",
        subtitle=(
            textwrap.fill(
                "Comparing predicted Bob's Burgers ratings and their actual IMDb user "
                "ratings for each episode in the test dataset. On average, predictions "
                "appear to slightly overrate episodes.",
                width=90,
            )
        ),
        x="Predicted Rating",
        y="IMDb Rating",
        caption="Visualisation: Paul Johnson | Data: {bobsburgersR} (via TidyTuesday)",
    )
)

A regression plot visualising the relationship between predicted and actual ratings for the random forest model predicting Bob's Burgers IMDb user ratings. On average, pedictions seem to overrate episode ratings.

Visualising predictions against actual ratings shows that the model slightly overrates episodes. While there’s plenty of noise, nothing jumps out as an obvious problem. Besides a single pesky outlier, nothing in these first two plots points to areas to try and squeeze out additional performance improvements.

Finally, we can visualise the relationship between prediction error and actual ratings. We are interested in the magnitude of the error, not the direction, so I’ve used absolute prediction error. This highlights the target values the model struggled to predict, and the trend in the error distribution is immediately apparent.

Plot Code (Click to Expand)

(
    ggplot(test_errors, aes(x="actual", y="abs_error"))
    + geom_point(size=2, stroke=0.5, fill="white", color="#343a40", shape="o")
    + geom_smooth(size=1, se=False, color="#026E99")
    + labs(
        title="Absolute Prediction Error by Bob's Burgers Episode Ratings",
        subtitle=(
            textwrap.fill(
                "Comparing absolute prediction error of predicted Bob's Burgers' "
                "episode ratings and actual IMDb user ratings. Errors are larger "
                "at the extreme ends of the ratings.",
                width=85,
            )
        ),
        x="IMDb Rating",
        y="Absolute Predcition Error",
        caption="Visualisation: Paul Johnson | Data: {bobsburgersR} (via TidyTuesday)",
    )
)

A regression plot visualising the relationship between IMDb ratings and the absolute prediction error for the random forest model predicting Bob's Burgers ratings. While the model performs relatively well, it struggles at the extreme ends of the ratings.

The model struggles with values at the extreme ends of the target distribution. Most observations in the dataset fall between 7 and 8, and the mean absolute error for episodes in this range in the test set is ~0.2. In fact, the trough in the prediction errors is between 7.5 and 8 and rises quickly once the rating value moves in either direction.

The observation with a rating of over 9 is driving a big spike. The episode responsible is —The Haunting (S6E3). I suspect an indicator that an episode is a “special” of some description (episodes relevant to public holidays, like Christmas and Halloween, extended episodes, or any one-off episodes) would minimise the issues the model seems to be having with very high ratings.

To me, the issue appears to be the sample size. The model handles the well-populated target value ranges but struggles once it gets into the ranges with tiny sample sizes. We just need more episodes.

Final Thoughts

Was this a good use of my time? No, but did we learn something? Also no. Perhaps there’s a moral to this story? Still, no. There is no big reveal. I’m just this stupid. My sincere apologies.

This is the product of my playing around with a fun dataset seeing what I can produce. A more concerted effort to predict Bob’s Burgers episode ratings would involve more exploratory work, including bringing in additional data, particularly the episode dialogue data. Dimensionality reduction methods, such as principal component analysis (PCA), would have added much value. The process of splitting out and one-hot encoding the writers and directors induced a bunch of sparsity that is probably hampering the model. Using PCA would simplify the data, and given the high chance that many features are not moving the needle that much, I suspect it would do so without losing much predictive value.

Beyond the gains that could be achieved with the data, I think it’s likely that other gradient boosting algorithms would probably beat out the random forest we’ve used here. I chose to limit myself to the models that scikit-learn provides for brevity (and for a bit of a change of pace), but XGBoost, LightGBM, and CatBoost are incredibly popular for a reason. They regularly beat their competition when dealing with structured data.

Despite being limited in scope, I hope this post’s limitations are also illustrative. If not, then at least I had fun. I guess we can’t turn TV shows into spreadsheets just yet.

Acknowledgments

Preview image by S. Tsuchiya on Unsplash.

Support

If you enjoyed this blog post and would like to support my work, you can buy me a coffee or a beer or give me a tip as a thank you.

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Geurts, Pierre, Damien Ernst, and Louis Wehenkel. 2006. “Extremely Randomized Trees.” Machine Learning 63: 3–42.

Heaven, Will Douglas. 2022. “This Is the Reason Demis Hassabis Started DeepMind.” MIT Technology Review. https://www.technologyreview.com/2022/02/23/1045016/ai-deepmind-demis-hassabis-alphafold/.

Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89. https://www.nature.com/articles/s41586-021-03819-2.

Service, Robert F. 2020. “‘The Game Has Changed.’ AI Triumphs at Solving Protein Structures.” Science. https://www.science.org/content/article/game-has-changed-ai-triumphs-solving-protein-structures.

Footnotes

My initial reaction to viewership’s strong correlation with ratings was that this made sense, but as soon as I tried to explain why, I realised, I had the direction of causality mixed up.↩︎
The other categorical features are the episode title and synopsis, which might include some helpful signal but would require some effort to extract. The date the episode aired is also categorical, but only because I haven’t bothered to convert it (having done some digging and found that there doesn’t appear to be any seasonality or anything to work with).↩︎
If I were spending more time on this, I would explore any interactions, but I think we can get away with chucking all the potentially valuable features into the mixer here and saving some time.↩︎
I’d highly recommend Emil Hvitfeldt’s Feature Engineering A-Z, a comprehensive resource on the endless possibilities of feature engineering.↩︎
I haven’t had much experience building custom transformers like this, so I needed a helping hand.↩︎
For an intuition into random forests (and, by extension, extra trees, too), I’d recommend MLU-Explain’s explainer.↩︎
For an intuition into gradient boosting (and all its variants), I’d recommend Terence Parr & Jeremy Howard’s How to Explain Gradient Boosting↩︎
There are other hyperparameter optimisation libraries, such as HyerOpt, and scikit-learn, which offers tuning functionality of its own, but Optuna is the most feature-rich. Perhaps the focus should be on methods, not libraries, but in my experience, Optuna’s methods also provide excellent performance.↩︎
A validation set or a more involved process like nested cross-validation would add some value. I don’t think we have enough data for a validation set, and I decided a fully nested pipeline was overkill, but I’m sure it would be beneficial.↩︎
Some simple preprocessing steps could improve the linear regression, potentially making it competitive with our final model. If I intended to put this into production, I’d be tempted to run the linear regression alongside the random forest to see if a simpler model would ultimately come out on top.↩︎

Reuse

CC BY-SA 4.0

Citation

For attribution, please cite this work as:

Johnson, Paul. 2025. “Predicting Bob’s Burgers Episode Ratings.” February 7, 2025. https://paulrjohnson.net/blog/2025-02-07-predicting-bobs-burgers-ratings/.