FINAL PROJECT EXAMPLE: Movie Ratings

Introduction

In the film industry, predictive modeling and machine learning are increasingly used to help studios make data-driven decisions and predict the success of films. From forecasting box office revenue to optimizing marketing campaigns, these models provide insights that can significantly influence a film’s profitability. Additionally, machine learning algorithms can analyze viewer preferences and behaviors, enabling studios to tailor content to specific audiences.

A key data source for the film industry is the Rotten Tomatoes website, an online aggregator that compiles critic and audience reviews into a “Tomatometer” score. This score, distinguishing films as “Fresh” or “Rotten,” heavily influences viewer perceptions and can drive box office success, as many potential viewers consult these ratings before choosing to see a film.

The aim is to assess how different factors, including the rotten tomatoes score, influences the domestic gross profit of a MAINSTREAM film (not a pilot run) for Lion Studios, using a training dataset based on 116 movies released in 2011. This dataset includes the following variables.

Variable Name	Description
Name	Film Name
DomesticGross (million USD)	Gross revenue in the US by the end of 2011, in millions of dollars
RottenTomatoes (%)	Percentage rating from critics reviews on Rotten Tomatoes
AudienceScore (%)	Percentage audience rating from opening weekend surveys
TheatersOpenWeek	Number of cinemas showing the movie on opening weekend. Films that were shown in less than 1000 cinemas are considered ‘pilot/test’ showings that should not be included in the analysis
BOAverageOpenWeek (USD)	Average box office revenue per theater opening weekend, in dollars
Profitability (categorical)	Loss: The film cost more to make than it gained as profit. Average: 100-300% of the budget spent was gained as profit. High: Greater than 300% of the budget spent was gained as profit.

A sample of the data can be seen here.

head(as.data.frame(movies))

##                    Name DomesticGross RottenTomatoes AudienceScore
## 1           The Dilemma          0.03              6            53
## 2                 Hanna          0.97             62            57
## 3 Friends With Benefits          4.40             66            55
## 4              Scream 4          4.46             46            66
## 5          Dolphin Tale          5.31             83            84
## 6               One Day          6.93             28            46
##   TheatersOpenWeek BOAverageOpenWeek Profitability
## 1                3              2972          Loss
## 2               22              4890          Loss
## 3              106              6111          Loss
## 4              265              3856          Loss
## 5              247              7174          Loss
## 6             2003              1730          Loss

Exploratory data analysis

The training data for this project includes information on 116 movies released in 2011:

Object of analysis: An individual film
Population: All films released by Lion Studios from 2011-present
Response variable: DomesticGross (millions of USD)

It is difficult in this case to assess whether the dataset is truly representative of the film industry now. With the rise of digital technology, streaming and COVID, the situation at present may look very different. Equally, there have been concerns about whether rotten tomatoes is gender neutral (as many more men than women appear to vote, further skewing the results towards films men are likely to watch). That said, this is the data available for this study, so we shall continue as requested.

Data wrangling

movies$Profitability <- as.factor(movies$Profitability)
summary(movies)

##      Name           DomesticGross    RottenTomatoes  AudienceScore  
##  Length:117         Min.   :  0.03   Min.   : 4.00   Min.   :25.00  
##  Class :character   1st Qu.: 24.82   1st Qu.:30.00   1st Qu.:49.00  
##  Mode  :character   Median : 43.85   Median :49.00   Median :59.00  
##                     Mean   : 71.57   Mean   :52.45   Mean   :61.44  
##                     3rd Qu.: 98.80   3rd Qu.:76.00   3rd Qu.:76.00  
##                     Max.   :381.01   Max.   :97.00   Max.   :93.00  
##  TheatersOpenWeek BOAverageOpenWeek Profitability
##  Min.   :   3     Min.   : 1513     Average:58   
##  1st Qu.:2555     1st Qu.: 3782     High   :40   
##  Median :2996     Median : 5715     Loss   :19   
##  Mean   :2839     Mean   : 8407                  
##  3rd Qu.:3417     3rd Qu.: 8995                  
##  Max.   :4375     Max.   :93230

The studio (Dr G), asked for me to only assess mainstream movies rather than pilot or test showings. They defined this as films shown in over 1000 theatres in their opening week and it is clear from the summary that there are pilot films in the dataset. This means I am filtering down to the correct,mainstream-film sub-set.

movies <- filter(movies,TheatersOpenWeek > 1000)

Response Variable Analysis

gghistostats(DomesticGross,data=movies,results.subtitle=FALSE)

As shown in the summary above, the domestic gross of films in the sample ranges from $400,000 (Friends with Benefits) to $381 million USD (Harry Potter and the Deathly Hallows II) with a mean domestic gross profit of $76.41 million and a median value of $51.93 million USD. As shown in the histogram, the data is highly skewed, with the films Harry Potter and 30 Minutes or Less making substantially more money than other films released in 2011.

Correlation Matrix

Correlation matrices are a quick and easy way of assessing whether there is a linear relationship between many sets of variables. However it is important to note that it will only “see” linear relationships, a complex pattern such as a curve will not be reflected in the statistics.

# remove the name column and the factor column

movies.numeric <- movies[ , sapply(movies,is.numeric)]
corrplot(cor(movies.numeric), type = "lower", tl.col = "black", tl.srt = 45)

The matrix suggests a strong relationship between the domestic gross and almost all the other scores provided, especially the number of theaters showing the film in its opening week. We can examine this in more detail (and include the profitability categorical variable) through looking at the individual scatterplots.

# remove the names variable
ggpairs(movies[,-1])

Here we can see that, for most predictors, there is a somewhat complex relationship with domestic gross profits. It seems that profitability also might play a role.

Single Predictor models

As requested, we shall now compute and compare two single predictor, simple linear regression models. Also as requested, I have chosen two predictor variables I consider to be ‘interesting’ from my summary analysis above:

The number of theaters showing the film in its opening week (TheatersOpenWeek). I chose this because it has the highest correlation out of all the predictor variables.
Average box office revenue per theater opening weekend, in dollars (BOAverageOpenWeek). I chose this because of a reasonably strong positive correlation, and from looking at the ggpairs matrix, the potential that investigating some outliers might increase the predictability further.

Model 1a: TheatersOpenWeek

# Scatterplot 
myplot <- ggplot(movies, aes(x=TheatersOpenWeek, y=DomesticGross)) + 
    geom_point() + 
    xlab("Number of theaters") + ylab("Domestic Gross Profit")+
    geom_smooth(method=lm , color="blue", se=TRUE) +
    theme_ipsum()
ggplotly(myplot)

LM_NumTheatres <- lm(DomesticGross ~ TheatersOpenWeek, data = movies)

In this case, as shown in the scatterplot above and the summary below, our model suggests that for this sample:

\[ \widehat{DomesticGross} = -189.48250 + 0.08705 (NumberTheatres) \]

LINE assumptions check

# Model Diagnostics 
ols_plot_resid_stud_fit(LM_NumTheatres)

Linearity: FAIL
Equal variance/normality: probably fail

As shown in the residual plot, this model has clear issues with non-linearity (e.g. a curve would clearly fit better than a line). Therefore I will apply a transformation to fix the issue before moving on.

Transformations

I used the code below to assess several transformations on both the predictor and the response. I first applied some predictor transformation to account for the non-linearity. It became increasingly clear that there was also non equal variance, so I applied a transformation to the response and tried transformations of both.

# Make some transformations
# This code-chunk has been set to eval=FALSE, but you can see what I did

# I don't want to mess up my movie dataset, so I'm trying this on a temporary dataset
tmpmovie <- movies

#===============================
# MAKE TRANSFORMATION COLUMNS
#===============================
# Log
 tmpmovie$Log_TheatersOpenWeek  <- log(tmpmovie$TheatersOpenWeek)
 tmpmovie$Log_DomesticGross     <- log(tmpmovie$DomesticGross)

# Sqrt
 tmpmovie$Sqrt_TheatersOpenWeek  <- sqrt(tmpmovie$TheatersOpenWeek)
 tmpmovie$Sqrt_DomesticGross     <- sqrt(tmpmovie$DomesticGross)

# Inverse e.g. 1/x
 tmpmovie$Inv_TheatersOpenWeek  <- 1/(tmpmovie$TheatersOpenWeek)
 tmpmovie$Inv_DomesticGross  <- 1/(tmpmovie$DomesticGross)

# Boxcox, a more detailed combination of log and 1/x  e.g. see equation below 
# I chose the 0.222 from the top of the boxcox plot. (NOT IN EXAM)
 boxcox_result <- MASS::boxcox(LM_NumTheatres)
 tmpmovie$box_TheatersOpenWeek <- (tmpmovie$TheatersOpenWeek^0.222 - 1) / 0.22


#===============================
# TEST SOME MODELS
# I'm just looking at the residual plots to see which works for fixing Linearity & equal variance. I ran each line using contrl-enter. I'm looking for the simplest possible answer that does the job
 
# PREDICTOR TRANSFORMATIONS. 
#=============================== 
 ols_plot_resid_stud_fit(lm(DomesticGross ~ Log_TheatersOpenWeek, data = tmpmovie))
 ols_plot_resid_stud_fit(lm(DomesticGross ~ Sqrt_TheatersOpenWeek,data = tmpmovie))
 ols_plot_resid_stud_fit(lm(DomesticGross ~ Inv_TheatersOpenWeek, data = tmpmovie))
 ols_plot_resid_stud_fit(lm(DomesticGross ~ box_TheatersOpenWeek, data = tmpmovie))
 ols_plot_resid_stud_fit(lm(DomesticGross ~ poly(TheatersOpenWeek,2),data = tmpmovie))
 ols_plot_resid_stud_fit(lm(DomesticGross ~ poly(TheatersOpenWeek,3),data = tmpmovie))
                             
# RESPONSE TRANSFORMATIONS. 
#=============================== 
 ols_plot_resid_stud_fit(lm(Log_DomesticGross ~ TheatersOpenWeek, data = tmpmovie))
 ols_plot_resid_stud_fit(lm(Sqrt_DomesticGross ~ TheatersOpenWeek,data = tmpmovie))
 ols_plot_resid_stud_fit(lm(Inv_DomesticGross ~ TheatersOpenWeek, data = tmpmovie))
 
# BOTH TRANSFORMATIONS. 
#===============================    
 ols_plot_resid_stud_fit(lm(Log_DomesticGross~Log_TheatersOpenWeek, data = tmpmovie))
 ols_plot_resid_stud_fit(lm(Sqrt_DomesticGross~Sqrt_TheatersOpenWeek,data = tmpmovie))
 ols_plot_resid_stud_fit(lm(Inv_DomesticGross~Inv_TheatersOpenWeek, data = tmpmovie))
 
 rm(tmpmovies)

There was a clear winner, the log transformation of the response, so I will move forwards from here. This makes sense given how skewed the resonse variable was.

# Log response transformation
 movies$Log_DomesticGross     <- log(movies$DomesticGross)
 ols_plot_resid_stud_fit(lm(Log_DomesticGross ~ TheatersOpenWeek,data=movies))

Model 1B, Log(Domestic_Gross)

Now, let’s assess our new and more complex model:

# New scatterplot with theatres open week vs log_domgross
ggplot(movies, aes(x=TheatersOpenWeek, y=Log_DomesticGross)) +
    xlab("Number of Theatres") + ylab("Domestic Gross Profit (million USD)")+
    geom_point() +
    geom_smooth(method="lm")

# linear model
LM_Log_NumTheatres <- lm(Log_DomesticGross ~ TheatersOpenWeek,data=movies)
LM_Log_NumTheatres

## 
## Call:
## lm(formula = Log_DomesticGross ~ TheatersOpenWeek, data = movies)
## 
## Coefficients:
##      (Intercept)  TheatersOpenWeek  
##         0.430433          0.001158

In this case, as shown in the scatterplot above and the summary below, our model suggests that for this sample:

\[ \widehat{\ln(DomesticGross)} = 0.430433 + 0.001158(NumberTheatres) \]

LINE assessment

residplot1 <- ols_plot_resid_stud_fit(LM_Log_NumTheatres,print_plot=FALSE) 
residplot1$plot +  ggtitle("log transformation of response only")

ols_plot_resid_qq(LM_Log_NumTheatres)

# Normality test
ols_test_normality(LM_Log_NumTheatres)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9944         0.9439 
## Kolmogorov-Smirnov        0.0524         0.9283 
## Cramer-von Mises         11.4842         0.0000 
## Anderson-Darling          0.1465         0.9661 
## -----------------------------------------------

# F test for variance
ols_test_f(LM_Log_NumTheatres)

## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of Log_DomesticGross 
## 
##       Test Summary        
##  -------------------------
##  Num DF     =    1 
##  Den DF     =    106 
##  F          =    2.734183 
##  Prob > F   =    0.1011804

Linearity: PASS - No evidence a curve would fit the mean of y|x better than a line
Independence: PASS - Hard to assess, but no huge issues I can see
Normality: PASS - the QQplot looks straight and according to several normality significance tests, there is almost a 95% probability of seeing a result like this if it was sampled from an underlying population which had normal residuals.
Equal variance: PASS (just) - The residual plot looks relatively random and according there is a 10% chance of seeing a result like this if it was sampled from a population with equal variances in residuals.

Check for Influential Outliers

Now that model 1B is valid in that it is satisfying the LINE conditions), I will make one final check for influential outliers, as shown below.

ols_plot_resid_lev(LM_Log_NumTheatres)

ols_plot_cooksd_bar(LM_Log_NumTheatres,threshold=1)

Although there are high leverage points (unusually far from the mean of x in the x direction), and some outliers, none of the outliers are influential, suggesting we can move on to model 2.

Model 2: BOAverageOpenWeek

My second model assesses if Domestic Gross (millions of dollars) can be modelled using the Average box office revenue per theater opening weekend (in dollars), variable BOAverageOpenWeek.

# Scatterplot 
ggplot(movies, aes(x=BOAverageOpenWeek, y=DomesticGross)) + 
    geom_point() + 
    xlab("Opening week profit ") + ylab("Domestic Gross Profit (million USD)")+
    geom_smooth(method=lm , color="blue", se=TRUE) +
    theme_ipsum()

LM_BOAvOpen <- lm(DomesticGross ~ BOAverageOpenWeek, data = movies)
LM_BOAvOpen

## 
## Call:
## lm(formula = DomesticGross ~ BOAverageOpenWeek, data = movies)
## 
## Coefficients:
##       (Intercept)  BOAverageOpenWeek  
##          -3.56700            0.01062

In this case, as shown in the scatterplot above, our model suggests that for this sample:

\[ \widehat{DomesticGross} = -3.56700 + 0.01062 (BO_AverageOpeningWeek) \]

LINE assumptions check

# Model Diagnostics 
ols_plot_resid_stud_fit(LM_BOAvOpen)

Linearity: PASS - No evidence a curve would fit the mean of y|x better than a line
Equal variance: FAIL - The residual plot shows significant “fanning out” (e.g. the points are close to the line for low predicted values and far from the line for higher ones).

This makes physical sense; there is probably a baseline ticket price for cinemas to break even, so for low profits, there’s not much room for variability. But for a very profitable movie or blockbuster, many different types of cinema are showing the film, with different levels of luxury.

Model 2B, Log-Log transformation

Again, we will try a transformation to fix line - after exploring the issue using the code above, it seems that this model needs a log-log transformation (e.g. both the response and predictor are log transformed).

# Log response transformation
 movies$Log_DomesticGross        <- log(movies$DomesticGross)
 movies$Log_BOAverageOpenWeek    <- log(movies$BOAverageOpenWeek)

 
# Scatterplot 
ggplot(movies, aes(x=Log_BOAverageOpenWeek, y=Log_DomesticGross)) + 
    geom_point() + 
    xlab("Opening week profit") + ylab("Domestic Gross Profit")+
    geom_smooth(method=lm , color="blue", se=TRUE) +
    theme_ipsum()

LM_BOAvOpen_loglog <- lm(Log_DomesticGross ~ Log_BOAverageOpenWeek, data = movies)
ols_plot_resid_stud_fit(LM_BOAvOpen_loglog)

LM_BOAvOpen_loglog

## 
## Call:
## lm(formula = Log_DomesticGross ~ Log_BOAverageOpenWeek, data = movies)
## 
## Coefficients:
##           (Intercept)  Log_BOAverageOpenWeek  
##                -7.077                  1.270

Our model is now:

\[ \widehat{\ln(DomesticGross)} = -7.077 + 1.270(\ln(BO_AverageOpeningWeek)) \]

LINE assessment

residplot1 <- ols_plot_resid_stud_fit(LM_BOAvOpen_loglog,print_plot=FALSE) 
residplot1$plot +  ggtitle("log transformation of response only")

ols_plot_resid_qq(LM_BOAvOpen_loglog)

# Normality test
ols_test_normality(LM_BOAvOpen_loglog)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9941         0.9286 
## Kolmogorov-Smirnov        0.0456         0.9781 
## Cramer-von Mises          19.907         0.0000 
## Anderson-Darling          0.231          0.7992 
## -----------------------------------------------

# F test for variance
ols_test_f(LM_BOAvOpen_loglog)

## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of Log_DomesticGross 
## 
##       Test Summary        
##  -------------------------
##  Num DF     =    1 
##  Den DF     =    106 
##  F          =    1.222969 
##  Prob > F   =    0.2712824

Linearity: PASS - No evidence a curve would fit the mean of y|x better than a line
Independence: PASS - Hard to assess, but no huge issues I can see
Normality: PASS - the QQplot looks straight and according to several normality significance tests, there is over an 80% probability of seeing a result like this if our sample was taken from an underlying population which had normal residuals.
Equal variance: PASS - The residual plot looks relatively random and according there is a 27% chance of seeing a result like this if it was sampled from a population with equal variances in residuals

Check for Influential Outliers

Now that model 1B is valid in that it is satisfying the LINE conditions), I will make one final check for influential outliers, as shown below.

ols_plot_resid_lev(LM_BOAvOpen_loglog)

ols_plot_cooksd_bar(LM_BOAvOpen_loglog,threshold=1)

Although there are high leverage points (unusually far from the mean of x in the x direction), and some outliers, none of the outliers are influential, suggesting we can move on to model 2.

Model comparison

We now have two valid models predicting the domestic gross product (LM_Log_NumTheatres and LM_BOAvOpen_loglog) where:

Model 1:

\[ \widehat{\ln(DomesticGross)} = 0.430433 + 0.001158(TheatersOpenWeek) \]

Model 2:

\[ \widehat{\ln(DomesticGross)} = -7.077 + 1.270(\ln(BOAverageOpeningWeek)) \]

Where, DomesticGross is the Gross revenue in the US by the end of 2011, in millions of dollars; TheatersOpenWeek is the number of cinemas showing the movie on opening weekend; and BOAverageOpeningWeek is the average box office revenue per theater opening weekend (in dollars).

The model summaries for each model can be seen as follows:

ols_regress(LM_Log_NumTheatres)

##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.763       RMSE                 0.561 
## R-Squared               0.581       MSE                  0.321 
## Adj. R-Squared          0.577       Coef. Var           14.267 
## Pred R-Squared          0.566       AIC                187.629 
## MAE                     0.443       SBC                195.675 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                 ANOVA                                 
## ---------------------------------------------------------------------
##                Sum of                                                
##               Squares         DF    Mean Square       F         Sig. 
## ---------------------------------------------------------------------
## Regression     47.209          1         47.209    147.234    0.0000 
## Residual       33.988        106          0.321                      
## Total          81.197        107                                     
## ---------------------------------------------------------------------
## 
##                                     Parameter Estimates                                     
## -------------------------------------------------------------------------------------------
##            model     Beta    Std. Error    Std. Beta      t        Sig      lower    upper 
## -------------------------------------------------------------------------------------------
##      (Intercept)    0.430         0.297                  1.451    0.150    -0.158    1.019 
## TheatersOpenWeek    0.001         0.000        0.763    12.134    0.000     0.001    0.001 
## -------------------------------------------------------------------------------------------

and

ols_regress(LM_BOAvOpen_loglog)

##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.947       RMSE                0.279 
## R-Squared               0.897       MSE                 0.079 
## Adj. R-Squared          0.896       Coef. Var           7.093 
## Pred R-Squared          0.892       AIC                36.664 
## MAE                     0.220       SBC                44.710 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                 ANOVA                                 
## ---------------------------------------------------------------------
##                Sum of                                                
##               Squares         DF    Mean Square       F         Sig. 
## ---------------------------------------------------------------------
## Regression     72.797          1         72.797    918.681    0.0000 
## Residual        8.400        106          0.079                      
## Total          81.197        107                                     
## ---------------------------------------------------------------------
## 
##                                         Parameter Estimates                                         
## ---------------------------------------------------------------------------------------------------
##                 model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
## ---------------------------------------------------------------------------------------------------
##           (Intercept)    -7.077         0.365                 -19.366    0.000    -7.801    -6.352 
## Log_BOAverageOpenWeek     1.270         0.042        0.947     30.310    0.000     1.187     1.353 
## ---------------------------------------------------------------------------------------------------

The studio asked us to compare the models against several metrics:

$R^2$ - How much variation is explained by the model:
- Model LM_Log_NumTheatres: $R^2$ = 0.581.
- Model LM_BOAvOpen_loglog: $R^2$ = 0.897
- So the second model performs better in terms of R2; AKA the average box office revenue per theater opening weekend (in dollars) is able to explain nearly 90% of the variation in $\ln(DomesticGross)$ in our sample
$MSE$ - The mean squared error - Model LM_Log_NumTheatres: MSE = 0.321 - Model LM_BOAvOpen_loglog: MSE = 0.079 - So the second model performs better in terms of the mean squared error, as it has smaller errors than the first (possible as both models have a response variable in the same units)
LINE Assumptions/residuals - Is the model valid
- We spent some time finding transformations that make the model valid in both cases.
- Each model performs equally well using this metric
AIC - Aikikes Information Criterion - This is a modification to the sum of squares error which takes into account the number of predictors (p) and the number of objects (n).

\[ n\ln{SSE}-n\ln{n}+2p \]

Given that in both our cases n=108 and p=1, it would stand to reason that this result would reflect the SSE.

AIC(LM_Log_NumTheatres,LM_BOAvOpen_loglog)

##                    df       AIC
## LM_Log_NumTheatres  3 187.62881
## LM_BOAvOpen_loglog  3  36.66406

So this metric also suggests that the second model is better, as it has a lower value of AIC.

Multiple regression

Best Subsets model selection

Finally, we move onto the use of multiple predictors, specifically using the stepwise best subsets model selection approach. Given that both of my individual models use $\ln(DomesticGross)$, I will also use this as my response variable here.

# apply a model to all the variables 
fullmodel <- lm(Log_DomesticGross ~ RottenTomatoes+ AudienceScore+TheatersOpenWeek+BOAverageOpenWeek+Profitability, data=movies)

# Now run the stepwise regression
ols_step_best_subset(fullmodel)

##                                   Best Subsets Regression                                   
## --------------------------------------------------------------------------------------------
## Model Index    Predictors
## --------------------------------------------------------------------------------------------
##      1         BOAverageOpenWeek                                                             
##      2         TheatersOpenWeek BOAverageOpenWeek                                            
##      3         TheatersOpenWeek BOAverageOpenWeek Profitability                              
##      4         AudienceScore TheatersOpenWeek BOAverageOpenWeek Profitability                
##      5         RottenTomatoes AudienceScore TheatersOpenWeek BOAverageOpenWeek Profitability 
## --------------------------------------------------------------------------------------------
## 
##                                                     Subsets Regression Summary                                                     
## -----------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                             
## Model    R-Square    R-Square    R-Square      C(p)        AIC         SBIC         SBC        MSEP       FPE       HSP       APC  
## -----------------------------------------------------------------------------------------------------------------------------------
##   1        0.6504      0.6471      0.6078    108.5112    168.1788    -140.7628    176.2252    28.9221    0.2728    0.0026    0.3628 
##   2        0.7455      0.7406      0.7121     52.7197    135.9018    -172.5831    146.6303    21.2593    0.2023    0.0019    0.2691 
##   3        0.8006      0.7928      0.7669     23.2289    113.5568    -195.8307    129.6496    16.8192    0.1630    0.0015    0.2148 
##   4        0.8338      0.8256      0.7995      5.0407     95.8839    -211.9276    114.6588    14.1558    0.1384    0.0013    0.1824 
##   5        0.8338      0.8240      0.7948      7.0000     97.8403    -209.8283    119.2974    14.2901    0.1410    0.0013    0.1857 
## -----------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

I am also going to try one with Log_BOAverageOpenWeek, as that seemed to improve my individual model skill

# apply a model to all the variables 
fullmodel.log <- lm(Log_DomesticGross ~ RottenTomatoes+ AudienceScore+TheatersOpenWeek+Log_BOAverageOpenWeek+Profitability, data=movies)

# Now run the stepwise regression
ols_step_best_subset(fullmodel.log)

##                                     Best Subsets Regression                                     
## ------------------------------------------------------------------------------------------------
## Model Index    Predictors
## ------------------------------------------------------------------------------------------------
##      1         Log_BOAverageOpenWeek                                                             
##      2         TheatersOpenWeek Log_BOAverageOpenWeek                                            
##      3         AudienceScore TheatersOpenWeek Log_BOAverageOpenWeek                              
##      4         AudienceScore TheatersOpenWeek Log_BOAverageOpenWeek Profitability                
##      5         RottenTomatoes AudienceScore TheatersOpenWeek Log_BOAverageOpenWeek Profitability 
## ------------------------------------------------------------------------------------------------
## 
##                                                   Subsets Regression Summary                                                  
## ------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                        
## Model    R-Square    R-Square    R-Square     C(p)        AIC        SBIC         SBC       MSEP      FPE       HSP      APC  
## ------------------------------------------------------------------------------------------------------------------------------
##   1        0.8966      0.8956      0.8918    59.2926    36.6641    -271.4104    44.7105    8.5581    0.0807    8e-04    0.1074 
##   2        0.9145      0.9129      0.9086    32.9461    18.0718    -289.6967    28.8004    7.1405    0.0679    6e-04    0.0904 
##   3        0.9301      0.9281      0.9232    10.3262    -1.6830    -308.3433    11.7276    5.8944    0.0566    5e-04    0.0753 
##   4        0.9357      0.9325      0.9267     5.5001    -6.6882    -314.5467    12.0867    5.4760    0.0536    5e-04    0.0705 
##   5        0.9360      0.9322      0.9258     7.0000    -5.2217    -312.8903    16.2354    5.5030    0.0543    5e-04    0.0715 
## ------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Out of all the tables above, it seems that first, by using Log_BOAverageOpenWeek, we can improve the skill, and that also other variables do appear to be important.

Final model

The analysis above suggests that the following model has the highest adjusted $R^2$ (0.9325) and the lowest AIC (-6.6882).

\[ \widehat{\ln(DomesticGross)} = \beta_0 + \beta_1(AudienceScore)+\beta_2(TheatersOpenWeek)+\beta_3(BOAverageOpenWeek)+\beta_4(Profitability) \]

finalmodel <- lm(Log_DomesticGross ~ AudienceScore+TheatersOpenWeek+Log_BOAverageOpenWeek+Profitability, data=movies)
finalmodel

## 
## Call:
## lm(formula = Log_DomesticGross ~ AudienceScore + TheatersOpenWeek + 
##     Log_BOAverageOpenWeek + Profitability, data = movies)
## 
## Coefficients:
##           (Intercept)          AudienceScore       TheatersOpenWeek  
##            -5.5944812              0.0066017              0.0003627  
## Log_BOAverageOpenWeek      ProfitabilityHigh      ProfitabilityLoss  
##             0.9244820              0.0910764             -0.1571944

Therefore, we see the following relationships.

For movies that made a loss:

\[ \widehat{\ln(DomesticGross)} = -5.594 + 0.0066(AudienceScore)+0.00036(TheatersOpenWeek)+0.9244820(\ln(BOAverageOpenWeek))-0.1572 \]

which can be simplified to:

\[ \widehat{\ln(DomesticGross)} = -5.752 + 0.0066(AudienceScore)+0.00036(TheatersOpenWeek)+0.9244820(\ln(BOAverageOpenWeek)) \]

For movies with an average level of profit:

\[ \widehat{\ln(DomesticGross)} = -5.594 + 0.0066(AudienceScore)+0.00036(TheatersOpenWeek)+0.9244820(\ln(BOAverageOpenWeek))-0.1572 \]

and for movies that made a high level of profit

\[ \widehat{\ln(DomesticGross)} = -5.594 + 0.0066(AudienceScore)+0.00036(TheatersOpenWeek)+0.9244820(\ln(BOAverageOpenWeek))+0.091 \]

which can be simplified to

\[ \widehat{\ln(DomesticGross)} = -5.5034 + 0.0066(AudienceScore)+0.00036(TheatersOpenWeek)+0.9244820(\ln(BOAverageOpenWeek))+0.091 \]

LINE assessment

residplot1 <- ols_plot_resid_stud_fit(finalmodel)

ols_plot_resid_qq(finalmodel)

# Normality test
ols_test_normality(finalmodel)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9867         0.3652 
## Kolmogorov-Smirnov        0.0555         0.8935 
## Cramer-von Mises         22.9808         0.0000 
## Anderson-Darling          0.3394         0.4935 
## -----------------------------------------------

# F test for variance
ols_test_f(finalmodel)

## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of Log_DomesticGross 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    1 
##  Den DF     =    106 
##  F          =    5.367239 
##  Prob > F   =    0.02244135

Linearity: PASS - No evidence a curve would fit the mean of y|x better than a line
Independence: PASS - Hard to assess, but no huge issues I can see
Normality: PASS, The QQ plot does show a few tiny issues around the tails, but the significance tests suggest that this could easily be explained by random chance
Equal variance: PASS (just..)- The residual plot looks relatively random, but the F-test does suggest only a 2% chance of seeing a result like this if it was sampled from a population with equal variances in residuals. I wonder if this is due to the two potential outliers (line 108: (Harry Potter and the Deathly Hallows II) and line 106 (The Thing)).

Check for Influential Outliers

Now that model 1B is valid in that it is satisfying the LINE conditions), I will make one final check for influential outliers, as shown below:

ols_plot_resid_lev(finalmodel)

ols_plot_cooksd_bar(finalmodel,threshold=1)

This shows that although there are several outliers and high leverage points, none are influential enough to especially capture the fit.

Studio questions

The studio had several questions about the new model, which I answer below:

1. Show your final model is more skillful than the other models you made

We can do this simply by using AIC, where the smallest value indicates the most skillful model (assuming LINE assumptions). Note, I only included models where the response was ln(DomesticGross), as this technique requires the response variable to be identical across the model comparisons. As you can see below, our final model has the lowest value of AIC.

AIC(finalmodel,fullmodel.log, fullmodel, LM_BOAvOpen_loglog, LM_Log_NumTheatres)

##                    df        AIC
## finalmodel          7  -6.688178
## fullmodel.log       8  -5.221659
## fullmodel           8  97.840334
## LM_BOAvOpen_loglog  3  36.664060
## LM_Log_NumTheatres  3 187.628806

2. Conduct a partial F test to assess if your final model is significantly better than your best ‘single variable’ model

anova(finalmodel, LM_BOAvOpen_loglog)

## Analysis of Variance Table
## 
## Model 1: Log_DomesticGross ~ AudienceScore + TheatersOpenWeek + Log_BOAverageOpenWeek + 
##     Profitability
## Model 2: Log_DomesticGross ~ Log_BOAverageOpenWeek
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1    102 5.2210                                 
## 2    106 8.3996 -4   -3.1785 15.524 5.97e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: The Reduced model is all we need (one predictor)
H1: The full model (finalmodel) expains so much more variability that this is appropriate to use
Test statistic: F-test: 15.524
P-value: very small

This suggests that the full model reduces the model error by such a large amount that we think it would be very unlikely to see this by chance. Therefore we have enough evidence to reject H0 and choose the full model.

3. In your final model, which predictors have the highest effect size and significance?

We can answer this by looking at the table of coefficients

ols_regress(finalmodel)

##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.967       RMSE                0.220 
## R-Squared               0.936       MSE                 0.051 
## Adj. R-Squared          0.933       Coef. Var           5.700 
## Pred R-Squared          0.927       AIC                -6.688 
## MAE                     0.169       SBC                12.087 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                 ANOVA                                 
## ---------------------------------------------------------------------
##                Sum of                                                
##               Squares         DF    Mean Square       F         Sig. 
## ---------------------------------------------------------------------
## Regression     75.976          5         15.195    296.858    0.0000 
## Residual        5.221        102          0.051                      
## Total          81.197        107                                     
## ---------------------------------------------------------------------
## 
##                                         Parameter Estimates                                         
## ---------------------------------------------------------------------------------------------------
##                 model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
## ---------------------------------------------------------------------------------------------------
##           (Intercept)    -5.594         0.372                 -15.039    0.000    -6.332    -4.857 
##         AudienceScore     0.007         0.001        0.128      4.630    0.000     0.004     0.009 
##      TheatersOpenWeek     0.000         0.000        0.239      6.524    0.000     0.000     0.000 
## Log_BOAverageOpenWeek     0.924         0.057        0.689     16.358    0.000     0.812     1.037 
##     ProfitabilityHigh     0.091         0.050        0.050      1.808    0.074    -0.009     0.191 
##     ProfitabilityLoss    -0.157         0.075       -0.059     -2.106    0.038    -0.305    -0.009 
## ---------------------------------------------------------------------------------------------------

The variable with the largest effect size is Log_BOAverage_OpenWeek. If we increase that value by 1, then LogDomesticGross goes up by 0.924. Of course the fact that both are log transformed makes this harder to interpret.

All the variables appear to be significant predictors of gross domestic profit (p < 0.05), apart from “ProfitabilityHigh”. So this suggests that perhaps the profitability should simply be coded as “loss” or “profit” - and the model could be improved further.

4. We have a new film that has been released, what does your model suggest its gross domestic profit will be? What is the 95% uncertainty interval on your estimate

The movie is:

Kung Fu Panda 4
RottenTomatoes: 78
Audience Score: 88
Open week profits: 5400 USD per cinema
Number of theatres: 3300 cinemas showed the film in week 1
Profitability: Average

OK - we can use our model to predict the gross domestic profit:

newdata <- data.frame (Name="Kung Fu Panda 4", RottenTomatoes=78,
                       AudienceScore=88, TheatersOpenWeek = 3300,
                       Log_BOAverageOpenWeek=log(5400),
                       Profitability = "Average")
newdata$Profitability <- as.factor(newdata$Profitability)


Ln_DomesticGross <- predict(finalmodel,newdata=newdata,
                            interval="prediction",level = 0.95)

# Back transform
DomesticGross <- exp(Ln_DomesticGross)

DomesticGross

##        fit      lwr      upr
## 1 62.09127 39.11226 98.57078

Our model suggests that the film will make $62,091,270 USD during its year of release. Or to put it in terms of confidence intervals, we are 95% certain that KungFu Panda 4 will make between $39.1 Million and $98.5 Million USD during its year of release.

Final Project

1 Projects/Starting a lab