Predicting IMDB ratings for movies

Python

I consider myself an enthusiastic, if not particularly knowledgable, cinephile. When debating between multiple movies to watch, my partner and I always do the “IMDB test” - the movie we choose is the one with the highest average rating on IMDB (I plan to create a movie recommender system to replace this method at some point, but that’s another blog post). I would like to see how easy it could be to predict the average rating for a movie, and what predictors have the most effect.

The dataset I used is the popular Movies dataset found on Kaggle. The aim is to try different models and find the one with the lowest root mean squared error (RMSE).

Once I have identified the best performing model, I will create a Flask app to allow users to create their own movies using the given feature options (cast, crew etc) and predict the rating and the revenue it would generate. It would also be worth hooking this up to the OMBd API to get the most up-to-date movie information.

import pandas as pd
import numpy as np
import pickle

from matplotlib import pyplot as plt
from matplotlib_venn import venn2
import seaborn as sns

from sklearn import linear_model as lm, metrics, tree, ensemble, model_selection as ms, feature_selection, svm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

%matplotlib inline

pd.options.mode.chained_assignment = None 

np.random.seed(42)

import warnings
warnings.filterwarnings("ignore")

sns.set(rc={
    'figure.figsize': (12, 8),
    'font.size': 14
})

# Set palette
sns.set_palette("husl")

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)

Import and prepare the datasets I’ll be using

The Movies dataset has multiple csvs which contain data in json format. I will extract each element from each csv that I want to use and put it into a tabular format. I will then combine all these different tables into one master table.

credits.csv

credits = pd.read_csv("/Users/jasminepengelly/Desktop/projects/predicting_movie/movie_revenue_predictor/credits.csv")

credits.head(2)

	cast	crew	id
0	[{'cast_id': 14, 'character': 'Woody (voice)',...	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	862
1	[{'cast_id': 1, 'character': 'Alan Parrish', '...	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	8844

There is a lot of nesting, so I’ll use a for loop to separate out the relevant elements.

all_casts = []
all_crews = []
for i in range(credits.shape[0]):
    cast = eval(credits['cast'][i])
    for x in cast:
        x['movie_id'] = credits['id'][i]
    crew = eval(credits['crew'][i])
    for x in crew:
        x['movie_id'] = credits['id'][i]
    all_casts.extend(cast)
    all_crews.extend(crew)

all_casts = pd.DataFrame(all_casts)
all_crews = pd.DataFrame(all_crews)

all_casts.head()

	cast_id	character	credit_id	gender	id	movie_id	name	order	profile_path
0	14	Woody (voice)	52fe4284c3a36847f8024f95	2	31	862	Tom Hanks	0	/pQFoyx7rp09CJTAb932F2g8Nlho.jpg
1	15	Buzz Lightyear (voice)	52fe4284c3a36847f8024f99	2	12898	862	Tim Allen	1	/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg
2	16	Mr. Potato Head (voice)	52fe4284c3a36847f8024f9d	2	7167	862	Don Rickles	2	/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg
3	17	Slinky Dog (voice)	52fe4284c3a36847f8024fa1	2	12899	862	Jim Varney	3	/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg
4	18	Rex (voice)	52fe4284c3a36847f8024fa5	2	12900	862	Wallace Shawn	4	/oGE6JqPP2xH4tNORKNqxbNPYi7u.jpg

all_crews.head()

	credit_id	department	gender	id	job	movie_id	name	profile_path
0	52fe4284c3a36847f8024f49	Directing	2	7879	Director	862	John Lasseter	/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg
1	52fe4284c3a36847f8024f4f	Writing	2	12891	Screenplay	862	Joss Whedon	/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg
2	52fe4284c3a36847f8024f55	Writing	2	7	Screenplay	862	Andrew Stanton	/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg
3	52fe4284c3a36847f8024f5b	Writing	2	12892	Screenplay	862	Joel Cohen	/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg
4	52fe4284c3a36847f8024f61	Writing	0	12893	Screenplay	862	Alec Sokolow	/v79vlRYi94BZUQnkkyznbGUZLjT.jpg

Cast

cast = all_casts[['name', 'order', 'gender', 'movie_id']]
cast.head()

	name	order	gender	movie_id
0	Tom Hanks	0	2	862
1	Tim Allen	1	2	862
2	Don Rickles	2	2	862
3	Jim Varney	3	2	862
4	Wallace Shawn	4	2	862

Crew

crew = all_crews[['name', 'job', 'gender', 'movie_id']]
crew.head()

	name	job	gender	movie_id
0	John Lasseter	Director	2	862
1	Joss Whedon	Screenplay	2	862
2	Andrew Stanton	Screenplay	2	862
3	Joel Cohen	Screenplay	2	862
4	Alec Sokolow	Screenplay	0	862

This worked well - I know how the features I would like to use for my models in an easy-to-use format.

movies_metadata.csv

There are a lot of data in this csv that I do not consider relevant to use for my features. I will have to extract the data I want to use.

movies_metadata = pd.read_csv("/Users/jasminepengelly/Desktop/projects/predicting_movie/movie_revenue_predictor/movies_metadata.csv", low_memory = False)

movies_metadata.head(2)

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	popularity	poster_path	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.946943	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg	[{'name': 'Pixar Animation Studios', 'id': 3}]	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1	False	NaN	65000000	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	17.015539	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg	[{'name': 'TriStar Pictures', 'id': 559}, {'na...	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-12-15	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0

Again, there will be a lot of unpicking required here. I’ll intuitively select the features I would like to use and clean them, for example dummifying the belongs_to_collection column.

metadata = movies_metadata[['title', 'id', 'budget',  'revenue', 'runtime', 'vote_average', 
                            'vote_count', 'belongs_to_collection']].copy()

metadata['belongs_to_collection'].fillna('0', inplace = True)

collection = metadata[['belongs_to_collection']].copy()
collection[collection != '0'] = '1'
metadata['belongs_to_collection'] = collection

metadata.head()

	title	id	budget	revenue	runtime	vote_average	vote_count	belongs_to_collection
0	Toy Story	862	30000000	373554033.0	81.0	7.7	5415.0	1
1	Jumanji	8844	65000000	262797249.0	104.0	6.9	2413.0	0
2	Grumpier Old Men	15602	0	0.0	101.0	6.5	92.0	1
3	Waiting to Exhale	31357	16000000	81452156.0	127.0	6.1	34.0	0
4	Father of the Bride Part II	11862	0	76578911.0	106.0	5.7	173.0	1

I want to also include the movie genre as a feature, but this is nested. I will extract the relevant column and transform it into a workable format. Since movies can belong to one genre, I will have to take this into account by creating a tmp column which will as an indicator as to whether that movie belongs to that genre - this will come in handy later.

g = movies_metadata[['id', 'genres']]

genlist = []

for i in range(g.shape[0]):
    gen = eval(g['genres'][i])
    for each in gen:
        each['id'] = g['id'][i]
    genlist.extend(gen)
genre = pd.DataFrame(genlist)

genre = genre.rename(columns={'name':'genre'})
genre = genre[['id', 'genre']]
genre['tmp'] = 1
genre.head()

	id	genre	tmp
0	862	Animation	1
1	862	Comedy	1
2	862	Family	1
3	8844	Adventure	1
4	8844	Fantasy	1

I will now create a pivot table out of my genre DataFrame, filling all the null cells (ie. the ones that do not contain 1s) with 0. This means every movies’ genre is indicated by a 1 or 0.

I can now merge my genres data frame to my metadata DataFrame.

pivot = genre.pivot_table('tmp', 'id', 'genre', fill_value=0)
flattened = pd.DataFrame(pivot.to_records())
metadata_genre = pd.merge(metadata, flattened, on = 'id', how = 'left')
metadata_genre.head()

	title	id	budget	revenue	runtime	vote_average	vote_count	belongs_to_collection	Adventure	Animation	Comedy	Drama	Family	Fantasy	Romance
0	Toy Story	862	30000000	373554033.0	81.0	7.7	5415.0	1	0.0	1.0	1.0	0.0	1.0	0.0	0.0
1	Jumanji	8844	65000000	262797249.0	104.0	6.9	2413.0	0	1.0	0.0	0.0	0.0	1.0	1.0	0.0
2	Grumpier Old Men	15602	0	0.0	101.0	6.5	92.0	1	0.0	0.0	1.0	0.0	0.0	0.0	1.0
3	Waiting to Exhale	31357	16000000	81452156.0	127.0	6.1	34.0	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
4	Father of the Bride Part II	11862	0	76578911.0	106.0	5.7	173.0	1	0.0	0.0	1.0	0.0	0.0	0.0	0.0

Now I’ll add the cast to the main dataframe, selecting the lead and the supporting actor as two of the features I would like to use.

lead = cast[cast['order'] == 0]
lead = lead[['movie_id', 'name']]
lead = lead.rename(columns={'movie_id':'id'})
lead = lead.rename(columns={'name':'lead'})
lead.head(2)

	id	lead
0	862	Tom Hanks
13	8844	Robin Williams

print("Number of rows before dropping those with null values:",len(metadata_genre))
metadata_genre.dropna(inplace = True)
print("Number of rows after dropping those with null values:",len(metadata_genre))

Number of rows before dropping those with null values: 45466
Number of rows after dropping those with null values: 42839

metadata_genre['id'] = metadata_genre['id'].astype('int64')
metadata_genre_lead = pd.merge(metadata_genre, lead, on = 'id', how = 'left')

supporting = cast[cast['order'] == 1]
supporting = supporting[['movie_id', 'name']]
supporting = supporting.rename(columns={'movie_id':'id'})
supporting = supporting.rename(columns={'name':'supporting'})
metadata_genre_lead_supporting = pd.merge(metadata_genre_lead, supporting, on = 'id', how = 'left')

Now I’ll add the director.

director = crew[crew['job'] == 'Director']
director = director[['movie_id', 'name']]
director = director.rename(columns={'movie_id':'id', 'name':'director'})

dataset = pd.merge(metadata_genre_lead_supporting, director, on = 'id', how = 'left')
print("Number of rows before dropping those with null values:",len(dataset))
dataset.dropna(inplace = True)
print("Number of rows after dropping those with null values:",len(dataset))

Number of rows before dropping those with null values: 47632
Number of rows after dropping those with null values: 37580

Time for some cleaning. I’ll remove all rows with null revenue and budget and where the vote count was below 50. I’ll also convert each column to the appropriate data type and delete the duplicates.

final_dataset = dataset.loc[(dataset['budget'] != '0') & (dataset['revenue'] != 0)]
final_dataset['budget'] = final_dataset['budget'].astype('float64')
final_dataset['belongs_to_collection'] = final_dataset['belongs_to_collection'].astype('int64')
final_dataset = final_dataset[final_dataset['vote_count'] > 50]

print("Before duplicates dropped:", len(final_dataset))
final_dataset.drop_duplicates(inplace=True)
print("After duplicates dropped:", len(final_dataset))

Before duplicates dropped: 4676
After duplicates dropped: 4632

final_dataset.head()

	title	id	budget	revenue	runtime	vote_average	vote_count	belongs_to_collection	Action	Adventure	Animation	Comedy	Crime	Drama	Family	Fantasy	Thriller	lead	supporting	director
0	Toy Story	862	30000000.0	373554033.0	81.0	7.7	5415.0	1	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	Tom Hanks	Tim Allen	John Lasseter
1	Jumanji	8844	65000000.0	262797249.0	104.0	6.9	2413.0	0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	Robin Williams	Jonathan Hyde	Joe Johnston
5	Heat	949	60000000.0	187436818.0	170.0	7.7	1886.0	0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	Al Pacino	Robert De Niro	Michael Mann
8	Sudden Death	9091	35000000.0	64350171.0	106.0	5.5	174.0	0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	Jean-Claude Van Damme	Powers Boothe	Peter Hyams
9	GoldenEye	710	58000000.0	352194034.0	130.0	6.6	1194.0	1	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	Pierce Brosnan	Sean Bean	Martin Campbell

Dealing with multiple directors

Though I managed to drop the duplicates above, I can see that some films have multiple directors. This leaves duplicated films that weren’t removed with the above because the director columns are different.

My intention was originally to dummify the directors and merge them with the original data frame. However, the data set became too large to process on my laptop so I’ll try again but with the least frequently appearing directors removed to make the dataset more manageable.

final_dataset['dir_count'] = final_dataset.groupby('director')['director'].transform('count')

final_dataset['dir_count'] = final_dataset.groupby('director')['director'].transform('count')
final_dataset = final_dataset[final_dataset["dir_count"] >= 5]
final_dataset_wo_dir = final_dataset.drop("director", axis=1).drop_duplicates(keep="first")
directors = final_dataset[["id", "director"]]

I’ll now export the final_dataset_wo_dir for use in my next analysis, which will require the same cleaned features.

final_dataset_wo_dir.to_csv("movies_wo_dir.csv")

Exploratory data analysis

Now it’s time to look and at all of my features and look out for things like multicollinearity that could affect my model.

First, I’ll define my predictors (without director) and response variables.

X = ['budget', 'runtime', 'vote_count', 'belongs_to_collection', 'Action', 'Adventure', 
              'Animation', 'Aniplex', 'BROSTA TV', 'Carousel Productions', 'Comedy', 'Crime', 'Documentary', 'Drama',
              'Family', 'Fantasy', 'Foreign', 'GoHands', 'History', 'Horror', 'Mardock Scramble Production Committee',
              'Music', 'Mystery', 'Odyssey Media', 'Pulser Productions', 'Rogue State', 'Romance', 'Science Fiction',
              'Sentai Filmworks', 'TV Movie', 'Telescene Film Group Productions', 'The Cartel', 'Thriller', 
              'Vision View Entertainment', 'War', 'Western', 'lead', 'supporting', 'revenue']

y = 'vote_average'

Now I’ll check for correlation between the predictors.

sns.heatmap(final_dataset_wo_dir.drop(['title', 'id'], axis=1).corr(), vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(10, 220, sep=80, n=7))

<matplotlib.axes._subplots.AxesSubplot at 0x1a3a7dbeb8>

Correlation Matrix

vote_average, my response variable, isn’t highly correlated with any other variable, so there are not currently any stand-out features that I think will be particularly predictive.

Within my features, there are four stronger correlations :

revenue and budget
vote_count and budget
revenue and vote_count
family and animation

I’ll look at these relationships in greater detail below.

revenue and budget

sns.regplot(x='revenue', y='budget', data=final_dataset_wo_dir)
plt.title('Revenue vs Budget')
plt.xlabel('Revenue')
plt.ylabel('Budget')

Revenue vs Budget

Since the points are very clustered together, I will visualise this a double-log scale to see if I can get a better view of the relationship.

rev_vs_bug = sns.regplot(x='revenue', y='budget', data=final_dataset_wo_dir)
plt.title('Revenue vs Budget')
plt.xlabel('Revenue')
plt.ylabel('Budget')
rev_vs_bug.set(xscale="log", yscale="log")

Revenue vs Budget double log scale

The relationship is strong at 0.69, although it is not represented perfectly by a straight line. The correlation coefficient between revenue and budget. It also makes intuitive sense that movies that have larger budgets will attract larger audiences.

revenue and vote_count

sns.regplot(x='revenue', y='vote_count', data=final_dataset_wo_dir)
plt.title('Revenue vs Vote')
plt.xlabel('Revenue')
plt.ylabel('Vote')

Revenue vs Vote Count

rev_vs_vote = sns.regplot(x='revenue', y='vote_count', data=final_dataset_wo_dir)
plt.title('Revenue vs Vote')
plt.xlabel('Revenue')
plt.ylabel('Vote')
rev_vs_vote.set(xscale="log", yscale="log")

Revenue vs Vote Count double log

Again, the relationship is better illustrated on a double-log scale and not perfectly represented by a straight line. The correlation coefficient between revenue and vote_count is even more significant at 0.74. As with the relationship between revenue and budget, the larger an audience the higher the number of votes can be expected.

vote_count and budget

sns.regplot(x='budget', y='vote_count', data=final_dataset_wo_dir)
plt.title('Vote Count vs Budget')
plt.xlabel('Budget')
plt.ylabel('Vote Count')

Vote Count vs Budget

votecount_vs_budget = sns.regplot(x='budget', y='vote_count', data=final_dataset_wo_dir)
plt.title('Vote Count vs Budget')
plt.xlabel('Budget')
plt.ylabel('Vote Count')
votecount_vs_budget.set(xscale="log", yscale="log")

Vote Count vs Budget

Although there is a correlation coefficient of 0.52, the relationship is again not exactly linear. It appears films with a larger budget spent producing them tend to generate more votes, which makes sense.

Family and Animation

# Both
both = len(final_dataset_wo_dir[(final_dataset_wo_dir['Family'] == 1) & (final_dataset_wo_dir['Animation'] == 1)])

# Animation
ani = len(final_dataset_wo_dir[final_dataset_wo_dir['Animation'] == 1])

# Family
fam = len(final_dataset_wo_dir[final_dataset_wo_dir['Family'] == 1])

venn2(subsets = (fam-both, ani-both, 67), 
      set_labels = ('Family', 'Animation'))
plt.show()

Venn Diagram

There are, in total, 78 films of genre ‘Animation’ and 186 films of genre ‘Family’. While most Animation films are also classified as Family, the same cannot be said for the other way around. Since this is the case, I will keep both predictors as they are in my models.

The highest and lowest rated films

Highest rated

final_dataset_wo_dir[['title', 'vote_average']].sort_values(by = 'vote_average', ascending = False).head(10)

	title	vote_average
849	The Godfather	8.5
13316	The Dark Knight	8.3
537	Schindler's List	8.3
5766	Spirited Away	8.3
2985	Fight Club	8.3
1199	One Flew Over the Cuckoo's Nest	8.3
300	Pulp Fiction	8.3
1225	The Godfather: Part II	8.3
1223	Psycho	8.3
297	Leon: The Professional	8.2

Lowest rated

final_dataset_wo_dir[['title', 'vote_average']].sort_values(by = 'vote_average', ascending = True).head(10)

	title	vote_average
13786	Disaster Movie	3.1
12287	Epic Movie	3.2
6795	Gigli	3.5
11468	Date Movie	3.6
13179	Meet the Spartans	3.8
14420	Street Fighter: The Legend of Chun-Li	3.9
19506	Jack and Jill	4.0
22997	The Canyons	4.1
30203	The Boy Next Door	4.1
1557	Speed 2: Cruise Control	4.1

Dummy variables

As well as making the variables in the final_dataset_wo_dir dummy, I also need to manually add the directors. I’ll do this by creating a pivot table with the directors as columns, the movie id as the row and the values as either 0 or 1.

Since the feature importance of dummy variables relies on the baseline of the first variable that is dropped, I’ll make a note of whom that is in lead, supporting and director for later analysis.

dropped_dummies = final_dataset_wo_dir[['lead', 'supporting']].head(1)
dropped_dummies

	lead	supporting
0	Tom Hanks	Tim Allen

dummies = pd.get_dummies(final_dataset_wo_dir, columns=['lead', 'supporting'], drop_first=True)

dummies.head()

	title	id	budget	revenue	runtime	vote_average	vote_count	belongs_to_collection	Action	Adventure	Animation	Comedy	Crime	Drama	Family	Fantasy	Thriller	dir_count	lead_Al Pacino	...	supporting_Powers Boothe	supporting_Robert De Niro	supporting_Sean Bean	supporting_Tim Allen
0	Toy Story	862	30000000.0	373554033.0	81.0	7.7	5415.0	1	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	5	0	...	0	0	0	1
1	Jumanji	8844	65000000.0	262797249.0	104.0	6.9	2413.0	0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	7	0	...	0	0	0	0
5	Heat	949	60000000.0	187436818.0	170.0	7.7	1886.0	0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	10	1	...	0	1	0	0
8	Sudden Death	9091	35000000.0	64350171.0	106.0	5.5	174.0	0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	10	0	...	1	0	0	0
9	GoldenEye	710	58000000.0	352194034.0	130.0	6.6	1194.0	1	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	8	0	...	0	0	1	0

5 rows × 1847 columns

directors.drop_duplicates(inplace=True)
directors["value"] = 1

unique_dir_names = list(set(directors.director))
master = pd.DataFrame(columns=unique_dir_names)

for n in list(set(directors.id)):
    t = directors[directors["id"] == n]
    pivoted = t.pivot(index="id", columns="director", values="value")
    master = master.append(pivoted)

master.fillna(0, inplace=True)
master.reset_index(level=0, inplace=True)

final = pd.merge(master, dummies, left_on = 'index', right_on = 'id')

Finally, I’ll drop the first director as a reference, as well as the index and id columns used for joining. The director I am dropping is “Aaron Seltzer”, which is worth noting for when I interpret the coefficients.

final.drop(["id", "index", "Aaron Seltzer", "dir_count", "title"], axis=1, inplace=True)
final.drop_duplicates(inplace=True)

master.to_csv("director_dummies.csv")

Pre-processing

Since the final product of this modelling is to produce a Flask app that allows someone to input factors about a film before it’s produced to get the rating and the revenue generated, some features will have to be dropped. For example, a user would not know the vote_count before the film is created. Some of the features I am removing are the most correlated with the response variable so I will be losing some of the predictive power.

I’ll begin by defining my train/test split. Then I’ll standardise the remaining predictive variables since it’s good practice for working with linear regressions. I’ll fit and transform the scaling on my training set and transform on my testing set.

X = final.drop(["vote_average", "revenue", "vote_count"], axis=1)
y = final.vote_average

scaler = StandardScaler()
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.3, random_state=42)
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)
print("Length of training sets:",len(X_train), len(X_test))
print("Length of testing sets:",len(y_train), len(y_test))

Length of training sets: 1296 556
Length of testing sets: 1296 556

Modelling

Baseline score

I need a baseline score with which to compare the scores for all my models moving forward. This score will represent the score one would get if they were just to predict the mean value for y. If my model outperforms this score, I know it is doing well.

y_pred_mean = [y_train.mean()] * len(y_test)

print("Baseline score RMSE: ",'{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_pred_mean))))

Baseline score RMSE:  0.83

The baseline score is an RMSE of 0.83, meaning that the predicted score is within 0.83 of a mark out of 10. I’ll use this is a benchmark as to the performance of my models.

Ultimately, I will be choosing the model that has the best cross-validated score of all as this will be the one that generalises well. I will also take into account the training and test scores, since these will indicate over or underfitting.

Function to generate model scores

Since I’ll be trying out many different models, I’ll build a function that returns all the information for efficiency. I’ll create one function for simple models and another for models utilising regularisation.

# Function to return simple model metrics
def get_model_metrics(X_train, y_train, X_test, y_test, model, parametric=True):
    """This function takes the train-test splits as arguments, as well as the algorithm 
    being used, and returns the training score, the test score (both RMSE), the 
    cross-validated scores and the mean cross-validated score. It also returns the appropriate 
    feature importances depending on whether the optional argument 'parametric' is equal to 
    True or False."""
    
    model.fit(X_train, y_train)
    train_pred = np.around(model.predict(X_train),1)
    test_pred = np.around(model.predict(X_test),1)
    
    print('Training score', '{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_train, train_pred))))
    print('Testing RMSE:', '{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, test_pred))))
    cv_scores = -ms.cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    print('Cross-validated RMSEs:', np.sqrt(cv_scores))
    print('Mean cross-validated RMSE:', '{0:0.2f}'.format(np.sqrt(np.mean(cv_scores))))
    
    if parametric == True:
        print(pd.DataFrame(list(zip(X_train.columns, model.coef_, abs(model.coef_))), 
                 columns=['Feature', 'Coef', 'Abs Coef']).sort_values('Abs Coef', ascending=False).head(10))
    else:
        print(pd.DataFrame(list(zip(X_train.columns, model.feature_importances_)), 
                 columns=['Feature', 'Importance']).sort_values('Importance', ascending=False).head(10))
    
    return model

# Function to return regularised model metrics
def regularised_model_metrics(X_train, y_train, X_test, y_test, model, grid_params, parametric=True):
    """This function takes the train-test splits as arguments, as well as the algorithm being 
    used and the parameters, and returns the best cross-validated training score, the test 
    score, the best performing model and it's parameters, and the feature importances."""
    
    gridsearch = GridSearchCV(model,
                              grid_params,
                              n_jobs=-1, cv=5, verbose=1, error_score='neg_mean_squared_error')
    
    gridsearch.fit(X_train, y_train)
    print('Best parameters:', gridsearch.best_params_)
    print('Cross-validated score on test data:', '{0:0.2f}'.format(abs(gridsearch.best_score_)))
    best_model = gridsearch.best_estimator_
    print('Testing RMSE:', '{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, best_model.predict(X_test)))))
    
    if parametric == True:
        print(pd.DataFrame(list(zip(X_train.columns, best_model.coef_, abs(best_model.coef_))), 
                 columns=['Feature', 'Coef', 'Abs Coef']).sort_values('Abs Coef', ascending=False).head(10))
    else:
        print(pd.DataFrame(list(zip(X_train.columns, best_model.feature_importances_)), 
                 columns=['Feature', 'Importance']).sort_values('Importance', ascending=False).head(10))
    
    return best_model

Linear regression

Simple

lm_simple = get_model_metrics(X_train, y_train, X_test, y_test, lm.LinearRegression())
lm_simple

Training score 0.16
Testing RMSE: 28456150374210.60
Cross-validated RMSEs: [8.93832333e+12 2.50381131e+13 3.51881686e+13 1.54939005e+13
 2.66140868e+13]
Mean cross-validated RMSE: 24055679197777.84
                          Feature          Coef      Abs Coef
287             lead_Aileen Quinn  1.236628e+13  1.236628e+13
563             lead_J.J. Johnson  7.765516e+12  7.765516e+12
98                Jason Friedberg -7.324091e+12  7.324091e+12
33                Brian Helgeland -7.206760e+12  7.206760e+12
595        lead_Jason Schwartzman -5.903357e+12  5.903357e+12
1807   supporting_Olivia Williams  5.676954e+12  5.676954e+12
276                    The Cartel  4.742263e+12  4.742263e+12
44                 Clyde Geronimi -4.666503e+12  4.666503e+12
1954  supporting_Shannyn Sossamon  4.071398e+12  4.071398e+12
284             lead_Adam Sandler  4.013993e+12  4.013993e+12





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

The test score and mean cross-validated scores are both terrible, way above baseline, and the OK training score indicates huge overfitting. This is not too surprising with such a simple model, so introducing some regularisation should help.

Regularised - Ridge

For every model I use regularisation on, I will get a list of the parameters so I can select which ones to gridsearch.

ridge = lm.Ridge()
list(ridge.get_params())

['alpha',
 'copy_X',
 'fit_intercept',
 'max_iter',
 'normalize',
 'random_state',
 'solver',
 'tol']

ridge_params = {'alpha': np.logspace(-10, 10, 10),
               'fit_intercept': [True, False],
               'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}

ridge_model = regularised_model_metrics(X_train, y_train, X_test, y_test, ridge, ridge_params)

Fitting 5 folds for each of 140 candidates, totalling 700 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   34.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done 700 out of 700 | elapsed:  9.6min finished


Best parameters: {'alpha': 2154.4346900318865, 'fit_intercept': True, 'solver': 'sparse_cg'}
Cross-validated score on test data: 0.20
Testing RMSE: 0.73
                      Feature      Coef  Abs Coef
247                   runtime  0.057142  0.057142
258                     Drama  0.035586  0.035586
255                    Comedy -0.035077  0.035077
246                    budget -0.034000  0.034000
249                    Action -0.025765  0.025765
25               Billy Wilder  0.024681  0.024681
468         lead_Eddie Murphy -0.024559  0.024559
187         Quentin Tarantino  0.024217  0.024217
41          Christopher Nolan  0.024203  0.024203
1594  supporting_Katie Holmes -0.021963  0.021963

The cross-validated training score and the test score here looks a lot more promising - although the model is overfitting, the test score is still better than baseline. The coefficients make a bit more sense, but I’ll see if the other models reflect this.

Regularised - Lasso

lasso = lm.Lasso()
list(lasso.get_params())

['alpha',
 'copy_X',
 'fit_intercept',
 'max_iter',
 'normalize',
 'positive',
 'precompute',
 'random_state',
 'selection',
 'tol',
 'warm_start']

lasso_params = {'alpha': np.logspace(-10, 10, 10),
               'fit_intercept': [True, False]}

lasso_model = regularised_model_metrics(X_train, y_train, X_test, y_test, lasso, lasso_params)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   23.2s


Best parameters: {'alpha': 0.07742636826811278, 'fit_intercept': True}
Cross-validated score on test data: 0.18
Testing RMSE: 0.75
                            Feature      Coef  Abs Coef
247                         runtime  0.204818  0.204818
246                          budget -0.100973  0.100973
258                           Drama  0.048917  0.048917
255                          Comedy -0.037615  0.037615
249                          Action -0.034441  0.034441
251                       Animation  0.028058  0.028058
25                     Billy Wilder  0.024367  0.024367
468               lead_Eddie Murphy -0.003947  0.003947
41                Christopher Nolan  0.003654  0.003654
1380  supporting_Herbert Grönemeyer  0.000000  0.000000


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   27.5s finished

The cross-validated training score is better here, although the test score is worse. The coefficients are also similar to the simple linear regression.

Regularised - ElasticNet

elastic = lm.ElasticNet()
list(elastic.get_params())

['alpha',
 'copy_X',
 'fit_intercept',
 'l1_ratio',
 'max_iter',
 'normalize',
 'positive',
 'precompute',
 'random_state',
 'selection',
 'tol',
 'warm_start']

elastic_params = {'alpha': np.logspace(-10, 10, 10),
                 'l1_ratio': np.linspace(0.05, 0.95, 10),
                 'fit_intercept': [True, False]}

elastic_model = regularised_model_metrics(X_train, y_train, X_test, y_test, elastic, 
                                          elastic_params)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   37.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  5.3min finished


Best parameters: {'alpha': 0.07742636826811278, 'fit_intercept': True, 'l1_ratio': 0.25}
Cross-validated score on test data: 0.29
Testing RMSE: 0.68
               Feature      Coef  Abs Coef
247            runtime  0.218141  0.218141
246             budget -0.162801  0.162801
255             Comedy -0.086776  0.086776
25        Billy Wilder  0.074144  0.074144
251          Animation  0.069016  0.069016
258              Drama  0.056174  0.056174
41   Christopher Nolan  0.054816  0.054816
249             Action -0.044436  0.044436
239       Wes Anderson  0.044365  0.044365
187  Quentin Tarantino  0.043444  0.043444

The best testing score yet, with less overfitting. All the linear regression models seem to have performed similarly. I’ll see how other models perform now.

Tree models

Simple decision tree

dt = get_model_metrics(X_train, y_train, X_test, y_test, tree.DecisionTreeRegressor(), 
                       parametric=False)
dt

Training score 0.00
Testing RMSE: 0.84
Cross-validated RMSEs: [0.84988687 0.79734114 0.79138703 0.78999536 0.85444556]
Mean cross-validated RMSE: 0.82
                   Feature  Importance
247                runtime    0.214021
246                 budget    0.212694
258                  Drama    0.027402
251              Animation    0.015236
259                 Family    0.015190
260                Fantasy    0.013420
67                Eli Roth    0.011142
250              Adventure    0.011036
272        Science Fiction    0.010173
248  belongs_to_collection    0.008883





DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

I expected a single decision tree to overfit, and we can see here from the perfect test score that it has. The test score is above the baseline, so I wouldn’t use this model.

Random forest

rf = get_model_metrics(X_train, y_train, X_test, y_test, ensemble.RandomForestRegressor(), 
                       parametric=False)
rf

Training score 0.27
Testing RMSE: 0.70
Cross-validated RMSEs: [0.65723342 0.62455729 0.64259098 0.67142504 0.66986168]
Mean cross-validated RMSE: 0.65
                   Feature  Importance
247                runtime    0.225637
246                 budget    0.219449
258                  Drama    0.014884
249                 Action    0.013076
255                 Comedy    0.012790
251              Animation    0.011984
250              Adventure    0.011676
248  belongs_to_collection    0.011356
259                 Family    0.010790
272        Science Fiction    0.009512





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

I believe this to be the best model so far. The cross-validated score is good compared with the cross-validated scores returned from the other models.

Gridsearched Random forest

rrf = ensemble.RandomForestRegressor()
list(rrf.get_params())

['bootstrap',
 'criterion',
 'max_depth',
 'max_features',
 'max_leaf_nodes',
 'min_impurity_decrease',
 'min_impurity_split',
 'min_samples_leaf',
 'min_samples_split',
 'min_weight_fraction_leaf',
 'n_estimators',
 'n_jobs',
 'oob_score',
 'random_state',
 'verbose',
 'warm_start']

rrf_params = {'bootstrap': [True, False],
             'max_depth': np.linspace(5, 50, 5),
             'min_samples_split': np.linspace(0.01, 1, 5),
             'n_estimators': [40, 50, 60]}

rrf_model = regularised_model_metrics(X_train, y_train, X_test, y_test, rrf, rrf_params,
                                     parametric=False)

Fitting 5 folds for each of 150 candidates, totalling 750 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   57.2s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 750 out of 750 | elapsed:  5.0min finished


Best parameters: {'bootstrap': True, 'max_depth': 50.0, 'min_samples_split': 0.01, 'n_estimators': 50}
Cross-validated score on test data: 0.34
Testing RMSE: 0.68
                   Feature  Importance
247                runtime    0.243007
246                 budget    0.232748
258                  Drama    0.018196
251              Animation    0.014074
249                 Action    0.011856
264                 Horror    0.011726
259                 Family    0.010169
255                 Comedy    0.009754
250              Adventure    0.009498
248  belongs_to_collection    0.007684

I expected this model to give one of the best performances, so I am not surprised to see it give the best cross-validated score so far. Although the test score isn’t too far off that of the simple random forest, this is the best model so far.

Bagged decision trees

bagdt = ensemble.BaggingRegressor()
bagdt.fit(X_train, y_train)

print('Training RMSE:', '{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_train, bagdt.predict(X_train)))))
print('Testing RMSE:', '{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, bagdt.predict(X_test)))))
cv_scores = -ms.cross_val_score(bagdt, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print('Cross-validated RMSEs:', cv_scores)
print('Mean cross-validated RMSE:', '{0:0.2f}'.format(np.mean(cv_scores)))

Training RMSE: 0.28
Testing RMSE: 0.72
Cross-validated RMSEs: [0.44400885 0.43938494 0.3979722  0.473639   0.46053475]
Mean cross-validated RMSE: 0.44

This performs very well, although the gridsearched random forest outperforms it.

Support Vector Machine

LinearSVR

lin = svm.LinearSVR() 

lin_params = {
    'C': np.logspace(-3, 2, 5),
    'loss': ['epsilon_insensitive','squared_epsilon_insensitive'],
    'fit_intercept': [True,False],
    'max_iter': [1000]
}

lin_model = regularised_model_metrics(X_train, y_train, X_test, y_test, lin, lin_params)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    9.9s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.5min finished


Best parameters: {'C': 0.001, 'fit_intercept': True, 'loss': 'epsilon_insensitive', 'max_iter': 1000}
Cross-validated score on test data: 76.13
Testing RMSE: 5.23
                              Feature          Coef      Abs Coef
179                    Peter Farrelly -2.351553e-15  2.351553e-15
793                  lead_Naomi Watts  1.901257e-15  1.901257e-15
90                     Hayao Miyazaki -1.775381e-15  1.775381e-15
52                      David Frankel  1.720209e-15  1.720209e-15
27                     Bobby Farrelly  1.552279e-15  1.552279e-15
92                        J.J. Abrams  1.449619e-15  1.449619e-15
1148  supporting_Catherine Zeta-Jones  1.289997e-15  1.289997e-15
1900        supporting_Robin Williams  1.251860e-15  1.251860e-15
67                           Eli Roth  1.186930e-15  1.186930e-15
102               Jean-Jacques Annaud  1.105900e-15  1.105900e-15

The scores here are terrible. This is my first foray into support vector machines, and I may not be optimising them correctly. However these scores plus the performance running them and the coefficients returned makes me feel they may not be the most appropriate models for this scenario.

RBF

rbf = svm.SVR(kernel='rbf')

rbf_params = {
    'C': np.logspace(-3, 3, 5),
    'gamma': np.logspace(-4, 1, 5),
    'kernel': ['rbf']}

rbf = GridSearchCV(rbf, rbf_params, n_jobs=-1, cv=5, verbose=1, error_score='neg_mean_squared_error')
rbf.fit(X_train, y_train)
print('Best parameters:', rbf.best_params_)
print('Cross-validated Training RMSE:', '{0:0.2f}'.format(abs(rbf.best_score_)))
print('Testing RMSE:', '{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, rbf.best_estimator_.predict(X_test)))))

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:  3.0min finished


Best parameters: {'C': 1.0, 'gamma': 0.0001, 'kernel': 'rbf'}
Cross-validated Training RMSE: 0.14
Testing RMSE: 0.76

This gives by far the best cross-validated train score - although the test score isn’t the best, the model obviously generalises well.

Poly

poly = svm.SVR(kernel='poly')

poly_params = {
    'C': np.linspace(0.01, 0.2, 10),
    'gamma': np.logspace(-5, 2, 10),
    'degree': [2]}

poly = GridSearchCV(poly, poly_params, n_jobs=-1, cv=5, verbose=1, error_score='neg_mean_squared_error')
poly.fit(X_train, y_train)
print('Best parameters:', poly.best_params_)
print('Cross-validated Training RMSE:', '{0:0.2f}'.format(abs(poly.best_score_)))
print('Testing RMSE:', '{0:0.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, poly.best_estimator_.predict(X_test)))))

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 10.2min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 11.5min finished


Best parameters: {'C': 0.09444444444444444, 'degree': 2, 'gamma': 0.01291549665014884}
Cross-validated Training RMSE: 0.08
Testing RMSE: 0.80

This model has the best cross-validated train score, so although the test score isn’t the greatest, it’s the one I will select for my Flask app.

Conclusion

Now the plan is:

Export the scaled dataset as a csv to be used in the Flask app
Pickle the SVR model to use in the app

X_scaled = pd.concat([X_train, X_test])
y_concat = pd.concat([y_train, y_test])

X_scaled.to_csv("X_ratings.csv")

final_model = poly.best_estimator_
final_model.fit(X_scaled, y_concat)
cv_scores = -ms.cross_val_score(final_model, X_scaled, y_concat, cv=5, scoring='neg_mean_squared_error')
print('Cross-validated RMSEs:', np.sqrt(cv_scores))
print('Mean cross-validated RMSE:', '{0:0.2f}'.format(np.mean(np.sqrt(cv_scores))))

Cross-validated RMSEs: [0.74717048 0.75255535 0.76234458 0.73663802 0.78108789]
Mean cross-validated RMSE: 0.76

with open('model_ratings.pkl', 'wb') as f:
    pickle.dump(final_model, f)

Jasmine Holdsworth

I am a Senior Data Analyst/Data Scientist based in London. I love R, Python, and anything data-related. I have previously worked at Stack Overflow and DAZN. I currently teach at General Assembly and work at Expedia.