Predicting profit generated by movies

Python

In my last post, I tested several models that predicted movie ratings. This time, I’ll try and predict the gross profit a movie might generate based on the same features. The aim is to create a Flask app that will allow a user to see the predicted rating and gross profit for a movie they create based on certain feature choices (ie. cast, director, genre).

Since all the data cleaning was done in my last blog post, I can move straight into the exploratory data analysis.

Set up and import data

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns
import pickle

from sklearn import linear_model as lm, metrics, tree, ensemble, svm
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler

%matplotlib inline

pd.options.mode.chained_assignment = None 
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)

np.random.seed(42)

sns.set(rc={
    'figure.figsize': (12, 8),
    'font.size': 14
})

# Set palette
sns.set_palette("husl")

movies = pd.read_csv("/Users/jasminepengelly/Desktop/projects/predicting_movie/movies_wo_dir.csv")
movies.drop("Unnamed: 0", axis=1, inplace=True)
movies["gross_profit"] = movies["revenue"] - movies["budget"]

movies.head()

	title	id	budget	revenue	runtime	vote_average	vote_count	belongs_to_collection	Action	Adventure	Animation	Comedy	Crime	Drama	Family	Fantasy	Thriller	lead	supporting	dir_count	gross_profit
0	Toy Story	862	30000000.0	373554033.0	81.0	7.7	5415.0	1	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	Tom Hanks	Tim Allen	5	343554033.0
1	Jumanji	8844	65000000.0	262797249.0	104.0	6.9	2413.0	0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	Robin Williams	Jonathan Hyde	7	197797249.0
2	Heat	949	60000000.0	187436818.0	170.0	7.7	1886.0	0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	Al Pacino	Robert De Niro	10	127436818.0
3	Sudden Death	9091	35000000.0	64350171.0	106.0	5.5	174.0	0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	Jean-Claude Van Damme	Powers Boothe	10	29350171.0
4	GoldenEye	710	58000000.0	352194034.0	130.0	6.6	1194.0	1	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	Pierce Brosnan	Sean Bean	8	294194034.0

Exploratory data analysis

From my previous analysis, I know there is a relatively high correlation between budget, revenue and vote_count. This time around I’ll focus on the relationships between revenue and the other variables.

First, I’ll define the variables that I’m using.

X = ['budget', 'runtime', 'vote_count', 'belongs_to_collection', 'Action', 'Adventure', 
              'Animation', 'Aniplex', 'BROSTA TV', 'Carousel Productions', 'Comedy', 'Crime', 'Documentary', 'Drama',
              'Family', 'Fantasy', 'Foreign', 'GoHands', 'History', 'Horror', 'Mardock Scramble Production Committee',
              'Music', 'Mystery', 'Odyssey Media', 'Pulser Productions', 'Rogue State', 'Romance', 'Science Fiction',
              'Sentai Filmworks', 'TV Movie', 'Telescene Film Group Productions', 'The Cartel', 'Thriller', 
              'Vision View Entertainment', 'War', 'Western', 'lead', 'supporting', 'vote_average']

y = 'gross_profit'

sns.heatmap(movies.drop(['title', 'id', 'dir_count'], axis=1).corr(), vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(10, 220, sep=80, n=7))

<matplotlib.axes._subplots.AxesSubplot at 0x1a1aabd240>

Correlation Matrix

My response variable, gross_profit, is highly correlated with budget, revenue and vote_ count. budget and revenue make sense, since the three are directly related, but vote_ count is less intuitive - perhaps more votes are to be expected on popular films, and films with a high budget or that make large revenue get more votes.

Within my features, I already identified the correlation between Family and Animation, and vote_count and budget, are not significant enough to worry about. This bodes well for my models - there will be no multicollinearity and there is no need for PCA here.

Highest revenue films

high_rev = movies[['title', 'revenue']].sort_values(by = 'revenue', ascending = False).head(10)
high_rev['revenue'] = high_rev['revenue'].map('${:,.2f}'.format)
high_rev

	title	revenue
1439	Avatar	$2,787,965,087.00
1766	Star Wars: The Force Awakens	$2,068,223,624.00
328	Titanic	$1,845,034,188.00
1778	Furious 7	$1,506,249,360.00
1540	Harry Potter and the Deathly Hallows: Part 2	$1,342,000,000.00
1875	Beauty and the Beast	$1,262,886,337.00
1881	The Fate of the Furious	$1,238,764,765.00
1535	Transformers: Dark of the Moon	$1,123,746,996.00
994	The Lord of the Rings: The Return of the King	$1,118,888,979.00
1614	Skyfall	$1,108,561,013.00

Highest grossing films

high_gp = movies[['title', "gross_profit"]].sort_values(by = "gross_profit", ascending = False).head(10)
high_gp["gross_profit"] = high_gp["gross_profit"].map('${:,.2f}'.format)
high_gp

	title	gross_profit
1439	Avatar	$2,550,965,087.00
1766	Star Wars: The Force Awakens	$1,823,223,624.00
328	Titanic	$1,645,034,188.00
1778	Furious 7	$1,316,249,360.00
1540	Harry Potter and the Deathly Hallows: Part 2	$1,217,000,000.00
1875	Beauty and the Beast	$1,102,886,337.00
994	The Lord of the Rings: The Return of the King	$1,024,888,979.00
1881	The Fate of the Furious	$988,764,765.00
1535	Transformers: Dark of the Moon	$928,746,996.00
1614	Skyfall	$908,561,013.00

Lowest revenue films

low_rev = movies[['title', 'revenue']].sort_values(by = 'revenue', ascending = True).head(10)
low_rev['revenue'] = low_rev['revenue'].map('${:,.2f}'.format)
low_rev

	title	revenue
576	Angela's Ashes	$13.00
1271	Death at a Funeral	$46.00
628	The Idiots	$7,235.00
1604	5 Days of War	$17,479.00
590	City Lights	$19,181.00
1464	Valhalla Rising	$30,638.00
480	Following	$48,482.00
1678	The Canyons	$56,825.00
1628	Byzantium	$89,237.00
1796	Manglehorn	$143,101.00

Biggest loss

high_gp = movies[['title', "gross_profit"]].sort_values(by = "gross_profit", ascending = True).head(10)
high_gp["gross_profit"] = high_gp["gross_profit"].map('${:,.2f}'.format)
high_gp

	title	gross_profit
1673	The Lone Ranger	$-165,710,090.00
1011	The Alamo	$-119,180,039.00
1884	Valerian and the City of a Thousand Planets	$-107,447,384.00
513	The 13th Warrior	$-98,301,101.00
7	Cutthroat Island	$-87,982,678.00
1365	Australia	$-80,445,998.00
578	Supernova	$-75,171,919.00
1080	A Sound of Thunder	$-74,010,360.00
1128	The Great Raid	$-69,833,498.00
1674	R.I.P.D.	$-68,351,500.00

Import directors and get dummy variables

directors = pd.read_csv("/Users/jasminepengelly/Desktop/projects/predicting_movie/director_dummies.csv")
directors.drop("Unnamed: 0", axis=1, inplace=True)
final = pd.merge(directors, movies, left_on = 'index', right_on = 'id')
final.drop(["id", "index", "title", "dir_count"], axis=1, inplace=True)
dummies = pd.get_dummies(final, columns=['lead', 'supporting'], drop_first=True)

Pre-processing

Since the final product of this modelling is to produce a Flask app that allows someone to input factors about a film before it’s produced to get the rating and the gross profit generated, some features will have to be dropped. For example, a user would not know the vote_count before the film is created. Some of the features I am removing are the most correlated with the response variable so I will be losing some of the predictive power.

I’ll begin by defining my train-test split. Then, as with my previous blog post, I’ll standardise the remaining predictive variables since it’s good practice for working with linear regression models.

X = dummies.drop(["revenue", "vote_count", "gross_profit", "vote_average"], axis=1)
y = dummies["gross_profit"]

scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)
print("Length of training sets:",len(X_train), len(X_test))
print("Length of testing sets:",len(y_train), len(y_test))

Length of training sets: 1320 567
Length of testing sets: 1320 567

Modelling

Baseline score

I need a baseline score with which to compare the scores for all my models moving forward. This score will represent the score one would get if they were just to predict the mean value for y. If my model outperforms this score, I know it is doing well.

y_pred_mean = [y_train.mean()] * len(y_test)

print("Dumb model RMSE: ",'${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_pred_mean))))

Dumb model RMSE:  $226,516,599.51

$226 million will be the benchmark RMSE for my model’s success. It’s still a very wide margin to be out by, so I’m hoping I can beat this.

Function to generate model scores

Since I’ll be trying out many different models, I’ll build a function that returns all the information for efficiency. I’ll create one function for simple models and another for models utilising regularisation.

If this were a function I was using regularly, I would put it in a script and import it. However, I wanted it explicitly stated here for you to see.

# Function to return simple model metrics
def get_model_metrics(X_train, y_train, X_test, y_test, model, parametric=True):
    """This function takes the train-test splits as arguments, as well as the algorithm 
    being used, and returns the training score, the test score (both RMSE), the 
    cross-validated scores and the mean cross-validated score. It also returns the appropriate 
    feature importances depending on whether the optional argument 'parametric' is equal to 
    True or False."""
    
    model.fit(X_train, y_train)
    train_pred = np.around(model.predict(X_train),1)
    test_pred = np.around(model.predict(X_test),1)
    
    print('Training RMSE:', '${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_train, train_pred))))
    print('Testing RMSE:', '${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, test_pred))))
    cv_scores = -cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    print('Cross-validated RMSEs:', np.sqrt(cv_scores))
    print('Mean cross-validated RMSE:', '${:,.2f}'.format(np.sqrt(np.mean(cv_scores))))
    
    if parametric == True:
        print(pd.DataFrame(list(zip(X_train.columns, model.coef_, abs(model.coef_))), 
                 columns=['Feature', 'Coef', 'Abs Coef']).sort_values('Abs Coef', ascending=False).head(10))
    else:
        print(pd.DataFrame(list(zip(X_train.columns, model.feature_importances_)), 
                 columns=['Feature', 'Importance']).sort_values('Importance', ascending=False).head(10))
    
    return model

# Function to return regularised model metrics
def regularised_model_metrics(X_train, y_train, X_test, y_test, model, grid_params, parametric=True):
    """This function takes the train-test splits as arguments, as well as the algorithm being 
    used and the parameters, and returns the best cross-validated training score, the test 
    score, the best performing model and it's parameters, and the feature importances."""
    
    gridsearch = GridSearchCV(model,
                              grid_params,
                              n_jobs=-1, cv=5, verbose=1, error_score='neg_mean_squared_error')
    
    gridsearch.fit(X_train, y_train)
    print('Best parameters:', gridsearch.best_params_)
    print('Cross-validated score on test data:', '${:,.2f}'.format(abs(gridsearch.best_score_)))
    best_model = gridsearch.best_estimator_
    print('Testing RMSE:', '${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, best_model.predict(X_test)))))
    
    if parametric == True:
        print(pd.DataFrame(list(zip(X_train.columns, best_model.coef_, abs(best_model.coef_))), 
                 columns=['Feature', 'Coef', 'Abs Coef']).sort_values('Abs Coef', ascending=False).head(10))
    else:
        print(pd.DataFrame(list(zip(X_train.columns, best_model.feature_importances_)), 
                 columns=['Feature', 'Importance']).sort_values('Importance', ascending=False).head(10))
    
    return best_model

Linear regression

Simple

lr = get_model_metrics(X_train, y_train, X_test, y_test, lm.LinearRegression())
lr

Training RMSE: $43,093,674.65
Testing RMSE: $6,850,145,181,265,346,691,072.00
Cross-validated RMSEs: [4.17892339e+21 9.90531764e+21 1.53491007e+22 1.13727100e+22
 5.58558345e+21]
Mean cross-validated RMSE: $10,116,431,047,916,374,720,512.00
                                    Feature          Coef      Abs Coef
1088              supporting_Bijou Phillips  1.347614e+21  1.347614e+21
276        Telescene Film Group Productions -1.259300e+21  1.259300e+21
266   Mardock Scramble Production Committee  1.202241e+21  1.202241e+21
665                         lead_Judi Dench -1.150649e+21  1.150649e+21
1951                  supporting_Seth Green -1.078695e+21  1.078695e+21
584                        lead_James Woods -1.075888e+21  1.075888e+21
428                   lead_Cuba Gooding Jr.  1.031542e+21  1.031542e+21
25                              Bill Condon  1.029315e+21  1.029315e+21
1212                 supporting_Dan Stevens -1.012743e+21  1.012743e+21
1011            supporting_Adrienne Barbeau -9.823532e+20  9.823532e+20





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Both my scores are terrible here, and the vast difference between the scores show the level of overfitting. Time for some regularisation.

Regularised - Ridge

ridge = lm.Ridge()

ridge_params = {'alpha': np.linspace(600, 800, 5),
               'fit_intercept': [True, False],
               'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}

ridge_model = regularised_model_metrics(X_train, y_train, X_test, y_test, ridge, ridge_params)

Fitting 5 folds for each of 70 candidates, totalling 350 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    9.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   35.6s
[Parallel(n_jobs=-1)]: Done 350 out of 350 | elapsed:  1.1min finished


Best parameters: {'alpha': 700.0, 'fit_intercept': True, 'solver': 'saga'}
Cross-validated score on test data: $0.36
Testing RMSE: $185,680,100.81
                       Feature          Coef      Abs Coef
249      belongs_to_collection  2.131550e+07  2.131550e+07
247                     budget  2.086358e+07  2.086358e+07
82                George Lucas  1.426045e+07  1.426045e+07
248                    runtime  1.330278e+07  1.330278e+07
437      lead_Daniel Radcliffe  1.128897e+07  1.128897e+07
1918   supporting_Rupert Grint  1.100804e+07  1.100804e+07
1976  supporting_Stanley Tucci  9.935829e+06  9.935829e+06
1395   supporting_Ian McKellen  9.889183e+06  9.889183e+06
251                  Adventure  9.849955e+06  9.849955e+06
491           lead_Emma Watson  9.219124e+06  9.219124e+06

This test score is already better than the baseline, so I know we are moving in the right direction. With the most predictive features removed, it’s interesting to see that ‘belongs_to_collection’ seems to be more influencial than budget.

Regularised - Lasso

lasso = lm.Lasso(tol=5)

lasso_params = {'alpha': np.logspace(10, 100, 5),
               'fit_intercept': [True, False]}

lasso_model = regularised_model_metrics(X_train, y_train, X_test, y_test, lasso, lasso_params)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    2.6s


Best parameters: {'alpha': 10000000000.0, 'fit_intercept': True}
Cross-validated score on test data: $0.02
Testing RMSE: $226,516,599.51
                        Feature  Coef  Abs Coef
0                 Aaron Seltzer  -0.0       0.0
1402   supporting_Irene Miracle   0.0       0.0
1400  supporting_Ingrid Bergman   0.0       0.0
1399    supporting_Ichirō Nagai  -0.0       0.0
1398        supporting_Ice Cube   0.0       0.0
1397     supporting_Iben Hjejle  -0.0       0.0
1396     supporting_Ian McShane   0.0       0.0
1395    supporting_Ian McKellen   0.0       0.0
1394   supporting_Ian McDiarmid   0.0       0.0
1393        supporting_Ian Holm   0.0       0.0


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    3.0s finished

After a lot of tuning and increasing the alpha, lasso still didn’t perform as well as ridge. The coefficients also make less intuitive sense.

Regularised - ElasticNet

elastic = lm.ElasticNet()

elastic_params = {'alpha': np.linspace(1, 10, 10),
                 'l1_ratio': np.linspace(0.05, 0.95, 10),
                 'fit_intercept': [True, False]}

elastic_model = regularised_model_metrics(X_train, y_train, X_test, y_test, elastic, 
                                          elastic_params)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   19.4s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   34.3s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:   53.1s


Best parameters: {'alpha': 2.0, 'fit_intercept': True, 'l1_ratio': 0.65}
Cross-validated score on test data: $0.36
Testing RMSE: $186,283,469.11
                       Feature          Coef      Abs Coef
247                     budget  1.991908e+07  1.991908e+07
249      belongs_to_collection  1.967256e+07  1.967256e+07
82                George Lucas  1.289770e+07  1.289770e+07
248                    runtime  1.227358e+07  1.227358e+07
437      lead_Daniel Radcliffe  1.076488e+07  1.076488e+07
1918   supporting_Rupert Grint  1.046807e+07  1.046807e+07
251                  Adventure  9.455739e+06  9.455739e+06
1395   supporting_Ian McKellen  9.207659e+06  9.207659e+06
1976  supporting_Stanley Tucci  9.097251e+06  9.097251e+06
1212    supporting_Dan Stevens  8.682729e+06  8.682729e+06


[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.1min finished

This gives me very similar results to the ridge linear regression, but still doesn’t perform as well. So far, the ridge linear regression is the best performing.

Tree models

Simple decision tree

dt = get_model_metrics(X_train, y_train, X_test, y_test, tree.DecisionTreeRegressor(), 
                       parametric=False)
dt

Training RMSE: $0.00
Testing RMSE: $189,435,992.34
Cross-validated RMSEs: [1.62986560e+08 1.50676461e+08 1.16390236e+08 1.28971798e+08
 1.60756615e+08]
Mean cross-validated RMSE: $145,114,517.27
                       Feature  Importance
247                     budget    0.330288
249      belongs_to_collection    0.124512
248                    runtime    0.096301
491           lead_Emma Watson    0.026973
742           lead_Mark Hamill    0.013669
37                Bryan Singer    0.012596
799            lead_Neel Sethi    0.012294
42           Christopher Nolan    0.011744
278                   Thriller    0.011406
2028  supporting_Toni Collette    0.010859





DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

Simple decision trees tend to overfit, so I am not surprised by the train and test scores being what they are. However, I am surprised that the cross-validated score is quite good.

Random forest

rf = get_model_metrics(X_train, y_train, X_test, y_test, ensemble.RandomForestRegressor(), 
                       parametric=False)
rf

Training RMSE: $51,290,496.58
Testing RMSE: $171,985,754.15
Cross-validated RMSEs: [1.30170546e+08 1.29842827e+08 9.17430304e+07 1.19111377e+08
 1.27713356e+08]
Mean cross-validated RMSE: $120,597,293.25
                       Feature  Importance
247                     budget    0.324819
249      belongs_to_collection    0.119175
248                    runtime    0.073327
25                 Bill Condon    0.012158
226           Steven Spielberg    0.012097
742           lead_Mark Hamill    0.011958
2028  supporting_Toni Collette    0.011653
1395   supporting_Ian McKellen    0.010731
799            lead_Neel Sethi    0.010293
251                  Adventure    0.009515





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

The best performing model so far, with feature importances that make intuitive sense.

Regularised Random forest

rrf = ensemble.RandomForestRegressor()

rrf_params = {'bootstrap': [True, False],
             'max_depth': np.linspace(5, 50, 5),
             'min_samples_split': np.linspace(0.1, 1, 5),
             'n_estimators': [10, 15, 20]}

rrf_model = regularised_model_metrics(X_train, y_train, X_test, y_test, rrf, rrf_params,
                                     parametric=False)

Fitting 5 folds for each of 150 candidates, totalling 750 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   16.4s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done 750 out of 750 | elapsed:  1.7min finished


Best parameters: {'bootstrap': False, 'max_depth': 5.0, 'min_samples_split': 0.1, 'n_estimators': 10}
Cross-validated score on test data: $0.35
Testing RMSE: $188,255,673.77
                       Feature  Importance
247                     budget    0.628325
249      belongs_to_collection    0.270686
2028  supporting_Toni Collette    0.023607
248                    runtime    0.015528
742           lead_Mark Hamill    0.014858
1371  supporting_Harrison Ford    0.014858
226           Steven Spielberg    0.013352
678      lead_Kathryn Beaumont    0.009393
2046   supporting_Verna Felton    0.009393
1395   supporting_Ian McKellen    0.000000

A good test score, but not quite as good as the random forest with the default parameters.

Bagged decision trees

bagdt = ensemble.BaggingRegressor()
bagdt.fit(X_train, y_train)

print('Training RMSE:', '${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_train, bagdt.predict(X_train)))))
print('Testing RMSE:', '${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, bagdt.predict(X_test)))))
cv_scores = -cross_val_score(bagdt, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print('Cross-validated RMSEs:', cv_scores)
print('Mean cross-validated RMSE:', '${:,.2f}'.format(np.mean(cv_scores)))

Training RMSE: $47,551,713.19
Testing RMSE: $167,557,918.72
Cross-validated RMSEs: [1.61524815e+16 1.63382135e+16 8.80421032e+15 1.40081532e+16
 1.70218650e+16]
Mean cross-validated RMSE: $14,464,984,700,217,268.00

I am disappointed at the mean cross-validated score here - this model won’t generalise well.

Support Vector Machine

LinearSVR

lin = svm.LinearSVR() 

lin_params = {
    'C': np.logspace(-3, 2, 5),
    'loss': ['epsilon_insensitive','squared_epsilon_insensitive'],
    'fit_intercept': [True,False],
    'max_iter': [1000]
}

lin_model = regularised_model_metrics(X_train, y_train, X_test, y_test, lin, lin_params)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   53.0s finished


Best parameters: {'C': 0.001, 'fit_intercept': True, 'loss': 'squared_epsilon_insensitive', 'max_iter': 1000}
Cross-validated score on test data: $0.07
Testing RMSE: $188,783,310.28
                       Feature          Coef      Abs Coef
249      belongs_to_collection  2.347037e+07  2.347037e+07
247                     budget  2.172464e+07  2.172464e+07
82                George Lucas  1.620963e+07  1.620963e+07
248                    runtime  1.470311e+07  1.470311e+07
437      lead_Daniel Radcliffe  1.180886e+07  1.180886e+07
1918   supporting_Rupert Grint  1.154390e+07  1.154390e+07
1976  supporting_Stanley Tucci  1.089841e+07  1.089841e+07
1395   supporting_Ian McKellen  1.066900e+07  1.066900e+07
251                  Adventure  1.028434e+07  1.028434e+07
2028  supporting_Toni Collette  1.008356e+07  1.008356e+07

Although the test score is below baseline, the test score isn’t as good as the random forest.

RBF

rbf = svm.SVR(kernel='rbf')

rbf_params = {
    'C': np.logspace(-3, 2, 5),
    'gamma': np.logspace(-3, 2, 5),
    'kernel': ['rbf']}

rbf = GridSearchCV(rbf, rbf_params, n_jobs=-1, cv=5, verbose=1, error_score='neg_mean_squared_error')
rbf.fit(X_train, y_train)
print('Best parameters:', rbf.best_params_)
print('Training RMSE:', '${:,.2f}'.format(abs(rbf.best_score_)))
print('Testing RMSE:', '${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, rbf.best_estimator_.predict(X_test)))))

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:  4.1min finished


Best parameters: {'C': 100.0, 'gamma': 0.001, 'kernel': 'rbf'}
Training RMSE: $0.14
Testing RMSE: $240,436,736.27

This has performed worse than the LinearSVR, so I won’t be using this.

Poly

poly = svm.SVR(kernel='poly')

poly_params = {
    'C': np.logspace(-3, 2, 3),
    'gamma': np.logspace(-5, 2, 3),
    'degree': [2]}

poly = GridSearchCV(poly, poly_params, n_jobs=-1, cv=5, verbose=1, error_score='neg_mean_squared_error')
poly.fit(X_train, y_train)
print('Best parameters:', poly.best_params_)
print('Training RMSE:', '${:,.2f}'.format(abs(poly.best_score_)))
print('Testing RMSE:', '${:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, poly.best_estimator_.predict(X_test)))))

Fitting 5 folds for each of 9 candidates, totalling 45 fits

[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  1.5min finished

Best parameters: {'C': 0.31622776601683794, 'degree': 2, 'gamma': 100.0}
Training RMSE: $0.14
Testing RMSE: $202,171,501.80

I did play with including 3 as a degree hyperparameter, but the gridsearch took ages to run and the results didn’t improve much.

Conclusion

The best performing model is the random forest. I will pickle it, as well as exporting the scaled features as csv, and use them in my Flask app.

X_scaled = pd.concat([X_train, X_test])
y_concat = pd.concat([y_train, y_test])

X_scaled.to_csv("X_profit.csv")

rf.fit(X_scaled, y_concat)
cv_scores = -cross_val_score(rf, X_scaled, y_concat, cv=5, scoring='neg_mean_squared_error')
print('Cross-validated RMSEs:', np.sqrt(cv_scores))
print('Mean cross-validated RMSE:', '${:,.2f}'.format(np.mean(np.sqrt(cv_scores))))

Cross-validated RMSEs: [1.25429824e+08 9.88043543e+07 1.11233629e+08 1.53454944e+08
 1.52458683e+08]
Mean cross-validated RMSE: $128,276,286.95

with open('model_profit.pkl', 'wb') as f:
    pickle.dump(rf, f)

Jasmine Holdsworth

I am a Senior Data Analyst/Data Scientist based in London. I love R, Python, and anything data-related. I have previously worked at Stack Overflow and DAZN. I currently teach at General Assembly and work at Expedia.