Project Title

Back to Research & Publication

Predicting Box Office Returns for Movies Adapted from Books



Introduction

The definition of an “adaptation” is a film that has been made from a pre-existing work in a different medium, usually literary or theatrical in nature. In the global film industry, adaptations are not just ubiquitous, but are also likely encouraged as it is believed that adaptations can be relied upon to net greater earnings than other comparable original films. In fact, film archivists have postulated that up to 50% of all Hollywood films are adaptations, a set of films that consistently happens to contain some of the highest grossing pictures in the box office. One notable example is the serial adaptations of novels by authors such as J.R.R Tolkien and J.K Rowling, which enjoyed great commercial success. As an avid reader and movie watcher myself, I was keen to investigate what properties of books in particular most influenced film industry agents to buy rights to the novel and go through with the adaptation process.

The idea behind this project is to discover which features of books are most indicative of a film adaptation’s box office revenue. Using Python and a variety of libraries including but not limited to Pandas and Sci-kit Learn, I employed a variety of data processing and machine learning techniques in order to evaluate the significance of different features of books in predicting revenue. In the end, the classifier with the maximum replicable accuracy was a GridCV hyper-parameter optimized logistic regression model with an accuracy of 36%.

An important aside is that this is the first end-to-end data science project, and so there is enormous scope for optimization throughout the project. Despite many setbacks I believe that I gained a lot of practical data science exposure working on this project and will be interested to develop it further in the future.

Data Collection and Analysis

I used two external datasets in my project as well as one dataset I webscraped myself. The two external datasets included: “Goodreads-books” (a dataset containing metadata about 50,000 different books listed on Goodreads.com) and “The Movies Dataset” (a dataset containing the metadata for the 45,000 movies listed in the Full MovieLens Dataset). Both of these datasets were sourced from Kaggle. It is important to note that because I could not find a single dataset containing metadata about both books and movies, I was forced to look for individual datasets about books and movies respectively. Lastly, I use Selenium to webscrape a dataset from Mid-Continent Public Library (https://apps.mymcpl.org/botb/book/browse/0-9) which contained a curated list of title matches between books and movies based on those books. The webscraped dataset had only 4,500 rows, which was not surprising since it was manually curated. The reason I webscraped this dataset was to get an estimate as to how many book adaptations had a different title than the book. After removing duplicates, out of the 3,900 rows remaining approximately 40% of book-movie pairings had different titles, which would end up to be a defining issue in my project. All three datasets were downloaded as CSV files and uploaded to the notebook environment as separate pandas dataframes.

Before merging my dataframes together, I did a preliminary EDA on the features included in my goodreads and movies dataframes. Considering the nature of my separated data, I realized that the process of merging the two aforementioned dataframes later on would pose numerous problems, so I decided to spend significant effort in EDA and visualizations. Figure 1.1 below includes various histograms and a box-and-whiskers plot of 6 different quantitative variables from my goodreads data. The data for genres in both my goodreads and movies dataframes were encoded categorically. Moreso, each book and movie was tagged with multiple different genres rather than just one. After creating a set of all unique genres using regex and memoization for efficiency, I re-tagged each book and movie with one of its listed genres if any of those listed genres were in the top 10% of represented genre tags; else, I labeled the book’s genre as “Non-Popular Genre.” Subsequently I created a bar chart plotting the frequencies of the new genre tags among the books (Fig 1.2). I proceeded to do the same thing with the genres data for the movies dataframe (Fig 1.3). There were several insights from these 3 analyses. In Figure 1.1, it can be seen that variables such as num_pages and ratings_count are not normally distributed and must be normalized before being inputted into the model. In Figure 1.2, the sidenoted bullet points reveal how a minority of genres grossly are grossly overrepresented within the entire pool of genres. There’s a similar story in Figure 1.3, although not to the same extent as in Figure 1.2.







Figures 1.1, 1.2, 1.3 were most important to me in terms of the project at hand, however I decided to perform a few more especially interesting analyses out of my own interest. Figure 1.4 below shows the relative distributions of different genres and number of movies across multiple release years. Figure 1.5 below shows a) the distribution of user votes out of 5 stars for movies as well as b) a scatterplot depicting the association between popularity and revenue of a movie. Figure 1.6 is a line graph showing how the average vote of a movie changes based on its runtime. There is a clear positive trend in the volume of movies produced in the last 20 years and surprisingly the frequencies of various genres per year are proportional throughout the years (Figure 1.4). I found it surprising that the distributions of vote averages for both books and movies were by and large normally distributed (Figures 1.1, 1.5). Another interesting avenue to explore in future research is the relationship between movie runtime and popularity, as there is a clear positive relationship in my data between movie runtime and vote average (Figure 1.6).







Modeling

This project was my first introduction to computational and statistical modeling. I learned how to work with packages such as sklearn and gridsearchcv for implementing and optimizing models respectively. I decided to do multi-class classification to predict expected revenue which I binned into arbitrary equally-dense bins. Initially I used Random Forests which are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time. Some pros of using Random Forests include being robust to outliers, working well with nonlinear data, lower risk of overfitting the model, efficient runtime on large datasets, and generally better accuracy than other classification algorithms. However, some cons include the fact that random forests have been shown to have bias among categorical variables, slow training times, and being generally not suitable for linear methods with many sparse features.

After fitting my random forest to the data, I performed hyper-parameter tuning using GridSearchCV. This is a function in sklearn which essentially searches the hyperparameter space for the best cross validation score. Figure 2.1 below shows the optimal hyperparameters after running this procedure.



Finally, I performed a comparison on different classifier models to see whether or not random forests were the best fit for my data. Using sklearn’s cross_validate function, I tested the accuracy, recall, precision, and F1 metrics for Logistic Regression, Support Vector Classifier, Decision Tree, Random Forest, and Gaussian Naive Bayes models. Figure 2.2 shows a table with the consolidated results.



While the spread of the metrics was relatively minimal (between 33% to 36%), it seems that Logistic Regression models eked out every other model slightly. In fact, Random Forests were tied for worst model in the line up. Nonetheless, I am wary of these results, since I find it odd that for each model, the value for each metric is the same among that model.

Conclusions

Given the inherent complexity of my data and the fact that there exists no up-to-date dataset thus far with information about books and their associated movies, I am surprised that I was able to garner a decent accuracy for my models. Since I binned the target variable into 5 bins, having a score of 36% indicates that I performed better than random chance guessing, which would be 20% accuracy theoretically. Nonetheless these results are tenuous due to the extreme amount of likely confounding factors in my work, such as my fuzzy merge algorithms between data sets based on names, and the limited amount of data in general that I could’ve used.

Additionally, there may be a number of variables that can be added to such models in the future, such as sentiment analysis of books’ cover blurbs, author’s writing experience, film studio, and much more.

While this is a very rudimentary analysis of this topic, I believe that data science with respect to different types of media such as books and movies could be highly profitable for stakeholders in the film and publishing industries, specifically those associated with producing adaptations of stories in books. By being able to predict revenue for adaptations, they can better budget for their film’s production and have better returns on investment.

References

  1. https://towardsdatascience.com/predicting-movie-profitability-and-risk-at-the-pre-production-phase-2288505b4aec
  2. https://www.researchgate.net/publication/334459131_CLASSIFICATION_OF_BOOK_REVIEWS_BASED_ON_SENTIMENT_ANALYSIS_A_SURVEY
  3. https://towardsdatascience.com/imdb-reviews-or-8143fe57c825
  4. https://ieeexplore.ieee.org/document/8281839
  5. https://towardsdatascience.com/what-makes-a-successful-film-predicting-a-films-revenue-and-user-rating-with-machine-learning-e2d1b42365e7
  6. https://towardsdatascience.com/predicting-movie-profitability-and-risk-at-the-pre-production-phase-2288505b4aec

Semester

Spring 2021

Researcher

Sushant Vema