Project Title

Back to Research & Publication

Korean Drama Recommendation System



Introduction

In the age of big data, many organizations generate growth by analyzing its data. These companies have information about its products and services, buyers and suppliers, as well as consumer preferences. When was the last time you saw an ad on the side of your browser about a product you had recently looked up?

Combining my love for entertainment (specifically gaming and watching korean dramas) and working with consumer information, I decided to do my research on building a K-drama recommendation system. To achieve this, I researched different types of filtering methods and implemented one known as the content-based filtering method. The goal of this project is to be able to understand the processes behind the recommendation of a product.

For more information and supplemental detail about this project, feel free to go through my decking slides: https://docs.google.com/presentation/d/1L2oTeLCcCUhKpN-N1U_s9oXZq3NeXnXy9SNoO9Nekig/edit?usp=sharing

Data Collection & Data Cleaning

Since I am implementing a content-based filtering method, the main features I would like to extract are genre, directors, actors, and plot. In the first part of this project, I used Selenium to web crawl and scrape information about the genre, directors, and actors for each Korean drama listed on Wikipedia’s list of Korean dramas. Because of the inconsistency of plot information and the dynamic nature of Wikipedia pages, I extracted the plot from the IMDB page. The total dataset has 4 features of 1351 observations, which can vary because Wikipedia is constantly edited.

To clean the directors, actors, and genre features, I used regex to match whitespace characters and replace them with commas. I replaced some extraneous information on the title feature with empty characters. Next, I did some basic text preprocessing such as lowercasing and removing punctuations, specifically the character ‘-’ since korean names tend to use it to separate different parts of the name. After cleaning the data, I decided to select only the top 3 directors, actors, and genres for each drama. I also wanted to focus on dramas with plot information, so the resulting dataset has about 700 observations. Despite losing almost half of my data, I decided to continue with my project since the results do not suffer too much from this loss (it is rare for people to watch that many dramas!)



Modeling/Building the System

I implemented two different approaches to the content-based filtering method. For the first approach, I only used the plot feature and recommended K-dramas that are most similar based on that one feature. To do this, I transformed the text data into feature vectors using TfidfVectorizer from scikit-learn’s library and then computed the cosine similarity matrix to determine how similar the plots are between each drama. I used the TfidfVectorizer to reduce the importance of words that occur frequently in plot and therefore their significance in computing the final similarity score.



After computing the similarity, I created a function that takes in a title as a parameter and outputs the top 10 most similar K-dramas by sorting the similarity scores in descending order.



For the second approach, I wanted to use directors, actors, genres, and keywords to recommend the K-dramas. To help extract keywords from the plot, I utilized RakeNLTK, which is an algorithm that detects keywords by analyzing the frequency of word appearance and its co-occurrence with other words in the plot feature. Next, I standardized the keywords by using LancasterStemmer. I then implemented a bag-of-words model, combining all features into one, and converted the text into a matrix of token counts using CountVectorizer. I used the CountVectorizer here instead of TfidfVectorizer because I do not want to down-weight the presence of an actor or director if he or she has participated in relatively more dramas.





Again, after computing the similarity, I created another function that takes in a title as a parameter and outputs the top 10 most similar K-dramas by sorting the similarity scores in descending order. As we can see, there are similarities and differences between the two approaches.



Tableau Dashboard

Finally, I used tabPy to execute Python scripts and saved functions by Tableau's table calculations. This allows us to display the information on a Tableau Dashboard as needed.



Conclusion

All in all, this project was a success. My main objective was to implement a content-based filtering method that would be able to recommend K-dramas given a title of a korean drama. As a next step, I would definitely find a better source of data since the dynamic nature of Wikipedia makes the data collection process inefficient. In addition, I could build a website that would recommend K-dramas to present it in a more visually appealing way. Lastly, I could implement other filtering methods given new data such as ratings and user history.