IMDB Movie recommendation Project

10 min readMar 22, 2021

IMDB (Internet Movie Database) is an amazon company and it has one of the biggest dataset when its comes down to audio-video entertainment content, either it be Movies, TvSeries, or short films or documentaries.

A few days back I came to know about this CloudGuruChallenge that publish in October 2020. The goal of the project is to use any technology tools to analyze data and make some recommendation predictions, like course recommendation or movie recommendation or song recommendation, etc.

In this project I decided to use IMDB dataset to perform movie recommendation. Please continue this article to understand how I completed this project and follow along with my code that is available on github repo.

Code Github repository link

Goal: Build a movie Recommendation Engine

Tools and Technology used: Pandas, Numpy, matplotlib, seaborn and SciKit Learn. (Side note, as you can see the original challenge is to use ML technologies on AWS or cloud platform, but we are using these basic tools for now and move up in next article to connect and use cloud technologies)

How I am going to make it different from others or probably from so many available ML / Data Science projects. Well, In this blog I am going to explain a simple approach that I have been practicing from last couple of months to tackle any ML projects. You can use these steps almost on any ML projects and tweak it according to your specific needs, but on a higher level, you can accomplish your project goal by following these steps to a great extend.

High level steps that we are going to follow:

Define Problem statement — (Understanding the basics of problem that you want to solve. Breakdown your problem into small problems).
Data Extraction / collection — (Collect any amount of supporting Data).
Analyze and Visualize Data Patterns — (This helps in understanding the Data that you collected)
Data Transformation — (Data Cleaning, Data Filtering, Data Wrangling, Feature Engineering, and many more)
Save / Load Data — (Save your transformed data for Machine Learning models)
Apply Machine Learning Models — (verify your results, predict and solve problems)
Presentation — (Build a Dashboard or report that can showcase the solution for the problem)

Data Science Paradigm

There is no specific approach to start your Machine Learning or any other Data Science Project. But most of the Data Science project/tasks starts with collecting good amount of data. Which is obvious, without data what else a Data Scientist can do. On the other-side, sometimes its important to understand the problem statement before even collecting data. As a matter of fact, you should know what you want to collect and for what.

Each above steps has many sub-steps. There are so many things involved in “Data Science” Paradigm. Let’s explore this in more detail with the help of this project.

Step 1 — Define problem statement:

Most of the time for usually these type of small practice projects problem statements are already defined for you, but in real world you need to understand it by looking at the business need or available data trends or even be creative and observe something out of the box scenarios. Hence this is a very important skill to have. For instance, In our this project, we know we are planning to recommend movies, but the bigger question is what are the different or various ways by which you can recommend movies. Example, recommend movies based on genre, or by language or region specific or based on last watch or even based on popular actors, actress, directors or writers.

I hope you got the idea. Let me define problem statement of this project for you.

We are going to recommend movies based on genre in specific language or region. Let’s see how are we going to do it.

Step 2 — Data Extraction / Collection:

This part was easiest one for me in this project. Because data was already selected for us before even starting the project. But as I was explaining you, that should not be the case always. Most of the time you need to gather requirements to collect needed dataset.

IMDB dataset link and its definitions: https://www.imdb.com/interfaces/

As you can see in above link, It’s not only about data but definition of data is also important. Based on the definition of data and problem statement, we have identified these 3 data files to use in this project. You will understand more as we move along. Hence, this is also important skills to have.

title.basics.tsv.gz — contains movie titles basic information

title.akas.tsv.gz — contains additional information about movie titles (also known as part)

title.ratings.tsv.gz — this will help us in prediction of high rated movies vs low rated movies and recommendation based on that.

Step3-Analyze and Visualize Data Patterns

For this part I recommend you to open my notebook from Github repo. I will explain my thought process and tell you why and what i did.

first, i read all the files and took the count of rows from each files.

#print(df_titleBasics.shape) # (7656314, 9)
#print(df_akas.shape) # (25290774, 8)
#print(df_ratings.shape) # (1125873, 3)
#print(df_crew.shape) # (7656314, 3)
#print(df_episods.shape) # (5556745, 4)
#print(df_principals.shape) # (43272187, 6)
#print(df_name_basics.shape) # (10747436, 6)

Now, it’s time to make analyze and make decisions about what is helpful for our recommendation engine and what you can neglect. This is the analytical skills that you will gain by practicing these kinds of projects again and again.

If you are familiar with the basics of ML models, you know that we need to convert text columns into numerical columns, we need to avoid high cardinality in our features, we need avoid bias in our dataset, and remove all data elements that has no data values (null in dataset) or minimal impact of overall data conclusions. Based on these understandings of ML field we will analyze our data for some questions like:

How many records for different titleTypes? also, under each genre?
How many records are null under each column?
What columns would directly connect to our problem statement?

Based on answers to these questions, Below is the list of identified columns that will be helpful for our recommendation engine from each files:

title.basics.tsv.gz — tconst, primaryTitle, titleType, startYear, genre

Why? Key reasons — titleType will tell us which rows are movies vs TvSeries, primaryTitle will tell us most popular title of the movie. startYear field allow us to identify movies that are released in the same period of time. Genre is the most important field for our recommendation.

title.akas.tsv.gz — titleId, region, language

Why? Key reasons — I decided to use only “hindi” and “english” movie language for this project, as I am familiar with those titles only. Hence, region would be USA and India mostly.

title.ratings.tsv.gz — tconst, averageRating, numVotes

Why? Key reasons — All fields of this file is numeric and it can help us in our prediction without making much changes to the dataset.

Few diagrams/plots from analysis, please read description below each diagram to understand the purpose behind them:

shows NumberofVotes vs movie rating and outlier in our dataset

Step4 — Data Transformation — (Data Cleaning, Data Filtering, Data Wrangling, Feature Engineering, and many more)

Let’s understand this by again considering each file separately. Remember to clean and filter as much data as possible before you combine and make your data large. Why, because it’s always faster and easier to perform operations on small set of data vs large and combined or complex set of data.

title.basics.tsv.gz — tconst, primaryTitle, startYear, genre

filter operation — on genre, startYear and titleType. Removed all rows where genre and startYear is null (‘\N’) and decided to keep only ‘movie’ related records based on titleType.

datatype correction — Converted startYear field data type from string to Numeric.

drop unnecessary columns — as explained earlier we are going to keep only required columns for this recommendation engine.

removed duplicate rows — while analyzing we found some title has duplicate rows, hence performed this operation.

Converted ‘Genre’ string column to integer — This can be done using the technique called ‘One-hot-encoding’. if you are not familiar with it please read details from this link.

title.akas.tsv.gz — titleId, region, language

filter operation — On language and region to keep only records related to “hindi” and “english” language and from “India” and “United States of America” region.

drop unnecessary columns — as explained earlier we are going to keep only required columns for this recommendation engine.

Converted ‘language’ and ‘region’ string column to integer — as explained earlier, we are going to use the same ‘one-hot-encoder’ technique to perform this task.

removed duplicate rows — while analyzing we found some title has duplicate rows, hence performed this operation.

title.ratings.tsv.gz — tconst, averageRating, numVotes

This file has all columns in numeric form but as we saw in our analysis section, there are few outliers in this dataset, that can make our model biased. Hence, we used IQR technique to handle these outliers. Please read this article if you are not familiar if IQR techniques.

Step-5 — Save / Load Data — (Save your transformed data for Machine Learning models)

Wow… we are coming closer to our ML model process. I know all above steps feels like so much but once you are familiar with these concepts, it all comes as a simple but effective approach to tackle ML projects.

As you can image we have modified our each individual files as per need, it’s time to merge them and bring them together as single set of combined dataset. From the definition of dataset it’s clear that ‘tconst’ and ‘titleId’ are nothing but same field. Hence, we merge our dataframe using these columns and keep only titleId column in our file dataset.

Also, remember to combine ‘titleId’ and ‘PrimaryTitle’ column together as these are still string columns and make them index of our row. This will make our dataset handy for further observation. Below are the lines that perform this step in my notebook.

final_data.index=final_data['titleId'] + " " + final_data['primaryTitle']final_data.drop(['titleId','primaryTitle', 'tconst'],axis=1,inplace=True)

Remember, there are many tasks that overlaps between definition of these steps when its comes to coding. Below is a great example of it. You need to make these steps as a high level approach but you can code as per your convenience and to make your code more readable and easily maintainable.

There is one feature Engineering task that we are going to perform now. Why because, we know that we need to perform this task on entire dataset, that why we waited till the end to combine and perform this task. So, what is it?

It’s called “Standardization of your data”, this is important because we want to avoid abnormality in our dataset and predictions. If you are not familiar with this concept please read about Standardization here. There are various techniques that can be used, here we used “MinMaxScaler” technique, can read more about it here. Here is the code snippet for same from our notebook.

scaler=MinMaxScaler()
df_titles_scaled=pd.DataFrame(scaler.fit_transform(final_data))
df_titles_scaled.columns=final_data.columns
df_titles_scaled.index=final_data.index

Step-6 — Apply Machine Learning Models — (verify your results, predict and solve problems)

Finally, it’s time to apply ML models and make some predictions.

Now, the important question is which ML models you should use and why? This again need some basic understanding of ML categories. We are going to use clustering algorithm in this project. If you are not familiar with this algorithm, please read more about it here.

Specifically We are planning to use K-means clustering algorithm.

Clustering — is an unsupervised machine learning task, that involves automatically discovering natural grouping in data. Clustering algorithms only interpret the input data and find natural groups or clusters in feature space.

Hence, by applying this technique, we can group movies based on genre, language and region in same clusters. This will help us in recommendation when user watch any movie from same cluster.

Below is a screenshot that fetched all movies from cluster — 0 and we can clearly see that all movies are from drama genre and from ‘indian’ region in ‘hindi’ language.

Step-7-Presentation — (Build a Dashboard or report or blog that can showcase the solution for the problem)

This is the final step that is used to showcase insights about work that you did in the project. In my case, the entire blog that read so far is the part of my step7. you can use same to prepare a powerpoint slide presentation or video series or interactive dashboard. There is no limit to the experimentation on it.

I hope you enjoyed this step by step procedure and learned how any Data Analysis, Data Engineering or Data Science project can be tackled. The idea is simple, divide the big process into small processes and arrange those small processes in a way the overall project become easy to achievable and maintainable.

IMDB Movie recommendation Project

Data Science Paradigm

Written by Neeraj Somani