Introduction
I recently enrolled in the Python for Data Science(UCSanDiegoX -DSE200x) course at edX. In the course I got to explore the "MovieLens" dataset which you can download here.
With this dataset I am gonna attempt to answer two questions.
- What was the average top rated movies for the last 10 years?
- What was the average top rated genres(Animation, Sci-fi or Horror) for all of the years in the dataset?
I will attempt to answer question 2 in part 2.
Acquire and prepare the data
The dataset contains four files. The files are named "links.csv", "movies.csv", "ratings.csv", "tags.csv". We shall explore them one by one by printing the first five rows in each dataset. If you are interested in the code you can check out my notebook here.
As it appears I will only need two datasets to solve my two problems. The datasaets are movies.csv and ratings.csv. The title in the movies.csv file contains the year so we need to create a separate "year" column. We also need to create a human readable date column from the "timestamp" column in the ratings.csv file. Now let us look at our modified datasets.
We need to get the average rating per movie. After that we need to combine the two DataFrames. The movie DataFrame shows the movie titles but not the ratings. The ratings DataFrame shows the ratings but not the movie titles. Both DataFrames have an unique identifier : "movieId". We will use "movieId" as our lookup value and pull the ratings data into the movies DataFrame therefore creating a new merged Dataframe.
Let's remove the "movieId", "userId" and "timestamp" columns as they will not be needed.
Analyze data and communicate results
Now that we have the information we want we can start to analyze the data. We can sort the movies DataFrame by "year" with the last year being displayed 1st. We can see the latest year is 2015.
After that we can sort by highest average rating. Now I can answers the first question : What was the average top rated movies for the last 10 years? The results are listed below.
FYI, I have not seen one of these movies listed above. Let's continue to part 2.