MovieLens Dataset

  • 100,000 ratings (scale of 1-5)
  • 1682 movies
  • 943 users
We first partition the dataset by randomly selecting 50% of the users as our “base” dataset and the remaining as our “test” dataset. This partitioning process divides the global number of users in the dataset into two subsets of unique users.

[Return to Top]


  • Global dataset split - % users in “base” dataset / % users in “test” dataset
  • K - number of clusters
  • N - number of passes under K-means algorithm
  • M - minimum number of ratings required by new user
  • Lambda
  • Lambda = 0: basic Pearson correlation-coefficient algorithm
  • Lambda = 1: basic cluster-based CF algorithm using average rating of clustering for similarity computation

[Return to Top]

zMovie Recommendation Engine

Click the below to enlarge:


[Return to Top]


Effectiveness of algorithm measured by Mean Absolute Error (MAE)

Sum of all absolute errors (predicted rating – actual rating) for all items in test set divided by size of test set
"Optimal" parameter values exist for the following reasons:
K - Too few clusters results in grouping dissimilar users whereas too many clusters defeats the purpose of grouping “like” users.
N - There is a fine line when picking the top N most similar neighbors within a cluster to an active user. N should not be too small (not representative of cluster) or too large (including everyone within cluster would be meaningless and inefficient).
P - After a certain number of passes of the K-means algorithm, the process of recalculating centroids and reclustering the “base” dataset becomes less and less valuable (lower incremental improvement in clustering of dataset).
M - There is a minimum number of movies that a new user must rate in order to receive reasonable predictions / recommendations. A M that is too small makes it harder to classify the new user to an existing cluster.

[Return to Top]



For our given dataset (N=~471):
  • K = 10
  • N = 20
  • P = 7
  • M = 10
  • Lambda = 0

Overall, we have developed a user-based / model-based collaborative filtering algorithm involving clustering users in order to address issues of scalability and computation in traditional memory-based CF systems. Primarily, we refined our algorithm to develop a sense of the parameter values required to achieve the best results (lowest MAE). However, several limitations still exist due to the user-based model aspect of the recommender system (e.g. not accounting for the uniqueness of individual movies). For future work, we would like to investigate the effects of metadata on individual movies (e.g. content-based factors such as genre, actors, year produced) as well as additional social networking effects (e.g. tags on movies supplied by users in the global database, recommendations made by a user’s friends, and so on).

[Return to Top]