Recommendation Engine with Neo4j

Collaborative Filtering Recommendation Engine with Neo4j & Python

Introduction

In a previous blog, we created a graph database in Neo4j using the DVDRENTAL database which you can download from here.

Here, we will continue our work and build a recommender system using Cypher, the query language of Neo4j. We will use Python to connect to the database and execute the queries. Our system will recommend movies to target users based on the preferences of the users who have a similar taste. I.e. for each target user we will identify the most similar users to them. Then we will use the preferences of these users to generate recommendations for the target user.

This algorithm is Collaborative Filtering.

Github link for the project and data: https://github.com/MNoorFawi/recommendation-engine-with-neo4j

Collaborative Filtering

Image source

Collaborative filtering (CF) is a technique commonly used to build personalized recommendation systems. Some popular websites that use CF technology include Amazon, Netflix, and IMDB. CF algorithm makes predictions about a user’s interests by compiling preferences from similar users.

In order to build our recommender system, we will follow the following steps:

  1. Select a similarity metric to quantify similarity among users in data.
  2. For each target user, compute similarity between them and the rest of users.
  3. Select the top k nearest neighbors based on the similarity metric.
  4. Identify Movies rented by the top k neighbors that the target user has not rented.
  5. Rank these movies by the number of renting neighbors.
  6. Recommend the top n movies to the target user.

The similarity metric we will use here is Jaccard Similarity Coefficient or Jaccard Index. Also known as Intersection over Union.

Jaccard Index

Jaccard Index between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B.

Our Graph Database

This is how our customers and the movies they rented look like in the graph database.

Let’s now run our database and connect it to python. Then we will query the number of movies per each genre.

from command line run “neo4j console”

from pprint import pprint
from py2neo import Graph
g = Graph("http://localhost:7474/db/data/", password = "password")

query = """
MATCH (cat:Category)<-[g:OF_GENRE]-()
RETURN cat.Name AS genre, COUNT(g) AS number_of_movies
ORDER BY number_of_movies DESC;
"""
g.run(query).to_data_frame()

    | genre       | number_of_movies
----+-------------+------------------
  1 | Sports      |               74
  2 | Foreign     |               73
  3 | Family      |               69
  4 | Documentary |               68
  5 | Animation   |               66
  6 | Action      |               64
  7 | New         |               63
  8 | Drama       |               62
  9 | Games       |               61
 10 | Sci-Fi      |               61
 11 | Children    |               60
 12 | Comedy      |               58
 13 | Travel      |               57
 14 | Classics    |               57
 15 | Horror      |               56
 16 | Music       |               51

Great! We have connected to the database. Let’s now choose a random user and try to make recommendations for them. We will use 25 nearest neighbors to recommend 5 movies to the target user.

Recommendation Engine Model

We first get the most similar users to our target user. Jaccard Index = Movies in common (Intersection) / Movies in total (Union) .

query = """
  // get target user and their neighbors pairs and count 
  // of distinct movies that they have rented in common
  MATCH (c1:Customer)-[:RENTED]->(f:Film)<-[:RENTED]-(c2:Customer)
  WHERE c1 <> c2 AND c1.customerID = $cid
  WITH c1, c2, COUNT(DISTINCT f) as intersection

  // get count of all the distinct movies that they have rented (Union)
  MATCH (c:Customer)-[:RENTED]->(f:Film)
  WHERE c in [c1, c2]
  WITH c1, c2, intersection, COUNT(DISTINCT f) as union

  // compute Jaccard index
  WITH c1, c2, intersection, union, (intersection * 1.0 / union) as jaccard_index
        
  // get top k nearest neighbors based on Jaccard index
  ORDER BY jaccard_index DESC, c2.customerID
  WITH c1, COLLECT([c2.customerID, jaccard_index, intersection, union])[0..$k] as neighbors
     
  WHERE SIZE(neighbors) = $k   // return users with enough neighbors
  RETURN c1.customerID as customer, neighbors"""

neighbors = {}
for i in g.run(query, cid = "13", k = 25).data():
    neighbors[i["customer"]] = i["neighbors"]

print("# customer13's 25 nearest neighbors: customerID, jaccard_index, intersection, union")
pprint(neighbors)

# customer13's 25 nearest neighbors: customerID, jaccard_index, intersection, union
{'13': [['93', 0.08695652173913043, 4, 46],
        ['211', 0.07142857142857142, 4, 56],
        ['379', 0.06521739130434782, 3, 46],
        ['578', 0.06521739130434782, 3, 46],
        ['134', 0.06382978723404255, 3, 47],
        ['8', 0.06382978723404255, 3, 47],
        ......
        ['464', 0.05, 2, 40],
        ['555', 0.047619047619047616, 2, 42]]}

Now, let’s see the top 5 movies that we can recommend to our target user. In addition, we will look at the number of neighbors who rented these movies.

# get the list of the nearest neighbors IDs
nearest_neighbors = [neighbors["13"][i][0] for i in range(len(neighbors["13"]))]

query = """
  // get top n recommendations for customer 13 from neighbors
  MATCH (c1:Customer),
    (neighbor:Customer)-[:RENTED]->(f:Film) // all movies rented by neighbors
  WHERE c1.customerID = $cid
    AND neighbor.customerID in $nearest_neighbors
    AND not (c1)-[:RENTED]->(f) // filter for movies that our user hasn't rented
        
  WITH c1, f, COUNT(DISTINCT neighbor) as countnns // times rented by nns
  ORDER BY c1.customerID, countnns DESC               
  RETURN c1.customerID as customer, COLLECT([f.Title, countnns])[0..$n] as recommendations"""

recommendations = {}
for i in g.run(query, cid = "13", nearest_neighbors = nearest_neighbors, n = 5).data():
    recommendations[i["customer"]] = i["recommendations"]
    
print("# customer13's recommendations: Movie, number of rentals by neighbors")
pprint(recommendations)

# customer13's recommendations: Movie, number of rentals by neighbors
{'13': [['Goodfellas Salute', 5],
        ['Pacific Amistad', 4],
        ['Streetcar Intentions', 4],
        ['Chill Luck', 4],
        ['Whisperer Giant', 4]]}

Perfect!

Command Line Tool

Now, we have successfully built our recommender system. It can recommend movies to target customers. Why not writing it in a script that we can run in command line using customer IDs as arguments and return recommendations per customer?

## Our Recommender System Script (dvd_recommender.py)
import sys
from pprint import pprint
from py2neo import Graph

cid = sys.argv[1:]

g = Graph("http://localhost:7474/db/data/", password = "password")

def cf_recommender(graph, cid, nearest_neighbors, num_recommendations):

    query = """
      MATCH (c1:Customer)-[:RENTED]->(f:Film)<-[:RENTED]-(c2:Customer)
      WHERE c1 <> c2 AND c1.customerID = $cid
      WITH c1, c2, COUNT(DISTINCT f) as intersection
           
      MATCH (c:Customer)-[:RENTED]->(f:Film)
      WHERE c in [c1, c2]
      WITH c1, c2, intersection, COUNT(DISTINCT f) as union

      WITH c1, c2, intersection, union, 
        (intersection * 1.0 / union) as jaccard_index

      ORDER BY jaccard_index DESC, c2.customerID
      WITH c1, COLLECT(c2)[0..$k] as neighbors
      WHERE SIZE(neighbors) = $k                                              
      UNWIND neighbors as neighbor
      WITH c1, neighbor

      MATCH (neighbor)-[:RENTED]->(f:Film)         
      WHERE not (c1)-[:RENTED]->(f)                        
      WITH c1, f, COUNT(DISTINCT neighbor) as countnns
      ORDER BY c1.customerID, countnns DESC                            
      RETURN c1.customerID as customer, 
        COLLECT(f.Title)[0..$n] as recommendations"""

    recommendations = {}
    # cid = [str(c) for c in cid]
    for c in cid:
        for i in graph.run(query, cid = c, k = nearest_neighbors, n = num_recommendations).data():
            recommendations[i["customer"]] = i["recommendations"]
    return recommendations

pprint(cf_recommender(g, cid, 25, 5))

Let’s now run the system from command line!

$ python dvd_recommender.py 13 11 19 91 

{'13': ['Goodfellas Salute', 
        'Pacific Amistad',
        'Streetcar Intentions',
        'Chill Luck',
        'Whisperer Giant'],
 '11': ['Sweethearts Suspects',
        'Tights Dawn',
        'Island Exorcist',
        'Jason Trap',
        'Earth Vision'],
 '19': ['Fatal Haunted',
        'Crossroads Casualties',
        'Ridgemont Submarine',
        'Wonderland Christmas',
        'Uptown Young'],
 '91': ['Forrester Comancheros',
        'Anaconda Confessions',
        'Bear Graceland',
        'Greatest North',
        'Hanover Galaxy']}

Share

Leave a Reply

Your email address will not be published. Required fields are marked *