Cracking Recommender Systems with Heterogeneous Graph Learning — Part 1

Lokesh Sharma
8 min readAug 14, 2023

--

So, here’s the scoop: HeteroGraphs hold a world of potential and the Regular ol’ graphs just can’t cut it. Conventional homogeneous graphs struggle to handle the complexity of diverse relationships and edge types. It’s time for the big guns — advanced architectures that can wrangle the complexities of the datasets with several edge and relation types. In this article we will learn how to create a heterograph from tabular datasets.

Now, let’s break it down:

Recommendation as a Link Prediction Task

Dataset Ingestion

We’ll initiate by obtaining and loading flat CSV from MovieLens, such that we are able to recommend movies to users. Since, we haven’t delved into ontologies yet, we’ll manually model the CSV files as property-labeled graph, extracting as much information as possible 😬

# Define utility functions
def log_dataframe_details(df: pd.DataFrame, filename: str) -> None:
print(f"\n{filename} - {len(df)} records")
print('-' * 50)
print(f'Index: {df.index.name}\t Unique values in columns')
for column in df.columns:
print(f'{column}: {len(df[column].unique())}')
print('=' * 100)


def load_csv(filename: str, index_col: str = 'movieId', verbose: bool = False) \
-> pd.DataFrame:
"""
Load a CSV file into a Pandas DataFrame, and perform optional preprocessing.


:param filename: he name of the CSV file to load (without the .csv extension).
:param index_col: The column to be used as the DataFrame index. Default is 'movieId'.
:param verbose: If True, print DataFrame details after loading. Default is False.
:return: The loaded and optionally preprocessed DataFrame.
"""
# Construct the file path using the current working directory
filepath = os.path.join(os.getcwd(), './ml-latest', f'{filename}.csv')
# Read the CSV file into a DataFrame and set the index column
df = pd.read_csv(filepath, index_col=index_col).sort_index()
# Remove the 'timestamp' column if present
if 'timestamp' in df.columns:
df.drop(columns=['timestamp'], inplace=True)
if verbose: # Optionally, print DataFrame details if verbose mode is enabled
log_dataframe_details(df, filename)
return df


# Load flat files into memory
movies = load_csv(filename='movies')
ratings = load_csv(filename='ratings')
tags = load_csv(filename='tags')
genome_scores = load_csv(filename='genome-scores', verbose=True)

# Not relevant
# links = load_csv(filename='links')
# genome_tags = load_csv(filename='genome-tags', index_col='tagId')
Tag Relevance Dataset for MovieIds

The genome_scores dataset spills the beans on how relevant different tags are to movies. Now, these tags aren't just thrown around randomly; each tag has a numerical relevance value assigned to it (on a scale of 0 to 5). In this dataset, we've got two main columns:
- tagId: a unique identifier for each tag
- relevance: is the numerical value we mentioned earlier, representing how closely a tag relates to a particular movie. The higher the value, the stronger the connection.

print(genome_scores.head())
print(f'Unique tag ids: {len(genome_scores.tagId.unique())}')
print(f'Unique movies tagged: {len(genome_scores.index.unique())}')
Now, let’s dive into some stats. We’ve got a total of 1128 unique tag ids and 16376 unique movies. Here’s where things get clever. Among the 86K movies in the dataset, only 16376 have their tag relevance scores. Let us keep our dataset tight and tidy by focusing on these movies. Why? Because when we're encoding movies with those 1128-length numerical vectors, we don't want any missing values in the mix.
Tags Dataset

The "movieId" tells us which movie received the tag, while the "userId" reveals the daring tagger. And finally, the "tag" is the graffiti itself – a word or phrase that captures the user's thoughts about the movie. There are a total of 25280 unique user ids – each with their own style of tagging. Now, here's the twist: Among 330K potential users, only 25280 have joined the tagging party. Let us encode users based on their tag choices - If two users tend to use similar tags, we'll assume they have similar tastes. It's not a guaranteed truth, but hey, we're working with the data we've got. So, we're creating a subset of the datasets that focuses on the users and movies that have tags in common.


# Filter dataset based on movieId present in genome-scores
moviesId = np.unique(genome_scores.index)
# Consider only movies which have a relevance scores in genome_scores
tags= tags[tags.index.isin(moviesId)]
usersId = np.unique(tags.userId)

print(f'Filtering datasets for: {len(moviesId)} movies & {len(usersId)} users')
movies = movies[movies.index.isin(moviesId)]
ratings = ratings[(ratings.index.isin(moviesId)) & (ratings.userId.isin(usersId))]

log_dataframe_details(movies, filename='movies')
log_dataframe_details(ratings, filename='ratings')
log_dataframe_details(tags, filename='tags')

HeteroGraphs Unveiled

We’ll explore the concept of heterographs, showcasing their distinct advantages. Right here we will also build it with context of movie lens dataset with additional edge properties to enhance its capabilities.

Heterographs offer a unique way to represent complex real-world scenarios in a flexible data format. In this type of graph, nodes and edges can belong to different categories, allowing for more accurate modeling of diverse relationships. For instance, consider a recommendation system with users, products, and various interactions as nodes and edges. This complexity goes beyond what a single homogeneous graph, with a single node and edge type, can handle.

PyTorch Geometric provides efficient tools for working with heterogeneous graphs. However, while the library is powerful, there’s a need for more comprehensive examples and discussions on effectively modeling tabular datasets as heterogeneous graphs for GNN training. The article intends to bridge this gap by utilizing the Movielens dataset and carefully crafting a recommendation system for movies.

The advantages of heterogeneous graphs are profound, showcasing their impact on advanced graph-based analyses like personalized recommendations, link prediction, and community detection. Here’s how these graphs can significantly enhance existing recommender systems:

  • Capturing Complex Relationships: Recommender systems thrive on understanding various interactions between users and items. Heterogeneous graphs excel at precisely capturing these intricate relationships, leading to superior modeling of user preferences.
  • Contextual Recommendations: Heterogeneous graphs enable context-aware recommendations by incorporating factors like time, location, and device type. This results in highly personalized suggestions.
  • Addressing the Cold Start Problem: When dealing with new users or items with limited interaction history, heterogeneous graphs integrate additional information, such as user demographics or item attributes. This results in more informed and accurate recommendations.
  • Mitigating Data Sparsity: Traditional methods often struggle with data sparsity. Heterogeneous graphs, unlike approaches based on adjacency matrices, directly model relationships between source and target nodes as triplets, alleviating data sparsity concerns.
  • Facilitating Cross-Domain Recommendations: Across diverse domains like music and movies, heterogeneous graphs seamlessly enable cross-domain recommendations. This leverages shared user behaviors and attributes for more comprehensive suggestions.

Let us do some more preprocessing on textual details available with a sentence transformer architecture

class TextEncoder:
"""
A class for encoding text using a SentenceTransformer model.
"""
def __init__(self, model='all-MiniLM-L6-v2', device=None):
"""
:param model: Name of the SentenceTransformer model to use.
:param device: Device to use for model inference. Default is None.
"""
self.device = device
self.model = SentenceTransformer(model, device=self.device)

@torch.no_grad()
def __call__(self, values: list):
"""
Encode a list of text values into embeddings.

:param values: List of text values to encode.
:return: Encoded embeddings as a PyTorch tensor.
"""
x = self.model.encode(values,
show_progress_bar=True,
convert_to_tensor=True,
device=self.device)
return x.cpu()


# Check if CUDA is available, and set the device accordingly
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
# Create an instance of the TextEncoder class with the determined device
encoder = TextEncoder(device=device)

Tabular to Graph Conversion

An in-depth explanation of how we transform tabular data into a heterograph format will follow. We’ve filtered through the data universe and chosen our stars — 16376 movies and 24683 users.


# Create a data object of type `torch_geometric.data.HeteroData`
graph = HeteroData()
# Identify node types ['movie', 'users'] using a single string
graph['movie'].node_id = torch.tensor(moviesId, dtype=torch.uint8)
graph['users'].node_id = torch.tensor(usersId, dtype=torch.uint8)
print(graph)
# Initialize feature vectors for movie nodes and user nodes
print(f"> Encoding Movie Titles...")
title_encoded = encoder(movies.title.values)
print(f"> Encoding Genres ...")
genres_encoded = encoder(movies.genres.values)

# Group genome scores by movieId and create a dictionary with relevance lists
genome_scores_dict = genome_scores.groupby(
genome_scores.index.name)['relevance'].apply(list).to_dict()
genome_scores_dict = dict(sorted(genome_scores_dict.items()))
genome_scores_encoded = torch.tensor(list(genome_scores_dict.values()))


print('Movie nodes feature matrices:')
print(f'Title: {title_encoded.shape}')
print(f'Genre: {genres_encoded.shape}')
print(f'Genome: {genome_scores_encoded.shape}')
graph['movie'].title = title_encoded
graph['movie'].genres = genres_encoded
graph['movie'].genome_scores = genome_scores_encoded
print(graph)
Node Features for Movie Entity Type

# Group user tags by userId and concatenate them
users_tags = tags.groupby(
tags.userId)['tag'].apply(lambda x: ', '.join(x))
print(f"> Encoding User Tags...")
users_tags_encoded = encoder(users_tags.values)

print('User nodes feature matrices:')
print(f'Tags Used: {users_tags_encoded.shape}')
graph['users'].tags = users_tags_encoded
print(graph)
Node Features for Users Entity Type

# Create edges and edge properties for user-rating-movie relationships
src_node_ids = torch.tensor(ratings.userId.values, dtype=torch.long)
dst_node_ids = torch.tensor(ratings.index.values, dtype=torch.long)
user_rating_movie_edge_index = torch.stack([src_node_ids, dst_node_ids], dim=0)
user_rating_movie_edge_attr = torch.tensor(ratings.rating.values, dtype=torch.float32)

# Set edge information for user-rating-movie relationships
graph['users', 'ratings', 'movie'].edge_index = user_rating_movie_edge_index
graph['users', 'ratings', 'movie'].edge_attr = user_rating_movie_edge_attr
print(graph)

But what’s a network without connections? That’s where our edges come into play. They’re like bridges that link nodes together, revealing the relationships within our data galaxy.

In particular, let’s focus on the (users, ratings, movie) relationship – it's a triad of connections that form a bridge between users, their ratings, and the movies they've rated. The edge_index tells us who's linked to whom, while the edge_attr gives us the ratings themselves.

# Create edges and edge properties for user-tag-movie relationships
user_tag_movie_edge_attr = tags.groupby(
[tags.userId, tags.index])['tag'].apply(lambda x: ', '.join(x))

src_node_ids = torch.tensor(user_tag_movie_edge_attr.index.get_level_values(
'userId').values, dtype=torch.long)
dst_node_ids = torch.tensor(user_tag_movie_edge_attr.index.get_level_values(
'movieId').values, dtype=torch.long)
user_tags_movie_edge_index = torch.stack([src_node_ids, dst_node_ids], dim=0)
print(f"> Encoding edges between users, tags, and movies...")
user_tag_movie_edge_attr = encoder(user_tag_movie_edge_attr.values)

# Set edge information for user-tag-movie relationships
graph['users', 'tags', 'movie'].edge_index = user_tags_movie_edge_index
graph['users', 'tags', 'movie'].edge_attr = user_tag_movie_edge_attr
print(graph)
Edge Features Representation

Graph Preservation

And there we have it, our journey through the captivating world of heterographs and PyTorch Geometric.

To summarize, heterogeneous graphs within PyTorch Geometric offer a robust framework for modeling intricate relationships in various domains. In the context of recommender systems, they surpass the limitations of conventional data modeling structures. By accurately representing diverse interactions, tackling the cold start problem, providing context-aware recommendations, managing data sparsity, and enabling cross-domain recommendations, heterogeneous graphs emerge as a vital tool for building precise and effective recommender systems in a wide array of business sectors. Recommendation systems are just the tip of the iceberg. Think about it — social graphs, e-commerce wonders, and a treasure trove of user ratings. We’ve learnt how to create heterogeneous graphs, how to breathe life into flat files, and how to lay the groundwork for revolutionary tricks.

# Now the 'graph' object contains the processed data and relationships between nodes and edges.
# It can be used for various graph-related tasks.
# Save the dataset
filepath = os.path.join(os.getcwd(), 'movielens_hetero.pt')
torch.save(graph, filepath)

Before we bid adieu, let’s talk about a tiny challenge — scalability. Sure, we’ve got a network of 50k nodes and a staggering 10 million edges. It's like having a bustling metropolis of data. But fear not, for our next part is diving into the magic of LinkNeighborLoaders, RandomSamplers, and more. We'll be creating batches that would make any chef jealous – batches that fit right into our GPU memory and perform data splits for GNN training and a downstream link prediction task. Stay Tuned!

Eager to run the cells? Access the Google Colab Notebook !

--

--

Lokesh Sharma
Lokesh Sharma

Written by Lokesh Sharma

Curious minds are exploring the potential of knowledge graphs in GIS technologies. If topography and graphs interest you too, join me in this journey!

No responses yet