View on GitHub

Slackersnooze

Hacker News: Personalized

Download this project as a .zip file Download this project as a tar.gz file

Welcome to Slacker Snooze.

Slacker snooze is a clone of hacker news, a popular news article aggregator for hacker-types. I enjoy occasionally browsing the posts on this site, but find myself only interested in a small subset of them. I wanted to save time I spend on the site by having an easier way to surface posts that actually interest me, so that's why I wanted to build a recommendation system for it. At the same time, I wanted to implement it with technologies that I hadn't really used before as a learning opportunity: python3, numpy/scipy, and GloVe word vectors. I wanted to keep the idea relatively simple to implement:

Reuse the design and layout of the existing site, and only reimplement the sorting algorithm of top stories
Track users by a cookie instead of requiring them to login
Recommend posts from their headlines using the history of posts a user has clicked on

In order to recommend by the headline, I needed some way of conceptually representing each word in a headline, which is why I chose to work with the GloVe vectors. However, some words are more important than others (eg: "docker" vs. "the"), so I wanted to weight these vectors by their tf-idf score. Thus, to vectorize the entire headline, I just sum each word vector in the headline, weighted by its tf-idf score.

Inserting GloVe vectors

I used the cased Common Crawl dataset, which provides 2.2 million words. This is a 5.5G file, so loading this into memory every time is prohibitively expensive. Instead, I inserted them into postgres, where each row is the word and its 300 dimensional vector stored as a postgres array.

Polling the HN API

In order to keep up-to-date with the current articles, I'm polling the HN API for the top 500 articles every 5 minutes, and then getting their metadata (posting user, number of comments, number of points, id, title, date posted, and article url), and upserting into Postgres, updating the number of comments and its score if it's already been inserted. In addition to keeping the current articles up-to-date, we also want to update the frequency in which words appear, so our tf-idf scores will update over time. Each unique word per headline is then incremented by one.

Copying design

In order to make it look similar to hacker news, I am re-using the same CSS file and images. Since I'm only reimplementing the sorting of the top articles, I can get away with using the majority of the same HTML, but need to template out the HTML for each headline, so I can display the title along with its metadata. For templating, I'm using python's jinja2 library.

Tracking users' clicks

A user can click an article's headline link or comments link, both we can use as an indication that the user is interested in that article's topic. However, to track these clicks, I added two redirect endpoints that will first write to the database a user's click, and then redirect to either the article's link or its comments section, depending on which was clicked.

Re-sorting top articles

This is the part that's fun. When a new user arrives at slackersnooze, we give them a new cookie. Since that user hasn't clicked on any articles yet, we give a default sort based on the popularity of an article, or number of points HN users have given it.

After a user has clicked on their first article, we then re-sort based on distance between an article's vector and the combination of articles a user has clicked on. At first, I was just averaging each clicked article's vector together, but this started to break down when a user had diverse interests (ie: large variance). I was introduced to Mahalanobis distance by a fellow RC'er, and quickly found that scipy has an implementation that will magically do it for me.

Deploying

Because AWS is cheap(ish), I decided to deploy this with elastic beanstalk and docker. There's a postgres RDS instance, autoscaling group, and load balancer that all come for free, which is nice. I purchased the domain from namecheap, and used route53 to hook it up.