Reddit's empire no longer founded on a flawed algorithm
About a month ago on January 12 2014, reddit made a seemingly minor
change to its hot
algorithm. This algorithm is responsible for
determining which stories appear at the top of the front page and each
individual subreddit. You can see why this algorithm is crucial to the
company as it decides which submissions live and prosper as well as
which ones wither and die.
The algorithm was often criticized as being flawed, the most recent occurence happening on December 9 2013 (reddit thread). Reddit’s response was always that there’s was more to it and that the “flaw” was by-design.
The hot sort might seem a bit weird, but it’s intentionally set up that way.
You can see most of the responses this criticism on this GitHub issue.
So why did they suddenly change? What was the impact? Let’s look at the code.
Old algorithm
cpdef double _hot(long ups, long downs, double date):
"""The hot formula. Should match the equivalent function in postgres."""
s = score(ups, downs)
order = log10(max(abs(s), 1))
if s > 0:
sign = 1
elif s < 0:
sign = -1
else:
sign = 0
seconds = date - 1134028003
return round(order + sign * seconds / 45000, 7)
New algorithm
cpdef double _hot(long ups, long downs, double date):
"""The hot formula. Should match the equivalent function in postgres."""
s = score(ups, downs)
order = log10(max(abs(s), 1))
if s > 0:
sign = 1
elif s < 0:
sign = -1
else:
sign = 0
seconds = date - 1134028003
return round(sign * order + seconds / 45000, 7)
Notice the difference?
- return round(order + sign * seconds / 45000, 7)
+ return round(sign * order + seconds / 45000, 7)
Let’s take a look at the impact on a couple examples.
Impact of score
The score (upvotes - downvotes) is used to compute the order
value.
old new
# score > 0
_hot(1, 0, 1262304000) 2850.0 2850.0
_hot(1000, 500, 1262304000) 2852.69897 2852.69897
# score < 0
_hot(0, 1, 1262304000) -2851.0 2850.0
_hot(1000, 1500, 1262304000) -2848.30103 2847.30103
# score == 0
_hot(1000, 1000, 1262304000) 0.0 2850.0
score > 0
We notice right away that submissions with a positive score haven’t
changed at all. It makes sense since the change only moved where sign
was applied and in the case of positive scores, the sign is always 1.
score < 0
Here we notice a drastic change in the results. Submissions that previously had a negative ranking now have a positive one. Furthermore, the lower the score, the lower the submission will be ranked. That was not the case before. Yes, that’s right, older submissions with a negative score would rank higher with the old algorithm than newer submissions with the same score. In practice this wasn’t such a problem because the interesting, positive submissions would still be ranked before them anyway.
score == 0
Finally, when the score was 0, all submissions would be ranked at 0, no matter what. With the new algorithm, the score has no impact on the rank, only it’s age as we’ll see next.
Impact of age
1262304000
and 1353107345
are simply timestamps represented in
seconds since epoch. Higher is more recent.
old new
# score > 0
_hot(1000, 500, 1262304000) 2852.69897 2852.69897
_hot(1000, 500, 1353107345) 4870.69897 4870.69897
# score < 0
_hot(1000, 1500, 1262304000) -2848.30103 2847.30103
_hot(1000, 1500, 1353107345) -4866.30103 4865.30103
# score == 0
_hot(1000, 1000, 1262304000) 0.0 2850.0
_hot(1000, 1000, 1353107345) 0.0 4868.0
score > 0
Again, no change here for the same reasons.
score < 0
As before, the rankings are now positive with the new algorithm. But the biggest change is that newer submissions now rank higher than older ones. You’d think that would have been the case before, but submissions with a negative scores would quickly disappear from the front page with the previous flawed algorithm. Now, these stories have at least a chance to show up if their score isn’t too abysmal.
score == 0
The rankings for submissions with a score of 0 serve as the baseline for submissions with a similar date. Furthermore, there’s now a way to compare the rank of these submissions.
Bird’s-eye view
If we distant ourselves from these results, this is what we see:
old algorithm:
positive submissions
neutral (score == 0) submissions
negative submissions
new algorithm:
positive submissions within a time frame
neutral submissions within a time frame
negative submissions within a time farme
positive submissions within an older time frame
...
Of course in practice things aren’t as tidy as this, but the point to remember is that neutral and negative submissions can now appear before older positive submissions. You’re still unlikely to see submissions with a score of 0 on your front page. However, if you look at the next pages or visit a less active subreddit, at least now the positively ranked submissions won’t drown out the more recent, less well received submissions.