Don’t Play With The Odds: How The Birthday Paradox Hit Us While Coding Solid

birthday-paradox-hash

Imagine a classroom full of students. How many students does it take so that we reach a 50% probability that 2 of them share the same birthday? There are 365 days in a year, and we’re looking for a 50% probability, then it must be 365/2 = ~182 students, right? Not even close! It actually only takes 23 students. This is known as the birthday paradox. The principle extends further: when considering a class of 70, the probability jumps to 99.9% for two students to share the same birthday.

While coding Solid, we faced the same paradox. Only, instead of 23 students, we were hosting 150k meetings. Shortened SHA1s hashes were the culprit. If you’re using hashes too in your code, you may find yourself in the same situation. Let’s dig a little deeper for you…

The setup - How and why we use SHA1s on Wisembly and Solid

Wisembly is an app to plan, manage and follow up on large-scale meetings. Typically, these don’t happen every day for our clients, so we deal with significantly less meetings than with Solid, which connects to each users’ calendar and manages a lot more meetings per users.

For Wisembly, we use hashes with 7 alphanumeric characters ([a-zA-Z0-9]{7}) for every object we handle through our API, and they are unique inside a meeting namespace (unicity constraint for the couple {meeting, hash} meaning that we could handle having two similar hashes but in different meetings)

As for us, we have unique auto-increment IDs that we use internally. These hashes are used externally by our app and our api to obfuscate internal incremental IDs and so have to be unique inside each meeting we manage. They’re generated by our javascript application and given to our API that has strong MySQL unicity constraint behind it.
We never had a single collision in 5 years of running with this system, mainly because these hashes live among others hashes in small series (we handle 3k to 5k objects maximum by meeting for the largest data sets). This leads to a 0.000001 % chance to have a collision

Solid is a new product, built on a new javascript framework for us. Yet, it shares its API and its backend with Wisembly. There’s only one main difference: because this time we handled remote objects we did not directly create and have already an ID in their own referential (which differs between a provider to another and are not consistent), we decided to normalize them by doing an injective SHA1 hashing pass on them and take the 7 first characters as hash, like git does with its short commit hashes. We ended then with hashes for Solid meetings that were 7 hexadecimal characters ([a-f0-9]{7}).

Doing so, these hashes would fit our backend API requirement which expects them to be 7 alphanumeric digits, and the world was pretty happy with it.

birthday-paradox-hexa-alpha-inclusion

Meetings are referred to by these hashes in the URL, but we strictly check under our applicative security firewall that you are logged in and figure in the attendees list to show you the sensitive content. So, if you’re trying to access a meeting you don’t belong to, for example if a friend directly gave you the link, we’ll show you an error instead, and we’ll keep you out of the meeting.

birthday-paradox-meeting-forbidden
Keep out, creeper, this meeting isn’t yours!

To sum up, the main differences with Wisembly are:

- Like we saw just before, hexadecimal hashes offer a way smaller number of possibilities than with alphanumeric hashes.

- All the meetings’ hashes lives in the same dataset, not namespaced by meeting anymore like we could do in Wisembly
=> The dataset is much, much bigger

- All the meetings are persisted into Redis’ cache to minimize API exchanges with Google and Office 365, before being persisted definitely into MySQL in a relational state when content / modification is done on the Solid side. Indeed, there’s no need to overload our databases with meetings that didn’t go through Solid.
=> We did not perform a unicity constraint check on hashes in Redis keys, unlike we did for MySQL.

It means that if two users have a different meeting on their respective calendar that share the same 7 first digits of their “unique” HMAC-SHA1 generated hash, when Solid syncs with one user’s calendar, it stores the data into Redis, overwriting the other user’s already synced meeting. And it does so over and over again each time one syncs his or her calendar with Solid and put it in Redis’ cache. Which happens a lot.

Never having had a single collision on Wisembly, we were pretty confident this would never happen on Solid either. We also looked at our git repositories. Like you probably do, we have an awful lot of commits, which means we have tons of 7-char short hashes. And we never had any problem with it. There are ([a-f, 0-9]) 16^6) = 17 Million possibilities for our meetings’ hashes. Surely, it would be a long time before we handled so many meetings on Solid! So we thought we were fine…

SPOILER ALERT: WE WEREN’T FINE. Soon enough, the birthday paradox knocked on our door. Let’s take a look at the maths and probabilities at hand.

What happened - Doing the maths

birthday-paradox-chalkboard

When Solid was in the very early stage of its closed beta, it was only used by the Wisembly team and a few other startup friends, rounding up to about 100 users then. This meant at this time approximately 150k meetings in our Redis cache (we’ve since optimized the caching mechanism and only keep ~56k meetings for these same users, caching only 2 weeks worth of meetings, before and after the current sync date, with a 14 days TTL, instead of fetching foolishly all the meetings returned by Google or Office365), and a lot fewer in MySQL (hundreds).

One day, a Wisembly team member saw a meeting appear in his list that belonged to another Solid user. Thankfully, and as we’ve seen above, we perform a check before letting users see the contents of a meeting. So, no critical data was shown outside of the name of the meeting itself. If the user pushed further and tried to access the meeting, he’d have seen the error page we showed above. The very next day, two more users found themselves in the same embarrassing situation. At first, we were flabbergasted that such a thing could happen: 3 collisions almost all at once, while we never had a single issue with short commits in all this time, even on our biggest git repository which has 16k+ commits.

So we looked at the probability for this to happen:

Probability (using the reverse problem formula, which is a pretty good approximation):

birthday-paradox-probabilities-formula

birthday paradox probability table

Here’s what we can take away from this table:

- With 150k meetings, we had a bigger than 99.99% probability of collisions happening… Who would have thought?!! No wonder we saw 3 collisions happen in just 2 days!

- If we did not rely on short SHA1 and instead generated and used a 7 alphanum digits hash, we would have dropped the probability to have a collision under 1%. But this doesn’t mean we would’ve eradicated the issue. As you can see in the table, the birthday paradox would’ve hit us, no matter what. Better sooner than later is our motto when bugfixing in a private beta.

So, why doesn’t it break on git short commits? Simple: it has a way to prevent any collapsing from happening: “If you pass -abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA1 unambiguous.” You can use them too in Github urls, but there, Github is smart enough to detect a short hash collision and pick the most recent full SHA1 commit associated.

For git, this feature is only a nice to have, and the commits are ultimately referred by their full SHA-1. In our case, it was a mandatory procedure since it’s what defined our meetings’ URLs and expected to be unique!

Luckily, we were quick to spot the problem and quick to react, so nobody got hurt in the end. Here’s what we did.

The solution - Fixing things fast while keeping an eye on the roadmap

birthday-paradox-roadmap

Our very first move was to deactivate the Redis cache in order to prevent any more collision from happening while we were fixing the issue.

We moved from 7 hexa chars to real 7 alphanum chars hashes. So in the end, we had to create a hash map to link randomly generated alphanum hashes with the unique IDs from connected calendars, which we wanted to avoid doing in the first place. We leveraged the intermediate hash map table in Redis cache to ensure the hash’s unicity: if the freshly generated new short hash isn’t unique, then the whole hash is created again (until new generation is unique) and voilà! We have our unique 7 alphanum chars hashes, and no collision. This is not optimal but will work for quite some time and is enough for us right now.

Looking to the future: Re-creating our hashes every time we see a collision for the short, 7 characters hash works for now. But what will happen when we have 100x more users? What about when we’ll have 100,000x more? Then, our hashes will possibly be re-generated and still collapse with another one, so they’ll be generated again. Add an order of magnitude more meetings and we’ll find ourselves in a situation where three generations of hashes will be necessary for a problematic meeting. Then we may get to four, and so forth. To avoid this, we’ll have to switch to a more robust answer. At this point, we’ll turn to the cream of the crops in avoiding collisions in short hashes: URL shortening services. Back in 2007, Jeff Atwood wrapped his head around it over at Coding Horror and posited that “URL shortening services are doing a simple iteration across every permutation of the available characters they have as the URLs come in.” This is quite possibly the failproof, fully scalable answer to our birthday problem.

For now, our fix is more than enough. We’re fine moving forward and pushing new product updates. Join the fun and sign up for Solid the beta!

Does this use case ring a bell for you? Have you ever encountered the birthday paradox in your projects? Found yourself in sticky situations because of the way you handled short hashes? We’d love to hear about it, so let us know in the comments!