Earlier today Matt Gemmell posted a really good article explaining hashing and how it can be used for privacy. The basic premise of the article is that data need not be stored in its raw form to enable comparisons to be made. This is a direct response to the recent discovery that the Path social networking app uploads the user’s entire iPhone address book to its servers.
I felt a little demonstration that you can do yourself might help illustrate Matt’s point and give people a better understanding of what he is talking about. As Matt mentioned there are many different hashing algorithms however the one most used (or at least the one I use the most) is called Message Digest 5 (md5 for short). You can read more about md5 here however if you accept that hashing algorithms change personal information into “Gibberish” then you do not need to understand the details of the algorithm to see this in practice.
For those of you with a Mac it is relatively straight forward for you to md5 text. Go to Applications and open up Terminal (this is where you can feel like a geek) if you type
md5 -s "some_text"
you will get
MD5 ("some_text") = 32d3f9b84bf99ae5faecc315d389c894
where 32d3f9b84bf99ae5faecc315d389c894 is our “Gibberish” representation of “some_text”. As an alternatively to using Terminal, if you want to generate an md5 for a piece of text you can at this website. Alright lets get started using Matt’s example to walk through how to use hashing for privacy.
Bob is a member of Path, a social networking site. His address book contains 2 friends Jane and John. Jane’s email is email@example.com and John’s email is firstname.lastname@example.org. When Bob joined Path the Path application uploaded the emails of both Jane and John. However following advice from a Matt they decided to use hashing to protect the privacy of their users. So before uploading the email addresses they used md5 to hash them, resulting in the following,
- md5(“email@example.com”) = “3e706175a85e3bc0a4dd52317d87285d”
- md5(“firstname.lastname@example.org”) = “65e2d14a9b1ec95f06730d7956c90e65″
Now the Path client uploads and stores only the resulting hashes (try hashing these emails yourself, you should see that your Gibberish will match mine).
Having heard Bob talking about Path Jane decides that she would really like to see what all of the fuss is about and signs up. Path ask for her own email address, email@example.com, which she gives as part of the sign up process and they hash it to get 3e706175a85e3bc0a4dd52317d87285d. Now they search their database to see if there are any people with contacts that have this email hash, which they do. They find Bob from the uploaded hashes pulled from his contacts list earlier and recommend that since she appears to know him that they become “friends”.
I hope this helps add a little more to Matt’s original article and will allow some of you to try it out.
Note: As kindly pointed out by Richard Buckle MD5 has been considered broken for several years, in that pairs of documents can be created to create collisions. This means that two different strings can create the same hash which I am sure from the example you can see would be problematic. This also raises some serious security questions that I won’t go into here but you can find out on wikipedia should you be interested. Instead it is recommended that the SHA-2 hashing algorithm is used for security applications. If you would like to see this in action you can follow the previous example replacing the MD5 hashing algorithm with the following command,
echo "some_text" | shasum -a 256