Editors note: well explore present and future applications of cryptocurrency and blockchain technologies at our upcoming Radar Summit: Bitcoin & the Blockchain on Jan. 27, 2015, in San Francisco.
A few data scientists are starting to play around with cryptocurrency data, and as bitcoin and related technologies start gaining traction, I expect more to wade in. As the space matures, there will be many interesting applications based on analytics over the transaction data produced by these technologies. The blockchain the distributed ledger that contains all bitcoin transactions is publicly available, and the underlying data set is of modest size. Data scientists can work with this data once its loaded into familiar data structures, but producing insights requires some domain knowledge and expertise.
I recently spoke with Sarah Meiklejohn, a lecturer at UCL, and an expert on computer security and cryptocurrencies. She was part of an academic research team that studied pseudo-anonymity (pseudonymity) in bitcoin. In particular, they used transaction data to compare potential anonymity to the actual anonymity achieved by users. A bitcoin user can use many different public keys, but careful research led to a few heuristics that allowed them to cluster addresses belonging to the same user:
In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically its called pseudo-anonymity. So, if they are a legitimate businessman on the one hand, they can use a certain set of pseudonyms for that activity, and then if they are dealing drugs on Silk Road, they might use a completely different set of pseudonyms for that, and you wouldnt be able to tell that thats the same user.
It turns out in reality, though, the way most users and services are using bitcoin, was really not following any of the guidelines that you would need to follow in order to achieve this notion of pseudo-anonymity. So, basically, what we were able to do is develop certain heuristics for clustering together different public keys, or different pseudonyms. Im happy to get into the technical details, but Im not sure how relevant they are. The point is that, if you think these are good heuristics, then basically they provided evidence that a certain set of pseudonyms were called into the same owner. In that owner could be a single individual or it could be an entire service, like bit scams or another exchange.
In the course of their research, Sarah and her collaborators realized that addresses used to collect excess bitcoins (change addresses) provided a good clustering mechanism:
If you think about making change with physical cash, if I walk into a physical store and I hand the clerk a $20 bill, and my thing only costs $5, then Im going to get $15 back in change, right? And in bitcoin, that process of making change is actually completely transparent, so you can observe the change public key in the blockchain.
What we tried to do is distinguish change addresses, as we called them, from the legitimate recipient in the transaction. So, in my example in the store, youd see two public keys as the out in that transaction, one of them would receive $5, and the other would receive $15. What we tried to do is develop a heuristic for distinguishing that $15 part of the transaction from the legitimate $5 recipient. That turned out to be much trickier, but that really was the bulk of the work in the project, just trying to make that heuristic as safe as possible.
Once they settled on heuristics with which to cluster addresses, the research project still required a data set for testing their theories. This entailed conducting and following transactions through the bitcoin ledger:
Image courtesy of Sarah Meiklejohn.