Machine Learning Can Identify the Authors of Anonymous Code

Researchers who study stylometrythe statistical analysis of linguistic stylehave long known that writing is a unique, individualistic process.

Automated tools can now accurately identify the author of a forum post for example, as long as they have adequate training data to work with.

Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, have found that code, like other forms of stylistic expression, are not anonymous. At the DefCon hacking conference Friday, the pair will present a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples.

Think of every aspect that exists in natural language: There's the words you choose, which way you put them together, sentence length, and so on. Greenstadt and Caliskan then narrowed the features to only include the ones that actually distinguish developers from each other, trimming the list from hundreds of thousands to around 50 or so.

Instead, they create "abstract syntax trees," which reflect code's underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone's sentence structure, instead of whether they indent each line in a paragraph.

'People should be aware that its generally very hard to 100 percent hide your identity in these kinds of situations.'

Thanks to an assist from Congress, your cable company has the legal right to sell your web-browsing data without your consent.

Original article