[From my post on Harvard Business Review]
Cybersecurity has become one of the CEO’s biggest worries, according to several surveys. Companies are investing billions in protecting their systems and training their employees. The worldwide cybersecurity market has been estimated at $77 billion in 2015 and will be $170 billion by 2020. However, the field has mostly focused on protecting systems from vulnerabilities in software and hardware. Today’s threats are no longer confined to those two places. As organizations have come to rely more and more on data-driven algorithms, risks are increasingly present in the data itself.
The pattern classification systems that machine-learning algorithms rely on may themselves exhibit vulnerabilities that can be exploited by hackers or other malefactors. One such common vulnerability is that the attacker estimated on what data the machine-learning algorithm was trained and thus manipulates the input data to the algorithm itself to her needs.
For example, search-engine-optimization (SEO) companies have long guessed how the search engine machine-learning algorithm was trained and manipulate their website content to boost their results in the search ranking. Senders of junk email try to fool the spam-filtering algorithm by adding unrelated words or sentences to their messages so they resemble legitimate email – hence the number of spam messages that start out “Hey” or “How are you?” with some bad words obfuscated by misspelling of bad words.
Most of us see examples like this every day, and they mostly seem like a minor annoyance – we click on a bad link, or have to delete a few extra emails. But there can be more serious consequences to this kind of fraud. The credit card industry, which has adopted many machine-learning approaches and other statistical, data-driven techniques to identify fraud, has been exposed to such threats for many years. If an attacker knows the usual pattern of a shopper, he can create a series of fraudulent purchases that deviate only slightly from the norm, and thus not be detected by the anomaly detector. For example, an attacker can see what was previously bought in Home Depot and buy products with similar prices at Amazon.
Algorithm fraud can also influence elections. The official scientific journal of the National Academy of Sciences (PNAS) has published research detailing how search engine manipulation can impact voters. The most notable experiment was conducted with Indian voters in the midst of India’s 2014 Lok Sabha elections. The results of the experiment clearly showed that biased search rankings could shift the voting preferences of undecided voters by 20% or more. Specifically, it was found that the order in which candidates appear in search results can have significant impact on perception.
Another weakness of machine-learning algorithms is that most of them make the common assumption that the data used to train the algorithm and the data to which the algorithm is applied are generated in the same way (or what statisticians call “sampled from the same distribution”). When that assumption is violated, the algorithm can be fooled.
Recently, such an attack was carried on biometric systems. Most biometric systems allow clients’ profiles to adapt with natural changes over time – so face recognition software updates little by little as your face ages and changes. But a malicious adversary can exploit this adaptability. By presenting a sequence of fake biometric traits to the sensor, an attacker can gradually update your profile until it is fully replaced with a different one, eventually allowing the attacker to impersonate you or other targeted clients.
As we increase our usage of smart, connected devices, we’re also deploying machine-learning algorithms more and more, in all aspects of life – from cars, to phones, to credit card readers, to wearable devices, and many more. Protecting the algorithms that run these devices from “statistical hacking” or “adversarial machine learning” is consequently becoming a bigger need. Some other fun direction can also be found in this paper.
There is a lot that can be done – from building learning algorithms with multiple classifiers to the use of randomization.
At a time when artificial intelligence algorithms drive everything from public opinion to business decision-making to how many steps you take each day, it’s worth asking – how secure are those algorithms? And what I can do to make them more secure?
One thought on “Hacking to Machine Learning Algorithms”
Nice post. Popularizing this issue is very important. Too bad that today adversarial learning&classification is still a research niche. Both researchers and stakeholders should become more aware of this (dark) side of data-driven economy.