Hacking to Machine Learning Algorithms

[From my post on Harvard Business Review]

Cybersecurity has become one of the CEO’s biggest worries, according to several surveys. Companies are investing billions in protecting their systems and training their employees.  The worldwide cybersecurity market has been estimated at $77 billion in 2015 and will be $170 billion by 2020. However, the field has mostly focused on protecting systems from vulnerabilities in software and hardware. Today’s threats are no longer confined to those two places. As organizations have come to rely more and more on data-driven algorithms, risks are increasingly present in the data itself.

The pattern classification systems that machine-learning algorithms rely on may themselves exhibit vulnerabilities that can be exploited by hackers or other malefactors. One such common vulnerability is that the attacker estimated on what data the machine-learning algorithm was trained and thus manipulates the input data to the algorithm itself to her needs.

For example, search-engine-optimization (SEO) companies have long guessed how the search engine machine-learning algorithm was trained and manipulate their website content to boost their results in the search ranking. Senders of junk email try to fool the spam-filtering algorithm by adding unrelated words or sentences to their messages so they resemble legitimate email – hence the number of spam messages that start out “Hey” or “How are you?” with some bad words obfuscated by misspelling of bad words.

Most of us see examples like this every day, and they mostly seem like a minor annoyance – we click on a bad link, or have to delete a few extra emails. But there can be more serious consequences to this kind of fraud. The credit card industry, which has adopted many machine-learning approaches and other statistical, data-driven techniques to identify fraud, has been exposed to such threats for many years. If an attacker knows the usual pattern of a shopper, he can create a series of fraudulent purchases that deviate only slightly from the norm, and thus not be detected by the anomaly detector. For example, an attacker can see what was previously bought in Home Depot and buy products with similar prices at Amazon.

Algorithm fraud can also influence elections. The official scientific journal of the National Academy of Sciences (PNAS) has published research detailing how search engine manipulation can impact voters. The most notable experiment was conducted with Indian voters in the midst of India’s 2014 Lok Sabha elections. The results of the experiment clearly showed that biased search rankings could shift the voting preferences of undecided voters by 20% or more. Specifically, it was found that the order in which candidates appear in search results can have significant impact on perception.

Another weakness of machine-learning algorithms is that most of them make the common assumption that the data used to train the algorithm and the data to which the algorithm is applied are generated in the same way (or what statisticians call “sampled from the same distribution”). When that assumption is violated, the algorithm can be fooled.

Recently, such an attack was carried on biometric systems. Most biometric systems allow clients’ profiles to adapt with natural changes over time – so face recognition software updates little by little as your face ages and changes. But a malicious adversary can exploit this adaptability. By presenting a sequence of fake biometric traits to the sensor, an attacker can gradually update your profile until it is fully replaced with a different one, eventually allowing the attacker to impersonate you or other targeted clients.

As we increase our usage of smart, connected devices, we’re also deploying machine-learning algorithms more and more, in all aspects of life – from cars, to phones, to credit card readers, to wearable devices, and many more. Protecting the algorithms that run these devices from “statistical hacking” or “adversarial machine learning” is consequently becoming a bigger need. Some other fun direction can also be found in this paper.

There is a lot that can be done – from building learning algorithms with multiple classifiers to the use of randomization.
At a time when artificial intelligence algorithms drive everything from public opinion to business decision-making to how many steps you take each day, it’s worth asking – how secure are those algorithms? And what I can do to make them more secure?

How Victor Hugo Can Predict Ebola and Help a Business Succeed

[From my post on Harvard Business Review]

There’s no doubt that our world faces complex challenges, from a warming climate to violent uprisings to political instability to outbreaks of disease. The number of these crises currently unfolding – in combination with persistent economic uncertainty – has led many leaders to lament the rise of volatility, uncertainty, complexity, and ambiguity. Resilience and adaptability, it seems, are our only recourse.

But what if such destabilizing events could be predicted ahead of time? What actions could leaders take if early warning signs are easier to spot? Just this decade, we have finally reached the critical amount of data and computer power needed to create such tools.

“What is history? An echo of the past in the future,” wrote Victor Hugo in The Man Who Laughs. Although future events have unique circumstances, they typically follow familiar past patterns. Advances in computing, data storage, and data science algorithms allow those patterns to be seen.

A system whose development I’ve led over the past seven years harvests large-scale digital histories, encyclopedias, social and real-time media, and human web behavior to calculate real-time estimations of likelihoods of future events. Essentially, our system combines 150 years of New York Times articles, the entirety of Wikipedia, and millions of web searches and web pages to model the probability of potential outcomes against the context of specific conditions. The algorithm generalizes sequences of historical events extracted from these massive datasets, automatically trying all possible cause-effect combinations and finding statistical correlations.

For instance, recently my fellow data scientists and I developed algorithms that accurately predicted the first cholera outbreak in 130 years. The pattern that our system inferred was that cholera outbreaks in land-locked areas are more likely to occur following storms, especially when preceded by a long drought up to two years before. The pattern only occurs in countries with low GDP that have low concentration of water in the area. This is extremely surprising, as cholera is a water-born disease and one would expect it to happen in areas with a high water concentration. (One possible explanation might lie in how cholera infections are treated: if prompt dehydration treatment is supplied, cholera mortality rates drop from 50% to less that 1%. Therefore, it might be that in areas with enough clean water the epidemic did not break out.)

The implication of such predictions, automatically inferred by an-ever-updating statistical system, is that medical teams can be alerted as far as two years in advance that there’s a risk of a cholera epidemic in a specific location, and can send in clean water and save lives.

Other epidemics can be predicted in a similar way. Ebola is still rare enough that statistical patterns are tough to infer. Nevertheless, using human casualty knowledge mined from medical publications, in conjunction with recurring events, a prominent pattern for Ebola outbreaks does emerge.

Several publications have reported a connection between both the current and the previous Ebola outbreaks and fruit bats. But what causes the fruit bats to come into contact with humans?

The first Ebola outbreaks occurred in 1976 in Zaire and Sudan. A year before that, a volcano erupted in the area, leading many to look for gold and diamonds. Those actions caused deforestation. Our algorithm inferred, from encyclopedias and other databases, that deforestation causes animal migration – including the migration of fruit bats.

We have used the same approach to model the likelihood of outbreaks of violence. Our system predicted riots in Syria and Sudan, and their locations, by noticing that riots are more likely in non-democratic regions with growing GDPs yet low per-person income, when a previously subsidized product’s price is lifted, causing student riots and clashes with police.

The algorithm also predicted genocide by identifying that those events happen with higher probability if leaders or prominent people in the country dehumanize the minority, specifically when they refer to minority members as pests. One such example is the genocide in Rwanda. Years before 4,000 Tutsis were murdered in Kivumu, Hutu leaders such as Kivumu mayor Gregoire Ndahimana referred to the minority Tutsis as inyenzi (cockroaches). From this and other historical data, our algorithm inferred that genocide probability almost quadruples if: a) a person or a group describes a minority group (as defined by census and UN data) as either a non-mammal or as a disease-spreading animal, such as mice, and b) the speaker does so 3-5 years before they’ve been are reported in the news a minimum of few dozen times and have a local language Wikipedia entry about them.

After an empirical analysis of thousands of events happening in the last century, we’ve observed that our system identifies 30%-60% of upcoming events with 70%-90% accuracy. That’s no crystal ball. But it’s far, far better than what humans have had before.

What would it mean to NGOs, construction companies, and health organizations to know that droughts followed by storms can lead to cholera? What would it mean to mining companies, regulators, environmental organizations, and government leaders to know that mining leads to deforestation, and that deforestation leads to fruit bat migrations, and that fruit bat migrations may increase the risk of an Ebola outbreak? And what would we all do with the information that certain linguistic choices and policy changes can result in widespread violence? How might we all start thinking about risk differently?

Yes, “big data” and sophisticated analytics do allow companies to improve their profit margins considerably. But combining the knowledge obtained from mining millions of news articles, thousands of encyclopedia articles, and countless websites to provide a coherent, cause-and-effect analysis has much more potential than just increasing sales. It can allow us to automatically anticipate heretofore unpredictable crises, think more strategically about risk, and arm humanity with insight about the future based on lessons from the relevant past. It means we can do something about the volatility, uncertainty, complexity, and ambiguity surrounding us. And it means that the next time there’s a riot or an outbreak, leaders won’t be blindsided.

Algorithmically Predicting Riots and Civil Unrest

Many governments put many fund in order to predict civil unrest.
Lately, I have been consulted a lot about how riots in the world can be predicted.
I want to share my personal view of how this can be achieved via a real prediction example I have experienced when using our prediction algorithms that are presented in a paper I co-authored with Eric Horvitz a couple of years ago.

September 22nd, 2013 – It is my morning routine to check out the new predictions made by the system I was coding during my PhD and see how the prediction algorithm can be improved. The main prediction on the screen is written in plain text: “Mass unrest and instability in Sudan”. When clicking on the prediction it reveals the pattern that is going to happen in the future – protests will start, then youth killed by police leading to government instability and mass unrest. The main reason is marked on the screen, quoting a newspaper title from the last few days– “government lifted its gas subsidies.” The historical pattern inferred by the algorithm is written next to it – when governments of countries with rising GDP with high population poverty lift subsidies of common products, student and youth protest start. If youth is killed, mass riots start. When clicking on the pattern, the system reveals the past events the contributed to this hypothesis. The first news title is from January 1st, 2012 describing that Nigeria lifts subsidies on oil, and the next title in the pattern is “Muyideen Mustapha, 23, was reportedly the first person to be killed during the nationwide protests over the lifting of petrol subsidies.” (Twitter by @ocupynigeria), and few days later on January 9:

“Tens of thousands of Nigerians took to the streets in cities across the country on Monday to protest a sudden sharp rise in oil prices after the government abruptly ended fuel subsidies” (NYT) and “Nigeria fuel protests: two killed and dozens wounded as police open fire” (the guardian). ). The last event in the historical pattern is “Nigeria’s oil economics fuel deadly protests” (CNN January 11th, 2012).

While scrolling through all the past events, I notice that Egypt revolution that started as subsidies on bread were lifted leading to student riots, death of a student and then mass riots, and much more.
On October 1st the Economist publishes “..protesting the lifting of fuel subsidies has left dozens of people dead in the capital, Khartoum, and around the country… single bullet, however, that hit a 26-year-old pharmacist in the chest during a protest in Buri… sent shock waves through the heart of Mr Bashir’s regime… The student-led protests are also expected to continue.”

The machine learning algorithm, that enabled this prediction is learning about the future by generalizing sequences of historical events extracted from massive amounts of data points, including: 150 years of NYT articles, millions of web searches, and dynamic webpages. Using unstructured, open ontologies (e.g. Wikipedia,) the approach models the context-conditional probabilities of potential outcomes, by “reading the news” and finding patterns in them.
I believe that generalizing this approach by combining it with other social media reports across several languages and the many internal data sources many of the governments and intelligence offices collect will help bring much higher automation and predictability to the task of predicting civil unrest across large