Hacking to Machine Learning Algorithms

[From my post on Harvard Business Review]

Cybersecurity has become one of the CEO’s biggest worries, according to several surveys. Companies are investing billions in protecting their systems and training their employees.  The worldwide cybersecurity market has been estimated at $77 billion in 2015 and will be $170 billion by 2020. However, the field has mostly focused on protecting systems from vulnerabilities in software and hardware. Today’s threats are no longer confined to those two places. As organizations have come to rely more and more on data-driven algorithms, risks are increasingly present in the data itself.

The pattern classification systems that machine-learning algorithms rely on may themselves exhibit vulnerabilities that can be exploited by hackers or other malefactors. One such common vulnerability is that the attacker estimated on what data the machine-learning algorithm was trained and thus manipulates the input data to the algorithm itself to her needs.

For example, search-engine-optimization (SEO) companies have long guessed how the search engine machine-learning algorithm was trained and manipulate their website content to boost their results in the search ranking. Senders of junk email try to fool the spam-filtering algorithm by adding unrelated words or sentences to their messages so they resemble legitimate email – hence the number of spam messages that start out “Hey” or “How are you?” with some bad words obfuscated by misspelling of bad words.

Most of us see examples like this every day, and they mostly seem like a minor annoyance – we click on a bad link, or have to delete a few extra emails. But there can be more serious consequences to this kind of fraud. The credit card industry, which has adopted many machine-learning approaches and other statistical, data-driven techniques to identify fraud, has been exposed to such threats for many years. If an attacker knows the usual pattern of a shopper, he can create a series of fraudulent purchases that deviate only slightly from the norm, and thus not be detected by the anomaly detector. For example, an attacker can see what was previously bought in Home Depot and buy products with similar prices at Amazon.

Algorithm fraud can also influence elections. The official scientific journal of the National Academy of Sciences (PNAS) has published research detailing how search engine manipulation can impact voters. The most notable experiment was conducted with Indian voters in the midst of India’s 2014 Lok Sabha elections. The results of the experiment clearly showed that biased search rankings could shift the voting preferences of undecided voters by 20% or more. Specifically, it was found that the order in which candidates appear in search results can have significant impact on perception.

Another weakness of machine-learning algorithms is that most of them make the common assumption that the data used to train the algorithm and the data to which the algorithm is applied are generated in the same way (or what statisticians call “sampled from the same distribution”). When that assumption is violated, the algorithm can be fooled.

Recently, such an attack was carried on biometric systems. Most biometric systems allow clients’ profiles to adapt with natural changes over time – so face recognition software updates little by little as your face ages and changes. But a malicious adversary can exploit this adaptability. By presenting a sequence of fake biometric traits to the sensor, an attacker can gradually update your profile until it is fully replaced with a different one, eventually allowing the attacker to impersonate you or other targeted clients.

As we increase our usage of smart, connected devices, we’re also deploying machine-learning algorithms more and more, in all aspects of life – from cars, to phones, to credit card readers, to wearable devices, and many more. Protecting the algorithms that run these devices from “statistical hacking” or “adversarial machine learning” is consequently becoming a bigger need. Some other fun direction can also be found in this paper.

There is a lot that can be done – from building learning algorithms with multiple classifiers to the use of randomization.
At a time when artificial intelligence algorithms drive everything from public opinion to business decision-making to how many steps you take each day, it’s worth asking – how secure are those algorithms? And what I can do to make them more secure?

The Data Monopolists

[From my post on Harvard Business Review]

The White House recently released a report about the danger of big data in our lives. Its main focus was the same old topic of how it can hurt customer privacy. The Federal Trade Commission and National Telecommunications and Information Administration have also expressed concerns about consumer privacy, as have PwC and the Wall Street Journal. However, big data holds many other risks. Chief among these, in my mind, is the threat to free market competition.

Today, we see companies building their IP not solely on technology, but rather on proprietary data and its derivatives. As ever-increasing amounts of data are collected by businesses, new opportunities arise to build new markets and products based on this data. This is all to the good. But what happens next? Data becomes the barrier-to-entry to the market and thus prevents new competitors from entering. As a result of the established player’s access to vast amounts of proprietary data, overall industry competitiveness suffers. This hurts the economy. Federal government regulators must ask themselves: Should data that only one company owns, to the extent that it prevents others from entering the market, be considered a form of monopoly?

The search market is a perfect example of data as an unfair barrier-to-entry. Google revolutionized the search market in 1996 when it introduced a search-engine algorithm based on the concept of website importance — the famous PageRank algorithm. But search algorithms have significantly evolved since then, and today, most of the modern search engines are based on machine learning algorithms combining thousands of factors — only one of which is the PageRank of a website. Today, the most prominent factors are historical search query logs and their corresponding search result clicks. Studiesshow that the historical search improves search results up to 31%. In effect, today’s search engines cannot reach high-quality results without this historical user behavior. This creates a reality in which new players, even those with better algorithms, cannot enter the market and compete with the established players, with their deep records of previous user behavior. The new entrants are almost certainly doomed to fail. This is the exact challenge Microsoft faced when it decided to enter the search market years after Google – how could it build a search technology with no past user behavior? The solution came one year later when they formed an alliance with Yahoo search, gaining access to their years of user search behavior data. But Bing still lags far behind Google. This dynamic isn’t limited only to internet search.

Given the importance of data to every industry, data-based barriers to entry can affect anything from agriculture, where equipment data is mined to help farms improve yields, to academia, where school performance and census data is mined to improve education. Even in medicine, hospitals specializing in certain diseases become the sole owners of the medical data that could be mined for a potential cure.

While data monopolies hurt both small start-ups and large, established companies, it’s the biggest corporate players who have the biggest data advantage. McKinsey calculates that in 15 out of 17 sectors in the U.S. economy, companies with more than 1,000 employees store, on average, over 235 terabytes of data—more data than is contained in the entire US Library of Congress. Data is a strategy – and we need to start thinking about it as one. It should adhere to the same competitive standards as other business strategies. Data monopolists’ ability to block competitors from entering the market is not markedly different from that of the oil monopolist Standard Oil or the railroad monopolist Northern Securities Company.Perhaps the time has come for a Sherman Antitrust Act – but for data.

Unsure where you come down on this issue? Consider this:  studies have shown that around 70% of organizations still aren’t doing much with big data. If that’s your company, you’ve probably already lost to the data monopolists.

You Need an Algorithm, Not a Data Scientist

[From my post on Harvard Business Review]

Mark Twain once said: “The past does not repeat itself, but it rhymes.” Although future events have unique circumstances, they typically follow familiar past patterns. Today, data scientists can predict everything from disease outbreaks to mortality to riots.

It’s no surprise, then, that companies trying to hear the rhymes and see the patterns in their sales conversions are trying to manually analyze their own data, hire the best data scientists, and train their managers to be more quantitative.

However, this people-centric, high-touch approach is not scalable. Markets are too dynamic, and some of the changes too imperceptible, to be realistically captured by humans.

Consider a company that is selling electronic devices. Let’s say that historically they have been selling well to companies that value their fast delivery and the quality of their product. As time passes, the competition grows and a global trend for green products arises. The profile of the company’s perfect customer slowly shifts and could go unnoticed by manually examining the market. However, those small shifts are identifiable by algorithms that continuously monitor the historical sales cycle of the company, cross-referencing it with external sources, like social media posts and newspaper articles discussing these trends, and finding correlations with the propensity to buy. Due to the size of this information base and its unstructured nature, monitoring all those delicate changes in real time becomes an almost impossible task for a human analyst.

While few companies have the luxury of having data scientists with the expertise needed to develop these sophisticated algorithms, nor the staff to analyze the results effectively, there is less need today. Data science today requires fewer experts, as many more automated tools are being developed and used to analyze thousands of events. (Disclosure: my company, SalesPredict, is in this industry.) The more sophisticated tools require very little or no human intervention, zero integration time, and almost no need for service to re-tune the predictive model as dynamics change.

Today, automated algorithms can identify patterns and provide insights such as:

  • Did you notice a big portion of your customer churn is from companies who have not used one specific feature of your product in the last three months?
  • Did you notice that the leads that converted to closed deals this month were from medium size high-growth companies who were searching for keywords comparing your product to your competitor?

But as your business changes, the answers will change as well, requiring more and more automation to track those changes and supply the business leader with real-time, actionable recommendations that are always relevant.

In the next few years, I believe many businesses, especially B2B, will use prediction in their business. But those who get the most from these analytics will be those that use automated algorithms – which are faster, more accurate, more scalable, and more adaptive than manually analyzed data.

In stock trading, human analysts once did the trading. Today, more and more automated machine learning algorithms accompany their decisions. It has become much harder to compete without such algorithms. Similarly, in the next few years, very few businesses can afford not to have automated decision making systems mining their data and suggesting the best next actions – not only in Operations, but in the Marketing, Sales, and Customer Success departments too. Following a large amount of ever-changing information will be the competitive edge.

How Victor Hugo Can Predict Ebola and Help a Business Succeed

[From my post on Harvard Business Review]

There’s no doubt that our world faces complex challenges, from a warming climate to violent uprisings to political instability to outbreaks of disease. The number of these crises currently unfolding – in combination with persistent economic uncertainty – has led many leaders to lament the rise of volatility, uncertainty, complexity, and ambiguity. Resilience and adaptability, it seems, are our only recourse.

But what if such destabilizing events could be predicted ahead of time? What actions could leaders take if early warning signs are easier to spot? Just this decade, we have finally reached the critical amount of data and computer power needed to create such tools.

“What is history? An echo of the past in the future,” wrote Victor Hugo in The Man Who Laughs. Although future events have unique circumstances, they typically follow familiar past patterns. Advances in computing, data storage, and data science algorithms allow those patterns to be seen.

A system whose development I’ve led over the past seven years harvests large-scale digital histories, encyclopedias, social and real-time media, and human web behavior to calculate real-time estimations of likelihoods of future events. Essentially, our system combines 150 years of New York Times articles, the entirety of Wikipedia, and millions of web searches and web pages to model the probability of potential outcomes against the context of specific conditions. The algorithm generalizes sequences of historical events extracted from these massive datasets, automatically trying all possible cause-effect combinations and finding statistical correlations.

For instance, recently my fellow data scientists and I developed algorithms that accurately predicted the first cholera outbreak in 130 years. The pattern that our system inferred was that cholera outbreaks in land-locked areas are more likely to occur following storms, especially when preceded by a long drought up to two years before. The pattern only occurs in countries with low GDP that have low concentration of water in the area. This is extremely surprising, as cholera is a water-born disease and one would expect it to happen in areas with a high water concentration. (One possible explanation might lie in how cholera infections are treated: if prompt dehydration treatment is supplied, cholera mortality rates drop from 50% to less that 1%. Therefore, it might be that in areas with enough clean water the epidemic did not break out.)

The implication of such predictions, automatically inferred by an-ever-updating statistical system, is that medical teams can be alerted as far as two years in advance that there’s a risk of a cholera epidemic in a specific location, and can send in clean water and save lives.

Other epidemics can be predicted in a similar way. Ebola is still rare enough that statistical patterns are tough to infer. Nevertheless, using human casualty knowledge mined from medical publications, in conjunction with recurring events, a prominent pattern for Ebola outbreaks does emerge.

Several publications have reported a connection between both the current and the previous Ebola outbreaks and fruit bats. But what causes the fruit bats to come into contact with humans?

The first Ebola outbreaks occurred in 1976 in Zaire and Sudan. A year before that, a volcano erupted in the area, leading many to look for gold and diamonds. Those actions caused deforestation. Our algorithm inferred, from encyclopedias and other databases, that deforestation causes animal migration – including the migration of fruit bats.

We have used the same approach to model the likelihood of outbreaks of violence. Our system predicted riots in Syria and Sudan, and their locations, by noticing that riots are more likely in non-democratic regions with growing GDPs yet low per-person income, when a previously subsidized product’s price is lifted, causing student riots and clashes with police.

The algorithm also predicted genocide by identifying that those events happen with higher probability if leaders or prominent people in the country dehumanize the minority, specifically when they refer to minority members as pests. One such example is the genocide in Rwanda. Years before 4,000 Tutsis were murdered in Kivumu, Hutu leaders such as Kivumu mayor Gregoire Ndahimana referred to the minority Tutsis as inyenzi (cockroaches). From this and other historical data, our algorithm inferred that genocide probability almost quadruples if: a) a person or a group describes a minority group (as defined by census and UN data) as either a non-mammal or as a disease-spreading animal, such as mice, and b) the speaker does so 3-5 years before they’ve been are reported in the news a minimum of few dozen times and have a local language Wikipedia entry about them.

After an empirical analysis of thousands of events happening in the last century, we’ve observed that our system identifies 30%-60% of upcoming events with 70%-90% accuracy. That’s no crystal ball. But it’s far, far better than what humans have had before.

What would it mean to NGOs, construction companies, and health organizations to know that droughts followed by storms can lead to cholera? What would it mean to mining companies, regulators, environmental organizations, and government leaders to know that mining leads to deforestation, and that deforestation leads to fruit bat migrations, and that fruit bat migrations may increase the risk of an Ebola outbreak? And what would we all do with the information that certain linguistic choices and policy changes can result in widespread violence? How might we all start thinking about risk differently?

Yes, “big data” and sophisticated analytics do allow companies to improve their profit margins considerably. But combining the knowledge obtained from mining millions of news articles, thousands of encyclopedia articles, and countless websites to provide a coherent, cause-and-effect analysis has much more potential than just increasing sales. It can allow us to automatically anticipate heretofore unpredictable crises, think more strategically about risk, and arm humanity with insight about the future based on lessons from the relevant past. It means we can do something about the volatility, uncertainty, complexity, and ambiguity surrounding us. And it means that the next time there’s a riot or an outbreak, leaders won’t be blindsided.

Algorithmically Predicting Riots and Civil Unrest

Many governments put many fund in order to predict civil unrest.
Lately, I have been consulted a lot about how riots in the world can be predicted.
I want to share my personal view of how this can be achieved via a real prediction example I have experienced when using our prediction algorithms that are presented in a paper I co-authored with Eric Horvitz a couple of years ago.

September 22nd, 2013 – It is my morning routine to check out the new predictions made by the system I was coding during my PhD and see how the prediction algorithm can be improved. The main prediction on the screen is written in plain text: “Mass unrest and instability in Sudan”. When clicking on the prediction it reveals the pattern that is going to happen in the future – protests will start, then youth killed by police leading to government instability and mass unrest. The main reason is marked on the screen, quoting a newspaper title from the last few days– “government lifted its gas subsidies.” The historical pattern inferred by the algorithm is written next to it – when governments of countries with rising GDP with high population poverty lift subsidies of common products, student and youth protest start. If youth is killed, mass riots start. When clicking on the pattern, the system reveals the past events the contributed to this hypothesis. The first news title is from January 1st, 2012 describing that Nigeria lifts subsidies on oil, and the next title in the pattern is “Muyideen Mustapha, 23, was reportedly the first person to be killed during the nationwide protests over the lifting of petrol subsidies.” (Twitter by @ocupynigeria), and few days later on January 9:

“Tens of thousands of Nigerians took to the streets in cities across the country on Monday to protest a sudden sharp rise in oil prices after the government abruptly ended fuel subsidies” (NYT) and “Nigeria fuel protests: two killed and dozens wounded as police open fire” (the guardian). ). The last event in the historical pattern is “Nigeria’s oil economics fuel deadly protests” (CNN January 11th, 2012).

While scrolling through all the past events, I notice that Egypt revolution that started as subsidies on bread were lifted leading to student riots, death of a student and then mass riots, and much more.
On October 1st the Economist publishes “..protesting the lifting of fuel subsidies has left dozens of people dead in the capital, Khartoum, and around the country… single bullet, however, that hit a 26-year-old pharmacist in the chest during a protest in Buri… sent shock waves through the heart of Mr Bashir’s regime… The student-led protests are also expected to continue.”

The machine learning algorithm, that enabled this prediction is learning about the future by generalizing sequences of historical events extracted from massive amounts of data points, including: 150 years of NYT articles, millions of web searches, and dynamic webpages. Using unstructured, open ontologies (e.g. Wikipedia,) the approach models the context-conditional probabilities of potential outcomes, by “reading the news” and finding patterns in them.
I believe that generalizing this approach by combining it with other social media reports across several languages and the many internal data sources many of the governments and intelligence offices collect will help bring much higher automation and predictability to the task of predicting civil unrest across large

Algorithmic game theory in the bible or the man who had 3 wives

(Strongly inspired by Gadi Aleksandrowicz)

I always find it fascinating how many math riddles you can find in the old writings like the Talmud and the bible.
Prof. Robert Aumann, an economy Nobel prize winner, published multiple papers on algorithmic game theory and their proof in the Talmud.
One of my favorite ones is the story of the man who had 3 wives.
Unfortunately, in Hebrew vowels are not always written, and the words “wives” and “debt collectors” are indistinguishable, but nevertheless
when the man dies the women (or the debt collectors) demand his inheritance: the first wife wants 300$, the second 200$ and the last one wants 100$.
From this point there is an interesting analysis of how the money should be divided, depending on how much money was left in the inheritance, and it is summarized in the following table:

table

When looking in the table it is extremely hard to understand the rule behind it.
The first line makes sense – you divide the inheritance equally. This comes to prevent the unfair situations that because one of the debt collectors wants a big share, the other debt collectors will get nothing. In other words, this solution tries to be as fair to as many women as possible.
The third solution also makes sense – you divide the inheritance by the ratio of the debts requested. But what surprised me the most was the second solution – why the one who wants the smallest debt gets so little while the others get an equal share? What the logic behind this?

So it turns out those rabbis knew a thing or two about game theory.
Game theory investigates many situations similar to this, or in general games that have collaboration between groups of players, and the ultimate question of “how is it fair to divide a profit in a way that everyone is happy”. There are many ways of defining fairness, e.g. Shapley value, nucleolus of a game,etc. And it turns out the rabbis actually got to the same solution as the nucleolus does in those games.
To explain the concept, I will present a simpler version of this game, which is presented in another Talmud riddle: two men are holding a prayer shawl – one says all of it is mine, the other says half of it is mine. The solution given in the Talmud is to divide it ¾ and ¼ . The intuition behind this is that only the part that was not agreed upon should be divided half. In this case, the one who demanded only half “agreed” that one half belongs to the other, therefore only half the shawl is in dispute and divided into two: ¼ goes to one and the ½+¼=¾ goes to the other.

“The man who was married to three women” is a generalization of the problem of the “two men are holding a prayer shawl”.
Let’s look on the second case, where the inheritance is 200$.
Let’s assume the woman who wants the 300$ decides to take the 75$ and leave. We are left with 125$ to divided between two women – one wants 100$ the other 200$. It means that the sum in dispute is only 100$, and the other 25$ the first woman agrees belong to the second woman.
Given the rule of “two men are holding a prayer shawl” we need to divide only the sum in dispute, i.e. the 100$, therefore the first woman gets 50$ and the other gets 50$ + the 25$ that were not in dispute, all in total of 75$. But here we handled only 2 players. So let’s assume now that the woman who wanted the 100$ took her 50$ and left. Now we have 150$ to divide. One wants 200$ and the other wants 300$, i.e. the entire sum is in dispute and should be divided half and half – 75$ to each. Therefore we have reached an equilibrium.

If you want to read the entire proof you can find it here: http://www.ma.huji.ac.il/~raumann/pdf/45.pdf.
Pretty amazing how mathematical some of those ancient laws are.

Perpetual Inequality on the Web

“The world is unequal in many dimensions; even life itself is unequally distributed. In the United States and other wealthy nations, only 2 to 6 children out of every 1000 die before age 1, yet there are 25 countries where more than 60 out of 1000 do so. There are 10 countries, all in Africa, where per-capita gross domestic product (GDP) is less than 10% of U.S per-capita GDP. These gaps are a legacy of the Great Divergence that began 250 years ago, in which sustained progress in health and wealth in Europe spread gradually to the rest of the world. Will such gaps continue to be an inevitable consequence of progress?” (Angus Deaton, Science 23 May 2014).

I was always intrigued about how exactly the same processes are occurring in the online world and its economy. Could the Web economy be going through an online great divergence?

Today, many companies focus on leveraging data to optimize their marketing efforts and eventually grow their sales. Companies collect data, such as, a person’s Twitter statuses, Facebook likes etc., to delve as much data about the potential buyer. This data can be used to predict a person ethnicity, gender and even if the person suffers from alcoholism, or whether they are expecting a baby. Using this information, companies decide how to target different people, which business offer to present them to maximize the probability for a purchase. This inherently leads to data redlining, as Kate Crawford described it in her lectures, or in other words – the process of showing different content and possibilities (such as job offers, discounts, life insurance offers, etc.) for different groups of people. Scientific American already mentioned that this decade there is a different internet to the rich and poor, where a bank can show an offer of a loan to a rich, but to a poor that really does need this loan it won’t be shown as the revenue and the risk for the bank might be bigger. The amazing thing is that if in the old days, the poor could see physical evidence that there are richer people (‘wow, look at their houses and yachts’). In the virtual world, the poor can’t even tell they are such, because they get the customized experience. And on top of that, it requires very little data about people to infer information about them. Only very partial view of their behavior and social graph is required.  Of course there are always data errors, as Shutterfly discovered last week, when sending mass email congratulating random people for their pregnancies, but the use of this data is pretty much the norm today.

The web is in the middle of a diverging phase of our online economy. Those that will gain strength now will be on the better half of what we might call in the future “the perpetual Inequality on the Web”. Science dedicated an entire section this month to the inequality in the world. What if we could know what we know now before the beginning of those inequalities, i.e., the divergence point? Would we change anything? I believe we would not have.  It is part of human healthy economical behavior developing in centuries. It is what we are.