by David Apgar, Santa Cruz County Bank
Underlying the recent congressional interest in tech monopolies is a wrong assumption about data. Lucrative monopolies are supposedly inevitable in lines of business that depend on data because companies become more effective as they accumulate more of the stuff. Whoever has the most data wins. And it’s not just excitable members of Congress. To get the attention of investors, tech entrepreneurs learn early they must promise to pursue winner-takes-all strategies. Except it’s just not true that effective data strategies are always, or even usually, winner-takes-all. In fact, most are not.
Public interest in tech monopolies is rising partly because researchers no longer think market power like Facebook’s in targeted advertising is benign. Britain’s Competition and Markets Authority, for example, estimates digital advertising costs households $650 per year, and Congress is exploring easier ways to reign in firms that abuse monopoly power, such as reallocating a majority of their board seats to labor and public interest representatives, stripping owners of their controlling interest.
Tech leaders have consequently become more circumspect in defending the market power of their firms since Peter Thiel told a Stanford audience in 2014 that competition is for losers. Not only do network effects supposedly lead to natural monopolies that benefit consumers who flock to the provider with the most customers, but machine learning arguably does so as well because more data – in the form of examples and indicators characterizing them – let the machines that learn draw better conclusions about new examples. Whether machine learning predicts sales probabilities, smooth paths down a highway, or the best way to end a sentence, whichever company has the most data to train its machine will provide the best service. You may as well become a customer and add your data to the biggest pile. Don’t blame tech leaders for monopolies, blame data.
If there are enough situations where winning data strategies do not depend on volume, though, the argument falls apart and we should not expect tech monopolies to become inevitable or pervasive – just very sweet deals for investors. The most important examples are strategies based on data relevance rather than data volume that leave room for competitors to offer services based on data that are relevant in different ways. In businesses where data relevance counts as much as data volume, rolling up your sleeves and pursuing a competitive data strategy won’t doom your startup to the mediocrity of Peter Thiel’s losers. LinkedIn and Netflix both pursued competitive data strategies based on relevance rather than volume, for example, that nevertheless proved critical to success.
You might not think LinkedIn founder Reid Hoffman, who coauthored Blitzscaling, ever deviated from pursuing monopolies. Long before Microsoft acquired it, however, LinkedIn had a plan to build a trusted network. Like an early blockchain, the professional network would let members vouch for their contacts, connecting people who had never met through chains of trust.
However advantageous size might be to members of such a network, there’s little about it that excluded rivals. Vouching for contacts was real work – few paid for the privilege. In the end, the trusted network died on its own vines, leaving a valuable recruiting tool growing out of its roots for which Microsoft was willing to pay $26 billion. Far from making LinkedIn a loser, its early competitive data strategy led to an innovative tool for deploying the data you consider relevant to advancing your own career. Size helps LinkedIn more these days, but plenty of specialized recruiting networks grow comfortably alongside.
While Netflix founder and CEO Reed Hastings did appear in one of Reid Hoffman’s Masters of Scale podcasts, he embraced competitive business models from the start. Even the introduction of its original Cinematch recommendation engine – arguably the streaming service’s stickiest feature – had little to do with discouraging Netflix users from switching to rivals. In fact, the original purpose of Cinematch was to manage the company’s inventory of physical DVDs. By recommending lesser-known films that users enjoyed, Netflix spread demand away from current hits and avoided DVD stockouts. The company actually deemphasized recommendations when it introduced streaming in 2007.
What started as an inventory management tactic nevertheless became a distinguishing feature of Netflix, leading it into the even more competitive business of developing original content. Like LinkedIn, Netflix lets users deploy information relevant to a specific challenge – in this case, finding new films you’ll like. It helps that Netflix recommendations factor in the preferences of lots of other viewers, but that’s not as important as each user’s own history with the company.
Far from dampening innovation, the early strategies of LinkedIn and Netflix that embraced competition gave innovation a push. Pursuing strategies based on data relevance rather than volume may not have made them monopolies. But by tailoring their data strategies to the problems they needed to solve, they transformed professional recruiting and online entertainment.
On their own, of course, these examples might be flukes. There’s a theoretical reason, however, to think they illustrate a general limit to the value of scale in data businesses. The foundational work of Thomas Bayes on probabilistic inference in the 1760s and Claude Shannon on communication theory in the 1940s both show the information a set of data provides about a variable of interest always depends on two quantifiable things: the size of the data set and how strongly outcomes of the variable determine outcomes in the data set. As it turns out, this second thing – how strongly outcomes of an unknown variable of interest determine the outcomes of a data set – gives a precise measure of the relevance of the data to the variable. Relevance and volume thus jointly fix the value of a company’s data resources.
Strategies based on data relevance that embrace competition thus always have the potential to challenge winner-takes-all strategies based on data volume – a heresy against faith in tech monopolies that used to be confined to data-science classrooms. COVID-19, however, has changed that because lots of worried parents and health workers have suddenly taken a crash course in the difference between viral tests and antibody tests.
A major use of antibody tests is determining whether health workers have immunity to a disease before sending them into wards where they would otherwise run a high risk of catching it. These tests need to avoid false positives that might lead a doctor to think she had immune protection she actually lacked. Epidemiologists say tests that successfully avoid false positives have high specificity, never confusing a common-cold antibody, for example, with one for the novel coronavirus.
The principal use of viral tests, in contrast, is to help health workers contain outbreaks. These tests must avoid false negatives that might lead a team to miss a major source of contagion. They have to be highly sensitive to the bug in question. Indeed, sensitivity is the term epidemiologists use for the ability of a test to avoid false negatives. In general, different tests are sensitive to different viruses.
This trips up apologists for tech monopolies because there’s a close parallel between software systems that analyze data and epidemiological tests. Software systems that must make fine distinctions like antibody tests gain their specificity through large data sets. In both cases, diagnostic systems backed by more data are better, while oversensitivity can be a danger. Software systems that must detect faint signals like viral tests rely on data strictly determined by those signals as opposed to large amounts of data. The specificity that winner-takes-all strategies can achieve is beside the point. What matters is the sensitivity of the test and the relevance of the data behind it.
Insisting big data sets solve everything better is like saying we need only antibody tests in a pandemic. It ignores the role of sensitive tests that avoid false negatives, like those for detecting individual viruses, where big data sets are superfluous and there are no winner-takes-all strategies.
COVID-19 has given us one other reason to doubt whether more data is always better. There are practical tradeoffs between the specificity and sensitivity of the health tests we can construct. In fact, it’s true of all diagnostic systems. Big data sets – like high-specificity tests – will generally sacrifice sensitivity in practice. The intuition here is that software systems able to make fine distinctions backed by a lot of data avoid mixing up situations that are only similar to one another. To do that, they can’t be oversensitive to situations that resemble one another in ways that may be essential.
For example, imagine your online sales system uses a massive database to customize product offers based on exactly where customers click and in what order. And let’s say it successfully discriminates among dozens of types of customers – high specificity. The trouble is a key customer may get entirely different offers if she visits the site twice. A system sensitive to key customers won’t make that mistake.
In short, applications backed by lots of data that can avoid false positives will probably generate false negatives. For plenty of commercial and social purposes, however, false negatives are the problem. LinkedIn users want to avoid the false negative of a recruiter failing to see they have the perfect skills for a job, for example. And Netflix users hope the streaming service won’t fail to find their future favorite film. In cases like these, data strategies need not be winner-takes-all – in fact, better if they’re not.
Most effective data strategies are not winner-takes-all because data does not add up in a simple way to insights. Even so, investors will always have an incentive to push data entrepreneurs to build monopolies. To be true to the data challenges they tackle, the next generation of entrepreneurs will often just have to say no.
[The author wishes to thank Stephen Beitzel for contributing to this article, and especially to the discussion of LinkedIn and Netflix.]
If you like what you read here, the Cloud Brigade team offers expert Machine Learning as well as Big Data services to help your organization with its insights. We look forward to hearing from you.
Please reach out to us using our Contact Form with any questions.
If you would like to follow our work, please signup for our newsletter.