Machine Intelligence and Training Data

16 November 2016

Machine Intelligence and Training Data

The frontier whereby we acknowledge our daily interactions with machine intelligence is in constant motion. As new capability emerges, we initially acknowledge profound new interactions before becoming so accustomed to the technology that it’s novel nature is soon dispelled.

This sliding interface has been propelled by advancements in computing power, ever improving deep learning algorithms and access to training data. The task for any company building cutting edge Machine intelligence solutions involves combining all three. It is the however the latter, the access to training data, which serves as both the greatest competitive moat for incumbents and the most significant challenge for new entrants.

More training data leads to better machine intelligence performance. Early adopter customers are drawn to the best solution and supply it with more and more training data. The additional data fine tunes performance making the best solutions even better. Commercially this characterises a ‘winner-takes-most’ environment and is labelled a data network effect.

For an illustration of a data network effect, examine the growing capabilities of Amazons voice activated assistant, Alexa. Alexa’s open, extensible API is available to 3^rd party developers to embed sophisticated Natural Language Processing technology within their own applications. Initially attracted by a superior off-the-shelf capability, these 3^rd party developers pedal more and more training data through the platform teaching Alexa new tasks every day. The new tasks are then made available across the entire platform, attracting even more new developers and expanding the platform capability.

How to gather training data?

Functionally, training data is required to build cutting edge solutions. Commercially, data network effects are necessary to establish defensibility within the marketplace but how do companies acquire datasets in the first place?

In practise we observe companies employing two distinct strategic approaches.

Strategy 1:

Provide a strategic data acquisition product that a gives real world utility to customers who readily exchange their data. The data acquisition play is independent of the long-term machine intelligence solution which will be trained by the newly acquired dataset. Data acquisition projects are most commonly introduced as APIs, SDKs or consumer facing applications. Using training data acquired from different users in a different context, the machine intelligence can be trained to satisfy a premium solution, without consuming that premium customers data or time. This superior off-the-shelf performance triggers valuable data network effects.

Strategy 2:

Build a very narrow application on top of a non-proprietary training data set which has been manually augmented by human classifications. Such datasets are acquired by many imaginative means but often cite web crawlers or Open source datasets for sources. The salient characteristic of this approach is that the company can build something narrow enough in scope such that the human augmented non-proprietary dataset can deliver a performance satisfactory to the customer prior to any performance uplift from generated data network effects.

Often early stage companies pursue the feasibility of both strategies in tandem before identifying one to prioritise over the other. In terms of emergent patterns governing this decision we have observed the following;

The strategic approach is generally determined by the minimum level of reliability a customer demands before they are willing to use machine intelligence (and exchange their data!).

Where the consequences of poor performance are significant, that is the customer will refuse to interact with an initially sub-par solution, the strategy is a close variation of Strategy 1 - building a highly refined training dataset and solution pre-user interaction. Examples of such zero margin for error applications include autonomous vehicles, healthcare diagnostics, Cybersecurity and many consumer facing Fintech applications.

Where the consequences of poor performance are initially insignificant and the users will persevere for a period. Strategy 2 and its initial narrow target application is the common approach. Example technologies where this can be tolerable to users include Recommendation Engines & Predictive Analytics, Speech-to-Text machine translations, and Search where inaccuracies either diminish quickly with user interaction or are substitutable with conventional GUIs that a user can easily toggle back and forth between and correct errors with minimal effort.