One of the biggest challenges in today’s business world for mid and top-level executives is to correctly estimate data needs of new AI projects both at inception and over time.

You may also want to read: Machine Learning in the Organization: First Steps

The main questions which arise here are: what kind of data do we need, how do we need to prepare (clean) the data, and how do we teach employees who are not familiar with machine learning concepts and artificial intelligence how to manipulate and curate that data.

Most clients in this field come in two flavors: those with massive amounts of data who do not know what to do with it, and those who have never bothered to collect data and don’t know where to begin their journey.

Data: The Have and the Have Nots

If your goal is the prediction of future outcomes in a project and you do have a large amount of data available already, the key is to clearly understand the nature and quality of that data. A lot of companies are claiming to employ “AI” when they are actually just giving their existing data a once over. AI projects need highly qualified, reliable training data sets, and the people in your organization who have hitherto managed data collection may not be the best qualified to assure the usability of those training sets (especially for voice, computer vision, and natural language processing).

In particular, projects involving localization of large data sets (i.e. rollout in multiple languages), it is important to make sure that the basic data sets and procedures can be reliably employed throughout the entire application space.

For organizations who have not yet collected data, the challenge is to find out exactly which data must be collected and what is the most efficient way to do so. Mission creep in the planning phase, business silos, traditional hierarchical thinking and what I like to call ‘project envy’ between departments often lead to situations where too much data is collected at too many touchpoints, resulting both in dirty data and dissatisfied customers. Medical institutions like hospitals deploying AI solutions are particularly vulnerable, as each specialty and department wants complete control over the data.

Domain Specificity is both Key and Curse

Whereas domain generic projects can be excused for leaving the user with imprecise results, in many instances domain-specificity (currently in particular finance, insurance, and medical) is the absolute key to get good results and define the kind of data needed.

The categories in your domain-specific data may work in one language, but be completely irrelevant in another, causing problems as you try to scale up. Very often you need domain-specialized people (from biologists to comptrollers) to shed a light on the exact data needs. However, every time you add another domain expert the mountain of data that is supposedly needed grows.

Finding expertise is one of the tougher challenges in this area, not just for startups. AI systems analyzing shoulder x-rays should be viewed, classified and analyzed by a large number of orthopedic surgeons, and trying to get medical professionals to participate in any kind of extracurricular activity is either impossible or hugely expensive. In other industries, experts are bound by contracts or NDAs and unavailable to help you improve your project.

Even if your data entry module is perfectly designed, humans have a tendency to mess things up. In a recent project, we identified twelve different ways physicians in one hospital describe exactly the same condition.

Estimating by Comparing

One way for managers to estimate data needs is to look for comparable projects or join existing project pools. It is not necessary to re-invent the wheel every time you launch another AI project. Despite what you and your customers may think, your needs are probably not unique.

In particular, in NLP, someone probably has already done what you are trying to do, cleaned the data and trained the model. Managers must keep their eyes open for news about similar projects in their domain. At this point, many startups are simply translating ideas from one domain, country, or jurisdiction to another, adding little in the value of aggregate AI development, but at least aiming to solve specific programs with toy application type solutions. ROI doesn’t come into it yet.

Executives Need to Understand the Data Needs

Most projects fail because executives simply follow the hype, have the data or the budget to collect new data, and want to add an AI project to the list of their achievements. But just because you have data does not mean you need that data, not to mention an AI use case for it.

Executives must understand that at the core of our current efforts are human-machine interaction and augmentation of human capabilities. The first step in analyzing your data needs should, therefore, be these two questions: does the project make interacting with machines easier, and does it increase the capacity of the human to solve problems or complete workflows. All estimation of data needs should be guided by these two principles, all data collection and data cleaning should help make those goals clearer.

The key decision factor (especially when working with consultants and 3rd party providers) should not be price, but the quality of data. If an executive doesn’t understand the harm bad quality data can do to a project, they should not be in charge of such a project.

Read Also: Beyond the Hype: The difference between Artificial Intelligence, Machine Learning, and Deep Learning