from detecting fraud For agricultural crop monitoring, a new wave of technology startups has emerged, all armed with the belief that their use of AI will meet the challenges presented by the modern world.
However, as the AI ​​landscape matures, a growing concern comes to light: the heart of many AI companies, their models, are rapidly becoming commoditized. The lack of significant differences between these models has led to questions about the sustainability of their competitive advantage.
Instead, as AI models become a core component of these companies, a paradigm shift is underway. The true value proposition of AI companies is now not only in the models but also mainly in the underlying datasets. The quality, breadth and depth of these datasets enable the models to outperform their competitors.
However, in a crowded market, many AI-driven companies, including those entering the promising field of biotechnology, are launching without strategic implementation of purpose-built technology stacks to generate the data needed for robust machine learning. This observation has significant implications for the longevity of their AI initiatives.
The true value proposition of AI companies is now not only in the models but also mainly in the underlying datasets.
As experienced venture capitalists (VCs) know well, it is not enough to scrutinize the surface-level appeal of an AI model. Instead, a comprehensive evaluation of a company’s tech stack is essential to gauge its fitness for purpose. The absence of a carefully crafted infrastructure for data acquisition and processing can signal the downfall of an otherwise promising venture from the outset.
In this article, I offer a practical framework drawn from the experiences of both CEOs and CTOs of machine learning-enabled startups. While by no means exhaustive, these principles aim to provide an additional resource for those with the difficult task of evaluating companies’ data processes and resulting data quality and, ultimately, determining whether they are set up for success.
From inconsistent datasets to noisy inputs, what could go wrong?
Before going into the framework, let’s first assess the basic factors that come into play when assessing data quality. And, crucially, what can go wrong if the data isn’t up to scratch.
Relevance
First, let’s consider the relevance of the dataset. The data must be intricately aligned with the problem the AI ​​model is trying to solve. For example, AI models developed to predict house prices require data including economic indicators, interest rates, real income and demographic changes.
Similarly, in the context of drug discovery, it is critical that experimental data represent the most likely predictors for effects in patients, requiring expert consideration of the most relevant assays, cell lines, model organisms, and more.
Accuracy
Second, there must be data accurate. Even a small amount of incorrect data can have a significant impact on the performance of an AI model. This is particularly poignant in medical diagnostics, where small errors in data can lead to misdiagnosis and potentially life-threatening consequences.
coverage
Third, data coverage is also essential. If important information is missing from the data, the AI ​​model will not be able to learn as effectively. For example, if an AI model is used to translate a particular language, it is important that the data includes different dialects.