Models are what they eat.
AI models trained on large-scale datasets have demonstrated jaw-dropping abilities and have the power to transform every aspect of our daily lives, from work to play. This massive leap in capabilities has largely been driven by corresponding increases in the amount of data we train models on, shifting from millions of data points several years ago to billions or trillions of data points today. As a result, these models are a reflection of the data on which they’re trained — models are what they eat.
This observation has led to statements like “data is the new oil” and concerns about what we will do when we run out of data. This has led much of the field to focus on issues of data quantity and scale; however, the issue of data quality has been comparatively under-addressed. This is a critical oversight, because not all data are created equal. Training models on the right data in the right way can have a dramatic impact on the resulting model. And it’s not just about performance. Improving training data means improving:
The efficiency of your training process, enabling you to train models to the same or better performance much more quickly, saving compute costs and making your ML team far more efficient
The performance of your model not just generically, but also on the long-tail of queries which are uncommon in your dataset, but absolutely critical to your business
The size of your model: better data means smaller, more portable models that cost dramatically less to serve and are equally performant
Solving this problem by identifying the right data to train on and the right way to present these data, especially when faced with petabytes of unlabeled data, is an incredibly challenging and expensive problem requiring specialized expertise. But the upside of solving this problem is colossal, and it’s arguably one of the most important topics in AI research today.
Our vision at DatologyAI is to make the data side of AI easy, efficient, and automatic, reducing the barriers to model training and enabling everyone to make use of this transformative technology on their own data.
High quality data are all you need
Companies setting the standard for leveraging AI need to train their own models on their own proprietary data. Many of these companies have petabytes or more of unlabeled and generally unstructured data — so much so that they couldn’t train on all of it even if they wanted to because it quickly becomes cost-prohibitive (assuming you can even access enough compute!). As a result, standard practice is to simply select a subset of the data randomly. Unlike most other areas of deep learning, innovations on this practice have seen comparatively little adoption. This is problematic, because training on a random subset of the data has many, many problems:
Models waste compute on redundant data, slowing down training and increasing cost.
Some data are misleading and actually harm performance. For example, training a code-gen model on code that doesn’t compile will lead to a worse model overall.
Slower training leads to worse performance for the same compute budget.
Datasets are unbalanced and have long tails - harming performance and fairness.
The bottom line is: training on the wrong data leads to worse models which are more expensive to train. And yet it remains standard practice.
Data curation as a service
This is where DatologyAI comes in: we leverage and perform state-of-the-art research to manage the entire process from data in blob storage to the dataloader used for your training code. We deploy to your own infrastructure, either on-premises or via your VPC to make sure that customer data is never at risk. We automatically optimize every step of this process, including:
Which data are most informative for training? This is a dynamic problem that changes not only as a function of individual data points, but rather the whole dataset.
How much redundancy is necessary for each concept? Different concepts have different complexity and therefore require different amounts of redundancy. Solving this requires automatically identifying these concepts, their complexity, and how much redundancy is actually necessary.
How do we balance datasets? Most data are long-tailed: we need to rebalance datasets so that the model learns the entire distribution – not just the modes.
How do we augment the data? Data augmentation, often using other models or synthetic data, is incredibly powerful, but must be done in a careful, targeted fashion.
How should we order and construct batches from these data? Though seemingly simple, the way you order and batch data can have a dramatic impact on learning speed!
All of this is done automatically, scales to petabytes of data, and supports any data modality, whether your data are text, images, video, audio, tabular, or more exotic modalities like genomic or geospatial data.
DatologyAI: from frontier research to usable product for everyone
Data curation is an extremely challenging frontier research problem that pushes the bounds of our current knowledge — we know because we’ve conducted much of this research ourselves. Our team has pioneered the field of data research, including a landmark paper showing the power of data curation to beat scaling laws.
Effective use of data is what sets the best foundation models apart, yet much of this research is closed and is therefore only available to the largest foundation model labs. Furthermore, combining these disparate techniques into a single pipeline is incredibly nuanced and challenging.
Our goal with DatologyAI is to democratize this absolutely critical part of the AI infrastructure stack so that every company can easily train their own custom model on the right data without needing to invest massive resources into solving this challenging problem themselves.
What’s next for DatologyAI?
We’re just getting started in executing our mission of making model training more accessible for all through better data. We’re currently working with a select set of customers in anticipation of a larger general release later this year.
We’re grateful to be joined on this adventure by an incredible team along with the support of fantastic investors and angels. We’re excited to announce that we’ve raised an $11.65M Seed led by Sarah Catanzaro and Mike Dauber from Amplify Partners with participation from Rob Toews at Radical Ventures, along with Conviction Capital, Outset Capital, and Quiet Capital. We are also incredibly fortunate to have the trust and support of an incredible set of angels, including Jeff Dean, Geoff Hinton, Yann LeCun, Adam D’Angelo, Aidan Gomez, Ivan Zhang, Douwe Kiela, Naveen Rao, Jascha Sohl-Dickstein, Barry McCardel, and Jonathan Frankle.
We are just beginning our journey and there is still so much to be done. Our team contains veterans with deep ML expertise from FAIR@MetaAI, MosaicML, Google DeepMind, Apple, Twitter, Snorkel AI, Cruise, and Amazon. We are actively hiring — if you’re excited about data, please apply here!