DatologyAI Team

Nov 14, 20247 min read

DatologyAI’s Image-Text Data Curation: Train Better, Faster, Smaller

Note: This post is a short, high-level summary of our contrastive image-text data curation results. We encourage interested readers to check out our technical deep-dive companion piece for more scientific depth and fun.

A robotic hand holding a single printed photograph of a serene lake scene, surrounded by a large scattered pile of various printed photos, suggesting the concept of AI analyzing or sorting through data

Models are what they eat. But a large portion of training compute is wasted training on data that are already learned or irrelevant for the model’s downstream application. Over the past few months, we’ve been building a state-of-the-art data curation pipeline (based on our research) and applying it to image-text pairs: and today we’re excited to announce that we’ve been able to obtain substantial improvements in image-text multimodal (CLIP) model quality, training speed, and inference efficiency.

Our reference is DataComp datasets of up to 1B image-text pairs and training budgets of up to 5B image-text pairs. We trained models on data that were minimally curated (raw baseline), data curated using industry-standard techniques (sophisticated baseline—exact deduplication and CLIP-score filtering, for the aficionados), and data curated using DatologyAI’s pipeline. By using our pipeline to curate training data, we are able to:

Train Faster

Save up to ~98% on compute (43x training speedup) to reach the same accuracy on retrieval tasks compared to the raw baseline and 96% on compute (28x speedup) compared to the sophisticated baseline. Save up to ~92% on compute (13x training speedup) to reach the same accuracy on classification tasks as the raw baseline and 67% (3x speedup) as the sophisticated baseline. Don’t worry, we’re not cherry-picking: the worst improvement we saw across all our experiments was a 55% compute savings (2.2x speedup), when training using a small dataset and evaluating on classification tasks.

Train Better

Improve model quality by up to ~13 absolute percentage points for the same training cost compared to a model trained without DatologyAI.

Train Smaller

Improve inference efficiency by training better, smaller models—training on curated data reduces the cost per query by 2.6x and is up to 13 percentage points better than trained on the same amount of raw baseline data.

Before we dive into details, we’d like to express how exciting this is for our team. Many of us have devoted ourselves to data curation research in academia and research labs. And many of us have dedicated our careers to building data tools and infrastructure for the pre-deep learning era of tech. But proving that an idea works in a research lab is a far cry from building a working product out of it, let alone a scalable, generalizable, effective product!

This is especially true for data curation: it’s a young, frontier research area in which experiments are costly to run at scale. And building a curation pipeline that can handle the scale of modern foundation model training datasets requires solving challenges across domains including data infrastructure, platform engineering, and security. It wasn’t easy, but by leveraging our collective expertise, working hard, and staying laser-focused, we did it. And boy does it work!

We’re extremely proud of what we’ve done, but quite frankly, this is the worst we’ll ever be. We’re expecting significant improvements to these figures over the next few months, plus a follow-up release for text curation and language models.

Data curation for large-scale deep learning is hard. It’s a frontier research and engineering problem, and basically unfeasible for companies to do efficiently at scale today. But with DatologyAI, you can. Is your company interested in training multimodal models faster, better, or smaller? Get in touch!

Train Models Faster

Reduce time and cost required to reach baseline accuracy by up to 97.7% on retrieval tasks and up to 92.4% on classification tasks. Use the dropdown menu at the top of the plot to select between viewing the speedup for classification and retrieval.

DatologyAI's data curation allows you to train models faster, and faster training means better models in the hands of your customers faster. We developed two curation recipes, one optimized for retrieval and another optimized for classification, and applied them to DataComp.

We then trained CLIP-ViT-B/32 models on data that were minimally curated (raw baseline), data curated using industry-standard techniques (sophisticated baseline), and data curated using DatologyAI’s pipeline. We then measured how long it takes the DatologyAI models to reach the same accuracy as the raw and sophisticated baselines:

Retrieval Optimization

Our pipeline reduces compute requirements by up to 98% (43x training speedup) compared to the raw baseline and up to 97% (29x training speedup) compared to the sophisticated baseline.

Classification Optimization

Our pipeline decreases compute usage by up to 92% (13x training speedup) compared to the raw baseline and up to 67% (3x training speedup) compared to the sophisticated baseline.

When you can train your models faster, you reduce iteration time and get to the right, impactful model more quickly. You can also just get more done in the same amount of time, and reduce the amount of resources required to get to a productive model. All of which means a better product in the hands of your users, faster.

Train Better Models

DatologyAI’s data curation doesn’t just help you train models faster; it helps you train better models. Better models mean better products and better customer experiences.

Retrieval Optimization

Our pipeline showed accuracy improvements ranging from 11 percentage points over the raw baseline to 9 percentage points over the sophisticated baseline when trained on the same quantity of data.

Classification Optimization

Our pipeline achieves a 9 percentage point increase over the raw baseline and 3 percentage point increase over the sophisticated baseline when trained on the same quantity of data.

Trying to replicate these accuracy gains when training on uncurated data is almost impossible to do cost efficiently. The following plot illustrates the relative cost per marginal accuracy point when training models with and without DatologyAI.

We show the marginal accuracy improvement (y-axis) as a function of compute cost multiple (x-axis) relative to the raw baseline trained reach 52% accuracy (for retrieval) or 19% accuracy (for classification). Accuracy gains are much cheaper when using curated data compared to the baselines. On retrieval tasks, for example, using 5x as much compute yields approximately a 3% improvement when training on Raw Baseline data, while it yields over a 16% improvement when training on DatologyAI curated data.

We attempted to match the performance of our retrieval-optimized data by training the model longer on raw baseline data. However, even after training for 32 times longer on the raw data, the performance was still nearly 7 percentage points worse than that of the model trained on our retrieval-optimized dataset. This highlights how much latent value good data curation can unlock in your data.

More accurate models aren’t just good for modeling, they’re good for business – improving your product quality and giving your users a better experience. And you’ll also see fewer model responses that don’t meet the mark.

Train Better, Smaller Models

Large models are all the rage, but what if you could get the same accuracy out of a smaller one?

Save on inference by training a smaller, better model using curated data. We calculate inference speedup by using relative FLOPS for a single 224x224 image as in OpenCLIP.

The main determinant of inference cost is model size. One major benefit of data curation is the ability to train smaller models to higher quality, so that they match or exceed the performance of larger models trained without curated data. DatologyAI’s data curation allows you to save on inference costs by training better, smaller models.

Retrieval Optimization

A CLIP-ViT-S/32 (63M parameters) trained on 512M samples of DatologyAI’s retrieval-optimized DataComp yields a 13 percentage point improvement over a CLIP-ViT-B/32 (151M parameters) trained using the same amount of compute on raw baseline data—that translates to an inference cost reduction of 2.6x, for a substantially better model.

Classification Optimization

Up to a 9 percentage point improvement over a model trained using the same amount of compute on Raw Baseline data—which again comes along an inference cost reduction of up to 2.6x.

With smaller models, you can serve more demand at the same cost, plus lower your latency for a snappier experience. And you can unlock new use cases that only smaller models work for, like on device computation or real time operations.

Meet our pipeline: behind the scenes

A diagram depicting the collection of data curation algorithms that compose DatologyAI's data curation pipeline — DatologyAI's data curation pipeline integrates a breadth of curation algorithm types.

Our data curation pipeline is the result of extensive research and engineering to optimally compose numerous algorithm families.

We’re pretty excited about it. It’s the culmination of many months of effort applying sound science and clever engineering to build something truly unique—a scalable pipeline that integrates a suite of bleeding-edge algorithms to curate data in the quantity necessary for modern foundation model training. Our suite of algorithms comprises the following algorithm families:

Exact Deduplication: You probably don’t need 7831 copies of the same placeholder clip-art in your dataset
Model-based Filtering: Use purpose-built models to sort and filter data
Embedding-based Curation: Embed your data and then leverage the geometric relationships between samples to find redundant, high-value, and extreme samples
Synthetic Data: Use generative models to enhance the relevance and diversity of your data
Target Distribution Matching: Make your pre-training data more similar to a high quality dataset or task data

We designed our pipeline from the ground up to be scalable, portable, performant, and secure. It can seamlessly curate billions of samples and runs inside your VPC, so your data never leaves your system.

We’re Data-Obsessed and We’re Just Getting Started

We built a state-of-the art data curation pipeline for text and used it to train models faster, train better models, and save inference costs by training smaller models. And our results will only get better. We’re starting to work with early customers: if you’re an enterprise AI company interested in training multimodal and/or text models faster, better, or smaller, get in touch!

If you’re interested in pushing the bounds of what’s possible with data curation, we’re also looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.

Looking for more scientific details? Check out our technical deep-dive companion piece!

DatologyAI’s Image-Text Data Curation: Train Better, Faster, Smaller

Train Faster

Train Better

Train Smaller

Train Models Faster

Retrieval Optimization

Classification Optimization

Train Better Models

Retrieval Optimization

Classification Optimization

Train Better, Smaller Models

Retrieval Optimization

Classification Optimization

Meet our pipeline: behind the scenes

We’re Data-Obsessed and We’re Just Getting Started

Recent Posts