DatologyAI Team

4 hours ago7 min read

Train LLMs Faster, Better, and Smaller with DatologyAI’s Data Curation

Two quick notes:

This post is a short, high-level summary of our initial text curation results for training LLMs. We encourage interested readers to check out our technical deep-dive companion piece for more scientific depth and fun.
The themes of this post are similar to our Image-Text Data Curation announcement, but here we apply our data curation pipeline to text data, so our models, baselines, evaluations, and results all change to tell the story of how we improve LLMs.

Models are what they eat. But a large portion of training compute is wasted training on data that are already learned or irrelevant for the model’s downstream application. Data curation—selecting the right data to train on—has the potential to massively improve training efficiency, allowing us to break down the compute walls that threaten to impede continued model improvements.

Over the past few months, we’ve been building a state-of-the art data curation pipeline (based on our research) and applying it to text data: and today we’re excited to announce that we’ve been able to obtain substantial improvements in LLM quality, training speed, and inference efficiency compared to existing public high-quality benchmark datasets.

We curate RedPajama v1 (RPJv1), a publicly-available pretraining dataset that attempts to replicate the Llama training data. We chose RPJv1 because it’s well-established, contains diverse content across a number of domains, and already has a moderate degree of curation applied to it. We then train models up to 2.7B parameters for token budgets up to 180B on our curated RPJv1 and other large-scale public pretraining corpora. We evaluate these models on a suite of 15 industry-standard language model evaluation datasets (see Appendix: Evaluation). By using our pipeline to curate training data, we are able to:

Train Faster

Save 86.9% on compute (7.7x training speedup) to reach the same average 5-shot accuracy compared to training on the base RPJv1. When comparing against DCLM, an aggressively-curated public dataset that’s known to be extremely high quality, we save 64.9% on compute (2.8x training speedup) to reach the same average 5-shot accuracy.

Train Better

Improve model quality by 8.5 percentage points on average 5-shot accuracy for the same training cost compared to training on RPJv1 (RPJv1: 52.0%; DatologyAI: 60.5%), and improve model quality by 4.4 absolute percentage points compared to DCLM (56.1%). Even training for 10x longer on RPJv1 still fails to close the gap with training on our curated data by 2.0 percentage points.

Train Smaller

Improve inference efficiency by training better, smaller models—training on curated data reduces the cost per query by up to 2.1x and is 5.7 percentage points better on average 5-shot accuracy than 2.7B parameter models trained using the amount of compute on RPJv1 data (RPJv1, 2.7B params: 50.8%; DatologyAI, 1.3B params: 56.5%) and 1.9 percentage points better than DCLM (2.7B params: 54.6%).

Before we dive into details, we’d like to express how exciting this is for our team. Many of us have devoted ourselves to data curation research in academia and research labs. And many of us have dedicated our careers to building data tools and infrastructure for the pre-deep learning era of tech. But proving that an idea works in a research lab is a far cry from building a working product out of it, let alone a scalable, generalizable, effective product.

This is especially true for data curation: it’s a young, frontier research area in which experiments are costly to run at scale, and building a curation pipeline that can handle the scale of modern foundation model training datasets requires solving challenges across domains including data infrastructure, platform engineering, and security. It wasn’t easy, but by leveraging our collective expertise, working hard, and staying laser-focused, we did it. And boy does it work!

Data curation for large-scale deep learning is hard. It’s a frontier research and engineering problem, and basically unfeasible for companies to do efficiently at scale today. But with DatologyAI, you can. Is your company interested in training LLMs faster, better, or smaller? Get in touch!

We’re extremely proud of what we’ve done, but this is just the start. We’ve already shown great results applying our curation to image-text data, and now we’ve proven that our pipeline can successfully improve text data and LLMs. But we’re still young—stay tuned for significant improvements to these figures over the coming months.

Train Models Faster

Reduce time and cost required to reach baseline accuracy by 86.9%% compared to RedPajama v1.

DatologyAI's data curation allows you to train models faster, and faster training means better models in the hands of your customers faster.

Compared to 2.7B parameter baselines trained on 180B tokens, our pipeline reduces compute requirements to reach baseline 5-shot accuracy by 86.9% (7.7x training speedup) relative to RPJv1, by 78.2% relative to FineWeb-Edu, another public dataset known to be of extremely high quality (4.6x training speedup), and by 69.4% relative to DCLM (3.3x training speedup).

When you train your models faster, you reduce iteration time, enabling you to reach the right, impactful model more quickly. You can also just get more done in the same amount of time, and reduce the amount of resources required to get to a productive model. All of which means a better product in the hands of your users, faster.

Train Better Models

DatologyAI’s data curation doesn’t just help you train models faster; it helps you train better models. Better models mean better products and better customer experiences.

When training for a budget of 180B tokens, our DatologyAI-curated RPJv1 reaches 60.5% 5-shot accuracy, which improves over the base RPJv1 (52.0% accuracy) by 8.4 percentage points, over FineWeb-Edu (54.4% accuracy) by 6.1 percentage points, and over DCLM (56.1% accuracy) by 4.4 percentage points. Our curated data also performs substantially above the base RPJv1 on every evaluation task, and performs at or above the level of other datasets on 11 of the 15 evaluation tasks, highlighting the strength and generality of our curation pipeline.

Trying to replicate these accuracy gains when training on other datasets is almost impossible to do cost effectively. The following plot illustrates the relative cost per marginal accuracy point when training models with and without DatologyAI.

We show the marginal accuracy improvement (y-axis) as a function of compute cost multiple (x-axis) relative to training on RPJv1 to reach 41.8% accuracy. Accuracy gains are much cheaper when using curated data compared to the baselines. For example, using 5x as much compute yields approximately a 4.6% improvement when training on the base RedPajama-v1 dataset, while it yields a 10.7% improvement when training on DatologyAI’s curation of RedPajama-v1.

We attempted to match the performance of a 1.3B parameter model trained on our DatologyAI-RPJv1 for 60B tokens (53.1% mean 5-shot accuracy) by training a model longer on the base RPJv1. However, even after training for 10x longer (600B tokens) on the base RPJv1, performance was still 2.1 percentage points worse than that of the model trained on our DatologyAI-RPJv1. This highlights how much latent value good data curation can unlock in your data.

More accurate models aren’t just good for modeling, they’re good for business – improving your product quality and giving your users a better experience. And you’ll also see fewer model responses that don’t meet the mark.

Train Better, Smaller Models

Large models are all the rage, but what if you could get the same accuracy out of a smaller one?

Save on inference by training a smaller, better model using curated data.

The main determinant of inference cost is model size. One major benefit of data curation is the ability to train smaller models to higher quality, so that they match or exceed the performance of larger models trained without curated data. DatologyAI’s data curation allows you to save on inference costs by training better, smaller models.

A 1.3B parameter model trained on 180B tokens of DatologyAI-curated RPJv1 reaches 56.5% mean 5-shot accuracy, a 5.7 percentage point improvement over a 2.7B parameter model trained using the same amount of compute on the base RPJv1 data, and 1.9 percentage points better than a 2.7B parameter model trained using the same amount of compute on DCLM. That translates to an inference cost reduction of up to 2.1x, for a substantially better model.

Meet Our Pipeline: Behind the Scenes

Our data curation pipeline is the result of extensive research and engineering to optimally compose numerous algorithm families.

We’re pretty excited about it. It’s the culmination of many months of effort applying sound science and clever engineering to build something truly unique—a scalable pipeline that integrates a suite of bleeding-edge algorithms to curate data in the quantity necessary for modern foundation model training. Our suite of algorithms comprises the following algorithm families:

Lexical Deduplication: You probably don’t need 731 copies of the same chiropractic ad copy in your dataset
Heuristic Filtering: Intuitive rules like “discard documents that are mostly whitespace”
Model-based Filtering: Use purpose-built models to sort and filter data
Embedding-based Curation: Embed your data and then leverage the geometric relationships between samples to find redundant, high-value, and extreme samples
Synthetic Data: Use generative models to enhance the relevance and diversity of your data
Target Distribution Matching: Make your pre-training data more similar to a high quality dataset or task data
Source Mixing: Determine the optimal way to mix data from different sources

We designed our pipeline from the ground up to be scalable, portable, performant, and secure. It can seamlessly curate billions of samples and runs inside your VPC, so your data never leaves your system. And much of it is generic across modalities (see our Image-Text Data Curation announcement).

We’re Data-Obsessed and We’re Just Getting Started

We built a state-of-the art data curation pipeline for text and used it to train models faster, train better models, and save inference costs by training smaller models. And our results will only get better. We’re starting to work with early customers: if you’re an enterprise AI company interested in training multimodal and/or text models faster, better, or smaller, get in touch!

If you’re interested in pushing the bounds of what’s possible with data curation, we’re also looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.

Looking for more scientific details? Check out our technical deep-dive companion piece!

Appendix

Evaluations

We evaluate on a suite of 15 in-context learning (ICL) datasets: