DatologyAI Team

11 hours ago41 min read

Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset

Train Better, Faster, and Smaller with DatologyAI's Multimodal Data Curation. We curate RedPajama-V1 into a new dataset. We trained transformers up to 2.7B parameters on up to 180B tokens of our curated data and they substantially outperform every public pretraining dataset we compared to when evaluated on a suite of 15 benchmark tasks. A 1.3B parameter model trained on our curated data also outperforms train-FLOPS-matched 2.7B models and reduces model FLOPs by 2.1x.

Two quick notes:

This post is a technical deep-dive into our initial text data curation results. For a brief, high-level summary of this work, we direct readers to our short companion piece.
The themes and some of the content of this post are similar to our Image-Text Curation Deep-Dive, but here we show that we can also obtain substantial gains applying our data curation pipeline to LLMs.

tl;dr

We share the results of applying our data curation pipeline to text data. Our pipeline is scalable, productionized, and integrates a suite of approaches including model-based filtering, embedding-based filtering, synthetic data, and more. We applied our curation pipeline to RedPajama V1 (RPJv1) and obtained substantial improvements in large language model (LLM) quality, training speed, and inference efficiency for model sizes of up to 2.7B parameters and training budgets of up to 180B tokens.

We trained standard MPT-style transformer models on our curated data (which we refer to as DAIT for brevity), on RPJv1, and on other publicly-available high quality pretraining corpora such as RefinedWeb, FineWeb, FineWeb-Edu, and DataComp-LM (DCLM). We compared their performance across 15 benchmark language model evaluation datasets. By using our pipeline to curate training data, we are able to:

Train Better: Improve model quality by 8.5 absolute percentage points (pp) on average 5-shot accuracy (16.2% relative improvement) when training a 2.7B parameter model for 180B tokens compared to RPJv1 (60.5% vs. 52.0%). When comparing to DCLM, an aggressively-curated public dataset that’s known to be extremely high quality, DAIT improves model quality by 4.4pp (7.8% relative improvement) over DCLM (56.1% mean accuracy) on average 5-shot accuracy. We also tie or outcompete even the strongest baseline, DCLM, in two thirds or more of the evaluation datasets, and are at par or outperforming other baselines on nearly all evaluations.
Train Faster: Save 86.9% on compute (7.7x training speedup) when training a 2.7B parameter model to reach the same average 5-shot accuracy compared to training on RPJv1 for 180B tokens, and save 70.1% on compute (3.4x training speedup) to reach the same average 5-shot accuracy as DCLM.
Train Smaller: Improve inference efficiency by training better, smaller models—training a 1.3B parameter model on 180B tokens of DAIT reduces the cost per query by up to 2.1x and simultaneously nets a 4.5pp average 5-shot accuracy improvement over training a 2.7B parameter model on 180B tokens of RPJv1 data, and has better 5-shot performance than every 2.7B model we trained on publicly curated datasets.

Our curation pipeline is able to take RedPajama v1, one of the consistently-lowest performing of the six public datasets that we baselined against, and transform it into a dataset that substantially outperforms the best publicly-available pretraining datasets.

While the results are strong, they did not arise easily; packaging our diverse set of cutting-edge data curations algorithms into a scalable, productionized pipeline was a significant challenge.

We’re extremely proud of what we’ve done, but this is just the start. We’re expecting to continually improve these numbers every few months. To do so, we need to push the bounds of what’s possible across research and engineering areas like embedding models, synthetic data, data infrastructure, and platform engineering—and we’re very excited to do it!

Are you a deep learning fanatic, data engineer, or otherwise data-obsessed person who wants to push the bounds of what’s possible with data curation? Check out our jobs page!

Is your company interested in training multimodal models faster, better, or smaller? Sign up for our customer waitlist!

1. Introduction

1.1. Why is Data Curation Important?

1.2. Why is Data Curation Hard?

1.3. Making Data Curation More Accessible

1.3.1. What We Did and How We Did It

1.3.2. A Bird's Eye View of this Post

2. Methods

2.1. Curation Algorithms

2.2. Data

2.2.1. Our Testbed: RPJv1

2.2.2. Why RPJv1?

2.2.3. Curating RPJv1 into DatologyAI Text (DAIT)

2.2.4. Comparing to Other High-Quality Public Datasets

3. Results

3.1. How We Present Our Results

3.2. Data Curation Improves Model Quality

3.2.1. Dataset Alchemy: Substantial Improvements from Curating RPJv1

3.2.2. DAIT Outperforms All Our Public Baselines

3.2.3. Consistent Curation Improvements Across Evaluation Datasets

3.2.4. More Shots More Gains

3.2.5. Olympic-Size Curation: Larger Pool Sizes Yield Better Curated Datasets

3.2.6. The Diminishing Returns of Training on Minimally-Curated Data

3.3. Data Curation Accelerates Model Training

3.4. Datology vs. Goliath: Less-Large Language Models via Data Curation

A.1. Supplementary Results

A.2. Meet the Family: A Deeper Dive on our Curation Algorithms

A.3. How we Evaluate

A.4. Models, Training, and Infrastructure

A.5. Building a Robust, Scalable Data Curation Pipeline

1. Introduction

Note: Much of the Introduction mirrors what we wrote in our Multimodal Deep-Dive. Feel free to fast-forward straight to the text-specific content.

1.1. Why is Data Curation Important?

Contemporary deep learning model capabilities require training large models on enormous quantities of training data (Dubey et al., 2024; Gemini Team et al., 2023; NVIDIA et al., 2024; Mistral AI Team, 2024). Each additional data point requires additional compute to train, and each additional parameter requires additional compute to train and deploy. Training and deploying cutting-edge models won’t be accessible if it requires access to a nuclear power plant. Reducing the costs required to train and deploy the best-quality models could thus facilitate broader access to the benefits of machine learning.

It has been shown that a large portion of training compute is wasted training on data that are already learned (Mindermann et al., 2022) or irrelevant (Sorscher et al., 2022) for the model’s downstream application. Some data can even be misleading and can actively harm model quality (Maini et al., 2023). This means that selecting the right data can substantially improve training and/or inference efficiency by reducing the cost of training a model to a given quality, improving the quality of models trained for a given compute cost, and enabling the training of smaller models to higher quality. All we have to do is select the right data to train on!

1.2. Why is Data Curation Hard?

Unfortunately, data curation for large-scale deep learning is hard—it’s a frontier research problem. It’s a comparatively new field, experiments are costly to run at scale (Feldman and Zhang, 2020; Choe et al., 2024), and small-scale results often aren’t predictive of large-scale outcomes (Sorscher et al., 2022; Goyal et al., 2024). Furthermore, even if we understand how a single curation algorithm works, it’s unclear how different algorithms work together. This is critical, because there’s likely no single algorithm to rule them all. As with training efficiency algorithms more generally (Leavitt et al., 2022; Bartoldson et al., 2023; Kaddour et al., 2023), different data curation algorithms impart their benefits through different mechanisms (Hu et al., 2023; Wang et al., 2024), so understanding how to effectively combine them is necessary for achieving the best outcomes.

Data curation is also a frontier engineering problem—there’s no established playbook for implementing a curation pipeline that can scale up to the billions of images or trillions of tokens that constitute modern foundation model training datasets. While numerous open-source projects offer a viable starting point for curation (e.g. DataTrove, NeMo-Curator, fastdup), they are often rigid and demand deep expertise, significant time, and resources to customize effectively, and don’t contain the cutting-edge research innovations necessary for training the best models. Integrating cutting-edge research into large-scale dataset curation tools presents unique challenges across domains like data infrastructure, platform engineering, reliability, and security. Off-the-shelf solutions typically lack the flexibility and adaptability needed to meet specific requirements and push the boundaries of model performance, making tailored solutions essential.

These research and engineering challenges are why cutting-edge data curation has primarily been restricted to the big players who can afford to hire large in-house data teams (we counted 69 authors across the different data-related teams on OpenAI’s GPT-4 Technical Report).

1.3. Making Data Curation More Accessible

At DatologyAI, we believe that intelligent data curation is too beneficial to remain locked away in closed research labs. We believe there is tremendous societal value to enabling the training of foundation models outside a small set of large, extremely well-resourced companies. We believe that data curation is one of the most promising ways to reduce the cost of training and deploying models. We believe everyone should be empowered to train their own models on their own data. And this is why we’re excited to announce the first set of results from our state-of-the art data curation pipeline.

1.3.1. What We Did and How We Did It

We built a scalable data curation pipeline that comprises a suite of curation algorithms and applied it to pools of Red Pajama v1 (RPJv1; Together Computer, 2023) of up to 600B tokens to generate curated datasets of up to 180B tokens. We then trained standard transformer models of up to 2.7B parameters on our curated data (DAIT), RPJv1, and a number of other publicly-available, high quality pretraining corpora, and compared their performance across 15 benchmark evaluation datasets.

Our curation can reduce the compute needed to reach baseline performance by up to 86.9% (a 7.7x training speedup), improve model quality by up to 8.4 percentage points compared to compute-matched baselines, and reduce inference costs by up to 2.1x by training smaller models to higher quality. For example, A 1.3B parameter model trained on DAIT (DatologyAI Text v1) for 180B tokens has 0.4pp higher mean 5-shot accuracy than a 2.7B parameter model trained for the same number of tokens on DCLM, an aggressively-curated public dataset that’s known to be extremely high quality.

1.3.2. A Bird’s Eye View of this Post

This post starts with an overview of the algorithms that constitute our curation pipeline, and presents the data we use and how we curate it. We then jump into results, describing the quality improvements, training speedups, and inference benefits of training on our curated data. Additional results, an in-depth tour of our curation algorithms, and methodological details (including evaluations and training) are contained in the Appendix.

2. Methods

2.1. Curation Algorithms

Figure 1: Data Curation Algorithm Families. Our pipeline integrates a number of families of curation algorithms.

Our algorithmic curation pipeline comprises a suite of algorithms, which we group into “families” based on their known or hypothesized mechanism(s) of action, effects, and computational requirements (see Figure 1). We use the following algorithm families:

Lexical deduplication: We remove exact duplicate documents across the entire data pool by computing a SHA-512 hash on the utf-8 encoded full text for each document. We also deduplicate strings within and across documents by detecting and removing exact ngrams matches across the complete dataset.
Heuristic Filtering: Intuitive rules like “discard documents that are mostly whitespace”, similar to Raffel et al., 2019, Rae et al., 2022, and Sharma et al., 2024. We found that most heuristic filters had little effect on RPJv1 due to the amount of curation that has already been applied to it, highlighting the challenge of curating on top of datasets that have already had some curation applied.
Model-based filtering: Similar to Marion et al., 2023, Li et al., 2024, and Ankner et al., 2024, we trained filtering models on high-quality reference datasets known to be in-domain with respect to our evaluation tasks. We then used these filtering models to score pretraining data samples and filtered out low-scoring examples. We found that the composition of the data that the filtering model was trained on strongly affected the quality of the model trained on the filtered data. Put another way: curation models need well-curated data. It’s data curation all the way down!
Embedding-based curation: We embed samples and then leverage the geometric relations between samples in embedding space to find high- and low-value data points, such as semantic duplicates, similar to Abbas et al., 2023, Tirumala et al., 2023, and Abbas et al., 2024. We improve on existing approaches by reducing the need for manual hyperparameter selection and enhancing the scalability and efficiency of the algorithms.
Target distribution matching: Like Gururangan et al., 2020 and Xie et al., 2023, we leveraged auxiliary data (for example from other high quality datasets and/or end tasks) to retrieve relevant data points from a larger corpus and upsample them during training. We found that the efficacy and efficiency of this approach is sensitive to a number of design elements, including the choice of target dataset(s), similarity and ranking algorithm, accounting for the density of the target distribution, and the proportion of retrieved data used in the final data mix. One risk of this approach is overfitting to the target datasets such that performance improves for benchmark evaluations but the model is worse in practice. We were careful to avoid this by restricting the amount of upsampling and ensuring lack of test set contamination.
Synthetic data: We used pretrained LLMs to rephrase data samples, similarly to Maini et al., 2024. We identified a number of factors that substantially impacted the quality of model trained on rephrased data, including the diversity of prompts for rephrasing, stylistic diversity of the rephrased samples, relevance of prompts and generated styles to the end-tasks, and the proportion of rephrased samples in the pretraining dataset.
Source Mixing: RPJv1 consists of data from multiple different sources (e.g. CommonCrawl, arXiv, Github, etc.), the relative proportions of which were not determined systematically. As in Liu et al., 2024, Ye et al., 2024, and Ge et al., 2024, we found that more optimal proportions for larger models can be determined via systematic sweep of relative proportions with smaller scale models.

Please see Appendix: Meet the Family for a more detailed discussion of each algorithm family.

2.2. Data

2.2.1. Our Testbed: RPJv1

We selected Red Pajama v1 (RPJv1; Together Computer, 2023), a ~1.2T token dataset intended to replicate Llama’s (Touvron et al., 2023) pretraining data, as our testbed. RedPajama v1 consists of the following subsets, as described in Together’s announcement blog post:

CommonCrawl (878B tokens): Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.
C4 (175B tokens): Standard C4 dataset.
GitHub (59B tokens): GitHub data, filtered by licenses and quality.
arXiv (28B tokens): Scientific articles removing boilerplate.
Books (26B tokens): A corpus of open books, deduplicated by content similarity. Note that this subset has since been removed due to reported copyright infringement.
Wikipedia (24B tokens): A subset of Wikipedia pages, removing boilerplate.
StackExchange (20B tokens): A subset of popular websites under StackExchange, removing boilerplate.

2.2.2. Why RPJv1?

We selected RPJv1 as our testbed for multiple reasons.

It’s well-established: it’s been available since April 2023, providing ample time for comprehensive training experiments and to establish itself as a mature and widely-used dataset.
RPJv1’s diverse content spans multiple domains, enabling us to evaluate our curation pipeline across a breadth of text types.
It has some curation already applied to it. In particular, the CommonCrawl subset has model-based quality filtering applied to it, C4 is a known CommonCrawl curation of high quality, and the remaining subsets all have some degree of text cleaning (e.g. boilerplate removal). Thus any improvements we achieve with our curation pipeline are additive, building upon these baseline quality measures rather than duplicating them. This cumulative effect allows us to evaluate the marginal benefit of our methods and provides a more stringent test of our curation's effectiveness.

Furthermore, the latter two factors—diverse, multi-source content and some amount of pre-existing curation—are representative of the practices of our target customers.

2.2.3. Curating RPJv1 into DatologyAI Text (DAIT)

We applied exact deduplication to the full RPJv1 dataset, which reduces the dataset from ~1.2T tokens to ~774B tokens. We use this exact-deduplicated version of RPJv1 as a stronger baseline for comparison and the starting point for further curation. Subsequent uses of “RPJv1” refer to the exact-deduplicated version.

All of the curation results that we present in this work are from applying our curation pipeline to RPJv1. We refer to our DatologyAI-curated version of RPJv1 as DAIT (DatologyAI Text).

In order to demonstrate the effectiveness of our curation pipeline across a breadth of dataset and compute scales, we applied our curation pipeline to different subsets of RPJv1, which we refer to as pool sizes, ranging from 60B tokens to 600B tokens, to capture the different starting dataset sizes that customers may have. We then trained for budgets ranging from 20B tokens to 180B tokens. We notate the pool size and training budget for a DatologyAI-curated dataset as DAIT Base_Dataset Pool_Size→Train_Budget, e.g. DAIT 540B→180B. The pool sizes and training budgets for different model sizes are shown in Table 1.

Table 1: Model Sizes, Pool Sizes, and Training Budgets. We start with different pool sizes of exact-deduplicated RPJv1 (~774B tokens) and curate them down to achieve a training dataset of a desired size. For the 20B and 60B training budgets, the pool sizes correspond to 3x, 6x, and 10x the training budget. The size of exact-deduplicated RPJv1 limited us to a 3x pool size for the 180B training budget. We notate the pool size and train budget for a DatologyAI-curated dataset as DAIT {PoolSize}→{Train_Budget}, e.g. DAIT 540B→180B.

When curating from a given pool size, we aimed to generate a curated dataset as close in size to the training budget as possible. We considered multi-epoch training (e.g. Muennighoff et al., 2023) to be out of the scope of the present work, but could potentially provide further gains due to the increased information density of curated data.

We note that one of our baseline datasets, C4 (Raffel et al. 2020; see Comparing to Other High-Quality Public Datasets), is 175B tokens, and so our longest training budget, 180B tokens, results in 1.03 epochs of training on C4, though we consider this a negligible amount of repetition.

We note that our curation recipe is fixed and agnostic to pool size and training budget. We do this to show that our pipeline is robust to scaling, despite recent work showing that tuning curation to the pool size and training duration is necessary for maximizing model quality (Goyal et al., 2024).

2.2.4. Comparing to Other High-Quality Public Datasets

We also compare DAIT to a number of publicly-available high quality pretraining corpora. Note that all datasets are curated starting from Common Crawl.

C4 (“Colossal Clean Crawled Corpus”; Raffel et al. 2020) is a 175B token dataset curated using several heuristic and NSFW safety filters, three span sentence-level deduplication, and document-level english filtering using langdetect.
RefinedWeb (Penedo et al., 2023) is a 600B token dataset. It's curated using NSFW URL filtering, text extraction with trafilatura (Barbaresi et al., 2021), document and line-level heuristic filters, and three types of deduplication: exact (n-gram), fuzzy (MinHash), and temporal (across different dumps).
FineWeb (Penedo et al., 2024) is a 15T token dataset, curated using URL filtering, independent MinHash deduplication per dump, and a selection of heuristic filters, including many of those used by C4. We highlight the methodical approach used by the creators of FineWeb to determine the optimal configurations for heuristic filtering and deduplication, which yielded a very high quality dataset.
FineWeb-Edu (Penedo et al. 2024) is a 1.3T token dataset derived from FineWeb. They provide a complex prompt to Llama-3-70B-Instruct to score 500k FineWeb samples based on their educational value then use these sample-score pairs to train a classifier. They then use the classifier to filter all of FineWeb down to the documents with high educational content, removing 92% of FineWeb.
DCLM (Li et al., 2024), short for DataComp for Language Models, extracts a standardized corpus of 240T tokens from Common Crawl dumps, which is then curated down to a 3.8T token dataset. It’s curated using the same heuristic filtering as RefinedWeb, a Bloom filter and MinHash for deduplication, and a fastText classifier trained on OpenHermes2.5 (Teknium, 2023) and Reddit r/explainlikeimfive (ELI5) datasets (Fan et al., 2019) to keep the top 10% of documents. We note the strength of this dataset, which is likely due to the extensive experimentation performed by the authors.

2.3. Evaluations, Training, and Pipeline

Details on how we evaluate, our models and training setup, and the technical rundown on our curation pipeline can be found in the Appendix.

3. Results

3.1. How We Present Our Results

Data curation with DatologyAI’s methods significantly enhances training efficiency by reducing the resources required to achieve a given level of quality or improving model quality for a given computational cost. Because models are not typically trained to convergence (and nor do we do so here), the relationship between training speed and accuracy can be adjusted to prioritize one over the other. We report results with a focus on maximizing either speed or quality independently, demonstrating that our curation provides substantial savings in computational resources while delivering high-performance outcomes in training and inference.

We first present our results on quality improvements, move on to training speedups, and finish with the inference benefits of training on curated datasets. We compare to six high quality public benchmark datasets (RPJv1, C4, RefinedWeb, FineWeb, FineWeb-Edu, and DCLM; see Methods: Comparing to Other Datasets for details), primarily focusing on training at the largest scale (2.7B parameter, 180B token training budget), and evaluate 0- and 5-shot performance on a suite of 15 evaluation datasets (see Appendix: How We Evaluate). We report accuracies ± standard error of the mean when replicates are available (see Models, Training, and Infrastructure).

We note that because the focus of this work is on the effects of data curation, we did not attempt to optimize model quality via any means other than data curation. All models of a given size used identical training configurations in every way except the dataset. We also chose to use standard models and training procedures (see Appendix: Models, Training, and Infrastructure).

As this is a blog post rather than an academic paper, we present our results in an interactive and browsable format, allowing readers to dynamically engage with the content. We encourage readers to take an active role in exploring our results. For readers who wish to really deep, we also include a plot in the Appendix that contains the full set of results for every task for every model we trained. Perhaps you will find an interesting result that we missed—if so, please let us know!

3.2. Data Curation Improves Model Quality

Figure 2: Data Curation Substantially Improves Model Quality. We plot mean accuracy across a set of 15 evaluation datasets (see Appendix: How We Evaluate) on the y-axis as a function of training tokens on the x-axis for a set of baseline datasets and one of our curated versions of RPJv1, DAIT 540B→180B (540B→180B indicates we started with a pool of 540B RPJv1 tokens and curated it down to 180B tokens; see Methods: Curating RPJv1 into DAIT for details). You can select between 0-shot and 5-shot evaluation and view individual and groups of evaluation datasets using the dropdown menus at the top of the plot. We also include a more detailed plot in the Appendix that contains the full set of results for every model we trained.

3.2.1. Dataset Alchemy: Substantial Improvements from Curating RPJv1

Our data curation unlocks substantial improvements in model quality. We trained MPT-3B transformer models (approximately 2.7B parameters; see Appendix: Models, Training, and Infrastructure) for 180B tokens on RPJv1, our curation of a pool of 540B RPJv1 tokens down to 180B curated tokens (DAIT 540B→180B), and multiple other public benchmark datasets: C4 (Raffel et al. 2020), RefinedWeb (Penedo et al., 2023), FineWeb (Penedo et al., 2024), FineWeb-Edu (Penedo et al., 2024), and DCLM (Li et al., 2024).

Our DAIT reaches 60.5% (±0.1 S.E.M.—Appendix: Models, Training, and Infrastructure) mean 5-shot accuracy on our suite of 15 in-context learning benchmarks (see Appendix: How We Evaluate), which improves over the base RPJv1 (52.0% accuracy) by 8.4 percentage points (pp; 16.2% relative improvement; see Figure 2 and Table 2).

DAIT also imparts substantial improvements on 0-shot evaluation: training a 2.7B parameter model on 180B DAIT tokens results in 57.0% (±0.3) mean 0-shot accuracy, a 7.3pp improvement (14.76% relative improvement) over training on 180B tokens of RPJv1 (49.7% mean 0-shot accuracy; see Figure 2 and Table 2).

Table 2: Data Curation Substantially Improves Model Quality at the 2.7B parameter, 180B token scale. This table summarizes the results shown in the previous plot. ± denotes standard error of the mean across replicates when replicates were trained (see Models, Training, and Infrastructure). All models are 2.7B parameter, MPT-style transformer models trained for 180B tokens on the indicated dataset. DAIT 540B→180B (540B→180B indicates we started with a pool of 540B RPJv1 tokens and curated it down to 180B tokens; see Methods: Curating RPJv1 into DAIT for details). “DatologyAI Improvement” is the difference between a given model performance and the performance of DAIT 540B→180B. 0-shot and 5-shot performance are averaged across 15 evaluation datasets (see Appendix: How We Evaluate).

3.2.2. DAIT Outperforms All Our Public Baselines

We also compare against five additional high quality public benchmark datasets: C4, RefinedWeb, FineWeb, FineWeb-Edu, and DCLM (see Methods: Comparing to Other Datasets for details). Our experiments show that training on our curated data yields models that substantially outperform models trained on any of the public benchmark datasets we trained on.

For example, we found that DCLM consistently yields the highest quality models of any non-DatologyAI dataset, but it still underperforms DAIT by 4.4pp on average across 5-shot evaluations (average 5-shot performance: DCLM: 56.1%±0.1; DAIT: 60.5%) and 3.0pp on average across 0-shot evaluations (average 0-shot performance: DCLM: 54.0%±0.1; DAIT: 57.0%±0.3) for 2.7B models trained for 180B tokens (Figure 2 and Table 2).

DAIT also substantially outperforms all other datasets at smaller model and/or training scales, except in the 1.3B parameter, 20B token setting, where FineWeb-Edu, DCLM, and DAIT all have similar performance. We direct readers to Figure A1 to explore the comprehensive set of results.

3.2.3. Consistent Curation Improvements Across Evaluation Datasets

Models trained on our curated datasets substantially outperform models trained on any other public dataset on average 0- and 5-shot evaluations, but it’s possible that our large average improvements could come disproportionally from a small set of evaluation datasets. Indeed, the improvements of FineWeb-Edu over other datasets are primarily from large gains on ARC, MMLU, and OpenBookQA (Penedo et al., 2024).

This is not the case for our curation. Our curation shows broad, consistent improvements across evaluation datasets. At the 2.7B parameter, 180B token scale, DAIT has 5-shot accuracy equal to or better than all other datasets on 11 out of the 15 evaluation datasets, and 0-shot accuracy equal to or better than all other datasets on 10 out of 15 evaluation datasets. We direct readers to Figure 2 and Table A1 for individual evaluation dataset comparisons for 2.7B parameter models trained for 180B tokens.

Our curated RPJv1 yields substantial, consistent gains over every publicly-available dataset we compared to. Furthermore, we tie or outcompete even the strongest baseline, DCLM, in two thirds or more of the evaluation tasks, and are at par or outperforming other baselines on nearly all tasks. This demonstrates the strength and generality of our curation pipeline.

3.2.4. More Shots More Gains

Our curation provides substantial gains on both 0-shot and 5-shot evaluation. If the benefit of our curation is primarily from putting task knowledge or capabilities into the model, or if we had overfitted to evaluation datasets, we would expect to observe the that 0-shot capabilities are improved more than 5-shot capabilities. Interestingly, we find the inverse to be true: examining Table 2 reveals that the curation gains are larger for 5-shot than for 0-shot evaluation. DAIT 540B→180B improves over the external baselines by 7.1pp on average for 5-shot evaluation, whereas it improves over the external baselines by 5.8pp on average for 0-shot evaluation.

In contrast, DCLM improves over other baselines by 3.3pp on average for 5-shot evaluation and 3.4pp on average for 0-shot, and FineWeb-Edu improves over other baselines by 2.0pp on average for 5-shot evaluation and 2.9pp on average for 0-shot. This indicates that our curation doesn’t just improve task learning, but also improves in-context learning more generally.

3.2.5. Olympic-Size Curation: Larger Pool Sizes Yield Better Curated Datasets

We observed in our previous work on data curation for image-text multimodal datasets (DatologyAI, 2024) that expanding the initial pool size can enhance the quality of the curated dataset by providing a broader range of high-quality samples for selection. We perform a similar analysis here, with one modification: we keep the target size of the curated dataset fixed. We find that for a fixed target dataset size, larger pool sizes lead to higher quality datasets (Table 3). When curating pools of RPJv1 of 600B, 360B, or 180B tokens to a target dataset size and training budget of 60B tokens, we observe in both 2.7B and 1.3B models that training on datasets curated from larger pool sizes yields higher quality models than training on datasets curated from smaller pool sizes. Though it is possible that the effect saturates, as our 600B→60B results and 360B→60B results are similar for 2.7B models.

Table 3: Larger Pool Sizes Yield Higher Quality Datasets. We compare models trained on variants of DAIT curated from different-sized pools of RPJv1 data down to 60B tokens. All models were trained for 60B tokens. ± denotes standard error of the mean across replicates when replicates were trained (see Appendix: Models, Training, and Infrastructure).

We note that in addition to meticulous heuristic filtering, DCLM and FineWeb-Edu both perform a model-based filtering step (see Comparing to Other Datasets) that removes 90% and 92% of the data, respectively. These translate to pool sizes that are 10x (DCLM) and 11.5x (FineWeb-Edu) the curated dataset size, whereas our 180B token dataset is curated from a pool size of 540B tokens, which is only 3x the curated dataset size. While our pool size was limited by the size of RPJv1 after exact deduplication (774B tokens), we believe that the quality of the results we obtained using a much smaller pool size and lower quality base dataset speak to the quality of our curation. We also expect the benefits of our curation to grow with the pool size.

3.2.6. The Diminishing Returns of Training on Minimally-Curated Data

We showed that our curation substantially improves the quality of models trained for a fixed compute budget. We now show that attempting to use less-curated data to match the accuracy improvements obtained when training on better-curated data—especially DatologyAI’s curated data—are very expensive, and in some cases impossible to obtain.

We attempted to match the performance of a 1.3B parameter model trained on our DAIT for 60B tokens (53.1% mean 5-shot accuracy, 50.4% mean 0-shot accuracy) by training a 1.3B parameter model longer on the base RPJv1. However, even after training for 600B tokens (10x longer) on RPJv1, the mean 5-shot and 0-shot performance were 51.1% and 48.1%, respectively, which are are 2.0pp worse (5-shot) and 2.3pp worse (0-shot) than that of the model trained on DAIT (Table 4). Curation raises the ceiling on achievable gains, even when allowing for drastically larger compute.

Table 4: Shattering the Quality Ceiling with Data Curation. Training for 60B tokens on DAIT 600B→60B (one of our RPJv1 curations; see Methods: Curating RPJv1 into DAIT for details) substantially outperforms training for 600B tokens on base RPJv1 on 5-shot and 0-shot evaluations (see Appendix: How We Evaluate). All models are 1.3B parameters. ± denotes standard error of the mean across replicates when replicates were trained (see Appendix: Models, Training, and Infrastructure).

3.2.7. How Much Does One Additional Point of Accuracy Cost?

The diminishing returns from extended model training on poorly curated data result in a high marginal cost for model quality improvements. To quantify this, we calculate the marginal accuracy point cost by establishing a reference accuracy. This reference point is a 1.3B parameter model’s performance at the first checkpoint of training on RPJv1 (41.8% mean 5-shot accuracy at ~6.5B tokens). We then define 1x as the training budget needed to reach this reference accuracy.

Figure 3: Data Curation Reduces the Marginal Cost of Accuracy Improvements. We show the marginal 5-shot accuracy improvement (y-axis) as a function of compute cost multiple (x-axis) relative to training on RedPajama-v1 to reach 41.8% accuracy. Accuracy gains are much cheaper when using curated data compared to the baselines. For example, using 5x as much compute yields approximately a 4.6pp improvement when training on RPJv1, while it yields a 10.7pp improvement when training on DatologyAI’s curated RPJv1. All models are 1.3B parameters.

Figure 3 shows the relative cost of each marginal accuracy point when training models with and without DatologyAI curation. Using 5-shot performance as an example, a 4.6pp mean accuracy improvement from 41.8% to 46.4% requires approximately 5x the compute of reaching 41.8% mean 5-shot accuracy when training on the base RPJv1 data. In contrast, investing the same 5x compute when training on DAIT 600→60 yields an improvement of 10.7pp in model quality over the baseline, an improvement that’s unachievable when training on the base RPJv1. These findings underscore the substantial efficiency gains achieved by using DatologyAI's curated data compared to raw datasets.

Curation not only significantly reduces the cost per unit of accuracy but also unlocks performance levels unattainable with raw data alone—highlighting the pivotal role of targeted dataset curation in advancing model quality.

3.3. Data Curation Accelerates Model Training

Thus far, we have focused on demonstrating how our curation improves model quality. We now shift our attention to how our curation saves compute by accelerating training. We measure compute savings by first training a baseline model for some fixed number of tokens, then comparing the number of DatologyAI-curated tokens required to reach the baseline model’s accuracy.

Figure 4: Data Curation Accelerates Model Training. We plot mean accuracy across a set of 15 evaluation datasets (see Appendix: How We Evaluate) on the y-axis as a function of training tokens on the x-axis for a set of baseline datasets and one of our curated versions of RPJv1, DAIT 540B→180B (540B→180B indicates we started with a pool of 540B RPJv1 tokens and curated it down to 180B tokens; see Methods: Curating RPJv1 into DAIT for details). You can select between 0-shot and 5-shot evaluation using the dropdown menu at the top of the plot. We also include a more detailed plot in the Appendix that contains the full set of results for every model we trained.

Compared to baselines trained on 180B tokens, our pipeline reduces compute requirements to reach baseline average 5-shot accuracy by 86.9% (7.7x training speedup) relative to RPJv1, by 78.2% relative to FineWeb-Edu (4.6x training speedup), and by 70.6% relative to DCLM (3.4x training speedup). We direct readers to Figure 4 for a complete comparison of speedups for 2.7B models trained on 180B token datasets.

For 0-shot evaluation, our pipeline reduces compute requirements to reach baseline average 0-shot accuracy by 88.9% (9.1x training speedup) relative to RPJv1, by 70.6% relative to FineWeb-Edu (3.4x training speedup), and by 64.9% relative to DCLM (2.8x training speedup).

We observe a similar range of training speedups for smaller models and/or training budgets, and direct the reader to Figure A1 for comprehensive results for every model, training budget, and task.

These reductions in computational cost underscore the value of curated data for rapid iteration in large-scale model development.

3.4. Datology vs. Goliath: Less-Large Language Models via Data Curation

Most published scaling laws focus solely on estimating changes in model quality as a function of parameter count and training data and ignore the cost of inference. However, Sardana et al., 2024 introduce inference-aware scaling laws that reveal that the optimal model size can be significantly smaller than what is predicted by conventional, inference-agnostic approaches. This is particularly relevant to deployed industrial settings, where inference cost can dominate the economics of the model lifecycle. Thankfully, the improved training efficiency unlocked by data curation can also be applied to improve the quality of smaller models, making them viable replacements for larger models in deployment and yielding substantial inference savings.

We measure inference efficiency using theoretical model FLOPs, which we multiply by 3 when discussing training results to account for the backward pass. We note the shortcomings of relying solely on FLOPs for quantifying efficiency (Deghani et al., 2021), and consider these FLOPs-based estimates as an approximate upper bound.

Figure 5: Reduce Inference Costs and Improve Performance by Training Smaller Models. We plot mean 5-shot accuracy (y-axis) as a function of model parameters (x-axis) for 2.7B parameter models trained using a set of baseline datasets, and a 1.3B parameter model trained on one of our curated versions of RPJv1, DAIT 540B→180B (540B→180B indicates we started with a pool of 540B RPJv1 tokens and curated it down to 180B tokens; see Methods: Curating RPJv1 into DAIT for details). All models are trained for 180B tokens.

3.4.1. Check My Math: 1.3 > 2.7?

We trained a 1.3B parameter model on 180B tokens of DAIT 540B→180B and compared it to the performance of 2.7B parameter models trained for an equivalent FLOPs budget on other baseline datasets. The 1.3B model trained on DAIT reaches 56.5% (±0.2) mean 5-shot accuracy, outperforming all baselines, and 1.9pp above the best-performing 2.7B model, trained on DCLM (54.6%). The 1.3B model trained on DAIT also outperforms all other baselines on FLOPs-matched mean 0-shot accuracy 52.9% (±0.3), and is 0.6pp above DCLM (52.3%±0.1). The 1.3B model trained on DAIT even reaches higher mean 5-shot accuracy than all 2.7B baselines when compared per training token (DAIT 1.3B: 56.5% ±0.2; DCLM 2.7B: 56.1%±0.1; see Figure 5 for token-matched comparison, see Figure A2 for FLOPs-matched comparison).

These results demonstrate the promise of high quality data curation for shrinking models, which allows satisfying more demand at the same cost, lower latency, and unlocking new use cases such as on-device computation or real time operations

4. Wrapping Up

In this work, we demonstrated that DatologyAI's state-of-the-art data curation pipeline dramatically improves the efficiency of LLM training across multiple dimensions:

Better Models: When training 2.7B parameter MPT-style transformers for the largest compute budget (180B tokens), training our curated version of RedPajama v1, DAIT, improves over the base RedPajama v1 by 8.4pp on average 5-shot accuracy across 15 evaluation tasks. We also compared against a suite of high-quality public pretraining datasets and found that DAIT outperforms all of them, including DCLM by 4.4pp and FineWeb-Edu by 6.1pp. We also tie or outcompete even the strongest baseline, DCLM, in two thirds or more of the evaluation datasets, and are at par or outperforming other baselines on nearly all evaluations.
Faster Training: Our curation enables substantial reductions in training compute requirements: Training a 2.7B parameter model on DAIT saves 86.9% on compute (7.7x training speedup) to reach the accuracy of a 2.7B model trained on RedPajama v1 for 180B tokens, and saves 69.4% on compute (3.3x training speedup) to reach the accuracy of a 2.7B model trained on DCLM.
Smaller, Better Models: Our curation enables training of smaller models that are both better and more efficient: We train 1.3B parameter models on our curated data, which leads to models that reduce the cost per query by up to 2.1x and have better training FLOPs-matched accuracy than every 2.7B model we trained on public datasets.

Our curation pipeline takes RedPajama v1, one of the consistently-lowest performing public datasets of out the six that we evaluated, and transforms it into a dataset that substantially outperforms the best publicly-available pretraining datasets.

4.1. What’s Next

These results extend the success of our multimodal image-text curation to text data, further emphasizing that data curation can be a powerful tool for improving the efficiency of foundation model training and deployment.

However, this work is just the start—expect us to keep improving these numbers, broadening our scope, and exploring the nitty-gritty of data curation.

4.2. Get in Touch!

If you’re interested in pushing the bounds of what’s possible with data curation, we’re looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.

We’re starting to work with early customers. If you’re an enterprise AI company interested in training multimodal and/or text models faster, better, or smaller, sign up for our customer waitlist!

Contributors

Core Contributors

Aldo Carranza

Alvin Deng

Pratyush Maini

Muhammed Razzak

Jack Urbanek

Contributors

Amro Abbas

Paul Burstein

Ning Cao

Priya Goyal

Josh McGrath

Fan Pan

Josh Wills

Haoli Yin

Interns

Vineeth Kada

Muhammed Razzak

Vishwa Shah

Vishruth Veerendranath

Leadership and Advising

Matthew Leavitt

Bogdan Gaza

Ari Morcos

For attribution in academic contexts, please cite this work as

Carranza et al., "DatologyAI Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset", 2024.

BibTeX citation

@techreport{carranza_datologyai_2024,
	title = {{DatologyAI} {Technical} {Deep}-{Dive}: {Curating} {Our} {Way} to a {State}-of-the-{Art} {Text} {Dataset}},
url = {https://www.datologyai.com/post/technical-deep-dive-curating-our-way-to-a-state-of-the-art-text-dataset},
	institution = {DatologyAI},
	author = {Carranza, Aldo and Deng, Alvin and Maini, Pratyush and Razzak, Muhammed and Urbanek, Jack and Abbas, Amro and Burstein, Paul and Cao, Ning and Goyal, Priya and McGrath, Joshua and Pan, Fan and Wills, Josh and Yin, Haoli and Kada, Vineeth and Shah, Vishwa and Veerendranath, Vishruth and Gaza, Bogdan and Morcos, Ari and Leavitt, Matthew},
	month = nov,
	year = {2024},

}

A. Appendix

A.1. Supplementary Results

A.1.1. The Big Plot

Figure A1: Comprehensive results plot (aka The Big Plot). Here we enable viewing the results for all evaluations, model sizes, pool sizes, and training datasets, each of which you can control using the dropdown menus at the top of the plot.

A.1.2. The Big Table

Table A1: The Big Table. We show average 0-shot and 5-shot results for every evaluation dataset for 2.7B models trained for 180B tokens (see Appendix: How we Evaluate). ± denotes standard error of the mean across replicates when replicates were trained (see Appendix: Models, Training, and Infrastructure).

A.1.3. Training FLOPs Comparison of 1.3B DAIT vs. 2.7B Baselines

Figure A2: Data Curation Enables Training Better, Smaller Models. We plot model accuracy (y-axis) as a function of theoretical training FLOPs (x-axis) for 2.7B parameter models trained using a set of baseline datasets, and a 1.3B parameter model trained on one of our curated versions of RPJv1, DAIT 540B→180B (540B→180B indicates we started with a pool of 540B RPJv1 tokens and curated it down to 180B tokens). Training on DAIT outperforms all baselines on a per-FLOP basis.

A.2. Meet the Family: A Deeper Dive on our Curation Algorithms

A.2.1. Lexical Deduplication

Lexical deduplication is a common and straightforward data curation practice, though the impact of training on duplicates varies based on the quality of the duplicated data. Many works have demonstrated that duplicate training examples increase training time and can hurt performance (Lee et al. 2021), and also lead to memorization (Carlini et al. 2022). In contrast FineWeb (Penedo et al. 2024) found that deduplication within common-crawl dumps was effective, but across them was not. Likewise the DataComp-LM’s (Li et al. 2024) highly-filtered DCLM-baseline dataset contains 4 trillion tokens, though the authors do not find benefit in exact deduplication even though they identify 50% of those as exact duplicates. We pursue exact deduplication both to reduce the costs of downstream curation steps as well as to have more direct control over repetition in our dataset.

Common methods to detect text duplicates include comparing full hashes of the raw documents, finding exact ngram matches, and using approximate hashes like MinHash (Broder 1997) or Bloom filters (Bloom 1970). We employ a full document SHA-512 hash alongside intra-document ngram removals in the exact deduplication stage.

A.2.2. Heuristic Filtering

There are many well-established heuristic filtering methods commonly used in curation of web-scale text data (Rae et al., 2021; Albalak et al., 2024; Sharma et al., 2024). These methods typically remove low-quality or uninformative text by discarding documents that contain low-quality linquistic or lexical characteristics. For example, they remove documents that are too short or too long, have unusual average word lengths, contain excessive symbols like hashtags and ellipses, or lack common English stop words. Such heuristics effectively filter out content like automatically generated text and incoherent social media posts.

Another important heuristic filter involves repetition removal which targets documents with excessive repetition of words or phrases, a common indicator of low-quality content. By calculating the proportion of repeated lines, paragraphs, and/or n-grams—and filtering out documents exceeding specific thresholds (e.g., a duplicate line fraction over 30%)—we enhance dataset quality and mitigate the tendency of language models to produce repetitive text.

Although these heuristic filtering methods are commonly used as the first first line of text curation, when we applied a wide range of heuristic filters to the RedPajama dataset, we found they had little effect, likely due to the baseline level of curation already applied to the dataset.

A.2.3. Model-Based Filtering

Heuristic filters, which are developed and selected through intuitive reasoning, can yield powerful insights when applied effectively, but recent results have shown that humans are often worse than random at guessing which examples a model will learn well from (Li et al., 2024). Moreover, hand-crafted filtering rules are a brittle foundation upon which to build a tool that works across diverse domains. Privacy concerns may also prevent accessing customer data directly, making it even more difficult to implement filtering methods that rely on manual inspection and trial-and-error. Using an existing model to filter data for training subsequent models is a more reliable approach, and is standard practice for identifying high-quality data points, for both multimodal ([Gadre et al., 2024]([Gadre et al., 2024]); Schuhmann et al., 2022; Zhu et al., 2024; Awadalla et al., 2024; Fang et al., 2023; Xu et al., 2023; Maini et al., 2023; Mahmoud et al., 2023) and text (Wenzek et al., 2019; Marion et al., 2025; Li et al., 2024) datasets. Models used for filtering are typically trained on high-quality datasets or datasets aligned with specific downstream tasks, such that the filtering model retains data more relevant to the target distribution.

Model-based filtering is typically implemented in one of a few ways. Classifier-based approaches (Li et al., 2024; Soldaini et al., 2024; Li et al., 2024; NSFW filters from the llama 3.2 multimodal section) utilize pre-trained or fine-tuned classification models to directly assess each data sample. These models output a score or probability indicating whether a sample meets certain criteria, effectively filtering data based on discrete predictions. Alternatively, contrastive-based approaches (Fang et al., 2023; Schuhmann et al., 2022; Lai et al., 2023) are trained with a contrastive loss function, which encourages the model to map similar data points closer together and dissimilar ones further apart in the embedding space. This technique is particularly common in image-caption alignment filtering for multimodal data (Schuhmann et al., 2022; Lin et al., 2025), where the goal is to measure the semantic similarity between different modalities. However, model-based score generation for filtering can be computationally expensive at scale. Recent methods have proposed using a larger model to generate training data and distill that into a smaller, more efficient filtering model (Penedo et al., 2024; Dubey et al., 2024), or employing active learning to selectively route samples to different model sizes and improve efficiency (Zhang et al., 2024).

A.2.4. Embedding-based Curation

Embedding-based data curation has emerged as a widely used method for improving the quality of training data (Sorscher et al., 2022; Abbas et al., 2023; Tirumala et al., 2023; Abbas et al., 2024; Vo et al., 2024). These methods transform the data into an embedding space using a pre-trained encoder model and apply different curation algorithms on the embeddings.

For example, SemDeDup (Abbas et al., 2023) exploits the high-level semantic information that’s captured by the embedding model to identify and removing semantic duplicates. A series of works (Sorscher et al., 2022; Tirumala et al., 2023; Abbas et al., 2024) use various clustering-based metrics computed in the embedding space to determine sample difficulty or relevance. Vo et al., 2024 employ a hierarchical clustering algorithm to balance concept distributions in pretraining data, resulting in more efficient training for both text and image datasets.

Embedding-based curation provides two key advantages. One advantage of is that it can leverage the relations between samples, instead of curating samples in isolation like many algorithms. The second advantage is that it reduces the need to implement modality-specific curation algorithms. Once the data are embedded, curation becomes modality-agnostic. While it still requires an embedding model, the existence of many high quality, open-sourced, pretrained embedding models makes it easy to extend embedding-based curation methods to novel domains and modalities.

A.2.5. Target Distribution-Matching

The distributional similarity between pretraining data and target data is generally recognized as a predictor of performance on the target task (Kirchenbauer et al., 2024). However, the distribution of web-crawled text data is often different from a given target task distribution. If one has access to auxiliary data drawn from a similar distribution as the target task, this can be leveraged to select for data points in the pretraining corpus in a way that aligns the resulting training data distribution to the target distribution (Dai et al., 2019; Aharoni et al., 2020; Gururangan et al., 2020). This target distribution-matching approach typically operates by using an embedding to obtain embeddings for the samples in the uncurated dataset and the samples in the target dataset(s), computing some similarity score between the embeddings of the uncurated and target data, and selecting and ranking samples from the uncurated dataset that have high similarity to the target data in a way that matches the target distribution. These retrieved samples are then upsampled during training or used to construct an entirely new dataset.

This approach is generally quite effective, although it requires having a high-quality target dataset and/or some a priori knowledge of the test distribution. The latter scenario is the norm in industrial settings, where models are trained intentionally for known tasks and applications. We found that the efficacy and efficiency of this algorithm is sensitive to a number of design elements, including the choice of target datasets, the similarity search & ranking algorithm, accounting for the density of the target distribution, and the proportion of retrieved data used in the final data mix.

A.2.6. Synthetic Data

Synthetic data generation, a data curation approach in which language models are used to supplement or enhance training data, has shown promise for addressing the dual challenges of data scarcity and training efficiency. In the present work we are primarily concerned with the latter issue. We roughly divide efficiency-oriented synthetic data approaches into two categories: the generation and augmentation.

The generation approach to synthetic data creation, exemplified by projects such as Tiny Stories (Eldan and Li, 2023), the Phi series of models (Gunasekar et al., 2023; Li et al., 2023; Abdin et al., 2024), and Cosmopedia (Ben Allal et al., 2024a; Ben Allal et al., 2024b), represents a significant innovation in efficient model training. This approach leverages large language models (LLMs) as knowledge generators to create training data de novo, allowing fine-grained control over the generated content. The approach has demonstrated remarkable success, with Phi-1 serving as a notable example, achieving performance comparable to LLaMA-7B while requiring 80 times less computational resources.

However, the generation approach faces inherent limitations. The quality and scope of the synthetic data are inherently bounded by the knowledge and capabilities of the generating model. This creates a natural ceiling for knowledge transfer and can lead to faster saturation of learning. There is also evidence that the models trained on these kind of data can be brittle and lacking in generalization capabilities (Zhang et al., 2024).

The augmentation approach, typified by Web Rephrase Augmented Pre-training (WRAP; Maini et al., 2024), presents an alternative methodology that bridges the gap between synthetic and natural data. Unlike pure generation, this method sources its knowledge directly from the training data, augmenting it by rephrasing with different prompts such as “like wikipedia”, or “in a question and answer format appropriate for a 5th grader”.

The primary advantage of this approach lies in its ability to combine the knowledge present in the data to be rephrased with the capabilities of the model doing the rephrasing. The results from Maini et al. (2024) demonstrate a 3x reduction in compute while maintaining comprehensive knowledge coverage.

Our approach extends the augmentation paradigm, focusing on enhancing existing knowledge rather than generating it anew. This method prevents the knowledge saturation observed in generator-driven approaches while maintaining efficiency gains in training.

Key Insight: The choice between these paradigms represents a tradeoff between knowledge breadth and generation control. Generator-driven approaches offer more controlled outputs but face knowledge limitations, while internet-augmented methods provide broader coverage but require careful quality management.

A.2.7. Source Mixing

Optimizing the composition of training data is pivotal in the pretraining of large language models, as it directly impacts their generalization capabilities and robustness (Anil et al., 2023; Dubey et al., 2024). Source mixing, or data mixing, refers to the strategic combination of diverse datasets to achieve a balanced representation of knowledge across various domains (Soldaini et al., 2024). Recent advancements in state-of-the-art LLMs (Rae et al., 2022; Dubey et al., 2024) have underscored the importance of employing balanced mixtures of data sources—including web text, literary works, scientific publications, multilingual corpora, and code—to ensure comprehensive language understanding and generation.

Traditionally, fixed heuristic-based weighting schemes have been utilized to balance different data sources, often derived from limited experiments on smaller-scale models (Rae et al., 2022). While these methods offer a foundational approach, they may not generalize effectively to larger models due to scaling disparities. To address this limitation, recent research has introduced dynamic and theoretically grounded techniques for more effective data mixture optimization. One prominent approach involves identifying optimal mixture proportions by training smaller proxy models under the assumption of rank invariance across scales, thereby predicting performance on larger models (Xie et al., 2023; Liu et al., 2024). Another strategy focuses on developing predictive data mixing laws that model the relationship between model size, domain proportions, and validation loss, enabling performance estimation without extensive pretraining (Ye et al., 2024; Ge et al., 2024). These methods can be computationally intensive, as they require training multiple proxy models. In practice, we find that these methods are still effective at optimizing the data mix for pretraining with proxy models multiple orders of magnitude smaller than the LLM to be trained.

Alternatively, dynamic data mixing strategies (Albalak et al., 2023; Jiang et al., 2024) adapt sampling proportions across domains in real-time during training. This adaptive methodology allows the optimal data mixture to be learned within a single training run, enhancing data efficiency and reducing computational overhead.

We found that predictive data mixing laws were able to successfully improve model quality when multi-source pretraining data were available.

A.3. How we Evaluate

A.3.1. Evaluation Datasets

We evaluate using LLM Foundry’s Eval Gauntlet on a suite of 15 evaluation datasets that includes the 8 used by FineWeb and additional datasets that are well-established and demonstrate above-chance performance and monotonic improvement during training (the same inclusion criteria as FineWeb).

ARC-Challenge (Clark et al., 2018): ARC-Challenge is a dataset of 2,590 genuine grade-school level, multiple-choice science questions, designed to promote research in advanced question-answering. It challenges AI capabilities by requiring reasoning and comprehension beyond standard retrieval methods, featuring a Challenge Set of difficult questions and an Easy Set (description from the unitxt documentation).
ARC-Easy (Clark et al., 2018): Similar to ARC-Challenge, but consists of 7,787 easier questions.
BoolQ (Clark et al., 2019): BoolQ is a question answering dataset for yes/no questions containing 15,942 examples. These questions are naturally occurring and generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the goal to assess reading comprehension and reasoning abilities of language models (description from the tensorflow documentation).
CommonsenseQA (Talmor et al., 2018): CommonsenseQA is a multiple-choice question answering dataset that contains 12,102 questions requiring different types of commonsense knowledge to predict the correct answers. It is designed to evaluate the understanding of commonsense reasoning in models, utilizing various contextual cues to differentiate between valid and invalid responses through intricate question structures (description from Talmor et al., 2018).
COPA (Roemelle et al., 2011): The COPA dataset is designed to assess commonsense causal reasoning through 1000 questions where each question consists of a premise and two alternatives. The task involves selecting the alternative that has the more plausible causal relationship with the premise (description from PapersWithCode).
HellaSwag (Zellers et al., 2019): HellaSwag is a benchmark dataset for commonsense natural language inference (NLI) that challenges state-of-the-art models with context and endings that are easy for humans to understand but difficult for machines, demonstrating the limitations of current AI in commonsense reasoning. The dataset utilizes Adversarial Filtering to select difficult, machine-generated incorrect answers, providing a complex test for model performance (description from LessWrong).
LAMBADA (Paperno et al., 2016): The LAMBADA dataset evaluates the capabilities of computational models for text understanding by means of a word prediction task. It consists of narrative passages where human subjects can guess the last word if they see the entire passage but not just the last sentence prior to the target word. Thus, the dataset tests the ability of models to manage long-term context and coherence in text (description from HuggingFace).
MMLU (Hendrycks et al., 2021): The MMLU (Massive Multitask Language Understanding) benchmark is designed to evaluate the capabilities of large language models through multiple-choice questions across 57 subjects, including mathematics, history, and law. It assesses the models' performance in zero-shot and few-shot scenarios to measure their world knowledge and problem-solving skills (description from datatunnel).
OpenbookQA (Mihaylov et al., 2018): OpenBookQA is a question-answering dataset modeled after open-book exams for assessing human understanding. It includes 5,957 multiple-choice questions designed to probe the understanding of core science facts and their applications in novel situations. The dataset encourages research in advanced question answering by requiring a combination of common knowledge and fact retrieval (description from HuggingFace).
PIQA (Bisk et al., 2020): The PIQA (Physical Interaction: Question Answering) dataset evaluates models' ability to reason about physical commonsense through everyday scenarios, testing their understanding of object interactions and their physical properties. It contains about 20,000 question-answer pairs designed to challenge language models in recognizing the appropriateness of actions based on physical knowledge, focusing on both typical and atypical uses of objects (description from TensorFlow Datasets).
RACE-High (Lai et al., 2017): RACE-High is the high school subset of the RACE dataset, containing 69,395 questions from 19,527 passages collected from English examinations for Chinese high school students. It emphasizes complex reasoning and understanding of advanced texts, providing a challenging benchmark for large language models in reading comprehension (description from Lai et al., 2017).
RACE-Middle (Lai et al., 2017): RACE-Middle is the middle school subset of the RACE dataset, consisting of 28,293 questions from 8,718 passages targeted at middle school students. The dataset assesses reading comprehension and reasoning abilities through questions designed by experts, making it suitable for evaluating the understanding of intermediate-level texts by language models (description from Papers with Code).
SciQ (Welbl et al., 2017): The SciQ dataset contains 13,679 crowdsourced science exam questions covering topics such as physics, chemistry, and biology. The questions are in multiple-choice format with four options each, and most include a supporting paragraph with evidence for the correct answer. The dataset aims to improve AI systems' performance on science question answering by providing high-quality, diverse questions (description from Welbl et al., 2017).
SIQA (Sap et al., 2019): SIQA (Social IQA) is a question-answering benchmark for testing social commonsense intelligence. It contains over 38,000 multiple-choice questions about everyday social interactions, focusing on reasoning about people's motivations, reactions, and emotional states. The dataset requires models to understand and predict social dynamics and implications, making it a valuable resource for evaluating AI understanding of social situations (description from Papers with Code).
WinoGrande (Sakaguchi et al., 2020): Winogrande is a large-scale dataset containing 44,000 problems inspired by the Winograd Schema Challenge, designed to evaluate commonsense reasoning in AI systems. It addresses biases in previous datasets by employing an adversarial filtering algorithm and crowdsourcing methods to enhance difficulty and reduce annotation artifacts. Each problem requires selecting the correct referent in an ambiguous sentence, testing a model's ability to understand nuanced context and commonsense knowledge (description from HuggingFace).

A.3.2. Evaluation Schema

We evaluate using 0-shot and 5-shot prompting for all evaluation datasets, and present the results for each n-shot separately.

We evaluate multiple choice question (MCQ) tasks using a relative scoring method as described in the HuggingFace investigation of the OpenLLM leaderboard. Briefly, we compare the probabilities predicted by the model restricted over on the set of valid responses. For example, if the dataset allows four possible responses, “A”, “B”, “C”, and “D”, we select the model’s response with the highest probability out of these four options, not the model’s full output distribution.

A.4. Models, Training, and Infrastructure

While numerous advancements in modeling and other components of training have emerged since the introduction of transformer models, the focus of this work is on the effects of data curation, so we did not attempt to optimize model quality via any means other than data curation. We chose to use standard models and training procedures.

We trained standard MPT-style 1.3B and 2.7B transformer models for our experiments. We use a custom fork of LLM Foundry for all our training and evaluation. We use the MPT-1B and MPT-3B training configurations. Note that the MPT-1B model is approximately 1.3B parameters, and the MPT-3B model is approximately 2.7B parameters. We describe relevant hyperparameters here:

Tokenizer: EleutherAI/gpt-neox-20b
Context window: 2048
Optimizer: Decoupled Adam-W (Loshchilov and Hutter, 2017).

We direct readers to the previously linked files for more detail, and note that we make the following modifications:

Precision: amp_fp8
Activation checkpointing: false
Warmup: 1000ba
For MPT-1B
- Global train batch size: 320
For MPT-3B
- Global train batch size: 640
- Learning rate: 4.0e-4

All training runs were replicated across at least two random seeds, except for the following, which were only run once:

1.3B parameter model trained on RPJv1 for 600B tokens
1.3B parameter models trained on C4 for 180B tokens
2.7B parameter models trained on C4, RPJv1, and RefinedWeb for 180B tokens

We report accuracies ± standard error of the mean when replicates are available (see Models, Training, and Infrastructure).

All our model training and evaluation was conducted on an AWS SageMaker Hyperpod cluster of H100 SXM nodes (p5.48xlarge instance).

A.5. Building a Robust, Scalable Data Curation Pipeline

Note: This section is duplicated with our Technical Deep-Dive on Image-Text Curation.

While many excellent open-source projects offer a solid foundation for basic curation on small, single-machine text datasets, scaling those efforts to apply cutting-edge research to multimodal datasets presents a unique set of challenges. First, the sheer size of the datasets means that we have to run our curation algorithms using a cluster of machines working together, with all of the challenges around network IO, fault tolerance, and data consistency that emerge when we are working with distributed systems. Optimal curation performance requires us to use all of computer science, from how we lay data out on disk and object storage, to how we perform batch inference on GPUs, to how we organize and track assets and curation decisions as data flows through our systems. These different components can interact in subtle and often unexpected ways, which is precisely why we feel that a tight integration between research and engineering for data curation systems is so necessary to ensure that state-of-the-art techniques translate into practical, scalable, and efficient solutions.

Our data curation platform evolved from the requirements of our initial customers and our internal research team, focusing on scalability, portability, performance and security. At its core, the platform is designed for Bring Your Own Cloud (BYOC) deployments, enabling seamless integration with our customers' existing infrastructure while maintaining cloud agnosticity.

As we developed this stack we focused on the following high-level requirements:

Scalable: our pipeline needs to curate datasets at the scale for training modern foundation models—billions of images / trillions of tokens.
Performant: the cost and speed of curation need to yield a substantial net savings on model training and deployment.
Portable: our data curation platform must be cloud-agnostic and not depend specific cloud resources available in just one cloud provider
Secure: our product runs on the customer’s side so handling their data needs to be done in a secure fashion with all the proper controls in place

A.5.1. Core Infrastructure

We chose Kubernetes as our primary container orchestration platform, providing a consistent foundation across different cloud environments. This decision has proven crucial for maintaining operational consistency and enabling portable deployments across diverse customer environments.

A.5.2. Data Processing Engine

Our primary data processing framework is Spark on Kubernetes using the Spark Operator, which offers finer-grained control over performance tuning. Specifically, it has enabled us to:

Optimize resource allocation more precisely
Fine-tune job configurations for maximum throughput
Maintain greater control over our processing infrastructure

We store our data primarily in Parquet format, optimizing for both storage efficiency and query performance. Looking ahead, we're evaluating Ray for inference workloads, which promises to better serve our machine learning deployment needs.

A.5.3. Workflow Orchestration with Flyte

A key differentiator in our stack is Flyte, which we leverage for complex workflow orchestration. Flyte has transformed how we manage our data curation pipelines by providing:

Dynamic pipeline composition, allowing us to reorder and recombine curation strategies
Strong type checking capabilities, ensuring reliable data flow between pipeline stages
Robust error handling and recovery mechanisms

The type system has proven particularly valuable, helping us catch potential issues early in the development cycle and ensuring smooth integration between different pipeline components.

Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset

tl;dr

Table of Contents

1. Introduction

1.1. Why is Data Curation Important?

1.2. Why is Data Curation Hard?

1.3. Making Data Curation More Accessible

1.3.1. What We Did and How We Did It

1.3.2. A Bird’s Eye View of this Post

2. Methods

2.1. Curation Algorithms

2.2. Data

2.2.1. Our Testbed: RPJv1

2.2.2. Why RPJv1?

2.2.3. Curating RPJv1 into DatologyAI Text (DAIT)

2.2.4. Comparing to Other High-Quality Public Datasets

2.3. Evaluations, Training, and Pipeline

3. Results

3.1. How We Present Our Results

3.2. Data Curation Improves Model Quality

3.2.1. Dataset Alchemy: Substantial Improvements from Curating RPJv1

3.2.2. DAIT Outperforms All Our Public Baselines

3.2.3. Consistent Curation Improvements Across Evaluation Datasets

3.2.4. More Shots More Gains

3.2.5. Olympic-Size Curation: Larger Pool Sizes Yield Better Curated Datasets

3.2.6. The Diminishing Returns of Training on Minimally-Curated Data

3.2.7. How Much Does One Additional Point of Accuracy Cost?

3.3. Data Curation Accelerates Model Training

3.4. Datology vs. Goliath: Less-Large Language Models via Data Curation

3.4.1. Check My Math: 1.3 > 2.7?

4. Wrapping Up

4.1. What’s Next

4.2. Get in Touch!

Contributors

Core Contributors

Contributors

Interns

Leadership and Advising

A. Appendix

A.1. Supplementary Results

A.1.1. The Big Plot

A.1.2. The Big Table

A.1.3. Training FLOPs Comparison of 1.3B DAIT vs. 2.7B Baselines

A.2. Meet the Family: A Deeper Dive on our Curation Algorithms

A.2.1. Lexical Deduplication

A.2.2. Heuristic Filtering

A.2.3. Model-Based Filtering

A.2.4. Embedding-based Curation

A.2.5. Target Distribution-Matching

A.2.6. Synthetic Data

A.2.7. Source Mixing

A.3. How we Evaluate

A.3.1. Evaluation Datasets

A.3.2. Evaluation Schema

A.4. Models, Training, and Infrastructure

A.5. Building a Robust, Scalable Data Curation Pipeline

A.5.1. Core Infrastructure

A.5.2. Data Processing Engine

A.5.3. Workflow Orchestration with Flyte

Recent Posts