Pre-training Data Infrastructure

Train smarter,
not bigger.

Physics-filtered pre-training data that delivers 17% lower loss than standard web corpora at identical compute budget. We use information theory to mathematically separate signal from noise — so your model trains faster on less data.


§1 The Problem

Most LLM training corpora are built by scraping the internet and applying heuristic filters. The result: billions of tokens of SEO spam, boilerplate, duplicated content, and low-information text that your model must process before it learns anything useful.

You are burning GPU hours teaching your model to memorize cookie banners.

Palladium Data takes a different approach. We treat data filtration as a physics problem — using information theory to quantify the information density of every document. The result is a corpus where every token carries measurably more signal.


§2 Results

We trained Qwen 2.5 (1.5B) on three datasets under identical conditions: same base model, same hyperparameters, same token budget (5M tokens), same hardware (NVIDIA A5000). Only the training data changes.

17%
Lower loss
vs. FineWeb
4.3%
Lower loss
vs. FineWeb-Edu
~1M
Curated
documents
2.32×
Compression
ratio

2.1 Training Loss

Dataset Final Loss Time (hrs) Tokens/sec
Palladium-1M 2.206 0.62 2,244
FineWeb-Edu 2.306 0.64 2,184
FineWeb (baseline) 2.654 0.64 2,173

Table 1. Training results on Qwen 2.5 (1.5B) with 5M token budget. Palladium achieves 17% lower final loss than FineWeb and 4.3% lower than FineWeb-Edu under identical compute constraints.

Training loss curves: Palladium vs FineWeb vs FineWeb-Edu

Figure 1. Training loss over 77 steps. Palladium (gold) maintains consistently lower loss throughout training, with visibly smoother convergence than both baselines. The separation is immediate and sustained.

2.2 Downstream Evaluation

We evaluated all models on five standard benchmarks. At the 5M token scale, downstream performance remains stable across all datasets — no degradation from the curated corpus. This is expected: continued pre-training at this scale primarily affects loss and perplexity, while downstream benchmark shifts require orders of magnitude more data.

Task Base FineWeb FineWeb-Edu Palladium
MMLU 59.69 59.75 59.72 59.68
ARC-Challenge 41.38 40.87 41.47 41.13
HellaSwag 50.22 50.26 50.39 50.29
Winogrande 63.22 63.38 62.83 64.09
PIQA 75.52 75.63 75.46 75.57

Table 2. Downstream benchmark accuracy (%). All scores within standard variance of the base model. Palladium training preserves capabilities while delivering significantly lower training loss.

The key result is training efficiency: identical downstream performance with 17% lower loss means your model is extracting more information per token. At larger token budgets, this efficiency advantage compounds.

§3 Methodology

Traditional data filtering relies on LLM-based "quality classifiers" — asking one neural network to judge text for another. This is circular, computationally expensive, and blind to information density. Palladium takes a first-principles approach grounded in information theory.

1

Entropy Measurement

We compress every document using ZSTD and measure the ratio of raw size to compressed size — a direct proxy for Shannon entropy. Highly compressible text indicates repetition, boilerplate, or formulaic content. We discard it.

2

Sophistication Scoring

Entropy alone is insufficient — random noise has high entropy but zero informational value. We cross-reference with linguistic sophistication: vocabulary diversity, sentence complexity, and information yield per token. Documents must clear both thresholds.

3

The Goldilocks Filter

The intersection of high entropy and high sophistication defines the "Goldilocks Zone" — roughly the top 10% of the open web by information density. Everything outside this zone is noise. The survivors constitute the Palladium corpus.

Definition 1 (Information Density)

For a document d, the information density ρ(d) is defined as the ratio |d|raw / |d|compressed. Documents with ρ > τ (where τ is empirically determined) contain sufficient novelty per token to justify inclusion in the training corpus. The Palladium corpus has mean ρ = 2.32.


§4 Use Cases

Palladium is designed for teams where compute efficiency is a binding constraint.

Pre-training

Small Model Teams

Training 1B–7B parameter models where every token matters. Match larger-dataset performance with a fraction of the data.

Fine-tuning

Domain Specialization

Building models for finance, legal, science, or engineering. Start from a high-quality general foundation, not raw web crawl.

Retrieval

RAG & Knowledge Bases

Need a clean, dense knowledge base without boilerplate? Palladium documents are pre-filtered for information density.


See the data for yourself.

Browse 10,000 sample documents on HuggingFace. No signup required.