Physics-filtered pre-training data that delivers 17% lower loss than standard web corpora at identical compute budget. We use information theory to mathematically separate signal from noise — so your model trains faster on less data.
Most LLM training corpora are built by scraping the internet and applying heuristic filters. The result: billions of tokens of SEO spam, boilerplate, duplicated content, and low-information text that your model must process before it learns anything useful.
Palladium Data takes a different approach. We treat data filtration as a physics problem — using information theory to quantify the information density of every document. The result is a corpus where every token carries measurably more signal.
We trained Qwen 2.5 (1.5B) on three datasets under identical conditions: same base model, same hyperparameters, same token budget (5M tokens), same hardware (NVIDIA A5000). Only the training data changes.
| Dataset | Final Loss | Time (hrs) | Tokens/sec |
|---|---|---|---|
| Palladium-1M | 2.206 | 0.62 | 2,244 |
| FineWeb-Edu | 2.306 | 0.64 | 2,184 |
| FineWeb (baseline) | 2.654 | 0.64 | 2,173 |
Table 1. Training results on Qwen 2.5 (1.5B) with 5M token budget. Palladium achieves 17% lower final loss than FineWeb and 4.3% lower than FineWeb-Edu under identical compute constraints.
Figure 1. Training loss over 77 steps. Palladium (gold) maintains consistently lower loss throughout training, with visibly smoother convergence than both baselines. The separation is immediate and sustained.
We evaluated all models on five standard benchmarks. At the 5M token scale, downstream performance remains stable across all datasets — no degradation from the curated corpus. This is expected: continued pre-training at this scale primarily affects loss and perplexity, while downstream benchmark shifts require orders of magnitude more data.
| Task | Base | FineWeb | FineWeb-Edu | Palladium |
|---|---|---|---|---|
| MMLU | 59.69 | 59.75 | 59.72 | 59.68 |
| ARC-Challenge | 41.38 | 40.87 | 41.47 | 41.13 |
| HellaSwag | 50.22 | 50.26 | 50.39 | 50.29 |
| Winogrande | 63.22 | 63.38 | 62.83 | 64.09 |
| PIQA | 75.52 | 75.63 | 75.46 | 75.57 |
Table 2. Downstream benchmark accuracy (%). All scores within standard variance of the base model. Palladium training preserves capabilities while delivering significantly lower training loss.
Traditional data filtering relies on LLM-based "quality classifiers" — asking one neural network to judge text for another. This is circular, computationally expensive, and blind to information density. Palladium takes a first-principles approach grounded in information theory.
We compress every document using ZSTD and measure the ratio of raw size to compressed size — a direct proxy for Shannon entropy. Highly compressible text indicates repetition, boilerplate, or formulaic content. We discard it.
Entropy alone is insufficient — random noise has high entropy but zero informational value. We cross-reference with linguistic sophistication: vocabulary diversity, sentence complexity, and information yield per token. Documents must clear both thresholds.
The intersection of high entropy and high sophistication defines the "Goldilocks Zone" — roughly the top 10% of the open web by information density. Everything outside this zone is noise. The survivors constitute the Palladium corpus.
For a document d, the information density ρ(d) is defined as the ratio |d|raw / |d|compressed. Documents with ρ > τ (where τ is empirically determined) contain sufficient novelty per token to justify inclusion in the training corpus. The Palladium corpus has mean ρ = 2.32.
Palladium is designed for teams where compute efficiency is a binding constraint.
Training 1B–7B parameter models where every token matters. Match larger-dataset performance with a fraction of the data.
Building models for finance, legal, science, or engineering. Start from a high-quality general foundation, not raw web crawl.
Need a clean, dense knowledge base without boilerplate? Palladium documents are pre-filtered for information density.
Browse 10,000 sample documents on HuggingFace. No signup required.