Custom Serialization for Parameter-Golf
I recently got nerd-sniped by OpenAI's parameter-golf challenge. It's a challenge to train the best language model that fits in a 16MB artifact and trains in 10 minutes in an 8xH100 environment. I came into this not really knowing much about how models were trained and how the model weights are stored, so working through this was a fun learning exercise for me. However, after experimenting with a few simple ideas (ex: adding a small weight decay to the embedding table, increasing the learning rate, etc) with a toy model that ran on my Macbook, I quickly found that any approach that worked well on my Macbook did not necessarily translate to improvements in model performance on an 8xH100 machine. And without having a high degree of confidence that my ideas would work anyways, I quickly burned through my free $25 in compute credits from the competition and didn't really want to invest my own money.
So instead, I pivoted to seeing if I could improve upon the serialization / compression layer. While a smaller model doesn't directly translate to improved BPB, having better lossless compression means that you could use more of that artifact space that you save on other things that do improve BPB (ex: wider MLP). Since compression is lossless, it's a strictly better technique for trimming artifact size than lossy approaches like pruning — anyone working on a model that's slightly over the 16MB limit could drop in custom serialization to "buy back" space for free. The main advantage of this pivot though, was that I could iterate on this locally without needing access to expensive GPU's.
What I built
I replaced torch.save + zstd-22 (the standard approach used by most submissions) with a custom binary format using Asymmetric Numeral Systems (ANS) entropy coding. The result: a 2.34% reduction in compressed size (~363KB saved), with zero loss in model accuracy.
| Method | Compressed bytes | vs Baseline |
|---|---|---|
| Baseline (torch.save + zstd-22) | 15,513,031 | — |
| Custom Serialization | 15,150,085 | -362,946 (-2.34%) |
Why this works
torch.save + generic compressors like zstd treat the serialized blob as an opaque byte stream. Since we know the model format, we can do better:
- Known value alphabets: Int6 weights use only 64 possible symbols ([-32, 31]), int8 embeddings use 256. ANS encodes directly against the true symbol distribution, reaching within bits of the entropy floor — whereas generic compressors discover this implicitly through LZ77 pattern matching.
- Row-level distribution structure: Rows within the same layer type share similar value distributions. K-means clustering (K=16) on row frequency histograms produces shared ANS probability models that adapt to different weight patterns.
- Dtype-aware stream separation: Splitting int8, fp16, and fp32 into independent streams and applying dtype-specific transforms (zigzag encoding for signed integers, byte-shuffling to group fp16 exponent bytes) makes each stream more compressible than the interleaved pickle format.
- No pickle overhead:
torch.saveuses pickle with per-tensor framing, ZIP containers, and metadata (~100KB). Our format stores a compact LZMA-compressed JSON header followed by length-prefixed compressed streams.
Methodology
This format was developed through 62 sequential experiments, each testing a single isolated change in an automated loop:
- Read prior results and notes
- Design one change, edit
serialize.py - Run
python test_serialize.py(roundtrip correctness + size benchmark against the real H100 artifact) - Log results to
results.tsv, updatenotes.mdwith hypothesis/result/insights - Keep if compressed size decreased with zero roundtrip error, otherwise revert
- Repeat
The full experiment history (62 custom-format experiments + 25 additional torch.save fork experiments) is in the playground repo. I also attempted to fork and alter the C implementation of torch.save directly, but the custom binary format proved superior.
The code is open sourced on GitHub, and the submission PR is here.