Custom Serialization for Parameter-Golf

I recently got nerd-sniped by OpenAI's parameter-golf challenge. It's a challenge to train the best language model that fits in a 16MB artifact and trains in 10 minutes in an 8xH100 environment. I came into this not really knowing much about how models were trained and how the model weights are stored, so working through this was a fun learning exercise for me. However, after experimenting with a few simple ideas (ex: adding a small weight decay to the embedding table, increasing the learning rate, etc) with a toy model that ran on my Macbook, I quickly found that any approach that worked well on my Macbook did not necessarily translate to improvements in model performance on an 8xH100 machine. And without having a high degree of confidence that my ideas would work anyways, I quickly burned through my free $25 in compute credits from the competition and didn't really want to invest my own money.

So instead, I pivoted to seeing if I could improve upon the serialization / compression layer. While a smaller model doesn't directly translate to improved BPB, having better lossless compression means that you could use more of that artifact space that you save on other things that do improve BPB (ex: wider MLP). Since compression is lossless, it's a strictly better technique for trimming artifact size than lossy approaches like pruning — anyone working on a model that's slightly over the 16MB limit could drop in custom serialization to "buy back" space for free. The main advantage of this pivot though, was that I could iterate on this locally without needing access to expensive GPU's.

What I built

I replaced torch.save + zstd-22 (the standard approach used by most submissions) with a custom binary format using Asymmetric Numeral Systems (ANS) entropy coding. The result: a 2.34% reduction in compressed size (~363KB saved), with zero loss in model accuracy.

Method	Compressed bytes	vs Baseline
Baseline (torch.save + zstd-22)	15,513,031	—
Custom Serialization	15,150,085	-362,946 (-2.34%)

Why this works

torch.save + generic compressors like zstd treat the serialized blob as an opaque byte stream. Since we know the model format, we can do better:

Known value alphabets: Int6 weights use only 64 possible symbols ([-32, 31]), int8 embeddings use 256. ANS encodes directly against the true symbol distribution, reaching within bits of the entropy floor — whereas generic compressors discover this implicitly through LZ77 pattern matching.
Row-level distribution structure: Rows within the same layer type share similar value distributions. K-means clustering (K=16) on row frequency histograms produces shared ANS probability models that adapt to different weight patterns.
Dtype-aware stream separation: Splitting int8, fp16, and fp32 into independent streams and applying dtype-specific transforms (zigzag encoding for signed integers, byte-shuffling to group fp16 exponent bytes) makes each stream more compressible than the interleaved pickle format.
No pickle overhead: torch.save uses pickle with per-tensor framing, ZIP containers, and metadata (~100KB). Our format stores a compact LZMA-compressed JSON header followed by length-prefixed compressed streams.

Methodology

This format was developed through 62 sequential experiments, each testing a single isolated change in an automated loop:

Read prior results and notes
Design one change, edit serialize.py
Run python test_serialize.py (roundtrip correctness + size benchmark against the real H100 artifact)
Log results to results.tsv, update notes.md with hypothesis/result/insights
Keep if compressed size decreased with zero roundtrip error, otherwise revert
Repeat

The full experiment history (62 custom-format experiments + 25 additional torch.save fork experiments) is in the playground repo. I also attempted to fork and alter the C implementation of torch.save directly, but the custom binary format proved superior.

The code is open sourced on GitHub, and the submission PR is here.