Custom Serialization for Parameter-Golf

I recently got nerd-sniped by OpenAI's parameter-golf challenge. It's a challenge to train the best language model that fits in a 16MB artifact and trains in 10 minutes in an 8xH100 environment. I came into this not really knowing much about how models were trained and how the model weights are stored, so working through this was a fun learning exercise for me. However, after experimenting with a few simple ideas (ex: adding a small weight decay to the embedding table, increasing the learning rate, etc) with a toy model that ran on my Macbook, I quickly found that any approach that worked well on my Macbook did not necessarily translate to improvements in model performance on an 8xH100 machine. And without having a high degree of confidence that my ideas would work anyways, I quickly burned through my free $25 in compute credits from the competition and didn't really want to invest my own money.

So instead, I pivoted to seeing if I could improve upon the serialization / compression layer. While a smaller model doesn't directly translate to improved BPB, having better lossless compression means that you could use more of that artifact space that you save on other things that do improve BPB (ex: wider MLP). Since compression is lossless, it's a strictly better technique for trimming artifact size than lossy approaches like pruning — anyone working on a model that's slightly over the 16MB limit could drop in custom serialization to "buy back" space for free. The main advantage of this pivot though, was that I could iterate on this locally without needing access to expensive GPU's.

What I built

I replaced torch.save + zstd-22 (the standard approach used by most submissions) with a custom binary format using Asymmetric Numeral Systems (ANS) entropy coding. The result: a 2.34% reduction in compressed size (~363KB saved), with zero loss in model accuracy.

MethodCompressed bytesvs Baseline
Baseline (torch.save + zstd-22)15,513,031
Custom Serialization15,150,085-362,946 (-2.34%)

Why this works

torch.save + generic compressors like zstd treat the serialized blob as an opaque byte stream. Since we know the model format, we can do better:

Methodology

This format was developed through 62 sequential experiments, each testing a single isolated change in an automated loop:

  1. Read prior results and notes
  2. Design one change, edit serialize.py
  3. Run python test_serialize.py (roundtrip correctness + size benchmark against the real H100 artifact)
  4. Log results to results.tsv, update notes.md with hypothesis/result/insights
  5. Keep if compressed size decreased with zero roundtrip error, otherwise revert
  6. Repeat

The full experiment history (62 custom-format experiments + 25 additional torch.save fork experiments) is in the playground repo. I also attempted to fork and alter the C implementation of torch.save directly, but the custom binary format proved superior.

The code is open sourced on GitHub, and the submission PR is here.