The Electronic DJ Model aka E.D.M.
The EDM predicts the next track in a DJ mix, trained on The DJ Mix Dataset - ~96,000 training tokens against a vocabulary of 63,040 unique tracks. Try it out yourself at www.electronicdjmodel.com!
DJ mixes share many properties with a written sentence, there is a beginning, an end and the order of the words/tracks is significant.
Thus, could we create a tokenizer that handles individual music tracks and train it to create new palatable DJ playlists?
Track-Token, a new tokenizer
Modern language models lean on BPE, which splits words into fragments that recur millions of times. We wrote EDMTokenizer: each whole track ID is one atomic token, a plain dictionary lookup. The vocabulary is the 63,040 unique tracks observed in the dataset, a None token for the ~20% of slots that are unidentified, plus a start marker. A mix becomes [BOS, trackA, trackB, None, trackC, …]. None stays visible as context but its training targets are remapped to ignore_index=-1, so the model is never rewarded for predicting "unknown." It was clear at this early stage that with many tracks appearing only once in the training dataset, it was going to be a challenge to train the model sufficiently.
Starting small
We began by adapting the microgpt architecture (Karpathy's pure-Python, dependency-free GPT) to our new track tokenizer with scalar autograd. Further exploration included Nanochat in conjunction with our tokenizer.
In search of the optimal model size
Nanochat's compute-optimal target is ~8 tokens per scaling parameter. With 96K tokens, even our smallest model sits ~40× below that; the biggest one (d6, 34.8M params) is ~2,700× below a token-to-parameter ratio of 0.003.
We swept depth across d6 (131M params), small (~8M, dim 64), and tiny (~4M, dim 32).
d6had the capacity to memorize: 87.5% of its generations were exact copies of training pairs.tinyover-shrank into near-randomness.smallshowed capacity to learn patterns, too little to memorize won top-1000 (2.64% vs d6's 2.01%) while generating 87% novel transitions.
Next-track prediction (all positions; key metric: top-1000)
| Model | Params | Top-100 | Top-500 | Top-1000 | Med. rank | Novelty |
|---|---|---|---|---|---|---|
| d6 | 131M | 1.19% | 1.76% | 2.01% | 49,462 | 12.5% |
| small | ~8M | 1.37% | 2.01% | 2.64% | 38,453 | 87.0% |
| tiny | ~4M | 1.28% | 1.95% | 2.58% | 33,230 | 97.7% |
| Bigram | — | 0.88% | 0.88% | 0.88% | 63,040 | — |
| Random | — | ~0.16% | ~0.79% | ~1.59% | ~31,500 | 99.8% |
All three models beat the bigram baseline (0.88%), so they learn context beyond raw A→B co-occurrence.
Considering what 'good' looks like for DJ mix next-track prediction
In language there's usually a very small number of appropriate next words, so top 1 is fair. In a DJ set, we speculate that perhaps 50-100 tracks could each mix in satisfactorily, and the data records only the one path a DJ took.
We made the headline metric few-shot top-1000: is the true track in a plausible shortlist of a thousand? Since random accuracy is just n/63,040 (1.59% at top-1000), it leaves room to separate models without swallowing the vocabulary.
Handing off to the machine, applying Autoresearch for exploration
By applying Autoresearch, 18 experiments were run to optimise the val_loss metric from 11.05 to 9.30. Due to the dearth of training data, this process illustrated that a softmax lookup was the best approach:
Deep learning tuning:
- Smaller batch, more steps → 11.46
- Tied embeddings + no logit softcap → 18.91
- Dropout 0.3 everywhere → 11.81
- Label smoothing 0.1 → 11.29
- Unigram bias + label smoothing → 12.64
- Pure bigram, depth 0 → 17.51
- Frozen prior + training (wd=10) → 11.10
- Train only frequent tokens (mask rare targets) → 11.91
- Unigram mixture (α=0.09), no training → 10.78
- Power-law smoothing (γ=1.27, k=40), no training → 10.76
- Log-linear prior with a "seen" penalty (a=1.07, k=3, b=−4.1) → 9.35
Statistical priors:
- 3-regime prior (separate unseen/singleton/multi-seen penalties) → 9.31
- 3-regime + gentle training (lr=1e-5, wd=50) → 9.30 BEST
- 4-regime + training → 9.31 (no real gain, reverted)
- Harder training on the prior (lr=3e-5, wd=100) → 9.31 (overshot, reverted)
Human evaluation - try the model yourself!
- The EDM-small model, trained on 300 steps is deployed at www.electronicdjmodel.com where you can select a song from the dataset to either start the mix or inspire the mix.
- Each track generated plays the youtube video and automatically progresses until the mix ends
- Press thumbs-up to give positive feedback
- Press thumbs-down to replace with a new track
- Preferences are anonymously captured for DPO improvement to the model
Conclusion & Future Work
- The novel tokenizer was an interesting exercise, but the lack of enough data for a transformer model to train on limited the effectiveness of this project.
- A larger dataset with greater token to parameter ratio would enable further exploration.
- Despite these limitations, I have enjoyed listening to the generated DJ playlists