The Electronic DJ Model aka E.D.M.

19 June 2026

The EDM predicts the next track in a DJ mix, trained on The DJ Mix Dataset - ~96,000 training tokens against a vocabulary of 63,040 unique tracks. Try it out yourself at www.electronicdjmodel.com!

DJ mixes share many properties with a written sentence, there is a beginning, an end and the order of the words/tracks is significant.

Thus, could we create a tokenizer that handles individual music tracks and train it to create new palatable DJ playlists?

Track-Token, a new tokenizer

Modern language models lean on BPE, which splits words into fragments that recur millions of times. We wrote EDMTokenizer: each whole track ID is one atomic token, a plain dictionary lookup. The vocabulary is the 63,040 unique tracks observed in the dataset, a None token for the ~20% of slots that are unidentified, plus a start marker. A mix becomes [BOS, trackA, trackB, None, trackC, …]. None stays visible as context but its training targets are remapped to ignore_index=-1, so the model is never rewarded for predicting "unknown." It was clear at this early stage that with many tracks appearing only once in the training dataset, it was going to be a challenge to train the model sufficiently.

Starting small

We began by adapting the microgpt architecture (Karpathy's pure-Python, dependency-free GPT) to our new track tokenizer with scalar autograd. Further exploration included Nanochat in conjunction with our tokenizer.

In search of the optimal model size

Nanochat's compute-optimal target is ~8 tokens per scaling parameter. With 96K tokens, even our smallest model sits ~40× below that; the biggest one (d6, 34.8M params) is ~2,700× below a token-to-parameter ratio of 0.003.

We swept depth across d6 (131M params), small (~8M, dim 64), and tiny (~4M, dim 32).

Next-track prediction (all positions; key metric: top-1000)

ModelParamsTop-100Top-500Top-1000Med. rankNovelty
d6131M1.19%1.76%2.01%49,46212.5%
small~8M1.37%2.01%2.64%38,45387.0%
tiny~4M1.28%1.95%2.58%33,23097.7%
Bigram0.88%0.88%0.88%63,040
Random~0.16%~0.79%~1.59%~31,50099.8%

All three models beat the bigram baseline (0.88%), so they learn context beyond raw A→B co-occurrence.

Considering what 'good' looks like for DJ mix next-track prediction

In language there's usually a very small number of appropriate next words, so top 1 is fair. In a DJ set, we speculate that perhaps 50-100 tracks could each mix in satisfactorily, and the data records only the one path a DJ took.

We made the headline metric few-shot top-1000: is the true track in a plausible shortlist of a thousand? Since random accuracy is just n/63,040 (1.59% at top-1000), it leaves room to separate models without swallowing the vocabulary.

Handing off to the machine, applying Autoresearch for exploration

By applying Autoresearch, 18 experiments were run to optimise the val_loss metric from 11.05 to 9.30. Due to the dearth of training data, this process illustrated that a softmax lookup was the best approach:

Deep learning tuning:

Statistical priors:

Human evaluation - try the model yourself!

Conclusion & Future Work

← All posts