arXiv:2604.17852

LLM-Codec
Neural Audio Codec Meets
Language Model Objectives

Training neural audio codecs with language-model-facing objectives to produce tokens that are both reconstructable and predictable under autoregressive modeling.

61.6%
SALMon Accuracy
35×
Perplexity Reduction
5.0%
Mel Distance Improv.

Bridging Reconstruction & Prediction

LLM-Codec: Neural Audio Codec Meets Language Model Objectives studies the mismatch between reconstruction-trained neural audio codecs and prediction-trained spoken language models. The paper introduces language-model-facing codec training objectives and releases the training code, model checkpoint, and Codec-SUPERB webpage examples.

Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model perplexity.


We propose LLM-Codec, which trains the codec encoder with language-model-facing objectives while keeping both codec and LLM architectures unchanged. LLM-Codec introduces (i) future token prediction with Medusa-style multi-step heads to encourage multi-step predictability, and (ii) semantic alignment that matches audio and text representations via a memory-bank contrastive loss. A differentiable Gumbel bridge enables end-to-end gradients from these objectives to the codec encoder.

How LLM-Codec Works

Three key components work together to make codec tokens LLM-friendly, without modifying model architectures.

LLM-Codec architecture diagram showing future token prediction and semantic alignment objectives

Future Token Prediction

K Medusa-style prediction heads with inverse-distance weighting capture multi-step dependencies, encouraging tokens that are locally predictable beyond a single step.

Semantic Alignment

Layer-wise cosine alignment and memory-bank contrastive loss ensure audio and text representations match inside the LLM, preserving linguistic content.

Gumbel Bridge

Hard Gumbel-Softmax keeps discrete tokens in the forward pass while providing smooth gradients backward, enabling end-to-end codec encoder optimization.

Reconstruction Examples

Codec-SUPERB-tiny examples synthesized with the paper baselines: LLM-Codec, AUV, BigCodec, UniCodec, and WavTokenizer-L. Each domain contains five examples.

Sample 1 — ravdess
Codec-SUPERB-tiny Speech · ID: s1 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 2 — voxceleb1
Codec-SUPERB-tiny Speech · ID: s2 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 3 — voxceleb1
Codec-SUPERB-tiny Speech · ID: s3 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 4 — crema_d
Codec-SUPERB-tiny Speech · ID: s4 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 5 — voxceleb1
Codec-SUPERB-tiny Speech · ID: s5 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 1 — nsynth
Codec-SUPERB-tiny Music · ID: m1 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 2 — m4singer
Codec-SUPERB-tiny Music · ID: m2 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 3 — opensinger
Codec-SUPERB-tiny Music · ID: m3 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 4 — gtzan
Codec-SUPERB-tiny Music · ID: m4 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 5 — gtzan
Codec-SUPERB-tiny Music · ID: m5 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 1 — esc50
Codec-SUPERB-tiny Audio · ID: e1 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 2 — esc50
Codec-SUPERB-tiny Audio · ID: e2 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 3 — esc50
Codec-SUPERB-tiny Audio · ID: e3 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 4 — esc50
Codec-SUPERB-tiny Audio · ID: e4 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L
Sample 5 — esc50
Codec-SUPERB-tiny Audio · ID: e5 · Duration: 4.0s
Ground Truth
LLM-Codec
AUV
BigCodec
UniCodec
WavTokenizer-L

Quantitative Evaluation

Reconstruction quality on Codec-SUPERB-tiny and speech coherence on SALMon benchmark.

Speech Reconstruction

Model Mel ↓ STFT ↓ PESQ ↑ STOI ↑
BigCodec 0.810 1.718 2.208 0.877
UniCodec 0.830 1.824 2.022 0.851
WavTokenizer-M 0.904 1.846 1.843 0.823
AUV (base) 0.762 1.648 2.094 0.850
LLM-Codec 0.683 1.507 2.147 0.858

SALMon Speech Coherence (3 Epochs)

Model Speaker Gender RIR BG-All Average
WavTokenizer-L 49.0 53.0 39.0 53.0 49.5
BigCodec 49.5 49.0 47.0 47.0 49.5
AUV (3 ep) 53.0 54.5 40.5 46.5 48.4
LLM-Codec (3 ep) 72.0 71.5 64.0 57.0 62.3

BibTeX

@article{chung2026llm,
  title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
  author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2604.17852},
  year={2026}
}