LLM-Codec: Neural Audio Codec Meets Language Model Objectives

Abstract

Bridging Reconstruction & Prediction

LLM-Codec: Neural Audio Codec Meets Language Model Objectives studies the mismatch between reconstruction-trained neural audio codecs and prediction-trained spoken language models. The paper introduces language-model-facing codec training objectives and releases the training code, model checkpoint, and Codec-SUPERB webpage examples.

arXiv GitHub Hugging Face

Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model perplexity.

We propose LLM-Codec, which trains the codec encoder with language-model-facing objectives while keeping both codec and LLM architectures unchanged. LLM-Codec introduces (i) future token prediction with Medusa-style multi-step heads to encourage multi-step predictability, and (ii) semantic alignment that matches audio and text representations via a memory-bank contrastive loss. A differentiable Gumbel bridge enables end-to-end gradients from these objectives to the codec encoder.

Method

How LLM-Codec Works

Three key components work together to make codec tokens LLM-friendly, without modifying model architectures.

⟡

Future Token Prediction

K Medusa-style prediction heads with inverse-distance weighting capture multi-step dependencies, encouraging tokens that are locally predictable beyond a single step.

◈

Semantic Alignment

Layer-wise cosine alignment and memory-bank contrastive loss ensure audio and text representations match inside the LLM, preserving linguistic content.

⬡

Gumbel Bridge

Hard Gumbel-Softmax keeps discrete tokens in the forward pass while providing smooth gradients backward, enabling end-to-end codec encoder optimization.

Audio Samples

Reconstruction Examples

Codec-SUPERB-tiny examples synthesized with the paper baselines: LLM-Codec, AUV, BigCodec, UniCodec, and WavTokenizer-L. Each domain contains five examples.

Sample 1 — ravdess

Codec-SUPERB-tiny Speech · ID: s1 · Duration: 4.0s

Ground Truth