Multimodal LLMs on top of LLaMA 3

Research works based on LLaMA 3 (Dubey et al., 2024) have been popping up since its release in July. I have been focused on multimodal papers based on LLaMA 3 for my project. One of the primary paper I am focused on right now is the LLaMA-Omni (Fang et al., 2024). It adds the speech modality on top of the text already in Llama-3.1-8B Instruct model. Their work fills the void of GPT-4o’s (OpenAI et al., 2024) capability of end-to-end speech interaction in the sphere of open-source LLMs. In the simplest term, they wanted the model to accept speech and generate speech as output. The authors uses a novel technique to achieve this goal. At first, the speech \(X^s\) passes through a frozen speech encoder, in this case, Whisper V3 (Radford et al., 2023). Whisper takes 30s frames of log-melspectrogram and outputs a hidden representation \(\vec{H} = [\vec{h_1}, … \vec{h_N}]\). These \(\vec{H}\) are downsampled by concatenating every \(k\) column vectors which gives \(\vec{H}'\) with length \(\frac{N}{k}\) instead of \(N\). Finally, this downsampled represenation goes through a speech adaptor (a typical feed forward neural netowork) that basically maps these to the LLM. The text and speech instructions are combined for the LLM, where the text is fixed for all samples and only the speech varies. You can see the template below.

You are a helpful language and speech assistant. You are able to understand the speech content that the user provides, and assist the user with a variety of tasks using natural language.
<speech>
Please answer the questions in the user’s input speech.
LLaMA-Omni architecture (Fang et al., 2024)

The <speech> is the \(\vec{H}'\). The LLM auto-regressively generates the output text as all LLaMA models do and trained with cross-entropy. To generate the speech directly, the output hidden state for the LLM are fed to a speech decoder as each tokens are predicted continuously. Note that even though output hidden states are fed through the speech decoder sequentially, one token after another, there is no hidden state that are being shared, making this a non-autoregressive technique. This makes generating the speech almost as fast as the text generation, differentiating this approach from a cascading technique. The speech decoder produces speech units that are basically scalers in \([1, K]\) range which a vocoder uses to produce audio. During infernece, the system waits for a few speech units and then generates a partial audio from them, thus, producing a streaming audio.

LLASM (Shu et al., 2023) is another model take understands speech + text to produce text only output. The core of this paper evolves around aligning speech embedding and text embedding together, which it does by going through a separate pre-training phase for a speech adaptor. The Whisper (Radford et al., 2023) encodes the audio to produce speech embeddings and how the text embeddings are produced not mentioned in the paper. But that seems like a trivial point in 2024.

LLaSM architecture (Shu et al., 2023)

Duirng the pre-training, the encoder and the LLM is frozen. The adaptor is trained with an ASR task where the instruction is a text, for example, “Transcribe the following speech into text”. There are many similar instructions in Chinese and English. The adaptor takes audio embeddings from Whisper. They are fed to the adaptor together in an interleaved manner as the LLaMA-Omni paper. This technique of interleaving is getting quite popular. The LLM and the adaptor is trained again the next phase: cross-modal instruction fine-tuning, which is basically multi-task training with cross-entropy. One of the primary achievement of this paper is creating the cross-modal instruction fine-tuning dataset.

Similarities

I am noticing that recent works are moving away from the cascading nature of transcribing the speech with an ASR first and also applying TTS for producing audio. Interleaving embeddings are the next hot-topic in this field.

References

2024

  1. The Llama 3 Herd of Models
    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and 532 more authors
    Aug 2024
    arXiv:2407.21783
  2. LLaMA-Omni: Seamless Speech Interaction with Large Language Models
    Qingkai Fang, Shoutao Guo, Yan Zhou, and 3 more authors
    Sep 2024
    arXiv:2409.06666
  3. GPT-4o System Card
    OpenAI, Aaron Hurst, Adam Lerer, and 416 more authors
    Oct 2024
    arXiv:2410.21276

2023

  1. Robust speech recognition via large-scale weak supervision
    Alec Radford, Jong Wook Kim, Tao Xu, and 3 more authors
    In Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, Oct 2023
  2. LLaSM: Large Language and Speech Model
    Yu Shu, Siwei Dong, Guangyao Chen, and 5 more authors
    Sep 2023
    arXiv:2308.15930



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Paper Notes / Contrastive Predicting Coding
  • আমার ইরাস্মুসে আবেদন
  • Permutation and Combination
  • Creating emails using EmailMessage in Python 3.8
  • ভেক্টর