Reading Note: "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming"

Xie et al. "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming". arXiv preprint arXiv:2408.16725 (2024).

Intro

Motivations:

Human-computer interaction needs the model to directly reason with the audio modality and generate output in streaming.
Previous works typically depend on extra TTS systems for speech synthesis, resulting in latency.

Contributions:

Mini-Omni: The first audio-based end-to-end open-source conversational model for real-time speech interaction.
A text-instructed speech generation method, along with batch-parallel strategies during inference.
A VoiceAssistant-400K dataset generated by GPT-4o for speech SFT (most QA datasets contain mixed code or overly lengthy text, rendering them unsuitable for speech models).

Challenges:

Complexity of Audio Reasoning: Direct training for audio modality reasoning often results in incoherent outputs from the model.
Model Complexity: Incorporating additional modules for speech input and output.
Difficulty in Modality Alignment: Reasoning ability developed for text is difficult to transfer to audio.
Resource Consumption: Adapting model capability from text to speech modality requires converting all data labels into audio and retraining.

Characteristic:

Adapting currently available methods for discretizing speech tokens and employing the simplest model architecture for easier follow-up research.
Using only a 0.5B model and a limited amount of synthesized audio data.
Including various other audio-text functionalities such as ASR and TTS.

"Any Model Can Talk": Three-Stage Training

Core Idea: The primary modality alignment tasks are handled during adapter training, thus the original model’s capabilities are maximally preserved.

Stage 1: Modality Alignment

Goal: Enhancing the text model’s ability to understand (i.e., speech recognition) and generate (i.e., speech synthesis) speech, aligning the speech modality with the text model’s input.
Data: From speech recognition and speech synthesis tasks.

Stage 2: Adaption Training

Goal: Training the model’s text capabilities when given audio inputs, as audio output is simply synthesized from text.
Data: From speech recognition, spoken question answering, and text response tasks.

Stage 3: Multi-Modal Fine-Tuning

Goal: Fine-tuning the entire model.
Data: Comprehensive data.

(Additional Final Stage: Annealing and fine-tuning with Voice QA.)

Batch Parallel Decoding

Model Input

Given the eight parallel output sequences, the input also requires eight sequences.
Placing the special token <answer> in different positions to guide the model for multi-modal output.

Dataset

PreviousIndex Page NextReading Note: "Safety at Scale: A Comprehensive Survey of Large Model Safety"

Last updated 4 months ago

Was this helpful?