Yanyun Wang's Gitbook
  • 👋Welcome to My Gitbook!
  • Reading Notes
    • Index Page
    • Reading Note: "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming"
    • Reading Note: "Safety at Scale: A Comprehensive Survey of Large Model Safety"
    • Reading Note: "Towards More Practical Threat Models in Artificial Intelligence Security"
    • Reading Note: "A Survey on Neural Speech Synthesis"
    • Reading Note: "Threats to Pre-trained Language Models: Survey and Taxonomy"
    • Reading Note: "Survey: Leakage and Privacy at Inference Time"
    • Reading Note: "Membership Inference Attacks on Machine Learning: A Survey"
Powered by GitBook
On this page
  • Intro
  • "Any Model Can Talk": Three-Stage Training
  • Batch Parallel Decoding
  • Model Input
  • Dataset

Was this helpful?

  1. Reading Notes

Reading Note: "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming"

Xie et al. "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming". arXiv preprint arXiv:2408.16725 (2024).

PreviousIndex PageNextReading Note: "Safety at Scale: A Comprehensive Survey of Large Model Safety"

Last updated 2 months ago

Was this helpful?

Intro

Motivations:

  • Human-computer interaction needs the model to directly reason with the audio modality and generate output in streaming.

  • Previous works typically depend on extra TTS systems for speech synthesis, resulting in latency.

Contributions:

  • Mini-Omni: The first audio-based end-to-end open-source conversational model for real-time speech interaction.

  • A text-instructed speech generation method, along with batch-parallel strategies during inference.

  • A VoiceAssistant-400K dataset generated by GPT-4o for speech SFT (most QA datasets contain mixed code or overly lengthy text, rendering them unsuitable for speech models).

Challenges:

  • Complexity of Audio Reasoning: Direct training for audio modality reasoning often results in incoherent outputs from the model.

  • Model Complexity: Incorporating additional modules for speech input and output.

  • Difficulty in Modality Alignment: Reasoning ability developed for text is difficult to transfer to audio.

  • Resource Consumption: Adapting model capability from text to speech modality requires converting all data labels into audio and retraining.

Characteristic:

  • Adapting currently available methods for discretizing speech tokens and employing the simplest model architecture for easier follow-up research.

  • Using only a 0.5B model and a limited amount of synthesized audio data.

  • Including various other audio-text functionalities such as ASR and TTS.

"Any Model Can Talk": Three-Stage Training

Core Idea: The primary modality alignment tasks are handled during adapter training, thus the original model’s capabilities are maximally preserved.

Stage 1: Modality Alignment

  • Goal: Enhancing the text model’s ability to understand (i.e., speech recognition) and generate (i.e., speech synthesis) speech, aligning the speech modality with the text model’s input.

  • Data: From speech recognition and speech synthesis tasks.

Stage 2: Adaption Training

  • Goal: Training the model’s text capabilities when given audio inputs, as audio output is simply synthesized from text.

  • Data: From speech recognition, spoken question answering, and text response tasks.

Stage 3: Multi-Modal Fine-Tuning

  • Goal: Fine-tuning the entire model.

  • Data: Comprehensive data.

(Additional Final Stage: Annealing and fine-tuning with Voice QA.)

Batch Parallel Decoding

Model Input

  • Given the eight parallel output sequences, the input also requires eight sequences.

  • Placing the special token <answer> in different positions to guide the model for multi-modal output.

Dataset