# Reading Note: "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming"

<figure><img src="/files/9gL1vYBtFov6W2Gv0AHr" alt="" width="563"><figcaption></figcaption></figure>

## Intro

***Motivations***:

* Human-computer interaction needs the model to <mark style="color:orange;">**directly reason with the audio modality**</mark> and generate output in streaming.
* Previous works typically depend on <mark style="color:orange;">**extra TTS systems**</mark> for speech synthesis, resulting in <mark style="color:orange;">**latency**</mark>.

***Contributions***:

* Mini-Omni: The <mark style="color:green;">**first audio-based end-to-end open-source**</mark> conversational model for <mark style="color:green;">**real-time**</mark> speech interaction.
* A <mark style="color:blue;">text-instructed speech generation</mark> method, along with <mark style="color:blue;">batch-parallel</mark> strategies during inference.
* A <mark style="color:blue;">VoiceAssistant-400K dataset</mark> generated by GPT-4o for speech SFT (most QA datasets contain <mark style="background-color:yellow;">mixed code or overly lengthy  &#x20;text</mark>, rendering them <mark style="background-color:yellow;">unsuitable for speech</mark> models).

***Challenges***:

* *<mark style="color:blue;">Complexity of Audio Reasoning</mark>*: <mark style="color:orange;">**Direct training for audio modality**</mark> <mark style="color:orange;">**reasoning**</mark> often results in <mark style="color:orange;">**incoherent outputs**</mark> from the model.
* *<mark style="color:blue;">Model Complexity</mark>*: <mark style="color:orange;">**Incorporating additional modules**</mark> for speech input and output.
* *<mark style="color:blue;">Difficulty in Modality Alignment</mark>*: Reasoning ability developed for text is difficult to <mark style="color:orange;">**transfer**</mark> to audio.
* *<mark style="color:blue;">Resource Consumption</mark>*: Adapting model capability from text  &#x20;to speech modality requires <mark style="color:orange;">**converting all data labels into audio**</mark> and retraining.

***Characteristic***:

* Adapting <mark style="background-color:yellow;">currently available methods</mark> for <mark style="color:blue;">discretizing speech tokens</mark> and employing the <mark style="background-color:yellow;">simplest model architecture</mark> for easier follow-up research.
* Using only a <mark style="color:orange;">**0.5B model**</mark> and a limited amount of synthesized audio data.
* Including various other audio-text functionalities such as <mark style="color:blue;">ASR</mark> and <mark style="color:blue;">TTS</mark>.

## "Any Model Can Talk": Three-Stage Training

***Core Idea***: The <mark style="color:orange;">**primary modality alignment**</mark> tasks are handled during <mark style="color:orange;">**adapter training**</mark>, thus the <mark style="color:green;">**original model’s capabilities are maximally preserved**</mark>.

<figure><img src="/files/ZtbpWbyx8S3IAOPh6Kbv" alt="" width="563"><figcaption></figcaption></figure>

**Stage 1: Modality Alignment**

* ***Goal***: Enhancing the <mark style="background-color:yellow;">text model’s ability</mark> to understand (i.e., <mark style="color:orange;">**speech recognition**</mark>) and generate (i.e., <mark style="color:orange;">**speech synthesis**</mark>) speech, aligning the speech modality with the text model’s input.
* ***Data***: From <mark style="color:blue;">speech recognition</mark> and <mark style="color:blue;">speech synthesis</mark> tasks.

**Stage 2: Adaption Training**

* ***Goal***: Training the model’s <mark style="color:orange;">**text capabilities**</mark> when <mark style="color:orange;">**given audio inputs**</mark>, as <mark style="background-color:yellow;">audio output is simply synthesized from text</mark>.
* ***Data***: From <mark style="color:blue;">speech recognition</mark>, <mark style="color:blue;">spoken question answering</mark>, and <mark style="color:blue;">text response</mark> tasks.

**Stage 3: Multi-Modal Fine-Tuning**

* ***Goal***: Fine-tuning the <mark style="color:orange;">**entire model**</mark>.
* ***Data***: <mark style="color:orange;">**Comprehensive**</mark> data.

(Additional Final Stage: <mark style="color:orange;">**Annealing**</mark> and fine-tuning with <mark style="color:blue;">Voice QA</mark>.)

## Batch Parallel Decoding

<figure><img src="/files/MKQ1a2k55G3VCfu0hSmB" alt=""><figcaption></figcaption></figure>

## Model Input

* Given the eight parallel output sequences, the input <mark style="color:orange;">**also requires eight sequences**</mark>.
* Placing the special token <mark style="color:green;">**\<answer>**</mark> in different positions to <mark style="color:orange;">**guide the model for multi-modal output**</mark>.

<figure><img src="/files/6SPoPdWPqsiDXUnDHuSH" alt=""><figcaption></figcaption></figure>

## Dataset

<figure><img src="/files/cu9d7eWgihxqN1loExkM" alt="" width="563"><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yanyun-wangs-gitbook.gitbook.io/yanyun-wangs-gitbook/reading-notes/reading-note-mini-omni-language-models-can-hear-talk-while-thinking-in-streaming.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
