# xtalk
**Repository Path**: jin1258804025/xtalk
## Basic Information
- **Project Name**: xtalk
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-11
- **Last Updated**: 2026-01-11
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# X-Talk
[](https://xtalk.sjtuxlance.com/)
[](https://arxiv.org/abs/2512.18706)

[](https://opensource.org/licenses/Apache-2.0)
> ⚠️ X-Talk is in active prototyping. Interfaces and functions are subject to change. We will try to keep interfaces stable.
X-Talk is an open-source full-duplex cascaded spoken dialogue system framework featuring:
- ⚡ **Low-Latency, Interruptible, Human-Like Speech Interaction**
- Speech flow is optimized to support **impressive low latency**
- Enables **natural user interruption** during interaction
- **Paralinguistic information** (e.g. environment noise, emotion) is encoded in parallel to support in-depth understanding and empathy
- 🧪 **Researcher Friendly**
- **New models and relevant logic** can be added [within one Python script](#introduce-a-new-model), and seamlessly integrated with the default pipeline.
- 🧩 **Super Lightweight**
- The framework backend is **pure Python**; nothing to build and install beyond `pip install`.
- 🏭 **Production Ready**
- **Concurrency** is ensured through asynchronous backend
- Websocket-based implementation empowers deployment **from web browsers to edge devices**.
## 📚 Contents
- [Demo](#demo)
- [Installation](#installation)
- [Quickstart](#quickstart)
- [Tutorial](#tutorial)
- [Start the Server](#start-the-server)
- [Text Embedding](#text-embedding)
- [Tool Use](#tool-use)
- [Config the Server](#config-the-server)
- [Sample Config for Fully Local Deployment](#sample-config-for-fully-local-deployment)
- [Introduce a New Model](#introduce-a-new-model)
- [Recipe](#recipe)
- [Customize the Service](#customize-the-service)
- [Design Philosophy](#design-philosophy)
- [Supported Models](#supported-models)
- [Contributing](#contributing)
- [Acknowledgements](#acknowledgements)
- [License](#license)
## 🎬 Demo
### Online Demo
[Demo Link](https://xtalk.sjtuxlance.com/)
This demo runs on 4090 cluster with 8-bit quantized *SenseVoice* as speech recognizer, *IndexTTS 1.5* as speech generator, and 4-bit quantized *Qwen3-30B-A3B* as language model. Though at the cost of intelligence due to a relatively small language model, it demonstrates low latency.
### Demo Videos
The tour guiding demos are conducted with *Qwen3-Next-80B-A3B-Instruct* as language model, and the other eight demos are aligned with the online demo setting. Larger language models are more intelligent at the cost of latency.
## 🛠️ Installation
```bash
pip install git+https://github.com/xcc-zach/xtalk.git@main
```
## 🚀 Quickstart
We will use APIs from AliCloud to demonstrate the basic capability of **X-Talk**.
First, install dependencies for AliCloud and server script:
```bash
pip install "xtalk[ali] @ git+https://github.com/xcc-zach/xtalk.git@main"
pip install jinja2 'uvicorn[standard]'
```
Then, obtain an API key from [AliCloud Bailian Platform](https://bailian.console.aliyun.com/?tab=model#/api-key). We will be using free-tier service from AliCloud.
> Online service may be unstable and of high latency. We recommend using locally deployed models for better user experience. See [server config tutorial](#config-the-server) and [supported models](#supported-models) for details.
After that, create a JSON config specifying the models to use, and **fill in ** with the key you obtained:
```json
{
"asr": {
"type": "Qwen3ASRFlashRealtime",
"params": {
"api_key": ""
}
},
"llm_agent": {
"type": "DefaultAgent",
"params": {
"model": {
"api_key": "",
"model": "qwen-plus-2025-12-01",
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1"
}
}
},
"tts": {
"type": "CosyVoice",
"params": {
"api_key": ""
}
}
}
```
> If you find *Qwen3ASRFlashRealtime* not working properly, you can use `"asr": "SenseVoiceSmallLocal",` instead which is a ~1GB local model. Also, you can try to use local speech generation model *IndexTTS* ([setup tutorial](https://github.com/Ksuriuri/index-tts-vllm)):
> ```json
> "tts": {
> "type": "IndexTTS",
> "params": {
> "port": 6006
> }
> },
> ```
> If you want all models deployed locally, see [here](#sample-config-for-fully-local-deployment).
The next step is to compose the startup script. Since we also need to link frontend webpage and scripts to get the demo working, the startup script is ready at `examples/sample_app/configurable_server.py`. We simply need to start the server with the config file (**fill in .json** with the path to the config file we just created) and a custom port:
```bash
git clone https://github.com/xcc-zach/xtalk.git
cd xtalk
python examples/sample_app/configurable_server.py --port 7635 --config .json
```
Finally, our demo is ready at `http://localhost:7635`. View it in the browser!
## 📖 Tutorial
### Start the Server
> [!NOTE]
> See `examples/sample_app/configurable_server.py`, `frontend/src` and `examples/sample_app/templates` for details.
X-Talk has most models and execution on server side, and the client is responsible for interacting with microphone, transmitting audio and Websocket messages, and handle lightweight operations like Voice-Actitvty-Detection.
For client side, you can start with snippet in `examples/sample_app/templates/index.html` and track where `convo` is used to see how to use frontend API:
```html
```
The client-side API mainly comes from `frontend/src/js/index.js`, and if interested, you can check the core code to see how different Websocket messages are handled:
```javascript
switch (json.action) {
case 'queue_status': {...}
case 'queue_granted': {...}
...
}
```
We plan to improve the client-side API in the near future.
For the server side, the core logic is to connect a X-Talk instance to Websocket of FastAPI instance:
```python
from fastapi import FastAPI, WebSocket
from xtalk import Xtalk
app = FastAPI(title="Xtalk Server")
xtalk_instance = Xtalk.from_config("path/to/config.json")
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await xtalk_instance.connect(websocket)
```
Then you can check `examples/sample_app/configurable_server.py` for how to mount client-side scripts and pages.
#### Text Embedding
> [!NOTE]
> See `examples/sample_app/configurable_server.py` and `frontend/src/js/index.js` for details.
X-Talk can understand documents uploaded through embedding search. To enable embedding, you need `langchain_openai.OpenAIEmbeddings` in the config:
```json
"embeddings": {
"type": "OpenAIEmbeddings",
"params": {
"api_key": "",
"base_url": "",
"model": ""
}
},
```
Then you can fetch `text` and `session_id` from client side and notify X-Talk instance through `embed_text`:
```python
@app.post("/api/upload")
async def upload_file(
session_id: str = Form(...),
file: UploadFile = File(...),
):
# Check file type
content_type = (file.content_type or "").lower()
filename = (file.filename or "").lower()
is_text = content_type.startswith("text/") if content_type else False
if content_type and not is_text:
raise HTTPException(status_code=400, detail="Only text files are supported.")
# Read file content and embed
text = (await file.read()).decode("utf-8", errors="ignore")
await xtalk_instance.embed_text(session_id=session_id, text=text)
return {"status": "ok"}
```
Note that client side should save `session_id` and send it in the request. Search `'session_info'` and `uploadFile` in `frontend/src/js/index.js` for how `session_id` is saved and used.
#### Tool Use
> [!NOTE]
> See `examples/sample_app/mental_consultant_server.py` for details.
X-Talk supports textual tool customization through `add_agent_tools`:
```python
xtalk_instance.add_agent_tools([build_mental_questionnaire_tool])
```
Here tool should be a [Langchain tool](https://docs.langchain.com/oss/python/langchain/tools):
```python
from langchain.tools import tool
def search_database(query: str, limit: int = 10) -> str:
"""Search the customer database for records matching the query.
Args:
query: Search terms to look for
limit: Maximum number of results to return
"""
return f"Found {limit} results for '{query}'"
```
In order to maintain seperate states for a tool in echo agent, you can also use a tool factory to maintain internal states (see `build_mental_questionnaire_tool` `examples/sample_app/mental_consultant_server.py`)
##### Built-in Tools
> [!NOTE]
> See source code under `src/xtalk/llm_agent/tools` for all built-in tools.
Built-in tools include agent-scope ones like `web_search` and `get_time`, and pipeline control ones like emotion, timbre and speed of speech. `DefaultAgent` has built-in tools registered by default.
> [!NOTE]
> In order to enable `web_search` tool, `SERPER_API_KEY` needs to be set. See [SerpAPI](https://serpapi.com/).
### Config the Server
As mentioned [before](#start-the-server), X-Talk instance can be created from a JSON config, which customizes models used and controls concurrency behavior.
For model config, config should match model Python class name and init args. For example, the definition of `DefaultAgent` lies in `src/xtalk/llm_agent/default.py`:
```python
class DefaultAgent(Agent):
def __init__(
self,
model: BaseChatModel | dict,
system_prompt: str = _BASE_PROMPT,
voice_names: Optional[List[str]] = None,
emotions: Optional[List[str]] = None,
tools: Optional[List[Union[BaseTool, Callable[[], BaseTool]]]] = None,
):
...
```
In order to match with the init args, the config item should look like:
```
"llm_agent": {
"type": "DefaultAgent",
"params": {
"model": {
"api_key": "none",
"base_url": "http://127.0.0.1:8000/v1",
"model": "cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit"
},
"voice_names": [
"Man",
"Woman",
"Child"
],
"emotions": [
"happy",
"angry",
"sad",
"fear",
"disgust",
"depressed",
"surprised",
"calm",
"normal"
]
}
},
```
Optional keys like `voice_names`, `emotions` and `tools`(not supported in config yet) can be ignored.
See [below](##Supported-Models) for full list of model types (`slot`), their optional dependencies, and their adapting location in source code.
> [!NOTE]
> Most model implementations are client-side adaptors. You may need to start the model instance following coresponding instructions.
Also, you can restrict concurrency through:
```json
"max_connections": 1
```
#### Sample Config for Fully Local Deployment
Below is an example config file for X-Talk when you want to have all models hosted locally. `SherpaOnnxASR` is used for speech recognition, and you can see [here](https://k2-fsa.github.io/sherpa/onnx/sense-voice/python-api.html#websocket-server-and-client-example) to set up the server. For LLM agent and embeddings, any model adhering to OpenAI protocol is fine. You should provide `api_key`, `base_url` and `model`. `IndexTTS` is used for speech generation, and see [here](https://github.com/Ksuriuri/index-tts-vllm) for server setup. Reference voices can be downloaded [here](https://github.com/xcc-zach/xtalk/releases/tag/audio-v0.1). The captioner is hard to set up, but you can refer to the tutorial [here](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner). **Finally, remember to look into each model type in [Supported Models](#supported-models) for how to install the optional dependencies of X-Talk for that model**.
```json
{
"asr": {
"type": "SherpaOnnxASR",
"params": {
"port": 6006,
"mode": "offline"
}
},
"llm_agent": {
"type": "DefaultAgent",
"params": {
"model": {
"api_key": "none",
"base_url": "http://127.0.0.1:8000/v1",
"model": "cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit"
},
"voice_names": [
"Man",
"Woman",
"Child"
],
"emotions": [
"happy",
"angry",
"sad",
"fear",
"disgust",
"depressed",
"surprised",
"calm",
"normal"
]
}
},
"embeddings": {
"type": "OpenAIEmbeddings",
"params": {
"api_key": "none",
"base_url": "http://127.0.0.1:8002/v1",
"model": "Qwen/Qwen3-Embedding-0.6B"
}
},
"tts": {
"type": "IndexTTS",
"params": {
"port": 11996,
"voices": [
{
"name": "Man",
"path": "ReferenceVoice/Man"
},
{
"name": "Woman",
"path": "ReferenceVoice/Woman"
},
{
"name": "Child",
"path": "ReferenceVoice/Child"
}
]
}
},
"speaker_encoder": "PyannoteSpeakerEncoder",
"captioner": {
"type": "Qwen3OmniCaptioner",
"params": {
"base_url": "http://localhost:8901/v1",
"api_key": "none"
}
},
"caption_rewriter": {
"type": "DefaultCaptionRewriter",
"params": {
"model": {
"api_key": "none",
"model": "cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit",
"base_url": "http://127.0.0.1:8000/v1"
}
}
},
"thought_rewriter": {
"type": "DefaultThoughtRewriter",
"params": {
"model": {
"api_key": "none",
"model": "cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit",
"base_url": "http://127.0.0.1:8000/v1"
}
}
},
"speech_speed_controller": "RubberbandSpeedController"
}
```
### Introduce a New Model
> [!NOTE]
> See `examples/sample_app/custom_model.py` and `examples/sample_app/echo_agent.py` for details.
> [!NOTE]
> See [Recipe](#recipe) for adding a model of existing types.
You may want to introduce a new model of an existing type (e.g. text-to-speech), or add a model of new type (e.g. a model that handles backchannel). This can be achieved by `register_model_search_spec` before a `xtalk_instance` is created from config:
```python
from xtalk import Xtalk
Xtalk.register_model_search_spec(
slot="llm_agent",
spec=Path(__file__).parent / "echo_agent.py",
)
xtalk_instance = Xtalk.from_config(args.config)
```
Here `slot` matches the name of corresponding init arg in `Pipeline`. You can check `Xtalk.MODEL_REGISTRY` for existing slots, or use a new slot to represent a new type of models (see `examples\sample_app\custom_service.py` and there `llm_output_refactor_model` can be the new slot).
`spec` is the path to model implementation, an example implementation in `echo_agent.py` looks like this:
```python
from xtalk.model_types import Agent
class EchoAgent(Agent):
"""A simple agent that echoes user input."""
def generate(self, input) -> str:
if isinstance(input, dict):
return input["content"]
return input
def clone(self) -> "EchoAgent":
return EchoAgent()
```
Then you can use the custom model in config file:
```python
{
"asr": {
"type": "Qwen3ASRFlashRealtime",
"params": {
"api_key": ""
}
},
"llm_agent": "EchoAgent",
"tts": {
"type": "CosyVoice",
"params": {
"api_key": ""
}
}
}
```
#### Recipe
Recipes for major model customization are listed below. You can read source code for interfaces of other model types. We will update these interfaces from time to time.
> [!NOTE]
> See `src/xtalk/model_types.py` for all available model types.
> [!IMPORTANT]
> X-Talk has asynchronous default implementations for sync versions, which usually with `run_in_executor`, like `async_recognize` for `recognize` w.r.t. ASR. However, in order to achieve best concurrency for production, we recommend to implement these async versions by your self.
##### New ASR (auto-speech-recognition) Model
Your ASR class must inherit from `xtalk.speech.interfaces.ASR` and implement the following methods:
* **`recognize(audio: bytes) -> str`**
* Recognize audio in a single pass.
* **`reset() -> None`**
* Reset internal recognition state.
* **`clone() -> ASR`**
* Return a new instance for use in new or concurrent sessions.
* Sharing weights/connections (e.g., `_shared_model`) is allowed, but you can't share states.
Methods below are optional:
* **`recognize_stream(audio: bytes, *, is_final: bool = False) -> str`**
* Interface for streaming incremental recognition.
* Returns the "current cumulative recognition result up to this point".
* **`async_recognize(audio: bytes)`**
* **`async def async_recognize_stream(
self, audio: bytes, *, is_final: bool = False
)`**
> [!IMPORTANT]
> Input for `recognize` and `recognize_stream` is PCM 16-bit mono 16 kHz raw bytes. You may need to do conversion by yourself.
> [!NOTE]
> X-Talk have default implementation for `recognize_stream` with a `MockStreamRecognizer`. Therefore, no worry for your non-streaming ASR models.
> [!NOTE]
> You can refer to existing implementations (e.g., `src/xtalk/speech/asr/zipformer_local.py`) when building your own ASR class. We recommend deploying ASR as a separate service and invoking it via API calls within the ASR class, referencing the implementation of `src/xtalk/speech/asr/sherpa_onnx_asr.py`.
##### New TTS (text-to-speech) Model
Your new TTS class must inherit from `xtalk.speech.interfaces.TTS` and implement the following methods:
- **`synthesize(self, text: str) -> bytes`**
- Input: The text to synthesize.
- Output: Raw audio bytes in PCM 16-bit, mono, 48000 Hz.
- **`clone(self) -> TTS`**
- Return a new TTS instance:
- It should have isolated runtime state to avoid cross-session interference and it may share read-only resources if your backend supports that.
**Optional methods**
- **`synthesize_stream(self, text: str, **kwargs) -> Iterable[bytes]`**
- If your backend supports streaming synthesis, you can override this method.
- **`set_voice(self, voice_names: list[str])`**
- This method works with the `TTSVoiceChange` event in `TTSManager` to switch voices via language model tool calls.
- Usually there is only one element in `voice_names`, and this is the current behavior for tool call result. However, some TTS models may support mixing multiple voices for reference. Therefore, `voice_names` is list type.
- **`set_emotion(self, emotion: str | list[float])`**
- This method works with the `TTSEmotionChange` event in `TTSManager` to switch emotions via language model tool calls.
- Current tool call result only carries `emotion` as `str`. However, you may also want `list[float]` as emotion vector for future use.
- **`async def async_synthesize(self, text: str, **kwargs: Any)`**
- **`async def async_synthesize_stream(
self, text: str, **kwargs: Any
)`**
### Customize the Service
> [!NOTE]
> See `examples/sample_app/custom_service.py` for details. A dummy `LLMOutputRefactorModel` is added to X-Talk to prepend `Assistant response: ` before model response text.
If you want to add new functionality, you can follow the procesures below:
First, you may want to define a new model. Here is a model that prepend some text before input:
```python
# Define a custom model
class LLMOutputRefactorModel:
def refactor(self, llm_output: str) -> str:
# Custom logic to refactor LLM output
return "Assistant response: " + llm_output
# If custom model has internal state, implement clone method with concrete state
def clone(self):
return LLMOutputRefactorModel()
```
Note that `clone` is neccesary when your model has internal state that should be distinct across user sessions, like the recognition cache of a streaming speech recognition model.
If you define a new model, or want to add some new function to `Pipeline`, the second step is to define a custom `Pipeline`:
```python
@dataclass(init=False)
class CustomPipeline(DefaultPipeline):
llm_output_refactor_model: Optional["LLMOutputRefactorModel"] = field(
default=None,
metadata={"init_key": "llm_output_refactor_model", "clone": True},
)
def __init__(
self,
asr: ASR,
llm_agent: Agent,
tts: TTS,
captioner: Optional[Captioner] = None,
punt_restorer_model: Optional[PuntRestorer] = None,
caption_rewriter: Optional[Rewriter | BaseChatModel] = None,
thought_rewriter: Optional[Rewriter | BaseChatModel] = None,
vad: Optional[VAD] = None,
speech_enhancer: Optional[SpeechEnhancer] = None,
speaker_encoder: Optional[SpeakerEncoder] = None,
speech_speed_controller: Optional[SpeechSpeedController] = None,
embeddings: Optional[Embeddings] = None,
llm_output_refactor_model: Optional["LLMOutputRefactorModel"] = None,
**kwargs,
):
super().__init__(
asr=asr,
llm_agent=llm_agent,
tts=tts,
captioner=captioner,
punt_restorer_model=punt_restorer_model,
caption_rewriter=caption_rewriter,
thought_rewriter=thought_rewriter,
vad=vad,
speech_enhancer=speech_enhancer,
speaker_encoder=speaker_encoder,
speech_speed_controller=speech_speed_controller,
embeddings=embeddings,
**kwargs,
)
self.llm_output_refactor_model = llm_output_refactor_model
def get_llm_output_refactor_model(
self,
) -> Optional["LLMOutputRefactorModel"]:
return self.llm_output_refactor_model
```
Note that `**kwargs` is necessary in `__init__` to swallow shadowed parameters from `DefaultPipeline`. And if you add a new arg to `__init__`, you will need to register it as a `field`, specifying its `clone` behavior (`True/False`)
Based on X-Talk’s event-bus mechanism, then you can add a new `Manager` to subscribe to an existing `Event` and implement the custom functionality you need. Meanwhile, you can create a new `Event` if needed.
For Example:
```python
LLMOutputRefactoredFinal = create_event_class(
name="LLMOutputRefactoredFinal", fields={"text": "", "turn_id": 0} # key: default_value
)
class LLMOutputRefactorManager(Manager):
def __init__(
self,
event_bus: EventBus,
session_id: str,
pipeline: Pipeline,
config: dict[str, Any],
):
self.event_bus = event_bus
self.pipeline = pipeline
@Manager.event_handler(LLMAgentResponseFinish)
async def handle_llm_response_finish(self, event: LLMAgentResponseFinish):
refactor_model = self.pipeline.get_llm_output_refactor_model()
if refactor_model:
refactored_output = refactor_model.refactor(event.text)
new_event = LLMOutputRefactoredFinal(
session_id=event.session_id,
text=refactored_output,
turn_id=event.turn_id,
)
await self.event_bus.publish(new_event)
async def shutdown(self):
pass
custom_service = DefaultService(pipeline=pipeline)
custom_service.register_manager(LLMOutputRefactorManager)
```
Then you can optionally use `unsubscribe_event` and `subscribe_event` to switch other components (such as `OutputGateway`) from subscribing the old event to the new event. Meanwhile, for the new event, you need to implement the handling method.
```python
custom_service.unsubscribe_event(
event_listener_cls=OutputGateway, event_type=LLMAgentResponseFinish
)
async def output_gateway_llm_output_refactored_final_handler(
self: OutputGateway,
event,
):
await self.send_signal(
{
"action": "finish_resp", # you can find "finish_resp" in frontend/src/js/index.js
"data": {"text": event.text, "turn_id": event.turn_id},
}
)
custom_service.subscribe_event(
event_listener_cls=OutputGateway,
event_type=LLMOutputRefactoredFinal,
method_or_handler=output_gateway_llm_output_refactored_final_handler,
)
```
## 🧩 Design Philosophy

*Prospective Data Flow of X-Talk*

*Architecture of X-Talk*
X-Talk follows a **modular, stage-wise functional flow**, progressing from noisy speech input, through frontend speech interaction, speech understanding, and an LLM-driven conversational agent, to speech generation. This logical pipeline is realized through a **layered, event-driven, and loosely-coupled architecture**, which forms the core of the system.
This design systematically addresses the key challenges of real-time speech-to-speech dialogue systems:
- Controlling **sub-second end-to-end latency**
- Orchestrating **multiple heterogeneous components**
- Enabling **flexible integration and swapping of backend models and services**
The entire system is built around a **centralized event bus**. All layers communicate asynchronously through event publishing and subscribing, enabling efficient management of complex conversational state and data flow.
### Frontend Layer
The Frontend Layer serves as the user-facing interface and directly handles browser-based interaction. It is responsible for:
- Rendering the conversational user interface
- Performing client-side Voice Activity Detection (VAD)
- Applying audio denoising and enhancement
- Displaying real-time latency metrics to the user
This layer packages audio streams, VAD markers, and contextual information for transmission to the backend.
### Event Center Layer
The Event Center Layer acts as the system’s **communication hub and network boundary**, unifying event routing and protocol translation. It consists of two tightly integrated components:
- **Gateways**
- The *Input Gateway* converts frontend streams into typed internal events
- The *Output Gateway* delivers processed events back to the frontend
- **Event Bus**
- Provides the asynchronous messaging fabric
- Routes events between all components in the system
Together, these components decouple all other layers by handling protocol adaptation, event distribution, and lifecycle isolation, forming the **extensible backbone** of the architecture.
### Managers Layer
The Managers Layer orchestrates the core conversational workflow through **specialized, capability-specific managers**. Each manager:
- Subscribes to relevant events
- Executes its dedicated logic (e.g., ASR, LLM inference, TTS)
- Publishes new events to drive the dialogue forward
This event-driven orchestration enables fine-grained control over execution order, concurrency, and latency.
### Agents Layer
The Agents Layer functions as the system’s **task-planning and execution engine**. It integrates structured inputs from upstream models—such as ASR outputs, voice captions, and contextual signals—into a coherent speech understanding.
Based on this understanding, the agent orchestrates tool usage, including:
- Web search
- Local retrieval
- Audio control
- External API calls
Finally, it synthesizes retrieved or processed information into a **context-aware natural language response**.
### Models Layer
The Models Layer provides a **unified, interface-driven abstraction** for core speech-to-speech dialogue capabilities, including:
- Speech understanding
- LLM-based conversational agents
- Speech generation
By defining stable and modular contracts for each capability, this layer allows compliant implementations to be seamlessly integrated, swapped, or scaled without impacting other system components.
## ✅ Supported Models
### Speech Recognition
**Slot**: `asr`
[SherpaOnnx](https://github.com/k2-fsa/sherpa-onnx) is recommended for its wide support of models and optimized inference performance.
SherpaOnnx
**Dependency:** `pip install "xtalk[sherpa-onnx-asr] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/asr/sherpa_onnx_asr.py`
A high-performance speech recognition framework and beyond.
[Repo](https://github.com/k2-fsa/sherpa-onnx)
[Models](https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models)
[Tutorial to start speech recognition server](https://k2-fsa.github.io/sherpa/onnx/sense-voice/python-api.html#websocket-server-and-client-example)
Qwen3ASRFlashRealtime
**Dependency:** `pip install "xtalk[ali] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/asr/qwen3_asr_flash_realtime.py`
[Details](https://bailian.console.aliyun.com/?tab=model#/model-market/detail/qwen3-asr-flash)
Zipformer
**Dependency:** `pip install "xtalk[zipformer-local] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/asr/zipformer_local.py`
[Details](https://www.modelscope.cn/models/yhdai666/xtalk_zipformer_onnx/summary)
ElevenLabs
**Dependency:** `pip install "xtalk[elevenlabs] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/asr/elevenlabs.py`
[API Reference](https://elevenlabs.io/docs/api-reference/speech-to-text/convert)
### Text to Speech
**Slot**: `tts`
IndexTTS
**Dependency:** `pip install "xtalk[index-tts] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:**
- `src/xtalk/speech/tts/index_tts.py`
- `src/xtalk/speech/tts/index_tts2.py`
[Repo](https://github.com/index-tts/index-tts)
[Installation (vllm boost)](https://github.com/Ksuriuri/index-tts-vllm)
GPT-SoVITS
**Dependency:** `pip install "xtalk[gpt-sovits] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/tts/gpt_sovits.py`
[Repo](https://github.com/RVC-Boss/GPT-SoVITS)
CosyVoice
**Dependency:** `pip install "xtalk[ali] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/tts/cosyvoice.py`
[Details](https://bailian.console.aliyun.com/?tab=model#/model-market/detail/cosyvoice-v3-flash)
ElevenLabs
**Dependency:** `pip install "xtalk[elevenlabs] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/tts/elevenlabs.py`
[API Reference](https://elevenlabs.io/docs/api-reference/text-to-speech/convert)
### Voice Activity Detection
**Slot**: `vad`
X-Talk has VAD on client side, so you may not need one.
Silero VAD
**Dependency:** `pip install "xtalk[silero-vad] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/vad/silero_vad.py`
[Model Details](https://github.com/snakers4/silero-vad)
[VAD-Web](https://github.com/ricky0123/vad)
### Speech Enhancement
**Slot**: `speech_enhancer`
FastEnhancer
**Dependency:** `pip install onnxruntime`
**Path:** `src/xtalk/speech/speech_enhancer/speech_enhancer.py`
[Model Details](https://github.com/aask1357/fastenhancer)
### Speaker Recognition
**Slot**: `speaker_encoder`
Wespeaker-Voxceleb-Resnet34-LM
**Dependency:** `pip install "xtalk[pyannote] @ git+https://github.com/xcc-zach/xtalk.git@main"`
**Path:** `src/xtalk/speech/speaker_encoder/pyannote_embedding.py`
[Wespeaker](https://github.com/wenet-e2e/wespeaker)
[Model Details](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM)
### Captioner
**Slot**: `captioner`
Captioners give you description of audio clip.
Qwen3-Omni-30B-A3B-Captioner
**Dependency:** None
**Path:** `src/xtalk/speech/captioner/qwen3_omni_captioner.py`
[HuggingFace](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner)
[ModelScope](https://www.modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Captioner)
## Contributing
See [Contribution Guide](CONTRIBUTING.md)
## Acknowledgements
We express sincere gratitude for:
- [Langchain](https://www.langchain.com/) as backbone of LLM agents
- [vllm](https://github.com/vllm-project/vllm) for deployment of most models
- All model providers mentioned in [Supported Models](#supported-models)
All of you provide the solid foundation of X-Talk!
## License
This project is licensed under the Apache License 2.0, if you do not install optional dependencies. Some optional dependencies may be under incompatible licenses.