Real-time C++ WebSocket Speech Recognition Server Based on sherpa-onnx and SenseVoice
Real-time C++ WebSocket Speech Recognition Server Based on sherpa-onnx and SenseVoice
With the growing popularity of voice interaction in smart devices and office automation, there is an increasing demand for real-time, efficient, and easy-to-deploy speech recognition services. This article introduces how to build a high-performance C++ WebSocket streaming ASR server based on sherpa-onnx and SenseVoice, supporting multilingual recognition, VAD, containerized deployment, and compatibility with various clients.
GitHub Project: mawwalker/stt-server. Star and feedback are welcome!
1. Background and Goals
Traditional speech recognition services are often complex to deploy and limited in performance, making it difficult to meet the needs of high concurrency and low latency for multiple clients. This project is implemented in C++, combining the sherpa-onnx inference engine and SenseVoice multilingual models, providing:
- Real-time streaming ASR
- OneShot ASR (utterance-level recognition, with emotion and language detection)
- Standard WebSocket API, easy to integrate
- Multilingual support (Chinese, English, Japanese, Korean, Cantonese)
- Built-in VAD (Voice Activity Detection), automatic segmentation
- High concurrency, low latency, production-ready
- One-click deployment for both local and container environments, unified configuration
2. Architecture and Core Components
Overall architecture:
- WebSocketASRServer: Manages WebSocket connections and message routing
- ASREngine: Encapsulates sherpa-onnx inference and VAD, supports multithreading
- ASRSession: Manages each client's audio stream and recognition state
- Unified Configuration System: Auto-adapts to local and Docker environments
- Logger: Unified logging for debugging and operations
Architecture diagram:
┌──────────────┐ ┌────────────────────┐ ┌───────────────┐
│ WebSocket │ │ WebSocket ASR │ │ sherpa-onnx │
│ Client │◄──►│ Server │◄──►│ Engine │
└──────────────┘ └────────────────────┘ └───────────────┘
3. Key Features
- Real-time streaming ASR: Get results as you speak, suitable for voice assistants, meeting transcription, etc.
- OneShot utterance recognition: Full audio post-processing, automatic language and emotion detection
- Multilingual support: SenseVoice model covers Chinese, English, Japanese, Korean, Cantonese
- Built-in VAD: Automatically detects speech segments, improves accuracy
- High concurrency, low latency: C++ implementation, low resource usage, supports multiple clients
- Unified configuration: .env file for seamless local/Docker switching, environment auto-adaptation
- Container deployment: Dockerfile and docker-compose supported, production-friendly
- Rich API: Standard WebSocket, compatible with Python/JS clients
4. Configuration and Deployment
1. Install Dependencies
- System dependencies (Ubuntu example):
sudo apt-get install -y cmake build-essential libjsoncpp-dev libwebsocketpp-dev libasio-dev pkg-config
- Install sherpa-onnx (recommended auto script):
./install_sherpa_onnx.sh
source ./setup_env.sh
2. Model Preparation
Ensure assets/
contains SenseVoice and Silero VAD models:
# Download SenseVoice multilingual model
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
tar xvf sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
mv sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 assets/
# Download VAD model
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
mkdir -p assets/silero_vad
mv silero_vad.onnx assets/silero_vad/
3. One-click Startup
- Local startup (recommended):
chmod +x run.sh
./run.sh local
- Docker startup:
./run.sh docker
# or
docker compose up -d
- Makefile support:
make run # Local
make run-docker # Docker
- Check service status:
./run.sh status
# or
make status
4. Unified Configuration
Maintain a single .env
file, auto-adapts to local and Docker. Main parameters:
Parameter | Description | Default |
---|---|---|
SERVER_PORT | Server port | 8000 |
ASR_MODEL_NAME | ASR model name | sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 |
VAD_THRESHOLD | VAD threshold | 0.5 |
ASR_POOL_SIZE | ASR thread pool size | 2 |
Priority: command line > environment variable > default value.
5. WebSocket API and Usage Examples
1. Endpoints
- Streaming ASR:
ws://localhost:8000/sttRealtime?samplerate=16000
- OneShot ASR:
ws://localhost:8000/oneshot
2. Protocol
- Streaming: Send PCM audio stream, receive JSON results in real time
- OneShot: Support
start/stop
control, returns full result, language, emotion, etc.
Streaming result example:
{
"text": "Recognized text",
"finished": false,
"idx": 0,
"lang": "zh"
}
OneShot result example:
{
"type": "result",
"text": "Recognized text",
"finished": true,
"lang": "zh",
"emotion": "neutral",
"timestamps": [...]
}
3. Python Client Example
import asyncio
import websockets
import wave
import json
async def test_asr():
uri = "ws://localhost:8000/sttRealtime?samplerate=16000"
async with websockets.connect(uri) as websocket:
with wave.open("test.wav", "rb") as wav_file:
data = wav_file.readframes(1024)
while data:
await websocket.send(data)
data = wav_file.readframes(1024)
async for message in websocket:
result = json.loads(message)
print(f"Result: {result['text']}")
if result['finished']:
break
asyncio.run(test_asr())
4. JavaScript Client Example
const socket = new WebSocket('ws://localhost:8000/sttRealtime?samplerate=16000');
socket.onopen = function() { /* send audio data */ };
socket.onmessage = function(event) {
const result = JSON.parse(event.data);
console.log('Result:', result.text);
};
6. Performance Tuning and Troubleshooting
- Build optimization:
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -march=native"
- Multithreading:
./build/websocket_asr_server --threads 8
- System tuning:
ulimit -n 65536
, adjust kernel params for concurrency - Common issues: missing models, port conflict, firewall, see README troubleshooting
7. Development and Extension
- Core structure:
ASREngine
(inference),ASRSession
(session),WebSocketASRServer
(server) - C++17 standard, Google style, Doxygen comments
- Support for custom models, VAD params, authentication, etc.
- Testing and debugging: Valgrind, GDB, logging system
8. Conclusion
This project, built with C++ and powered by sherpa-onnx and SenseVoice, delivers a high-performance, easy-to-use, and extensible real-time speech recognition service. Whether for local development or production deployment, you can get started with one click and integrate quickly. Feedback and contributions are welcome!
For questions or suggestions, please open an issue or discuss on GitHub.
Quick Links: Install Dependencies | Docker Deployment | API Docs | Troubleshooting | Development