Real-time C++ WebSocket Speech Recognition Server Based on sherpa-onnx and SenseVoice

SmartDeng7/13/25...About 3 min

Real-time C++ WebSocket Speech Recognition Server Based on sherpa-onnx and SenseVoice

With the growing popularity of voice interaction in smart devices and office automation, there is an increasing demand for real-time, efficient, and easy-to-deploy speech recognition services. This article introduces how to build a high-performance C++ WebSocket streaming ASR server based on sherpa-onnx and SenseVoice, supporting multilingual recognition, VAD, containerized deployment, and compatibility with various clients.

GitHub Project: mawwalker/stt-server. Star and feedback are welcome!

1. Background and Goals

Traditional speech recognition services are often complex to deploy and limited in performance, making it difficult to meet the needs of high concurrency and low latency for multiple clients. This project is implemented in C++, combining the sherpa-onnx inference engine and SenseVoice multilingual models, providing:

Real-time streaming ASR
OneShot ASR (utterance-level recognition, with emotion and language detection)
Standard WebSocket API, easy to integrate
Multilingual support (Chinese, English, Japanese, Korean, Cantonese)
Built-in VAD (Voice Activity Detection), automatic segmentation
High concurrency, low latency, production-ready
One-click deployment for both local and container environments, unified configuration

2. Architecture and Core Components

Overall architecture:

WebSocketASRServer: Manages WebSocket connections and message routing
ASREngine: Encapsulates sherpa-onnx inference and VAD, supports multithreading
ASRSession: Manages each client's audio stream and recognition state
Unified Configuration System: Auto-adapts to local and Docker environments
Logger: Unified logging for debugging and operations

Architecture diagram:

┌──────────────┐    ┌────────────────────┐    ┌───────────────┐
│ WebSocket    │    │ WebSocket ASR      │    │ sherpa-onnx   │
│ Client       │◄──►│ Server             │◄──►│ Engine        │
└──────────────┘    └────────────────────┘    └───────────────┘

3. Key Features

Real-time streaming ASR: Get results as you speak, suitable for voice assistants, meeting transcription, etc.
OneShot utterance recognition: Full audio post-processing, automatic language and emotion detection
Multilingual support: SenseVoice model covers Chinese, English, Japanese, Korean, Cantonese
Built-in VAD: Automatically detects speech segments, improves accuracy
High concurrency, low latency: C++ implementation, low resource usage, supports multiple clients
Unified configuration: .env file for seamless local/Docker switching, environment auto-adaptation
Container deployment: Dockerfile and docker-compose supported, production-friendly
Rich API: Standard WebSocket, compatible with Python/JS clients

4. Configuration and Deployment

1. Install Dependencies

System dependencies (Ubuntu example):

sudo apt-get install -y cmake build-essential libjsoncpp-dev libwebsocketpp-dev libasio-dev pkg-config

Install sherpa-onnx (recommended auto script):

./install_sherpa_onnx.sh
source ./setup_env.sh

2. Model Preparation

Ensure assets/ contains SenseVoice and Silero VAD models:

# Download SenseVoice multilingual model
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
tar xvf sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
mv sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 assets/

# Download VAD model
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
mkdir -p assets/silero_vad
mv silero_vad.onnx assets/silero_vad/

3. One-click Startup

Local startup (recommended):

chmod +x run.sh
./run.sh local

Docker startup:

./run.sh docker
# or
docker compose up -d

Makefile support:

make run         # Local
make run-docker  # Docker

Check service status:

./run.sh status
# or
make status

4. Unified Configuration

Maintain a single .env file, auto-adapts to local and Docker. Main parameters:

Parameter	Description	Default
SERVER_PORT	Server port	8000
ASR_MODEL_NAME	ASR model name	sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
VAD_THRESHOLD	VAD threshold	0.5
ASR_POOL_SIZE	ASR thread pool size	2

Priority: command line > environment variable > default value.

5. WebSocket API and Usage Examples

1. Endpoints

Streaming ASR: ws://localhost:8000/sttRealtime?samplerate=16000
OneShot ASR: ws://localhost:8000/oneshot

2. Protocol

Streaming: Send PCM audio stream, receive JSON results in real time
OneShot: Support start/stop control, returns full result, language, emotion, etc.

Streaming result example:

{
    "text": "Recognized text",
    "finished": false,
    "idx": 0,
    "lang": "zh"
}

OneShot result example:

{
    "type": "result",
    "text": "Recognized text",
    "finished": true,
    "lang": "zh",
    "emotion": "neutral",
    "timestamps": [...]
}

3. Python Client Example

import asyncio
import websockets
import wave
import json

async def test_asr():
    uri = "ws://localhost:8000/sttRealtime?samplerate=16000"
    async with websockets.connect(uri) as websocket:
        with wave.open("test.wav", "rb") as wav_file:
            data = wav_file.readframes(1024)
            while data:
                await websocket.send(data)
                data = wav_file.readframes(1024)
        async for message in websocket:
            result = json.loads(message)
            print(f"Result: {result['text']}")
            if result['finished']:
                break

asyncio.run(test_asr())

4. JavaScript Client Example

const socket = new WebSocket('ws://localhost:8000/sttRealtime?samplerate=16000');
socket.onopen = function() { /* send audio data */ };
socket.onmessage = function(event) {
    const result = JSON.parse(event.data);
    console.log('Result:', result.text);
};

6. Performance Tuning and Troubleshooting

Build optimization: cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -march=native"
Multithreading: ./build/websocket_asr_server --threads 8
System tuning: ulimit -n 65536, adjust kernel params for concurrency
Common issues: missing models, port conflict, firewall, see README troubleshooting

7. Development and Extension

Core structure: ASREngine (inference), ASRSession (session), WebSocketASRServer (server)
C++17 standard, Google style, Doxygen comments
Support for custom models, VAD params, authentication, etc.
Testing and debugging: Valgrind, GDB, logging system

8. Conclusion

This project, built with C++ and powered by sherpa-onnx and SenseVoice, delivers a high-performance, easy-to-use, and extensible real-time speech recognition service. Whether for local development or production deployment, you can get started with one click and integrate quickly. Feedback and contributions are welcome!

For questions or suggestions, please open an issue or discuss on GitHub.

Quick Links: Install Dependencies | Docker Deployment | API Docs | Troubleshooting | Development