Skip to content

modelscope/FunASR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4,976 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(简体中文|English|日本語|한국어)

FunASR

Industrial speech recognition. 170x faster than Whisper. 50+ languages.
Speaker diarization · Emotion detection · Streaming · One API call

PyPI Stars Downloads Docs

Trendshift

Quick Start · Benchmark · Models · Agent Integration · Docs · Contribute


Quick Start

pip install funasr
from funasr import AutoModel

model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda")
result = model.generate(input="meeting.wav")

Output — structured text with speaker labels, timestamps, and punctuation:

[00:00.4 → 00:03.8] Speaker 0: Let's discuss the Q3 plan.
[00:04.2 → 00:07.1] Speaker 1: Sounds good. I have three points.
[00:07.5 → 00:12.3] Speaker 0: Go ahead. We have 30 minutes.

That's it. One model, one call — VAD segmentation, speech recognition, punctuation, speaker diarization all happen automatically.

Deploy as API server: funasr-server --device cuda → OpenAI-compatible endpoint at localhost:8000

Use with AI agents: MCP Server for Claude/Cursor · OpenAI API for LangChain/Dify/AutoGen

Why FunASR?

FunASR Whisper Cloud APIs
Speed 170x realtime 13x realtime ~1x realtime
Speaker ID ✅ Built-in ❌ Needs pyannote ✅ Extra cost
Emotion ✅ Happy/Sad/Angry
Languages 50+ 57 Varies
Streaming ✅ WebSocket
vLLM Acceleration ✅ 2-3x faster N/A
Self-hosted ✅ MIT license ✅ MIT license ❌ Cloud only
Cost Free Free $0.006/min+
CPU viable ✅ 17x realtime ❌ Too slow N/A

Benchmark

184 long-form audio files (192 min). Full report →

Model GPU Speed CPU Speed vs Whisper-large-v3
SenseVoice-Small 170x realtime 17x realtime 🚀 13x faster
Paraformer-Large 120x realtime 15x realtime 🚀 9x faster
Whisper-large-v3-turbo 46x realtime 3.4x faster
Fun-ASR-Nano 17x realtime 3.6x realtime 1.3x faster
Whisper-large-v3 13x realtime baseline

Key takeaway: FunASR models run on CPU faster than Whisper runs on GPU.


What's new

  • 2026/05/24: vLLM Inference Engine — 2-3x faster LLM decoding for Fun-ASR-Nano. Streaming WebSocket service with VAD + Speaker Diarization. Guide →
  • 2026/05/24: Dynamic VAD — adaptive silence threshold (default on). Short sentences stay intact, long segments get auto-split. Details →
  • 2026/05/24: v1.3.3funasr-server CLI, OpenAI-compatible API, MCP Server for AI agents. pip install --upgrade funasr
  • 2026/05/20: Added Qwen3-ASR (0.6B/1.7B) — 52 languages, auto detection. usage
  • 2026/05/20: Added GLM-ASR-Nano (1.5B) — 17 languages, dialect support. usage
  • 2026/05/19: Fun-ASR-Nano and SenseVoice now support speaker diarization.
  • 2025/12/15: Fun-ASR-Nano-2512 — 31 languages, tens of millions of hours training.
Older
  • 2024/10/10: Whisper-large-v3-turbo support added.
  • 2024/07/04: SenseVoice — ASR + emotion + audio events.
  • 2024/01/30: FunASR 1.0 released.

Installation

pip install funasr
From source / Requirements
git clone https://github.com/modelscope/FunASR.git && cd FunASR
pip install -e ./

Requirements: Python ≥ 3.8, PyTorch ≥ 1.13, torchaudio


Model Zoo

Model Task Languages Params Links
Fun-ASR-Nano ASR + timestamps 31 languages 800M 🤗
SenseVoiceSmall ASR + emotion + events zh/en/ja/ko/yue 234M 🤗
Paraformer-zh ASR + timestamps zh/en 220M 🤗
Paraformer-zh-streaming Streaming ASR zh/en 220M 🤗
Qwen3-ASR ASR, 52 languages multilingual 1.7B usage
GLM-ASR-Nano ASR, 17 languages multilingual 1.5B usage
Whisper-large-v3 ASR + translation multilingual 1550M usage
Whisper-large-v3-turbo ASR + translation multilingual 809M usage
ct-punc Punctuation zh/en 290M 🤗
fsmn-vad VAD zh/en 0.4M 🤗
cam++ Speaker diarization 7.2M 🤗
emotion2vec+large Emotion recognition 300M 🤗

Usage

Full examples with parameter docs: Tutorial →

from funasr import AutoModel

# Chinese production (VAD + ASR + punctuation + speaker)
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", spk_model="cam++", device="cuda")
result = model.generate(input="meeting.wav", hotword="关键词 20")

# 31 languages with timestamps
model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", hub="hf", trust_remote_code=True,
                  vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda")
result = model.generate(input="audio.wav", batch_size=1)

# Streaming real-time
model = AutoModel(model="paraformer-zh-streaming", device="cuda")
result = model.generate(input="chunk.wav", cache={}, chunk_size=[0, 10, 5])

# Emotion recognition
model = AutoModel(model="emotion2vec_plus_large", device="cuda")
result = model.generate(input="audio.wav", granularity="utterance")

Deploy

# OpenAI-compatible API (recommended)
pip install funasr fastapi uvicorn python-multipart
funasr-server --model sensevoice --device cuda
# → POST /v1/audio/transcriptions at localhost:8000

Verify it with a public sample:

curl -L https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav -o sample.wav
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@sample.wav \
  -F model=sensevoice \
  -F response_format=verbose_json
# Docker streaming service
docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.12

OpenAI API example → · Deployment docs → · Agent integration →


Community

📖 Documentation 🐛 Issues
💬 Discussions 🤗 HuggingFace
🤝 Contributing 📈 20k growth plan

Star History

Star History Chart

License

MIT License

Citations

@inproceedings{gao2023funasr,
  author={Zhifu Gao and others},
  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
  booktitle={INTERSPEECH},
  year={2023}
}