Open Preview

ChindaTTS

Igniting Hope in Thai AI Speech Technology

Advanced bilingual text-to-speech synthesis powered by Neural Audio Codec and Transformer Architecture

1-3+
Speaker Voices
10.56%
CER Score
1.59
RTF Score
24kHz
Audio Quality

Choose Your Model

From open-weight previews to premium commercial solutions, find the perfect TTS model for your needs.

Available

ChindaTTS Open Preview

Open-Weight • Commercial Free

Initial open-weight TTS model preview with baseline Thai-English synthesis capabilities.

  • 1 Speaker Voice
  • Open-weight model
  • Thai-English bilingual support
  • Available on HuggingFace
Late October - Early November 2025
Coming Soon

ChindaTTS Open

Open-Weight • Commercial Free

Enhanced version with improved tone accuracy, better quality, and expanded voice variations.

  • 2 Speaker Voices
  • Improve tone accuracy
  • Enhanced audio quality
  • Simple emotional control
Late November - Early December 2025
Coming Soon
Coming Soon

ChindaTTS Prime

Commercial

Premium commercial version with top-tier quality, advanced features, and enterprise support.

  • 3+ Speaker Voices
  • Professional-grade quality
  • More emotional control
  • Enterprise SLA support
Mid December 2025
Coming Soon

Generated Speech Samples

Listen to ChindaTTS in action with Thai, English, and mixed language samples

Thai Sample 1

Thai

สวัสดีครับ ยินดีต้อนรับสู่ ChindaTTS

English Sample 1

English

Welcome to ChindaTTS, a Thai-English text-to-speech model

Mixed Language Sample

Mixed

ChindaTTS คือโมเดลสังเคราะห์เสียงพูดที่รองรับทั้งภาษาไทยและ English

Thai Sample 2

Thai

เทคโนโลยีปัญญาประดิษฐ์สำหรับการสังเคราะห์เสียงพูดภาษาไทย

English Sample 2

English

High-quality speech synthesis with exceptional intelligibility

Mixed Sample 2

Mixed

ระบบ TTS ที่ออกแบบมาเพื่อ Thai and English languages

Audio samples generated by ChindaTTS Open Preview.

Performance Metrics

Comprehensive evaluation using ASR-as-a-Judge methodology

Approaching human-level performance

Character Error Rate
0.00%

Measures intelligibility and pronunciation accuracy on Thai-English mixed speech

Real-Time Factor
0.00

Processing performance on standard GPUs. Research preview - optimization ongoing

ASR-as-a-Judge Evaluation

ASR Model
Whisper Large V3
Test Dataset
Common Voice 17 Thai
Metric
Character Error Rate

Generated speech is transcribed using ASR and compared with reference text.
Lower CER indicates better intelligibility and pronunciation accuracy.
* Evaluated on 1,000 randomly selected samples from the test set.

Training Process

Two-stage approach to achieve exceptional quality and natural Thai-English speech

Stage 1

Continuous Pre-training

Building foundational understanding of speech patterns and features across diverse corpora

Teaching the AI to understand how speech works in general

Large-scale diverse speech data
General speech feature learning
1,696 hours of audio data
Stage 2

Fine-tuning

Refining speech quality, tonal accuracy, and naturalness with curated Thai-English datasets

Polishing the AI to sound natural and get Thai tones exactly right

High-quality Thai-English data
Optimized pronunciation & tones
65 hours of audio data
Natural High-Quality Speech

System Architecture

Modern LLM-based framework treating speech synthesis as sequential generation

Text Encoder

LLaMA-based transformer processes Thai and English text into contextual embeddings

Converts your text into AI-understandable format

Acoustic Decoder

Generates SNAC tokens per frame for direct temporal speech feature generation

AI plans how to say your text naturally

SNAC Codec

Multi-scale neural audio codec with RVQ for high-fidelity audio reconstruction

Smart audio blocks that keep voice quality high

Audio Output

24kHz waveform with ~200ms latency, preserving temporal consistency

Natural-sounding speech output

Key Advantages

End-to-End Efficiency

Joint learning of text-to-audio without intermediate transformations

One AI system does everything, no complex pipelines

End-to-End Architecture

Direct audio generation without separate vocoder inference

Simpler processing pipeline

Higher Robustness

Better consistency reducing artifacts like frame popping

Works reliably even with unusual text

24kHz
Sample Rate
LLaMA
Base Model
SNAC
Audio Codec