Open Preview

ChindaTTS

Igniting Hope in Thai AI Speech Technology

Advanced bilingual text-to-speech synthesis powered by Neural Audio Codec and Transformer Architecture

1-3+

Speaker Voices

10.56%

CER Score

1.59

RTF Score

24kHz

Audio Quality

Choose Your Model

From open-weight previews to premium commercial solutions, find the perfect TTS model for your needs.

Available

ChindaTTS Open Preview

Open-Weight • Commercial Free

Initial open-weight TTS model preview with baseline Thai-English synthesis capabilities.

1 Speaker Voice
Open-weight model
Thai-English bilingual support
Available on HuggingFace

Late October - Early November 2025

Coming Soon

ChindaTTS Open

Open-Weight • Commercial Free

Enhanced version with improved tone accuracy, better quality, and expanded voice variations.

2 Speaker Voices
Improve tone accuracy
Enhanced audio quality
Simple emotional control

Late November - Early December 2025

Coming Soon

ChindaTTS Prime

Commercial

Premium commercial version with top-tier quality, advanced features, and enterprise support.

3+ Speaker Voices
Professional-grade quality
More emotional control
Enterprise SLA support

Mid December 2025

Coming Soon

Generated Speech Samples

Listen to ChindaTTS in action with Thai, English, and mixed language samples

Thai Sample 1

Thai

“สวัสดีครับ ยินดีต้อนรับสู่ ChindaTTS”

English Sample 1

English

“Welcome to ChindaTTS, a Thai-English text-to-speech model”

Mixed Language Sample

Mixed

“ChindaTTS คือโมเดลสังเคราะห์เสียงพูดที่รองรับทั้งภาษาไทยและ English”

Thai Sample 2

Thai

“เทคโนโลยีปัญญาประดิษฐ์สำหรับการสังเคราะห์เสียงพูดภาษาไทย”

English Sample 2

English

“High-quality speech synthesis with exceptional intelligibility”

Mixed Sample 2

Mixed

“ระบบ TTS ที่ออกแบบมาเพื่อ Thai and English languages”

Audio samples generated by ChindaTTS Open Preview.

Performance Metrics

Comprehensive evaluation using ASR-as-a-Judge methodology

Approaching human-level performance

Character Error Rate

0.00%

Measures intelligibility and pronunciation accuracy on Thai-English mixed speech

Real-Time Factor

0.00

Processing performance on standard GPUs. Research preview - optimization ongoing

ASR-as-a-Judge Evaluation

ASR Model

Whisper Large V3

Test Dataset

Common Voice 17 Thai

Metric

Character Error Rate

Generated speech is transcribed using ASR and compared with reference text.
Lower CER indicates better intelligibility and pronunciation accuracy.
* Evaluated on 1,000 randomly selected samples from the test set.

Training Process

Two-stage approach to achieve exceptional quality and natural Thai-English speech

Stage 1

Continuous Pre-training

Building foundational understanding of speech patterns and features across diverse corpora

Teaching the AI to understand how speech works in general

Large-scale diverse speech data

General speech feature learning

1,696 hours of audio data

Stage 2

Fine-tuning

Refining speech quality, tonal accuracy, and naturalness with curated Thai-English datasets

Polishing the AI to sound natural and get Thai tones exactly right

High-quality Thai-English data

Optimized pronunciation & tones

65 hours of audio data

Natural High-Quality Speech

System Architecture

Modern LLM-based framework treating speech synthesis as sequential generation

Text Encoder

LLaMA-based transformer processes Thai and English text into contextual embeddings

Converts your text into AI-understandable format

↓

Acoustic Decoder

Generates SNAC tokens per frame for direct temporal speech feature generation

AI plans how to say your text naturally

↓

SNAC Codec

Multi-scale neural audio codec with RVQ for high-fidelity audio reconstruction

Smart audio blocks that keep voice quality high

↓

Audio Output

24kHz waveform with ~200ms latency, preserving temporal consistency

Natural-sounding speech output

Key Advantages

End-to-End Efficiency

Joint learning of text-to-audio without intermediate transformations

One AI system does everything, no complex pipelines

End-to-End Architecture

Direct audio generation without separate vocoder inference

Simpler processing pipeline

Higher Robustness

Better consistency reducing artifacts like frame popping

Works reliably even with unusual text

24kHz

Sample Rate

LLaMA

Base Model

SNAC

Audio Codec