Note: Currently new registrations are closed, if you want an account Contact us
Difference between revisions of "Dhvani"
(Added section "Installation" - Rajeesh K Nambiar) |
|||
(19 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== '''Dhvani Indian Language Text to speech Engine''' == | == '''Dhvani Indian Language Text to speech Engine''' == | ||
Dhvani project is a [http://www.efytimes.com/efytimes/24867/news.htm FOSS India award 2008 winner] | |||
==''Introduction''== | ==''Introduction''== | ||
Dhvani is a [[Text To Speech System]] specially designed for Indian languages. The project started in 2000 by [[Simputer]] trust headed by Dr. Ramesh Hariharan, Indian Institute of Science Bangalore. It uses [[diphone concatenation]] algorithm. Currently it has [[Hindi]],[[Malayalam]],[[Kannada]] modules.It can serve as a | Dhvani is a [[Text To Speech System]] specially designed for Indian languages. The project started in 2000 by [[Simputer]] trust headed by Dr. Ramesh Hariharan, Indian Institute of Science Bangalore. It uses [[diphone concatenation]] algorithm. Currently it has [[Hindi]],[[Malayalam]],[[Kannada]] .[[Bengali]], [[Oriya]],[[Panjabi]], [[Gujarati]],[[Telugu]] modules.It can serve as a | ||
back end for speech synthesisers in Indian Languages, in conjunction | back end for speech synthesisers in Indian Languages, in conjunction | ||
with a laguage-specific text-to-phonetics module | with a laguage-specific text-to-phonetics module. This speech engine has not made any attempt | ||
to do prosody on the output. It simply concatenates basic sound units | to do prosody on the output. It simply concatenates basic sound units | ||
at pitch periods and plays them out. Adding prosody is a task for the future. | at pitch periods and plays them out. Adding prosody is a task for the future. | ||
Line 394: | Line 395: | ||
===''Kannada Module''=== | ===''Kannada Module''=== | ||
Kannada Module is written by Ravi Masalthi, IISC Bangalore | Kannada Module is written by Ravi Masalthi, IISC Bangalore | ||
===''Gujarati Module''=== | ===''Gujarati Module''=== | ||
The CVS contains a draft working version of this langauge module. | |||
Developers from this Languages are required to refine it | |||
===''Bengali Module''=== | ===''Bengali Module''=== | ||
The CVS contains a draft working version of this langauge module. | |||
Developers from this Langauges are required to refine it | |||
===''Oriya Module''=== | ===''Oriya Module''=== | ||
The CVS contains a draft working version of this langauge module. | |||
Developers from this Languages are required to refine it | |||
===''Panjabi Module''=== | ===''Panjabi Module''=== | ||
The CVS contains a draft working version of this langauge module. | |||
Developers from this Languages are required to refine it | |||
===''Telugu Module''=== | |||
The CVS contains a draft working version of this langauge module. | |||
Developers from this Languages are required to refine it | |||
===''Malayalam Module''=== | ===''Malayalam Module''=== | ||
Line 406: | Line 422: | ||
==''Dhvani Front End''== | ==''Dhvani Front End''== | ||
To be developed. Integrating Dhvani with openoffice, epiphany etc is planned | To be developed. Integrating Dhvani with openoffice, epiphany etc is planned | ||
== '' Text to Ogg file Conversion''== | |||
TO convert a utf-8 file to ogg file follow these steps | |||
a) Convert the text to raw sound file | |||
dhvani -o outputfile.wav textfile | |||
a) Convert the sound file to ogg | |||
oggenc -B 16 -C 1 -R 16000 outputfile.wav | |||
The oggfile will be created with the name outputfile.ogg | |||
Some sample speech files | |||
# [http://santhosh00.googlepages.com/hindi.ogg Hindi] | |||
# [http://santhosh00.googlepages.com/magic_cat-0.ogg Malayalam] | |||
==''Developers''== | ==''Developers''== | ||
# [http://144.16.67.13/~ramesh Ramesh Hariharan] | # [http://144.16.67.13/~ramesh Ramesh Hariharan] | ||
# [[User:Santhosh|Santhosh Thottingal]] | # [[User:Santhosh|Santhosh Thottingal]] | ||
==''How to create Audio books using Dhvani''== | |||
One of the important feature of dhvani is , it can be used for creating audio books out of utf-8 formatted texts in supported languages. | |||
To create an audiobook follow these steps | |||
* dhvani -o audiobook.wav textfile | |||
* oggenc -B 16 -C 1 -R 16000 audiobook.wav | |||
Now you have a file called audiobook.ogg. If you prefer ogg, then your audiobook is ready. If you want the file in mp3 format | |||
* oggdec audiobook.ogg (This will create a file named audiobook.ogg.wav ) | |||
* lame --preset 192 -ms -h audiobook.ogg.wav (install [http://lame.sourceforge.net/ lame] if it is not present using your package manager) | |||
Now your mp3 file is ready. Transfer it to your music player and enjoy! | |||
==''Download''== | ==''Download''== | ||
Dhvani can be downloaded from [http://sourceforge.net/projects/dhvani Sourceforge Project page] | Dhvani can be downloaded from [http://sourceforge.net/projects/dhvani Sourceforge Project page] | ||
=='' | Latest source code is available at the [http://sourceforge.net/cvs/?group_id=35339 CVS repository of the project] | ||
==''Installation''== | |||
a) Debian/Ubuntu package: http://download.savannah.nongnu.org/releases/smc/Dhvani | |||
b) Fedora 8 onwards uses PulseAudio as the default sound server, and has issues with the pulseaudio-alsa plugin. You will have to either: | |||
* Remove pulseaudio-alsa plugin, or | |||
* Comment out all the lines in /etc/alsa/pulse-default.conf | |||
==''License''== | |||
Dhvani is licensed under GPL version 2 or later | |||
Revision as of 17:26, 31 March 2008
Dhvani Indian Language Text to speech Engine
Dhvani project is a FOSS India award 2008 winner
Introduction
Dhvani is a Text To Speech System specially designed for Indian languages. The project started in 2000 by Simputer trust headed by Dr. Ramesh Hariharan, Indian Institute of Science Bangalore. It uses diphone concatenation algorithm. Currently it has Hindi,Malayalam,Kannada .Bengali, Oriya,Panjabi, Gujarati,Telugu modules.It can serve as a back end for speech synthesisers in Indian Languages, in conjunction with a laguage-specific text-to-phonetics module. This speech engine has not made any attempt to do prosody on the output. It simply concatenates basic sound units at pitch periods and plays them out. Adding prosody is a task for the future.
A platform independent java port of Dhvani is under development. It will include API for applications which can use Dhvani as TTS system. Minimal support for SSML will be also present.
Sound Database
The database has the following structure. All sound files stored in the database are gsm compressed .gsm files (see the gsm directory containing an open source distribution of the GSM standard by The Communications and Operating Systems Research Group (KBS) at the Technische Universitaet Berlin) recorded at 16KHz as 16bit signed linear samples. The following sound units are stored in the database (the numbers below have been explained above).
CV pairs: 1..33 * 2 4 6 8 9 10 12 13 14 15
VC pairs: 2 4 6 8 9 10 12 13 14 15 * 1..34
V: 1..14 33 0C sounds, all consonants except an.
Halfs: ky kr kl kll kv ksh khy khr khl khv gy gr gl gv gn ghy ghr ghv ghn chy chr chv jy jv ty tr tv thy thr dy dr dv dhy dhr dhv ny nr nv tty ttr ttv ddy ddr ddv py pr pl pll fr fl by br bl bhy bhr bhl my mr vy vr vl
The total size of the database is currently around 1MB, though we can possibly work to get it down to about half the size by storing only parts of vowels and extending them on the fly. We are using gsm compression which gives about a factor of 10 compression. There are programs with better compression ratios available but they do not seem to be open source.
Sound playback is programmed in ALSA- Advanced Linux Sound Architecture.
Architecture
CV files are in the cv/ directory within database/ VC files are in the vc/ directory within database/ V files are in the v/ directory within database/ Halfs files are in the halfs/ directory within database/ 0C files are in the c/ directory within database/
CV files are named x.y.gsm where x is the consonant number and y is the vowel number. VC files are named x.y.gsm where x is the vowel number and y is the consonant number. V files are named x.gsm where x is the vowel number. Halfs files are named x.y.gsm where x,y are the two consonants involved. 0C files are named x.gsm where x is the consonant number.
All files other than the 0C files have been pitch marked and the marks appear in the corresponding .marks files, one mark per byte as an unsigned char.
In addition to the sound files, there are four files
in database/, namely cvoffsets, vcoffsets, voffsets
and hoffsets, which store various attributes of the
sound files.
cvoffsets
CV fields: start(start of the cv) diphst(diphone start position: default halfway to ctov from start) ctov(cons to vowel change position) longvowlen(length of long vowel, currently not really used) shortvowlen(length of short vowel) diphend(end of diphone for long vowel, short will be obtained from long) diphshortfactor(factor for getting short diphone from long) halfst(place where this cv is cut to connect to previous half)
vcoffsets
VC fields: end(end of vc) diphend(diphone end position: default halfway from ctov to end) vtoc(vowel to cons change position) longvowlen(length of long vowel, currently not really used) shortvowlen(length of short vowel) diphst(start of diphone for long vowel, short will be obtained from long)
voffsets
V fields: length (length to be played starting from 0)
hoffsets
Halfs fields: start (start of half) end (place where this half is cut and appended to the next sound)
Several of the above files will have xxx attributes meaning that
the synthesis program can set default values for these attributes.
Phonetic Script
The phonetic description is syllable based. Eight kinds of sounds are allowed (C stands for consonant, V for Vowel, H for a half consonant). The text to be spoken out must be expressed in terms of these eight types of sound units.
- V: a plain vowel
- CV: a consonant followed by a vowel
- VC: a vowel followed by a consonant
- CVC: a consonant followed by a vowel followed by a consonant
- HCV: a half consonant, followed by a CV
- HCVC: a half consonant, followed by a CVC
- 0C: a consonant alone
- G[0-9]*: a silence gap of the specified length (typical gaps
between words would be between G1500 and G3000 depending upon the speed required; max allowed is G15000; larger gaps can be given by repeating G15000 as many times as required)
Before giving examples of the above, we need to enumerate the
consonants and vowels we allow.
Vowels
vowels allowed are:
- a as is pun
- aa as in the hindi word saal (meaning year)
- i as in pin
- ii as in keen
- u as in pull
- uu as in pool
- e as in met
- ee as in mate
- ae as in mat
- ai as in height
- o as in the tamil word ponni (meaning gold)
- oo as in court
- au as in call
- ow as in cow
- tamil-u : as in the tamil aanddu (meaning year)
The phonetic description uses the numbers 1-15 instead of the pnemonics given above.
Consonants
k kh g gh ch chh j jh t th d dh n tt tth dd ddh nna p f b bh m y r l ll v sh s h zh z an
Most of the above are self-explanatory for those who know an Indian language. The only ones which may need explanation are
ll as in the tamil word vellam (meaning water, not jaggery) zh as in the tamil word vazhi (meaning way) z as in the urdu work roz (meaning daily) an as in the hindi kahaan (meaning where)
These consonants are numbered 1..34. the phonetic description however uses the pnemonics above. Within the program and in the database nomenclature, the numbers are used.
Examples
- khana (food in hindi) kh2 n2 (CV CV)
- maun (silence in hindi) m13n (CVC)
- kahaan (where in hindi) k1 h2an (CV CVC)
- pratibha (talent in hindi) pHr1 t3 bh2 (HCV CV CV)
- sankalp (resolution in hindi) s1n k1l 0p (CVC CVC 0C)
- chandramaa (the moon in hindi) ch1n dHr1 m2 (CVC HCV CV)
- praan (life in hindi) pHr2n (HCVC)
- mysore (as pronounced in kannada) m10 s6 r5 (CV CV CV)
- rashtr (nation in hindi) r2sh 0tt 0r (CVC 0C 0C)
- aadesh (instruction in hindi) 2 d8sh (V CHC)
- andaaz (style in urdu) 1n d2z (VC CVC)
- ahimsa (nonviolence) 1 h3n s2 (V CVC CV)
- vazhapazham (banana in tamil) v2 zh1 p1 zh1m (CV CV CV CVC)
A note on Half Characters
Only the following half sounds are allowed.
ky kr kl kll kv ksh khy khr khl khv gy gr gl gv gn ghy ghr ghv ghn chy chr chv jy jv ty tr tv thy thr dy dr dv dhy dhr dhv ny nr nv tty ttr ttv ddy ddr ddv py pr pl pll fr fl by br bl bhy bhr bhl my mr vy vr vl
If you want to use a half sound which is not in
this list, you must use 0C instead. For example,
srushtti would be 0s r5sh tt3
hrithik would be 0h r3 t3k
but
dhyan is dhHy2n
khyaati is khHy2 t3
Modules
Hindi Module
Developed By Rileen Sinha, IISC bangalore
- Replace the input UTF text to the corresponding phonetic symbols in our database.This is easily achieved by a careful mapping of the UTF symbols for Hindi onto the phonetic symbols in our database.Simultaneously, each symbol is tagged as a Consonant(C),Vowel(V), or Halant(H).All this is implememnted in the functions replace (for words) and replacenum (for numbers).
- Now we must parse the phonetic strings thus obtained to produce speakable tokens - but this is not as easy as it sounds - in fact, it's quite involved. The main challenge lies in a peculiarity of the Hindi language - the occasional presence of an implicit 'a' (as in the English word 'pun') in a consonant sound. This implicit vowel obviously alters the pronunciation of a word quite drastically.
The challenge, therefore, is to come up with an algorithm that can accomodate this peculiarity and still produce the desired pronunciation, ie the desired phonetic output. The algorithm that we have implemented seems to work for all simple words. It may occasionally produce erroneous pronunciations for compound words, ie words made up of two or more simpler words.
The Algorithm for Parsing a Hindi Word into Speakable Tokens
The basic idea is to parse a given hindi word & produce speakable
sounds, which must be of the form :
V: a plain vowel CV: a consonant followed by a vowel VC: a vowel followed by a consonant CVC: a consonant followed by a vowel followed by a consonant HCV: a half consonant, followed by a CV HCVC: a half consonant, followed by a CVC 0C: a consonant alone
The input is a string of {C,V,Ch,""} where Ch stands for a consonant followed by a halant, and "" stands for a blank.
Let the alphabet being considered at any given time be denoted by c(n), the previous (counting from left to right) as c(n-1), etc
Parsing is done from right to left, as follows : (1)First of all, if the word ends in a C, make it Ch.This is done in order to make the pronunciation consistent with conventional spoken hindi.
(2)Now, parse the word recursively as follows :
If c(n) is :
(a)A "C" :
c(n-1) output "" C1 V CVC,if c(n-2) is a C; else CV C C1C Ch C1C
(b)A "Ch" :
c(n-1) output "" 0C V CVC,if c(n-2) is a C; else VC C C1C Ch CHCV if c(n) & c(n+1) are a CV pair the corresponding 'half' sound is available; else 0C
(c)A "V" :
c(n-1) output "" V V V C CV Ch CV
Examples :
Consider the word "Samaaroh"(Function), for example. In our phonetic symbols, this becomes "sm2r12h", and the desired phonetic output of the convertor should be "s1 m2 r12h".
Now let's see how our algorithm processes this input : We can write this "CCVCVC" in terms of consonat, vowel etc First of all, since it ends in a C, we add a halant, so we get "CCVCVCh". The length of the input is 6. Going from right to left, the algorithm works as follows -
c(6) is a Ch, and c(4) plus c(5) make a CV pair, thus we get a CVC - therefore we have "CVC" ie "r12h" in the output.
Next, c(2) and c(3) make a CV pair, and so we get "CV", or "m2", in the output. Lastly, c(1) is a C all by itself so we output "C1", ie "s1" (that takes care of the implicit vowel :-) !!)
Thus we've got "r12h","m2","s1" and the final output is these in the reverse order, ie "s1 m2 r12h" as desired.
As another example, let's take the word "Naujawaanon"(Youngsters), which doesn't end in a "C" - in our phonetic symbols, this is "n13jv2n13", or "CVCCVCV", input of length 7. The desired output is "n13j v2 n13"
The algorithm works as follows :
c(6) and c(7) make a CV pair, so output is "n13" So do c(4) and c(5), so output is "v2" Finally, c(1), c(2) and c(3) give a CVC, thus "n13j" Thus, we get "n13j v2 n13", (after reversal) as desired. (Actually, the last part is n13n, where 13n is to give the appropriate hindi pronuciation of the vowel, just like 2n for kahaan)
CAVEAT :
The algorithm can fail for certain compound words, eg consider the word "Sabhaapati"(Chairman), ie "sbh2pt3" the desired output is "s1 bh2 p1 t3". The input is "CCVCCV", of length 6. Now let's see how the algorithm works :
First c(5) and c(6) make a CV pair, so - "t3" Next, c(2),c(3) and c(4) make a CVC so - "bh2p" (oops!!) Lastly, c(1) is a solitary "C" so - "s1" Thus, we get "s1 bh2p t3" - not quite what we wanted.
This is an example of how our algorithm can fail for certain compund words - evidently, this will happen when the "join" is such that the end of a previous "subword" gets included in the beginning of a "subword", eg the "bh2" from "sbh2" and "p" from "pt3" in "sbh2pt3".
An obvious (brute force!!) workaround is to have a small dictionary of such "problem" words, and check whether a given word matches any of them.If so, break it up into the corresponding subwords & parse them separately - this works satisfactorily, and we've implemented this with a few words (refer functions checkspecial() and the corresponding parts of process() in hindiphoserv.c).
It must be stressed, however, that even if the algorithm makes a mistake of the above kind, the program isn't going to crash/segfault etc - one merely gets an unexpected pronunciation :-) .
Any constructive criticism/suggestion(s) are most welcome.
Kannada Module
Kannada Module is written by Ravi Masalthi, IISC Bangalore
Gujarati Module
The CVS contains a draft working version of this langauge module. Developers from this Languages are required to refine it
Bengali Module
The CVS contains a draft working version of this langauge module. Developers from this Langauges are required to refine it
Oriya Module
The CVS contains a draft working version of this langauge module. Developers from this Languages are required to refine it
Panjabi Module
The CVS contains a draft working version of this langauge module. Developers from this Languages are required to refine it
Telugu Module
The CVS contains a draft working version of this langauge module. Developers from this Languages are required to refine it
Malayalam Module
Malayalam module is written by Santhosh Thottingal. Support for malayalam numbers, english numbers, decimal places, English abbreviations, and special characters are provided.It consists of unicode parser which will read malayalam unicode encoded text. Words are identified by '.', '!', ';', ',' ,'-' etc. Then the unicode text is converted to Dhvani phoneme script. Number reading logic done using pattern-exception rules.
Dhvani Front End
To be developed. Integrating Dhvani with openoffice, epiphany etc is planned
Text to Ogg file Conversion
TO convert a utf-8 file to ogg file follow these steps a) Convert the text to raw sound file
dhvani -o outputfile.wav textfile
a) Convert the sound file to ogg
oggenc -B 16 -C 1 -R 16000 outputfile.wav
The oggfile will be created with the name outputfile.ogg
Some sample speech files
Developers
How to create Audio books using Dhvani
One of the important feature of dhvani is , it can be used for creating audio books out of utf-8 formatted texts in supported languages. To create an audiobook follow these steps
- dhvani -o audiobook.wav textfile
- oggenc -B 16 -C 1 -R 16000 audiobook.wav
Now you have a file called audiobook.ogg. If you prefer ogg, then your audiobook is ready. If you want the file in mp3 format
- oggdec audiobook.ogg (This will create a file named audiobook.ogg.wav )
- lame --preset 192 -ms -h audiobook.ogg.wav (install lame if it is not present using your package manager)
Now your mp3 file is ready. Transfer it to your music player and enjoy!
Download
Dhvani can be downloaded from Sourceforge Project page
Latest source code is available at the CVS repository of the project
Installation
a) Debian/Ubuntu package: http://download.savannah.nongnu.org/releases/smc/Dhvani
b) Fedora 8 onwards uses PulseAudio as the default sound server, and has issues with the pulseaudio-alsa plugin. You will have to either:
- Remove pulseaudio-alsa plugin, or
- Comment out all the lines in /etc/alsa/pulse-default.conf
License
Dhvani is licensed under GPL version 2 or later