Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)

Phil Pilgrim (PhiPi) · 2006-11-08 09:03

Chip,

Thanks! You're definitely right about context, and I think your word-based approach will yield some vital clues. I've still got some troublesome consonants and even some vowels that would benefit by a more contextual approach. Right now, the k sound is algorithmic, determined by the following vowel. That's a step up from context-free, but hasn't solved the problem. Better, would be a separate k for each diphone: ka, ke, koo, etc. This could be accomplished easily in a table-driven system, where matches are performed on the longest patterns first, then working down the line to single letters. Whole words could be accommodated this way, too, for any that are truly exceptional. In such as system, for example, I wouldn't have to spell "beer" with three e's, "beeer", i.e. "b ee er", since "eer" would have it's own rule set. Also, to save space, phonemic macros would be useful for phoneme groups that get used in more than one sound sequence.

Also, my inflections are frame-based, rather than phoneme-based. Oddly enough, this sounds better than it should, given that only the first frame in a compound phoneme will get inflected. Pitch is a tricky thing. There's an awfully fine line between monotone and sing-song, and I certainly haven't mastered it. At first I thought your 1/4 semitone resolution would be too coarse for speech. But inflections do cover a wider range than that. The key, I think, is the blend, which makes the spoken word sound less like individual musical notes and more like a continuum. I just need to figure a way to make the blend occur over a wider context. This will likely entail pre-buffering groups of frames and modifying them en masse before passing them on to the vocal tract. Notation-wise, this will likely involve brackets, braces, or maybe just spaces to delineate the inflected units.

Tempo is another tough nut. Stressed syllables are often drawn out, as well as being inflected. And certain vowels have different durations, depending where in a phonemic group they appear. It would be nice to find some rules for the latter, since the notational burden gets to be cumbersome otherwise. (I'm not sold on the "%nnn" notation, either. It's too wordy.) Also, in songs, where tempo needs to be strict, there has to be a way to make a phonemic group fit a particular time slot. This usually involves stretching or compressing a single vowel to make the group fit. But, again, there's some notational baggage that needs to be optimized.

There's still a lot to do...

-Phil

Cliff L. Biffle · 2006-11-08 16:54

Most of the phonemic TTS systems I've worked with have a "hint" database, which includes fine-tuned word-level pronunciations for hard words (like SCSI, in a technical context). For any word that's not in the database, they apply some basic grammatical rules and cook up a phoneme string. The difference is usually pretty obvious.

For those of you on Macs, the Vicki voice is an example of how good such a system can sound -- but if you throw it a curve ball, like a Spanish text, the quality breaks down. (You should hear the voices from Leopard.

)

Graham Stabler · 2006-11-08 17:39

All of this is so cool and the propeller is both liberating and limiting, and I reckon that will be the mother of invention! Plus the R&D "team" on this forum is pretty tasty really.

I'm struggling to catch up, I really want to help!

Graham

cgracey · 2006-11-08 18:34

Cliff L. Biffle said...

For those of you on Macs, the Vicki voice is an example of how good such a system can sound -- but if you throw it a curve ball, like a Spanish text, the quality breaks down. (You should hear the voices from Leopard. )

I was intrigued by what you said here about this "Vicki" voice, so I Googled it and found out that it takes ~25MB!!! It had better sound good. A Propeller target would be more like ~3KB.

http://developer.apple.com/releasenotes/Carbon/Speech.html

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Ym2413a · 2006-11-08 19:01

This sort of reminds me of the Voder for some odd reason.
The Voder was a old voice synthesizer from the late 1930s that you controlled my hand.

It had buttons and switches to control the systhesis parameters.

I bet it was a real pain to learn and use!

www.obsolete.com/120_years/machines/vocoder/

Cliff L. Biffle · 2006-11-08 19:19

Chip,

Yes, the Vicki voice sounds good. It was the best realtime TTS I'd heard until Apple demoed their next-gen voices (coming next year) -- which sound better, but of course take even more space.

As for the Voder, Chip's synthesis code is basically the digital equivalent, sans keyswitches. Anyone want to interface some?

Kevin Wood · 2006-11-08 19:33

So who will be the first person to create a "Funkytown" object?

Ym2413a · 2006-11-08 19:40

Cliff L. Biffle said...
Chip,
As for the Voder, Chip's synthesis code is basically the digital equivalent, sans keyswitches. Anyone want to interface some?

Oh darn Cliff! That gives me an idea!
I'm a pianist and composer as well. Cliff you just gave me an idea for a new instrument design. (lol)
The Prop-Voder! *laughs*

Eitherway you could get some cool sounds out of it!

Phil Pilgrim (PhiPi) · 2006-11-08 21:45

Chip,

I've been thinking about how I'd make the synth table-driven. It would be nice to have data structures that look something like this:

DAT

table   BYTE "eer", 0, "ee", 0, "er", 0, 0
        BYTE "eel", 0, "ee", 0, "el", 0, 0
        ..
        BYTE "ee", 0, 0, F, 310 / 19, 2020 / 19, 2960 / 19, 3500 / 19, GA, 30, 0, 10, 20, 10
        etc.

The idea is that the start routine would scan the table and create an array of "dictionary" entries, each indexed by one of the one- to four-letter patterns and pointing to the rest of the string. All well and good so far. But then I'd have to create my own lookdown routine, since Spin's lookdown doesn't accept an array address, but only a fixed expression list. Written in Spin, such a routine would be too slow, and I don't want to waste an assembly cog on just a dictionary search function.

Okay, I could do something like the following, but it's rather awkward (and would have to be quite long):

address := [b]lookdown[/b](pattern: d[noparse][[/noparse] 0], d[noparse][[/noparse] 1], d[noparse][[/noparse] 2], d[noparse][[/noparse] 3], d[noparse][[/noparse] 4], d[noparse][[/noparse] 5], ... , d[noparse][[/noparse]n])

Spin's built-in lookdown and case constructs are plenty fast for this sort of thing when the parameters are static. But their speed would be hard to duplicate when simulated in Spin from dynamic data. The only other option I can think of would be a hash function. Properly constructed in Spin, that might eliminate a linear search and be fast enough. This may be the route I have to take, unless I've overlooked some Spin feature I'm not yet familiar with...

-Phil

william chan · 2006-11-09 02:19

Help !

I can't compile or download the talk_demo.spin !
I downloaded the latest update. ( why is there no .zip extension ?)

I tried to compile but
it gives an error "Expected a DAT symbol" at this line

call #sine

in the VocalTract.spin file.

Why is the 1st version (zip file ) much larger than the 2nd or 3rd versions?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Phil Pilgrim (PhiPi) · 2006-11-09 02:35

William,

There is a zip extension in the file name. I just rechecked. The reason the first zip is so much larger is that it includes the latest (v1.02) Propeller IDE. The others dont. If you haven't installed this version, that may be the reason you're having trouble getting the package to compile. So download the first zip, and extract the .exe into your Propeller program directory. Then try compiling the newest talk_demo again.

-Phil

cgracey · 2006-11-09 06:47

Phil Pilgrim (PhiPi) said...
Chip,

I've been thinking about how I'd make the synth table-driven.

Phil,

As you probably know, the Spin interpreter has two built-in functions which could aid in this: STRSIZE(@zstring) and STRCOMP(@zstring1, @zstring2). Spending memory making a hash table may not be necessary. I mean, you've got under 100 strings you want to compare to, right? If you made a DAT list of all the targets in z-string form, you could use STRSIZE and STRCOMP to navigate through it and do the comparisons pretty rapidly. You could get the partial benefit of a hash table just by having several smart starting positions within the target list. The targets could each contain a z-string and a pointer to their respective data sets using 'WORD @dataset'. I think this all applies to what you were asking about. BTW, STRSIZE and STRCOMP are very fast. They'd take as much time to execute as it would to handle a single-character comparison discretely in Spin.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2006-11-09 07:37

Chip,

Actually, I don't think there will be any strings to compare that're longer than four characters, so I could just do a compare on appropriately-constructed longs. I was more concerned about the loop overhead, which virtually disappears when using lookdown or case on static data or program structures. But now that I think about it, there's no reason I couldn't just sort the dictionary and do a binary search. That'd be plenty fast!

The drill would be to keep a four-byte (long variable) shift register of the incoming string data. If the first byte is a lower-case letter, then the entire four-byte value is looked up in the table. If there's no exact match, the table position prior to where the match would've been will hold the correct pointer. (There will always be 26 single-letter entries, some with null pointers, so "fob." isn't going to match "eer."; it will match "f..." first.) Then as many characters as there were non-zero bytes in the found long can be shifted in for the next match, and so forth.

Dang! I'd put this stuff out of my mind for the night and closed up my shop. Now I'm inspired to go back out there and work into the wee hours — again!

-Phil

LoopyByteloose · 2006-11-09 09:39

I am quite amazed by all of this. The whole study of phonology is based on the physical limits of the oral cavity, nasal cavity, and larnyx to produce sound. Since I teach ESL, I have to deal with it on a daily basis.

I would have simply sampled speech from an appropriate source and used that rather than get involved in the physics. After all, there is even a tonal register for gender. And another tonal register for culture.

By the way, British phonology tends to have more phonemes than American phonolgy. With more phonemes, more permutations; and more software overhead.

Of course if you want the complete phoneme set, the IPA or Internationa Phonology Alphabete will provide you with an inventory. But, it really is quite unwieldly.

In sum, whatever voice you give a robotic device is going to give it a personality or the lack of one. Might I suggest that you use sampling to get the personality factor right? It really isn't just an expedient.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"If you want more fiber, eat the package.· Not enough?· Eat the manual."········

···················· Tropical regards,····· G. Herzog [noparse][[/noparse]·黃鶴 ]·in Taiwan

william chan · 2006-11-10 03:11

Phil,

Sorry, must be the my new FireFox 2.0 browser that removed the .zip extension.
Anyway, I got the v1.02 propeller tool and it just works ! Congratulations !

Thanks.

P.S. Why the 1.02 IDE is not posted on the Parallax website yet?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Paul Baker · 2006-11-10 06:08

Because it was an ad-hoc revision Chip and Jeff put together to incorporate the ability to use a RES as a return label point as required by Chip's vocal tract object. Since this is the only revision over v1.0 and it hasn't undergone the normal verification and testing process, it is technically a beta version and not an official Parallax release.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

kelvin james · 2006-11-10 08:25

Phil

Here is a very simple thing called " can you do it? ". I added some timing from chips' version, seems to transition better. Just a couple of extra set timing parameters, a little smoother than the set tempo for sustain. Not perfect yet, but slowly making some headway. I have been trying to add some personality, a lot of trial and error here. I think you have all the basics there, it is just a matter of experimenting. The 2 modified files are attatched.

kelvin

Phil Pilgrim (PhiPi) · 2006-11-11 03:36

Kelvin,

Hmmm, that sounds pretty good! I hadn't thought about adding a sustenudo, but it's a really good idea! Look for one in the next release.

Thanks!
Phil

yerpa58 · 2006-11-11 21:55

Any chance of an mp3 or wav for us uninitiates? Kudos on all the nice work so far. I'm up to my ears in projects right now but I look forward to using the propellor chip.

Phil Pilgrim (PhiPi) · 2006-11-12 01:19

I've tried making a recording using my PC from some of the speech output, so I could convert it to MP3 and post it. But for some reason, the PC just isn't getting an adequate signal level. It may be a cable issue. At any rate, it would be worthwhile for people to hear the speech without any visual clues to help decipher it.

I was brought back down to ground a couple days ago when a friend stopped by my shop. The conversation went something like this:

Me: "Hey, you wanna hear this thing talk?"
Friend: "Wow! It can talk? Sure I'd like to hear it!"
Me: <starts demo>
Friend: <grimaces, looks quizzical, grimaces some more>
Demo: <finishes with a flourish>
Friend: "Now there's a voice only a mother could understand!"
Me: <PSSSssss! (balloon deflating)>

But, what the hey. Self-delusion is part of what keeps us going, right? And when reality comes knocking, it only makes us try harder!

-Phil

kelvin james · 2006-11-12 03:58

Phil

Thanks, but not my idea, this is from Chips' programming, i was just adding it. Not to worry about other peoples' opinions, this is something new, and will take some time to please everyone. Your efforts on whatever you do are well appreciated.

kelvin

kelvin james · 2006-11-13 05:09

Here is a mp3 of canyoudoit. The audio out from the demo board is not really designed for a line-in to the sound card, so it is a little on the noisy side.

kelvin

Joel Rosenzweig · 2006-11-15 04:18

Phil, Chip,

I've been following the thread for a while and tonight, I finally had a few moments to give the demo a try. You both did an outstanding job with your respective pieces. The speech demo sounded even better than what I was anticipating. I agree that it's hard to understand some of the words, but it appears that this can be resolved by tweaking the phonemes you're using more than anything else. I experimented with the demo by adding a few words of my own, and the speech sounded quite good. I recall having to make the same types of tweaks to my SP0-256 speech synthesizer based projects.

I certainly look forward to your next set of enhancements. This is really neat. I was going to use a nice backlit LCD for the user interface on my propellor project. Maybe I'll have to reconsider and add the speech synthesizer instead.

Thanks to both of you for your work on this. I appreciate it. This is really good stuff!

Joel-

cgracey · 2006-11-15 06:30

So, Phil,

What are you working on today? Are you off in large-model land?

Phil Pilgrim (PhiPi) said...
I've tried making a recording using my PC from some of the speech output, so I could convert it to MP3 and post it. But for some reason, the PC just isn't getting an adequate signal level. It may be a cable issue. At any rate, it would be worthwhile for people to hear the speech without any visual clues to help decipher it.

I was brought back down to ground a couple days ago when a friend stopped by my shop. The conversation went something like this:

Me: "Hey, you wanna hear this thing talk?"
Friend: "Wow! It can talk? Sure I'd like to hear it!"
Me: <starts demo>
Friend: <grimaces, looks quizzical, grimaces some more>
Demo: <finishes with a flourish>
Friend: "Now there's a voice only a mother could understand!"
Me: <PSSSssss! (balloon deflating)>

But, what the hey. Self-delusion is part of what keeps us going, right? And when reality comes knocking, it only makes us try harder!

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2006-11-15 07:05

Hey Chip,

Right now, I'm laying out a new daughterboard (i.e. Real Work). 'Can't help staying checked into the forum, though.

But don't worry: I haven't given up on the speech stuff! I'd like to find a way to record a word and convert it programmatically to the proper VocalTract settings. The sound quality would be much more natural that way. I don't remember: can your frequency analyzer program output a file with frequency domain data?

-Phil

rokicki · 2006-11-15 08:33

Actually, I'm very intrigued by that sort of thing myself. I'm quite deaf, actually,
so I'm hoping the Propeller can take some of the load.

Phoneme recognition
would be amazing!

william chan · 2007-03-28 02:43

Phil,

Any fourth installment coming soon?

How to make the speaker sound like a female?

Can the propeller be run at 5Mhz x 8 = 40Mhz or lower to save current consumption and yet get the same voice quality?

Can I embed the Propeller with a CR2032 coin battery into a voice greeting card?
How long will the battery last?

Thanks.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

RytonMike · 2007-03-31 17:49

Phil Pilgrim (PhiPi) said...
Hey Chip,

Right now, I'm laying out a new daughterboard (i.e. Real Work). 'Can't help staying checked into the forum, though.

But don't worry: I haven't given up on the speech stuff! I'd like to find a way to record a word and convert it programmatically to the proper VocalTract settings. The sound quality would be much more natural that way. I don't remember: can your frequency analyzer program output a file with frequency domain data?

-Phil

Phil,

Back in the seventies I worked on the first generation of speech technology. We were developing direct waveform synthesis adding damped sinusoids to make formants and filtered noise for sibilants. We only had TTL ssi and a bit of msi technology in those days.
One of the things I developed which I was quite proud of at the time was a set of tools for extracting synthesiser parameters from real speech. I pass on the key things I learned which are not obvious – it was not published at the time and it may be useful to you.

1. Pitch synchronous analysis is best, using data from the glottal closed part of the waveform. Detect glottal closure by selecting the locally highest peak in the waveform and tracking back to the previous zero crossing. Repeat moving the analysis window forward by the parameter update rate through the recorded waveform.

2. Use a strange window function which is half a Hanning starting at 1 and tapering to 0 over a period corresponding to about 66% - 75% of the pitch period. This does not need to be adaptive. In theory if we assume that the glottal excitation is impulsive there is little energy before the start so a step at this side of the window has little effect on the output and we get maximum information from the glottal closed free response part of the signal.

3. I think we sampled at 10k, the highest third format we used was never more than 3kHz and used an FFT analysis window of 512 filling unused samples with zeros.

4. If you plot the resulting spectrum it is, of course very smooth. The real frequency resolution is quite course because of the relatively short sampling window, however,

5. we are only going to select and use the spectral peak values so we are getting a good interpolation effect. Select the peaks in the pitch synchronous spectra coding amplitude as a number and frequency as position in the array and setting them one after the other.

6. If you look at the resulting patterns you will see a beautifully clear spectrogram with very well defined formants in the voiced segments. We used an operator assisted tool to then extract parameters.

7. Pitch extraction is a separate process. We used a relatively simple time domain peak picking process.

8. There are some very interesting possibilities for using this approach to bootstrap speech by rule algorithms and in speech recognition using elastic matching.

Hope this gives some ideas, I really like your synthesiser which brought back many memories.

By the way, our stuff required a computer the size of a medium sized van, things move on!

Mike

cgracey · 2007-03-31 21:32

RytonMike said...

Back in the seventies I worked on the first generation of speech technology. We were developing direct waveform synthesis adding damped sinusoids to make formants and filtered noise for sibilants. We only had TTL ssi and a bit of msi technology in those days.
One of the things I developed which I was quite proud of at the time was a set of tools for extracting synthesiser parameters from real speech. I pass on the key things I learned which are not obvious – it was not published at the time and it may be useful to you.

Mike,

These are real gems!!! This is the kind of·know-how that is all but lost these days, but remains as applicable and relevant as ever. These kinds of algorithms just amaze me, and the discipline of their development fascinates me more than any other aspect of computing.

I've thought for a long time that if we could get reliably distilled formant analyses, speech recognition could become pretty straightforward. This is great! I don't think it would take more than a few kilobytes of code to get the recognizer core running. I probably won't have time to do this for a while, but I'm really happy knowing HOW to think about it now.

Thank you very much for sharing these ideas. They are invaluable.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2007-04-01 00:04

Mike,

The knowledge density within those few paragraphs is mind-blowing. That the results could be applied to coax voice from SSI and MSI circuitry is even more astounding. I'm anxious to apply your principals — perhaps in a couple months when I get a breath from work.

Thanks for taking the time to impart your wisdom!

-Phil

Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)

Comments