Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)

Phil Pilgrim (PhiPi) · 2006-11-01 08:44

Attached is a very crude attempt at speech synthesis using Chip's recently posted VocalTract object. The "talk" object is quite rough around the edges, and to say that some of my phonemes are barely intelligible gives them way too much credit. But maybe with input from the community and some fine tuning (okay, coarse tuning), the quality can be improved over time. Chip's marvelously compact object has everything that's needed for intelligible speech. But like any tool of its utility and complexity, it needs to be mastered; and that takes time.

I've relied heavily on this paper for the formant values used in the program. The internet has many other valuable resources for synthesized speech, some dating back decades. This can be a problem, too, since much of the seminal work on the subject was done before the internet existed, and the resulting papers have likely never been converted to machine-readable form and posted.

Much of what is done here via individual argument lists might more efficiently be accomplished by table-driven methods. But in its current form, it's somewhat more readable, which is important for development and debugging. Plus it makes playing with the settings a little easier.

The attached archive includes the latest (v1.02) IDE/compiler exe. If you haven't already installed that version, copy the exe from the ZIP over the existing copy in your Propeller IDE directory.

Anyway, for what it's worth, enjoy!

-Phil

Update (2006.11.04): Attached is a somewhat improved version. Some of the consonants are better, there are more demos, and I've added whispering and a spell procedure. 'Still some extraneous popping and hissing to cure.

Update (2006.11.07): Added inflections, rolled r's, better musical notation, on-the-fly tempo adjustments, multiple speakers.

Post Edited (Phil Pilgrim (PhiPi)) : 11/8/2006 6:26:51 AM GMT

cgracey · 2006-11-01 13:01

Phil,

Wow!·I didn't imagine anyone·would accomplish so much, so soon. You've made a phoneme layer for the VocalTract in about 300 lines of code.

Interested Propeller programmers could glean a lot from looking at your talk.spin object, as it shows a flow for feeding the VocalTract. As you said, a table-driven implementation would be more compact, but what you've made is very readable and understandable -- and it's a functional general-purpose speech synthesizer!

You could make different formant sets for "man", "woman", and "child" tracts, as well as corresponding pitch ranges... Well, I'm sure you've thought of all that. What you have actually works quite well, already. As you said, the annunciation is crude compared to what's possible, but it is synthesizing speech, all right. It sounds like the Votrax SC-01A chip.

Good job!

BTW, if you go the stereo spatializer thread, the VocalTract in that demo is v1.1. It·behaves more sensibly during frame gaps. In fact, I'll just attach it here...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Cobalt · 2006-11-01 16:29

WOW... just wow, this is simply awsome. First Chip made the monks/seven demo, and now this... I think I'm going to be losing some sleep tonight!

Also, I seem to be getting some poping and such when I play back the sentences, it looks like the "~" is causing most of them, any idea why?

Paul Sr. · 2006-11-01 18:02

Cobalt said...
WOW... just wow, this is simply awsome. First Chip made the monks/seven demo, and now this... I think I'm going to be losing some sleep tonight!

Also, I seem to be getting some poping and such when I play back the sentences, it looks like the "~" is causing most of them, any idea why?

Have YOU ever tried to say "~" ??

I agree - this stuff is quite impressive..

Phil Pilgrim (PhiPi) · 2006-11-01 19:06

Some of my transitions between frames are a pretty rough. The popping that you hear may be coming from too abrupt changes, or it might be from bad gain settings leading to overflow. I'm just not sure which. I added the "~" to give emphasis to terminal consonants -- sort of a Lawrence Welk effect, though not nearly so protracted. The reason is that some of them seemed to get swallowed without the added vocalization.

Another thing I need to add is a dynamic tempo modifier. The optimum duration of a vowel is context-dependent. Sometimes you want to extend them for emphasis, particularly long vowels; other times shortening them almost to the point of inaudibility works better.

In addition, I haven't really figured out the stress thing. The glottal pitch modifier works fine for songs; but when it's applied to stressed syllables, it sounds totally fake.

Hopefully, people will feel free to experiment with the settings and offer improvements as they discover them. In particular, some of the consonants are virtually unintelligible and need a lot of help.

-Phil

cgracey · 2006-11-01 19:31

Phil Pilgrim (PhiPi) said...

I haven't really figured out the stress thing. The glottal pitch modifier works fine for songs; but when it's applied to stressed syllables, it sounds totally fake.

Phil,

Maybe stress could be better conveyed through a combination of timing, glottal amplitude, perhaps some subtle formant tweaks, as well as glottal pitch.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Graham Stabler · 2006-11-02 02:20

Try this:

    
       t.say(string("+7he-loa ever+i-won. -doa-nt ++yoo, th+ink -tha-t dher proa+pel-er is +soa, --cooool"))
       t.say(string("+9Videe-oakild-her ++raid--ee--oa star. +4Videe-oakild-her ++raid--ee--oa star"))
       t.say(string("+5in ++mae mae-n-d, an-d ++in -mae, car"))
       t.say(string("+5wee ++cahnt ree-wae-nd wee-v ++gon -too, far"))
       t.say(string("+8oa, +we, -oa. yoo +wer-dher +ferst -won"))
       t.say(string("+8oa, +we, -oa. yoo +wer-dher +last -won"))
       t.say(string("+9Videe-oakild-her ++raid--ee--oa star. +4Videe-oakild-her ++raid--ee--oa star"))

This is fun but I must go to bed. Thanks for the fun Phil

Graham

Phil Pilgrim (PhiPi) · 2006-11-02 05:05

Chip,

Thanks for your comments and suggestions, but mainly for VocalTract.spin! This is too much fun!

I think you're right about the stress thing. It's got to be a combination of all those factors. I need to nail down some consonants first, though. k and g are particularly nettlesome. And I may need different phonemes for leading and trailing zs: 'can't seem to get one to work in both places.

-Phil

davee · 2006-11-02 05:46

Fantastic! I was wanting to build a talking altimeter for my HPR rockets.

Behold, the world's most interesting speech processing microchip..

Really, I was just going to connect my digital altimeter to my bluetooth PDA, but now I can do what I wanted in the first place. Thank you!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Dave Evartt

People don't hate computers, they just hate lousy programmers.

http://wehali.com

cgracey · 2006-11-02 06:32

Phil Pilgrim (PhiPi) said...
Chip,

Thanks for your comments and suggestions, but mainly for VocalTract.spin! This is too much fun!

I think you're right about the stress thing. It's got to be a combination of all those factors. I need to nail down some consonants first, though. k and g are particularly nettlesome. And I may need different phonemes for leading and trailing zs: 'can't seem to get one to work in both places.

-Phil

If I recall, a leading "k" is made by a short white noise burst between the following vowel's F2 and F3 positions,·then F2 and F3 rapidly head to their vowel positions from the "k" center, with an aspiration turning to voiced excitation. For a trailing "k" the leading vowel's F2 and F3 converge onto their average as they fade, then there's a silent pause, followed by the white noise burst at where F2 and F3 converged, then an unvoiced (aspirated) "uhhh" sound·starting at the same point.·It's necessary to use the surrounding vowels like this. The "k" in "hike" is audibly higher than in "hook".

...Oh, and "g" is a voiced "k", just as "d" is a voiced "t", and "b" is a voiced "p", and "v" is a voiced "f", and "zh" is a voice "sh", and "z" is a voiced "s". All these symmetries, and you realize the human speech aparatus has a rather limited set of basic sounds it can make.

What I think we need is a motion model of the mouth, where we have only two or three bytes worth of data which define its position. The raw 13 parameters (perhaps requisite at the bottom level) can define nearly inumerable configurations, 99.99% of which are physiological impossibilites. The real range of mouth movement and behavior is relatively constrained. How to qualify this is tough, though. We need to reduce the complexity somehow and make it very intuitive to configure using mouth-movement type data. For example, formants are significant mainly in relation to eachother. Rather than specify exact resonator frequencies, we need a model whereby they find their places based on overall tract formation within the confines of·a base tract model's geometries (the male/female/kid/baby differentiator).·From those constraints the fricative, plosive, affricate, etc. qualities could be inferred. This could be done right on top of VocalTract. The real magic would come from some lava-lamp like morphing of the formants in response to mouth movement. This would mean moving formants in a way that the speech aparatus would have to, which is often not a straight line·between point-A and point-B.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/2/2006 7:39:35 AM GMT

Phil Pilgrim (PhiPi) · 2006-11-02 07:31

Thanks, Chip. That may explain the difficulty I've been having. I was trying to keep things context-independent as much as possible. But it looks like I'll need a bit of look-ahead when processing things like k.

Chip Gracey said...
... with an aspiration turning to voiced excitation.

Whooh! You should should write a textbook. That'd be enough to keep any undergrad riveted!

Chip Gracey said...
What I think we need is a motion model of the mouth, where we have only two or three bytes worth of data which define its position. The raw 13 parameters (perhaps requisite at the bottom level) can define nearly inumerable configurations, 99.99% of which are physiological impossibilites. ...

So basically, all the trajectories and interpolation would be done in a smaller-dimensional space, from which the raw parameters could be derived at any given point in time. That makes sense. It would certainly keep memory requirements to a minimum.

Going one step further still, and to yield the most natural-sounding speech (giving encodings like ADPCM a run for the money), there needs to be a way to go the other way, too: from natural speech (recordings) to the parameters that can produce a reasonable facsimile. That's going to be extremely hard. It'd be an interesting task to train a neural net on.

-Phil

cgracey · 2006-11-02 07:53

Phil Pilgrim (PhiPi) said...

Going one step further still, and to yield the most natural-sounding speech (giving encodings like ADPCM a run for the money), there needs to be a way to go the other way, too: from natural speech (recordings) to the parameters that can produce a reasonable facsimile. That's going to be extremely hard. It'd be an interesting task to train a neural net on.

-Phil

In some old speech processing book I have from 1978, they mention a formant-based vocoder system that squished speech down to 600 bits per second, and they said you could recognize a person's voice through it. Imagine that -- 600bps, without compression. It could probably be compressed to less than half that in real-time, maybe even a tenth with a bit of loss over a longer recording.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

[Deleted User] · 2006-11-02 13:06

Hi,
·I'm having vocaltrack.spin proplems , i've redownloaded chip's new one from above. Heres the line of code that i'm getting a error on.

······················· mov···· t1,vr·················· 'vibrato rate
······················· shr···· t1,#10
······················· add···· vphase,t1
······················· mov···· t1,vp·················· 'vibrato pitch
······················· mov···· t2,vphase
······················· call··· #sine··· *expected DAT symbol

··Thank's Brian·····················

davee · 2006-11-02 13:13

I think maybe the 1.0.3 version of the tool should be plugged into the propeller download page. My guess is that truckwiz is using the old tool.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Dave Evartt

People don't hate computers, they just hate lousy programmers.

http://wehali.com

Graham Stabler · 2006-11-02 14:03

Phil, if there are any papers you want I have access to an academic library and a scanner, I'd be happy to pdf anything you think it really seminal.

Graham

[Deleted User] · 2006-11-02 15:31

Dave,
that was the problem, thank's (sounds awsome)

Brian

Phil Pilgrim (PhiPi) · 2006-11-02 17:40

Graham,

Thanks for the offer. Unfortunately, I don't have a particular title in mind. I just remember there being a lot of ferment in the area back in the 70's. I even took a college course that included speech synthesis, but didn't keep any of my notes or instructional materials. Now I wish I had. One of the great things about the internet is that one can live in a backwater town, like I do, and still have access to a world of resources. But if the resources you need are pre-90s, you're often out of luck.

Chip,

Do you recall where you read about the parameters for the k sound? That's the kind of info that would come in handy. The source I cited is pretty sketchy on consonants.

Thanks,
Phil

cgracey · 2006-11-02 18:13

Phil Pilgrim (PhiPi) said...

Do you recall where you read about the parameters for the k sound? That's the kind of info that would come in handy. The source I cited is pretty sketchy on consonants.

It can be found here:

http://web.inter.nl.net/hcc/davies/esp7cpt.html

Scroll 80% the way down and you'll see all the consonant recipes. This is the most straightforward description I've found. It took me about 1/2 hour to read this documentation, but afterwards, I felt like I was on very solid ground. Here's the picture of interest:

· attachment.php?attachmentid=43969

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

sharpie · 2006-11-03 03:15

For what it is worth to those interested in a good starting point.. (to understand some of what these guys are talking about)

http://en.wikipedia.org/wiki/Speech_synthesis

Phil Pilgrim (PhiPi) · 2006-11-04 07:09

Chip,

A belated thanks for the reference: it contains some good insights. I'm still trying to get that k sound right, with some success, but I'm still not satisfied. In the process, I managed to kill the glottal amplitude completely once. Oddly enough, I got a whisper, and it was intelligible! And that got me thinking: if a g is just a voiced k, and a z a voiced s, how are we able to make them sound different when whispering? By trying it, I realized that the tongue positions are a little different. In the unvoiced consonants, it's flattened against the teeth or the palate more than with their voiced brethren. This may explain why z has been so danged elusive. Mine sounds like a buzzy s, but there's more to it than that. The investigation contniues...

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 11/4/2006 7:16:09 AM GMT

cgracey · 2006-11-04 08:16

Phil Pilgrim (PhiPi) said...
Chip,

A belated thanks for the reference: it contains some good insights. I'm still trying to get that k sound right, with some success, but I'm still not satisfied. In the process, I managed to kill the glottal amplitude completely once. Oddly enough, I got a whisper, and it was intelligible! And that got me thinking: if a g is just a voiced k, and a z a voiced s, how are we able to make them sound different when whispering? By trying it, I realized that the tongue positions are a little different. In the unvoiced consonants, it's flattened against the teeth or the palate more than with their voiced brethren. This may explain why z has been so danged elusive. Mine sounds like a buzzy s, but there's more to it than that. The investigation contniues...

-Phil

I think the difference·between a whispered "k" and a whispered·"g"· is the onset rate of the aspiration after the white burst; or, looked at differently, how close F2 and F3 start from the burst, and how long they are delayed -- "g" would be close and "k" would be farther. I also suspect that F1 in·"g" starts out lower. Just whisper "kai" and "guy" to yourself.

About "z", you might try turning on the nasal anti-resonator to simulate near-closure of the mouth. There is one issue that might need revisiting in the VocalTract, and that is an option to modulate the frication with the glottal pulse so that "z", "zh", "v", and "th"en buzz slightly. Think "zzzzzzap" and "vvvvvvvvvoice". I'm hoping it's not required, but you may find that it is. It wouldn't be too hard to do, but for simplicity's sake, I hope it's not necessary.

Are you using that spectograph I posted to see the spectrum over time? You need some tool like that see what's going on with real sounds. Thanks for the updates.

BTW, as you found, formants are still intelligible with only aspiration excitation. Any speech you make should sound whispered by keeping the aspiration amplitude proportional to what the glottal amplitude was.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/4/2006 8:21:41 AM GMT

Ym2413a · 2006-11-04 16:42

Hey this is really cool, I just now found the time to compile this and run it.
It doesn't sound that bad at all. With a little tuning this thing could be a really useful object.

Phil Pilgrim (PhiPi) · 2006-11-05 03:30

Hi all,

I've attached a slightly improved version to the thread's top post.

-Phil

P.S. Chip: BTW, your spectrum analyzer has helped a lot! Thanks!

Post Edited (Phil Pilgrim (PhiPi)) : 11/5/2006 3:53:05 AM GMT

cgracey · 2006-11-05 05:02

Phil,

That is, indeed, sounding better now. You are on the verge of being able to make a numbers+units talker (ie "one point·seven micro amps").

Have you discovered any limitations of the model? Is there anything that needs tuning, changing, or to be added?

Keep up the great work!

Phil Pilgrim (PhiPi) said...
Hi all,

I've attached a slightly improved version to the thread's top post.

-Phil

P.S. Chip: BTW, your spectrum analyzer has helped a lot! Thanks!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2006-11-05 08:48

Chip,

Thanks! There's still a ways to go, and I really need to start thinking about converting to a table-driven program at some point. So far the only limitations I've encountered have been my own: time and understanding.

I still haven't quite nailed the k. It sounds better before some vowels than others. And the z works poorly as a final consonant. But I've discovered that substituting an s there (i.e. "cars", instead of "carz") works well enough. And there are still some pops, clicks, and hisses that need eradicating...

Graham Stabler said...
I'm hoping to analyse my own voice a bit and get Phil's speech object sounding more like me.

Graham, I can't wait to hear it: a Propeller with a British accent! (Or is that without an American accent?)

-Phil

Ym2413a · 2006-11-05 19:54

I'm surprized at how well it speaks for being just released.
I like the fact that it can sing too.

Tim-M · 2006-11-05 22:52

· Would any of you be willing to post an example audio file or two·for those of us to don't have Propellers yet?· I know this is a large favor to ask, but I'd love to hear what you guys have been working on.· Chip's·examples of the 'Singing Monks' and 'Seven' are so amazing that I can hardly imagine what may be next.· Thanks to Chip and all of you for your hard work.

Tim

Kevin Wood · 2006-11-06 02:36

For anybody interested in doing some more in-depth language analysis and research, I suggest checking out SIL International. SIL International is affiliated with Wycliffe Bible Translators, and they have some freeware versions of their linguistics field research tools and references.

One word of warning - linguistics is a fascinating subject, and you may get hooked on it! Furthermore, studying linguistics will make you a better programmer, since programming languages are really designed for us, not for computers. That's one of the interesting aspects of PERL - Larry Wall is a linguist, and he designed PERL from that perspective.

Although I've been following this thread, it's been awhile since I really focused on linguistics (having been a language major in college). I'm not too certain what type of data structures SPIN allows, but it shouldn't be too difficult to build a phone model that uses mostly on-off representations of individual properties. So if "voiced" is a property, a 1=voiced, 0=devoiced. By setting the relevant properties on/off, you should be able to define your phones. From this, you could build the phonological models for individual languages. In C++, I could see using a struct for the basic phone representation, and classes for the phonological environments.

For anybody interested in a bit of basic linguistics, here is a page with some on-line lecture notes:
www.ling.upenn.edu/courses/Summer_2004/ling001/schedule.html

The second lecture would most closely relate to this thread.

Phil Pilgrim (PhiPi) · 2006-11-06 04:44

Kevin,

That's a really good link that should help a lot of people get their feet wet with this stuff. Combined with Chip's CompuTalker reference and the Wikipedia article cited earlier in this thread, it should be possible to get a grasp on the basic concepts discussed here. Thanks!

Chip,

Kevin's link also led me here: www.haskins.yale.edu/facilities/asy.html, a website devoted to "articulatory synthesis", which synthesizes speech by modeling the mechanics of the vocal tract.

-Phil

Phil Pilgrim (PhiPi) · 2006-11-08 06:25

I've attached a new update to the lead post in this thread.

-Phil

cgracey · 2006-11-08 07:17

Phil,

It just keeps getting better sounding. I like that syntax it uses for speech sounds, and all the little controls within strings to change pitch up and down. That's pretty fancy, but easy to set up. The nice thing is, you can "read" it almost as fast as if it were typed English. I bet your brain is pretty acclimated now to typing phonetics in.

Like you mentioned earlier, going to a table-based approach would make things way smaller. What you have now is a more flexible test bed, though.

I think the sound quality is now approaching that of the phoneme-based synthesizers that have been around. I'm wondering, how can you make a quantum improvement over them? Maybe phonemes, as they are often realized, are not the way to get there. My gut feeling is that they are too limited in context. BTW, your interpolation between phonemes has helped a lot -- many of the pops are gone, which is great.

What I was going to do, myself,·was work on individual words via the VocalTract, directly,·and make them sound as perfect as I could, so that the limitations of the synthesizer were defining the quality limits. Then, I figured that after I had done·a good number of words, I should be able to make some inferences about what kind of abstraction (ie phonemes, or something quite different) would allow that quality to be preserved while making the repertoire infinite. I don't know when I'll get to it, but it's an area·open for·someone to explore.

What you've got working is a totally adequate synthesizer. When/if you cap it's functionality and performance, and make a compact table-based implementation, I bet it could be a Propeller mainstay for the next 10 years.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)

Comments