Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)
Phil Pilgrim (PhiPi)
Posts: 23,514
Attached is a very crude attempt at speech synthesis using Chip's recently posted VocalTract object. The "talk" object is quite rough around the edges, and to say that some of my phonemes are barely intelligible gives them way too much credit. But maybe with input from the community and some fine tuning (okay, coarse tuning), the quality can be improved over time. Chip's marvelously compact object has everything that's needed for intelligible speech. But like any tool of its utility and complexity, it needs to be mastered; and that takes time.
I've relied heavily on this paper for the formant values used in the program. The internet has many other valuable resources for synthesized speech, some dating back decades. This can be a problem, too, since much of the seminal work on the subject was done before the internet existed, and the resulting papers have likely never been converted to machine-readable form and posted.
Much of what is done here via individual argument lists might more efficiently be accomplished by table-driven methods. But in its current form, it's somewhat more readable, which is important for development and debugging. Plus it makes playing with the settings a little easier.
The attached archive includes the latest (v1.02) IDE/compiler exe. If you haven't already installed that version, copy the exe from the ZIP over the existing copy in your Propeller IDE directory.
Anyway, for what it's worth, enjoy!
-Phil
Update (2006.11.04): Attached is a somewhat improved version. Some of the consonants are better, there are more demos, and I've added whispering and a spell procedure. 'Still some extraneous popping and hissing to cure.
Update (2006.11.07): Added inflections, rolled r's, better musical notation, on-the-fly tempo adjustments, multiple speakers.
Post Edited (Phil Pilgrim (PhiPi)) : 11/8/2006 6:26:51 AM GMT
I've relied heavily on this paper for the formant values used in the program. The internet has many other valuable resources for synthesized speech, some dating back decades. This can be a problem, too, since much of the seminal work on the subject was done before the internet existed, and the resulting papers have likely never been converted to machine-readable form and posted.
Much of what is done here via individual argument lists might more efficiently be accomplished by table-driven methods. But in its current form, it's somewhat more readable, which is important for development and debugging. Plus it makes playing with the settings a little easier.
The attached archive includes the latest (v1.02) IDE/compiler exe. If you haven't already installed that version, copy the exe from the ZIP over the existing copy in your Propeller IDE directory.
Anyway, for what it's worth, enjoy!
-Phil
Update (2006.11.04): Attached is a somewhat improved version. Some of the consonants are better, there are more demos, and I've added whispering and a spell procedure. 'Still some extraneous popping and hissing to cure.
Update (2006.11.07): Added inflections, rolled r's, better musical notation, on-the-fly tempo adjustments, multiple speakers.
Post Edited (Phil Pilgrim (PhiPi)) : 11/8/2006 6:26:51 AM GMT
Comments
Wow!·I didn't imagine anyone·would accomplish so much, so soon. You've made a phoneme layer for the VocalTract in about 300 lines of code.
Interested Propeller programmers could glean a lot from looking at your talk.spin object, as it shows a flow for feeding the VocalTract. As you said, a table-driven implementation would be more compact, but what you've made is very readable and understandable -- and it's a functional general-purpose speech synthesizer!
You could make different formant sets for "man", "woman", and "child" tracts, as well as corresponding pitch ranges... Well, I'm sure you've thought of all that. What you have actually works quite well, already. As you said, the annunciation is crude compared to what's possible, but it is synthesizing speech, all right. It sounds like the Votrax SC-01A chip.
Good job!
BTW, if you go the stereo spatializer thread, the VocalTract in that demo is v1.1. It·behaves more sensibly during frame gaps. In fact, I'll just attach it here...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Also, I seem to be getting some poping and such when I play back the sentences, it looks like the "~" is causing most of them, any idea why?
Have YOU ever tried to say "~" ??
I agree - this stuff is quite impressive..
Another thing I need to add is a dynamic tempo modifier. The optimum duration of a vowel is context-dependent. Sometimes you want to extend them for emphasis, particularly long vowels; other times shortening them almost to the point of inaudibility works better.
In addition, I haven't really figured out the stress thing. The glottal pitch modifier works fine for songs; but when it's applied to stressed syllables, it sounds totally fake.
Hopefully, people will feel free to experiment with the settings and offer improvements as they discover them. In particular, some of the consonants are virtually unintelligible and need a lot of help.
-Phil
Maybe stress could be better conveyed through a combination of timing, glottal amplitude, perhaps some subtle formant tweaks, as well as glottal pitch.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
This is fun but I must go to bed. Thanks for the fun Phil
Graham
Thanks for your comments and suggestions, but mainly for VocalTract.spin! This is too much fun!
I think you're right about the stress thing. It's got to be a combination of all those factors. I need to nail down some consonants first, though. k and g are particularly nettlesome. And I may need different phonemes for leading and trailing zs: 'can't seem to get one to work in both places.
-Phil
Behold, the world's most interesting speech processing microchip..
Really, I was just going to connect my digital altimeter to my bluetooth PDA, but now I can do what I wanted in the first place. Thank you!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Dave Evartt
People don't hate computers, they just hate lousy programmers.
http://wehali.com
...Oh, and "g" is a voiced "k", just as "d" is a voiced "t", and "b" is a voiced "p", and "v" is a voiced "f", and "zh" is a voice "sh", and "z" is a voiced "s". All these symmetries, and you realize the human speech aparatus has a rather limited set of basic sounds it can make.
What I think we need is a motion model of the mouth, where we have only two or three bytes worth of data which define its position. The raw 13 parameters (perhaps requisite at the bottom level) can define nearly inumerable configurations, 99.99% of which are physiological impossibilites. The real range of mouth movement and behavior is relatively constrained. How to qualify this is tough, though. We need to reduce the complexity somehow and make it very intuitive to configure using mouth-movement type data. For example, formants are significant mainly in relation to eachother. Rather than specify exact resonator frequencies, we need a model whereby they find their places based on overall tract formation within the confines of·a base tract model's geometries (the male/female/kid/baby differentiator).·From those constraints the fricative, plosive, affricate, etc. qualities could be inferred. This could be done right on top of VocalTract. The real magic would come from some lava-lamp like morphing of the formants in response to mouth movement. This would mean moving formants in a way that the speech aparatus would have to, which is often not a straight line·between point-A and point-B.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 11/2/2006 7:39:35 AM GMT
Whooh! You should should write a textbook. That'd be enough to keep any undergrad riveted!
So basically, all the trajectories and interpolation would be done in a smaller-dimensional space, from which the raw parameters could be derived at any given point in time. That makes sense. It would certainly keep memory requirements to a minimum.
Going one step further still, and to yield the most natural-sounding speech (giving encodings like ADPCM a run for the money), there needs to be a way to go the other way, too: from natural speech (recordings) to the parameters that can produce a reasonable facsimile. That's going to be extremely hard. It'd be an interesting task to train a neural net on.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
·I'm having vocaltrack.spin proplems , i've redownloaded chip's new one from above. Heres the line of code that i'm getting a error on.
······················· mov···· t1,vr·················· 'vibrato rate
······················· shr···· t1,#10
······················· add···· vphase,t1
······················· mov···· t1,vp·················· 'vibrato pitch
······················· mov···· t2,vphase
······················· call··· #sine··· *expected DAT symbol
··Thank's Brian·····················
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Dave Evartt
People don't hate computers, they just hate lousy programmers.
http://wehali.com
Graham
that was the problem, thank's (sounds awsome)
Brian
Thanks for the offer. Unfortunately, I don't have a particular title in mind. I just remember there being a lot of ferment in the area back in the 70's. I even took a college course that included speech synthesis, but didn't keep any of my notes or instructional materials. Now I wish I had. One of the great things about the internet is that one can live in a backwater town, like I do, and still have access to a world of resources. But if the resources you need are pre-90s, you're often out of luck.
Chip,
Do you recall where you read about the parameters for the k sound? That's the kind of info that would come in handy. The source I cited is pretty sketchy on consonants.
Thanks,
Phil
http://web.inter.nl.net/hcc/davies/esp7cpt.html
Scroll 80% the way down and you'll see all the consonant recipes. This is the most straightforward description I've found. It took me about 1/2 hour to read this documentation, but afterwards, I felt like I was on very solid ground. Here's the picture of interest:
·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
http://en.wikipedia.org/wiki/Speech_synthesis
A belated thanks for the reference: it contains some good insights. I'm still trying to get that k sound right, with some success, but I'm still not satisfied. In the process, I managed to kill the glottal amplitude completely once. Oddly enough, I got a whisper, and it was intelligible! And that got me thinking: if a g is just a voiced k, and a z a voiced s, how are we able to make them sound different when whispering? By trying it, I realized that the tongue positions are a little different. In the unvoiced consonants, it's flattened against the teeth or the palate more than with their voiced brethren. This may explain why z has been so danged elusive. Mine sounds like a buzzy s, but there's more to it than that. The investigation contniues...
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 11/4/2006 7:16:09 AM GMT
About "z", you might try turning on the nasal anti-resonator to simulate near-closure of the mouth. There is one issue that might need revisiting in the VocalTract, and that is an option to modulate the frication with the glottal pulse so that "z", "zh", "v", and "th"en buzz slightly. Think "zzzzzzap" and "vvvvvvvvvoice". I'm hoping it's not required, but you may find that it is. It wouldn't be too hard to do, but for simplicity's sake, I hope it's not necessary.
Are you using that spectograph I posted to see the spectrum over time? You need some tool like that see what's going on with real sounds. Thanks for the updates.
BTW, as you found, formants are still intelligible with only aspiration excitation. Any speech you make should sound whispered by keeping the aspiration amplitude proportional to what the glottal amplitude was.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 11/4/2006 8:21:41 AM GMT
It doesn't sound that bad at all. With a little tuning this thing could be a really useful object.
I've attached a slightly improved version to the thread's top post.
-Phil
P.S. Chip: BTW, your spectrum analyzer has helped a lot! Thanks!
Post Edited (Phil Pilgrim (PhiPi)) : 11/5/2006 3:53:05 AM GMT
That is, indeed, sounding better now. You are on the verge of being able to make a numbers+units talker (ie "one point·seven micro amps").
Have you discovered any limitations of the model? Is there anything that needs tuning, changing, or to be added?
Keep up the great work!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Thanks! There's still a ways to go, and I really need to start thinking about converting to a table-driven program at some point. So far the only limitations I've encountered have been my own: time and understanding.
I still haven't quite nailed the k. It sounds better before some vowels than others. And the z works poorly as a final consonant. But I've discovered that substituting an s there (i.e. "cars", instead of "carz") works well enough. And there are still some pops, clicks, and hisses that need eradicating...
Graham, I can't wait to hear it: a Propeller with a British accent! (Or is that without an American accent?)
-Phil
I like the fact that it can sing too.
Tim
One word of warning - linguistics is a fascinating subject, and you may get hooked on it! Furthermore, studying linguistics will make you a better programmer, since programming languages are really designed for us, not for computers. That's one of the interesting aspects of PERL - Larry Wall is a linguist, and he designed PERL from that perspective.
Although I've been following this thread, it's been awhile since I really focused on linguistics (having been a language major in college). I'm not too certain what type of data structures SPIN allows, but it shouldn't be too difficult to build a phone model that uses mostly on-off representations of individual properties. So if "voiced" is a property, a 1=voiced, 0=devoiced. By setting the relevant properties on/off, you should be able to define your phones. From this, you could build the phonological models for individual languages. In C++, I could see using a struct for the basic phone representation, and classes for the phonological environments.
For anybody interested in a bit of basic linguistics, here is a page with some on-line lecture notes:
www.ling.upenn.edu/courses/Summer_2004/ling001/schedule.html
The second lecture would most closely relate to this thread.
That's a really good link that should help a lot of people get their feet wet with this stuff. Combined with Chip's CompuTalker reference and the Wikipedia article cited earlier in this thread, it should be possible to get a grasp on the basic concepts discussed here. Thanks!
Chip,
Kevin's link also led me here: www.haskins.yale.edu/facilities/asy.html, a website devoted to "articulatory synthesis", which synthesizes speech by modeling the mechanics of the vocal tract.
-Phil
-Phil
It just keeps getting better sounding. I like that syntax it uses for speech sounds, and all the little controls within strings to change pitch up and down. That's pretty fancy, but easy to set up. The nice thing is, you can "read" it almost as fast as if it were typed English. I bet your brain is pretty acclimated now to typing phonetics in.
Like you mentioned earlier, going to a table-based approach would make things way smaller. What you have now is a more flexible test bed, though.
I think the sound quality is now approaching that of the phoneme-based synthesizers that have been around. I'm wondering, how can you make a quantum improvement over them? Maybe phonemes, as they are often realized, are not the way to get there. My gut feeling is that they are too limited in context. BTW, your interpolation between phonemes has helped a lot -- many of the pops are gone, which is great.
What I was going to do, myself,·was work on individual words via the VocalTract, directly,·and make them sound as perfect as I could, so that the limitations of the synthesizer were defining the quality limits. Then, I figured that after I had done·a good number of words, I should be able to make some inferences about what kind of abstraction (ie phonemes, or something quite different) would allow that quality to be preserved while making the repertoire infinite. I don't know when I'll get to it, but it's an area·open for·someone to explore.
What you've got working is a totally adequate synthesizer. When/if you cap it's functionality and performance, and make a compact table-based implementation, I bet it could be a Propeller mainstay for the next 10 years.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.