PDA

View Full Version : Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)



Phil Pilgrim (PhiPi)
11-01-2006, 04:44 PM
Attached is a very crude attempt at speech synthesis using Chip's recently posted VocalTract object. The "talk" object is quite rough around the edges, and to say that some of my phonemes are barely intelligible gives them way too much credit. But maybe with input from the community and some fine tuning (okay, coarse tuning), the quality can be improved over time. Chip's marvelously compact object has everything that's needed for intelligible speech. But like any tool of its utility and complexity, it needs to be mastered; and that takes time.

I've relied heavily on this paper (http://www.ling.ohio-state.edu/courses/materials/825/klsyn-dos/klsynman.pdf) for the formant values used in the program. The internet has many other valuable resources for synthesized speech, some dating back decades. This can be a problem, too, since much of the seminal work on the subject was done before the internet existed, and the resulting papers have likely never been converted to machine-readable form and posted.

Much of what is done here via individual argument lists might more efficiently be accomplished by table-driven methods. But in its current form, it's somewhat more readable, which is important for development and debugging. Plus it makes playing with the settings a little easier.

The attached archive includes the latest (v1.02) IDE/compiler exe. If you haven't already installed that version, copy the exe from the ZIP over the existing copy in your Propeller IDE directory.

Anyway, for what it's worth, enjoy!

-Phil

Update (2006.11.04): Attached is a somewhat improved version. Some of the consonants are better, there are more demos, and I've added whispering and a spell procedure. 'Still some extraneous popping and hissing to cure.

Update (2006.11.07): Added inflections, rolled r's, better musical notation, on-the-fly tempo adjustments, multiple speakers.

Post Edited (Phil Pilgrim (PhiPi)) : 11/8/2006 6:26:51 AM GMT

cgracey
11-01-2006, 09:01 PM
Phil,

Wow!·I didn't imagine anyone·would accomplish so much, so soon. You've made a phoneme layer for the VocalTract in about 300 lines of code.

Interested Propeller programmers could glean a lot from looking at your talk.spin object, as it shows a flow for feeding the VocalTract. As you said, a table-driven implementation would be more compact, but what you've made is very readable and understandable -- and it's a functional general-purpose speech synthesizer!

You could make different formant sets for "man", "woman", and "child" tracts, as well as corresponding pitch ranges... Well, I'm sure you've thought of all that. What you have actually works quite well, already. As you said, the annunciation is crude compared to what's possible, but it is synthesizing speech, all right. It sounds like the Votrax SC-01A chip.

Good job!

BTW, if you go the stereo spatializer thread, the VocalTract in that demo is v1.1. It·behaves more sensibly during frame gaps. In fact, I'll just attach it here...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Cobalt
11-02-2006, 12:29 AM
WOW... just wow, this is simply awsome. First Chip made the monks/seven demo, and now this... I think I'm going to be losing some sleep tonight!

Also, I seem to be getting some poping and such when I play back the sentences, it looks like the "~" is causing most of them, any idea why?

Paul Sr.
11-02-2006, 02:02 AM
Cobalt said...
WOW... just wow, this is simply awsome. First Chip made the monks/seven demo, and now this... I think I'm going to be losing some sleep tonight!

Also, I seem to be getting some poping and such when I play back the sentences, it looks like the "~" is causing most of them, any idea why?


Have YOU ever tried to say "~" ??

I agree - this stuff is quite impressive..

Phil Pilgrim (PhiPi)
11-02-2006, 03:06 AM
Some of my transitions between frames are a pretty rough. The popping that you hear may be coming from too abrupt changes, or it might be from bad gain settings leading to overflow. I'm just not sure which. I added the "~" to give emphasis to terminal consonants -- sort of a Lawrence Welk effect, though not nearly so protracted. The reason is that some of them seemed to get swallowed without the added vocalization.

Another thing I need to add is a dynamic tempo modifier. The optimum duration of a vowel is context-dependent. Sometimes you want to extend them for emphasis, particularly long vowels; other times shortening them almost to the point of inaudibility works better.

In addition, I haven't really figured out the stress thing. The glottal pitch modifier works fine for songs; but when it's applied to stressed syllables, it sounds totally fake.

Hopefully, people will feel free to experiment with the settings and offer improvements as they discover them. In particular, some of the consonants are virtually unintelligible and need a lot of help.

-Phil

cgracey
11-02-2006, 03:31 AM
Phil Pilgrim (PhiPi) said...


I haven't really figured out the stress thing. The glottal pitch modifier works fine for songs; but when it's applied to stressed syllables, it sounds totally fake.

Phil,

Maybe stress could be better conveyed through a combination of timing, glottal amplitude, perhaps some subtle formant tweaks, as well as glottal pitch.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Graham Stabler
11-02-2006, 10:20 AM
Try this:





t.say(string("+7he-loa ever+i-won. -doa-nt ++yoo, th+ink -tha-t dher proa+pel-er is +soa, --cooool"))
t.say(string("+9Videe-oakild-her ++raid--ee--oa star. +4Videe-oakild-her ++raid--ee--oa star"))
t.say(string("+5in ++mae mae-n-d, an-d ++in -mae, car"))
t.say(string("+5wee ++cahnt ree-wae-nd wee-v ++gon -too, far"))
t.say(string("+8oa, +we, -oa. yoo +wer-dher +ferst -won"))
t.say(string("+8oa, +we, -oa. yoo +wer-dher +last -won"))
t.say(string("+9Videe-oakild-her ++raid--ee--oa star. +4Videe-oakild-her ++raid--ee--oa star"))





This is fun but I must go to bed. Thanks for the fun Phil

Graham

Phil Pilgrim (PhiPi)
11-02-2006, 01:05 PM
Chip,

Thanks for your comments and suggestions, but mainly for VocalTract.spin! This is too much fun!

I think you're right about the stress thing. It's got to be a combination of all those factors. I need to nail down some consonants first, though. k and g are particularly nettlesome. And I may need different phonemes for leading and trailing zs: 'can't seem to get one to work in both places.

-Phil

davee
11-02-2006, 01:46 PM
Fantastic! I was wanting to build a talking altimeter for my HPR rockets.

Behold, the world's most interesting speech processing microchip..

Really, I was just going to connect my digital altimeter to my bluetooth PDA, but now I can do what I wanted in the first place. Thank you!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Dave Evartt

People don't hate computers, they just hate lousy programmers.

http://wehali.com

cgracey
11-02-2006, 02:32 PM
Phil Pilgrim (PhiPi) said...
Chip,

Thanks for your comments and suggestions, but mainly for VocalTract.spin! This is too much fun!

I think you're right about the stress thing. It's got to be a combination of all those factors. I need to nail down some consonants first, though. k and g are particularly nettlesome. And I may need different phonemes for leading and trailing zs: 'can't seem to get one to work in both places.

-Phil
If I recall, a leading "k" is made by a short white noise burst between the following vowel's F2 and F3 positions,·then F2 and F3 rapidly head to their vowel positions from the "k" center, with an aspiration turning to voiced excitation. For a trailing "k" the leading vowel's F2 and F3 converge onto their average as they fade, then there's a silent pause, followed by the white noise burst at where F2 and F3 converged, then an unvoiced (aspirated) "uhhh" sound·starting at the same point.·It's necessary to use the surrounding vowels like this. The "k" in "hike" is audibly higher than in "hook".

...Oh, and "g" is a voiced "k", just as "d" is a voiced "t", and "b" is a voiced "p", and "v" is a voiced "f", and "zh" is a voice "sh", and "z" is a voiced "s". All these symmetries, and you realize the human speech aparatus has a rather limited set of basic sounds it can make.

What I think we need is a motion model of the mouth, where we have only two or three bytes worth of data which define its position. The raw 13 parameters (perhaps requisite at the bottom level) can define nearly inumerable configurations, 99.99% of which are physiological impossibilites. The real range of mouth movement and behavior is relatively constrained. How to qualify this is tough, though. We need to reduce the complexity somehow and make it very intuitive to configure using mouth-movement type data. For example, formants are significant mainly in relation to eachother. Rather than specify exact resonator frequencies, we need a model whereby they find their places based on overall tract formation within the confines of·a base tract model's geometries (the male/female/kid/baby differentiator).·From those constraints the fricative, plosive, affricate, etc. qualities could be inferred. This could be done right on top of VocalTract. The real magic would come from some lava-lamp like morphing of the formants in response to mouth movement. This would mean moving formants in a way that the speech aparatus would have to, which is often not a straight line·between point-A and point-B.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/2/2006 7:39:35 AM GMT

Phil Pilgrim (PhiPi)
11-02-2006, 03:31 PM
Thanks, Chip. That may explain the difficulty I've been having. I was trying to keep things context-independent as much as possible. But it looks like I'll need a bit of look-ahead when processing things like k.


Chip Gracey said...
... with an aspiration turning to voiced excitation.

Whooh! You should should write a textbook. That'd be enough to keep any undergrad riveted! http://forums.parallax.com/images/smilies/smile.gif


Chip Gracey said...
What I think we need is a motion model of the mouth, where we have only two or three bytes worth of data which define its position. The raw 13 parameters (perhaps requisite at the bottom level) can define nearly inumerable configurations, 99.99% of which are physiological impossibilites. ...

So basically, all the trajectories and interpolation would be done in a smaller-dimensional space, from which the raw parameters could be derived at any given point in time. That makes sense. It would certainly keep memory requirements to a minimum.

Going one step further still, and to yield the most natural-sounding speech (giving encodings like ADPCM a run for the money), there needs to be a way to go the other way, too: from natural speech (recordings) to the parameters that can produce a reasonable facsimile. That's going to be extremely hard. It'd be an interesting task to train a neural net on.

-Phil

cgracey
11-02-2006, 03:53 PM
Phil Pilgrim (PhiPi) said...


Going one step further still, and to yield the most natural-sounding speech (giving encodings like ADPCM a run for the money), there needs to be a way to go the other way, too: from natural speech (recordings) to the parameters that can produce a reasonable facsimile. That's going to be extremely hard. It'd be an interesting task to train a neural net on.

-Phil
In some old speech processing book I have from 1978, they mention a formant-based vocoder system that squished speech down to 600 bits per second, and they said you could recognize a person's voice through it. Imagine that -- 600bps, without compression. It could probably be compressed to less than half that in real-time, maybe even a tenth with a bit of loss over a longer recording.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

11-02-2006, 09:06 PM
Hi,
·I'm having vocaltrack.spin proplems , i've redownloaded chip's new one from above. Heres the line of code that i'm getting a error on.

······················· mov···· t1,vr·················· 'vibrato rate
······················· shr···· t1,#10
······················· add···· vphase,t1
······················· mov···· t1,vp·················· 'vibrato pitch
······················· mov···· t2,vphase
······················· call··· #sine··· *expected DAT symbol

··Thank's Brian·····················

davee
11-02-2006, 09:13 PM
I think maybe the 1.0.3 version of the tool should be plugged into the propeller download page. My guess is that truckwiz is using the old tool.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Dave Evartt

People don't hate computers, they just hate lousy programmers.

http://wehali.com

Graham Stabler
11-02-2006, 10:03 PM
Phil, if there are any papers you want I have access to an academic library and a scanner, I'd be happy to pdf anything you think it really seminal.

Graham

11-02-2006, 11:31 PM
Dave,
that was the problem, thank's (sounds awsome)

Brian

Phil Pilgrim (PhiPi)
11-03-2006, 01:40 AM
Graham,

Thanks for the offer. Unfortunately, I don't have a particular title in mind. I just remember there being a lot of ferment in the area back in the 70's. I even took a college course that included speech synthesis, but didn't keep any of my notes or instructional materials. Now I wish I had. One of the great things about the internet is that one can live in a backwater town, like I do, and still have access to a world of resources. But if the resources you need are pre-90s, you're often out of luck.

Chip,

Do you recall where you read about the parameters for the k sound? That's the kind of info that would come in handy. The source I cited is pretty sketchy on consonants.

Thanks,
Phil

cgracey
11-03-2006, 02:13 AM
Phil Pilgrim (PhiPi) said...

Do you recall where you read about the parameters for the k sound? That's the kind of info that would come in handy. The source I cited is pretty sketchy on consonants.

It can be found here:

http://web.inter.nl.net/hcc/davies/esp7cpt.html

Scroll 80% the way down and you'll see all the consonant recipes. This is the most straightforward description I've found. It took me about 1/2 hour to read this documentation, but afterwards, I felt like I was on very solid ground. Here's the picture of interest:

·http://forums.parallax.com/attachment.php?attachmentid=43969

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

sharpie
11-03-2006, 11:15 AM
For what it is worth to those interested in a good starting point.. (to understand some of what these guys are talking about) =)

http://en.wikipedia.org/wiki/Speech_synthesis

Phil Pilgrim (PhiPi)
11-04-2006, 03:09 PM
Chip,

A belated thanks for the reference: it contains some good insights. I'm still trying to get that k sound right, with some success, but I'm still not satisfied. In the process, I managed to kill the glottal amplitude completely once. Oddly enough, I got a whisper, and it was intelligible! And that got me thinking: if a g is just a voiced k, and a z a voiced s, how are we able to make them sound different when whispering? By trying it, I realized that the tongue positions are a little different. In the unvoiced consonants, it's flattened against the teeth or the palate more than with their voiced brethren. This may explain why z has been so danged elusive. Mine sounds like a buzzy s, but there's more to it than that. The investigation contniues...

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 11/4/2006 7:16:09 AM GMT

cgracey
11-04-2006, 04:16 PM
Phil Pilgrim (PhiPi) said...
Chip,

A belated thanks for the reference: it contains some good insights. I'm still trying to get that k sound right, with some success, but I'm still not satisfied. In the process, I managed to kill the glottal amplitude completely once. Oddly enough, I got a whisper, and it was intelligible! And that got me thinking: if a g is just a voiced k, and a z a voiced s, how are we able to make them sound different when whispering? By trying it, I realized that the tongue positions are a little different. In the unvoiced consonants, it's flattened against the teeth or the palate more than with their voiced brethren. This may explain why z has been so danged elusive. Mine sounds like a buzzy s, but there's more to it than that. The investigation contniues...

-Phil
I think the difference·between a whispered "k" and a whispered·"g"· is the onset rate of the aspiration after the white burst; or, looked at differently, how close F2 and F3 start from the burst, and how long they are delayed -- "g" would be close and "k" would be farther. I also suspect that F1 in·"g" starts out lower. Just whisper "kai" and "guy" to yourself.

About "z", you might try turning on the nasal anti-resonator to simulate near-closure of the mouth. There is one issue that might need revisiting in the VocalTract, and that is an option to modulate the frication with the glottal pulse so that "z", "zh", "v", and "th"en buzz slightly. Think "zzzzzzap" and "vvvvvvvvvoice". I'm hoping it's not required, but you may find that it is. It wouldn't be too hard to do, but for simplicity's sake, I hope it's not necessary.

Are you using that spectograph I posted to see the spectrum over time? You need some tool like that see what's going on with real sounds. Thanks for the updates.

BTW, as you found, formants are still intelligible with only aspiration excitation. Any speech you make should sound whispered by keeping the aspiration amplitude proportional to what the glottal amplitude was.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/4/2006 8:21:41 AM GMT

Ym2413a
11-05-2006, 12:42 AM
Hey this is really cool, I just now found the time to compile this and run it.
It doesn't sound that bad at all. With a little tuning this thing could be a really useful object.
http://forums.parallax.com/images/smilies/smile.gif

Phil Pilgrim (PhiPi)
11-05-2006, 11:30 AM
Hi all,

I've attached a slightly improved version to the thread's top post.

-Phil

P.S. Chip: BTW, your spectrum analyzer has helped a lot! Thanks!

Post Edited (Phil Pilgrim (PhiPi)) : 11/5/2006 3:53:05 AM GMT

cgracey
11-05-2006, 01:02 PM
Phil,

That is, indeed, sounding better now. You are on the verge of being able to make a numbers+units talker (ie "one point·seven micro amps").

Have you discovered any limitations of the model? Is there anything that needs tuning, changing, or to be added?

Keep up the great work!


Phil Pilgrim (PhiPi) said...
Hi all,

I've attached a slightly improved version to the thread's top post.

-Phil

P.S. Chip: BTW, your spectrum analyzer has helped a lot! Thanks!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi)
11-05-2006, 04:48 PM
Chip,

Thanks! There's still a ways to go, and I really need to start thinking about converting to a table-driven program at some point. So far the only limitations I've encountered have been my own: time and understanding.

I still haven't quite nailed the k. It sounds better before some vowels than others. And the z works poorly as a final consonant. But I've discovered that substituting an s there (i.e. "cars", instead of "carz") works well enough. And there are still some pops, clicks, and hisses that need eradicating...


Graham Stabler said...
I'm hoping to analyse my own voice a bit and get Phil's speech object sounding more like me.

Graham, I can't wait to hear it: a Propeller with a British accent! (Or is that without an American accent?) http://forums.parallax.com/images/smilies/smile.gif

-Phil

Ym2413a
11-06-2006, 03:54 AM
I'm surprized at how well it speaks for being just released.
I like the fact that it can sing too.

Tim-M
11-06-2006, 06:52 AM
· Would any of you be willing to post an example audio file or two·for those of us to don't have Propellers yet?· I know this is a large favor to ask, but I'd love to hear what you guys have been working on.· Chip's·examples of the 'Singing Monks' and 'Seven' are so amazing that I can hardly imagine what may be next.· Thanks to Chip and all of you for your hard work.

Tim

Kevin Wood
11-06-2006, 10:36 AM
For anybody interested in doing some more in-depth language analysis and research, I suggest checking out SIL International. SIL International is affiliated with Wycliffe Bible Translators, and they have some freeware versions of their linguistics field research tools and references.

One word of warning - linguistics is a fascinating subject, and you may get hooked on it! Furthermore, studying linguistics will make you a better programmer, since programming languages are really designed for us, not for computers. That's one of the interesting aspects of PERL - Larry Wall is a linguist, and he designed PERL from that perspective.

Although I've been following this thread, it's been awhile since I really focused on linguistics (having been a language major in college). I'm not too certain what type of data structures SPIN allows, but it shouldn't be too difficult to build a phone model that uses mostly on-off representations of individual properties. So if "voiced" is a property, a 1=voiced, 0=devoiced. By setting the relevant properties on/off, you should be able to define your phones. From this, you could build the phonological models for individual languages. In C++, I could see using a struct for the basic phone representation, and classes for the phonological environments.

For anybody interested in a bit of basic linguistics, here is a page with some on-line lecture notes:
www.ling.upenn.edu/courses/Summer_2004/ling001/schedule.html (http://www.ling.upenn.edu/courses/Summer_2004/ling001/schedule.html)

The second lecture would most closely relate to this thread.

Phil Pilgrim (PhiPi)
11-06-2006, 12:44 PM
Kevin,

That's a really good link that should help a lot of people get their feet wet with this stuff. Combined with Chip's CompuTalker reference and the Wikipedia article cited earlier in this thread, it should be possible to get a grasp on the basic concepts discussed here. Thanks!

Chip,

Kevin's link also led me here: www.haskins.yale.edu/facilities/asy.html (http://www.haskins.yale.edu/facilities/asy.html), a website devoted to "articulatory synthesis", which synthesizes speech by modeling the mechanics of the vocal tract.

-Phil

Phil Pilgrim (PhiPi)
11-08-2006, 02:25 PM
I've attached a new update to the lead post in this thread.

-Phil

cgracey
11-08-2006, 03:17 PM
Phil,

It just keeps getting better sounding. I like that syntax it uses for speech sounds, and all the little controls within strings to change pitch up and down. That's pretty fancy, but easy to set up. The nice thing is, you can "read" it almost as fast as if it were typed English. I bet your brain is pretty acclimated now to typing phonetics in.

Like you mentioned earlier, going to a table-based approach would make things way smaller. What you have now is a more flexible test bed, though.

I think the sound quality is now approaching that of the phoneme-based synthesizers that have been around. I'm wondering, how can you make a quantum improvement over them? Maybe phonemes, as they are often realized, are not the way to get there. My gut feeling is that they are too limited in context. BTW, your interpolation between phonemes has helped a lot -- many of the pops are gone, which is great.

What I was going to do, myself,·was work on individual words via the VocalTract, directly,·and make them sound as perfect as I could, so that the limitations of the synthesizer were defining the quality limits. Then, I figured that after I had done·a good number of words, I should be able to make some inferences about what kind of abstraction (ie phonemes, or something quite different) would allow that quality to be preserved while making the repertoire infinite. I don't know when I'll get to it, but it's an area·open for·someone to explore.

What you've got working is a totally adequate synthesizer. When/if you cap it's functionality and performance, and make a compact table-based implementation, I bet it could be a Propeller mainstay for the next 10 years.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi)
11-08-2006, 05:03 PM
Chip,

Thanks! You're definitely right about context, and I think your word-based approach will yield some vital clues. I've still got some troublesome consonants and even some vowels that would benefit by a more contextual approach. Right now, the k sound is algorithmic, determined by the following vowel. That's a step up from context-free, but hasn't solved the problem. Better, would be a separate k for each diphone: ka, ke, koo, etc. This could be accomplished easily in a table-driven system, where matches are performed on the longest patterns first, then working down the line to single letters. Whole words could be accommodated this way, too, for any that are truly exceptional. In such as system, for example, I wouldn't have to spell "beer" with three e's, "beeer", i.e. "b ee er", since "eer" would have it's own rule set. Also, to save space, phonemic macros would be useful for phoneme groups that get used in more than one sound sequence.

Also, my inflections are frame-based, rather than phoneme-based. Oddly enough, this sounds better than it should, given that only the first frame in a compound phoneme will get inflected. Pitch is a tricky thing. There's an awfully fine line between monotone and sing-song, and I certainly haven't mastered it. At first I thought your 1/4 semitone resolution would be too coarse for speech. But inflections do cover a wider range than that. The key, I think, is the blend, which makes the spoken word sound less like individual musical notes and more like a continuum. I just need to figure a way to make the blend occur over a wider context. This will likely entail pre-buffering groups of frames and modifying them en masse before passing them on to the vocal tract. Notation-wise, this will likely involve brackets, braces, or maybe just spaces to delineate the inflected units.

Tempo is another tough nut. Stressed syllables are often drawn out, as well as being inflected. And certain vowels have different durations, depending where in a phonemic group they appear. It would be nice to find some rules for the latter, since the notational burden gets to be cumbersome otherwise. (I'm not sold on the "%nnn" notation, either. It's too wordy.) Also, in songs, where tempo needs to be strict, there has to be a way to make a phonemic group fit a particular time slot. This usually involves stretching or compressing a single vowel to make the group fit. But, again, there's some notational baggage that needs to be optimized.

There's still a lot to do...

-Phil

Cliff L. Biffle
11-09-2006, 12:54 AM
Most of the phonemic TTS systems I've worked with have a "hint" database, which includes fine-tuned word-level pronunciations for hard words (like SCSI, in a technical context). For any word that's not in the database, they apply some basic grammatical rules and cook up a phoneme string. The difference is usually pretty obvious.

For those of you on Macs, the Vicki voice is an example of how good such a system can sound -- but if you throw it a curve ball, like a Spanish text, the quality breaks down. (You should hear the voices from Leopard. http://forums.parallax.com/images/smilies/smile.gif )

Graham Stabler
11-09-2006, 01:39 AM
All of this is so cool and the propeller is both liberating and limiting, and I reckon that will be the mother of invention! Plus the R&D "team" on this forum is pretty tasty really.

I'm struggling to catch up, I really want to help!

Graham

cgracey
11-09-2006, 02:34 AM
Cliff L. Biffle said...

For those of you on Macs, the Vicki voice is an example of how good such a system can sound -- but if you throw it a curve ball, like a Spanish text, the quality breaks down. (You should hear the voices from Leopard. http://forums.parallax.com/images/smilies/smile.gif )

I was intrigued by what you said here about this "Vicki" voice, so I Googled it and found out that it takes ~25MB!!! It had better sound good. A Propeller target would be more like ~3KB.

http://developer.apple.com/releasenotes/Carbon/Speech.html


▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Ym2413a
11-09-2006, 03:01 AM
This sort of reminds me of the Voder for some odd reason.
The Voder was a old voice synthesizer from the late 1930s that you controlled my hand.

It had buttons and switches to control the systhesis parameters.
http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/image50.gif

I bet it was a real pain to learn and use!

www.obsolete.com/120_years/machines/vocoder/ (http://www.obsolete.com/120_years/machines/vocoder/)

Cliff L. Biffle
11-09-2006, 03:19 AM
Chip,

Yes, the Vicki voice sounds good. It was the best realtime TTS I'd heard until Apple demoed their next-gen voices (coming next year) -- which sound better, but of course take even more space.


As for the Voder, Chip's synthesis code is basically the digital equivalent, sans keyswitches. Anyone want to interface some? http://forums.parallax.com/images/smilies/smile.gif

Kevin Wood
11-09-2006, 03:33 AM
So who will be the first person to create a "Funkytown" object?

Ym2413a
11-09-2006, 03:40 AM
Cliff L. Biffle said...
Chip,
As for the Voder, Chip's synthesis code is basically the digital equivalent, sans keyswitches. Anyone want to interface some? http://forums.parallax.com/images/smilies/smile.gif


Oh darn Cliff! That gives me an idea!
I'm a pianist and composer as well. Cliff you just gave me an idea for a new instrument design. (lol)
The Prop-Voder! *laughs*

Eitherway you could get some cool sounds out of it!
http://forums.parallax.com/images/smilies/smilewinkgrin.gif

Phil Pilgrim (PhiPi)
11-09-2006, 05:45 AM
Chip,

I've been thinking about how I'd make the synth table-driven. It would be nice to have data structures that look something like this:




DAT

table BYTE "eer", 0, "ee", 0, "er", 0, 0
BYTE "eel", 0, "ee", 0, "el", 0, 0
..
BYTE "ee", 0, 0, F, 310 / 19, 2020 / 19, 2960 / 19, 3500 / 19, GA, 30, 0, 10, 20, 10
etc.




The idea is that the start routine would scan the table and create an array of "dictionary" entries, each indexed by one of the one- to four-letter patterns and pointing to the rest of the string. All well and good so far. But then I'd have to create my own lookdown routine, since Spin's lookdown doesn't accept an array address, but only a fixed expression list. Written in Spin, such a routine would be too slow, and I don't want to waste an assembly cog on just a dictionary search function.

Okay, I could do something like the following, but it's rather awkward (and would have to be quite long):




address := lookdown(pattern: d[ 0], d[ 1], d[ 2], d[ 3], d[ 4], d[ 5], ... , d[n])




Spin's built-in lookdown and case constructs are plenty fast for this sort of thing when the parameters are static. But their speed would be hard to duplicate when simulated in Spin from dynamic data. The only other option I can think of would be a hash function. Properly constructed in Spin, that might eliminate a linear search and be fast enough. This may be the route I have to take, unless I've overlooked some Spin feature I'm not yet familiar with...

-Phil

william chan
11-09-2006, 10:19 AM
Help !

I can't compile or download the talk_demo.spin !
I downloaded the latest update. ( why is there no .zip extension ?)

I tried to compile but
it gives an error "Expected a DAT symbol" at this line

call #sine

in the VocalTract.spin file.

Why is the 1st version (zip file ) much larger than the 2nd or 3rd versions?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Phil Pilgrim (PhiPi)
11-09-2006, 10:35 AM
William,

There is a zip extension in the file name. I just rechecked. The reason the first zip is so much larger is that it includes the latest (v1.02) Propeller IDE. The others dont. If you haven't installed this version, that may be the reason you're having trouble getting the package to compile. So download the first zip, and extract the .exe into your Propeller program directory. Then try compiling the newest talk_demo again.

-Phil

cgracey
11-09-2006, 02:47 PM
Phil Pilgrim (PhiPi) said...
Chip,

I've been thinking about how I'd make the synth table-driven.
Phil,

As you probably know, the Spin interpreter has two built-in functions which could aid in this: STRSIZE(@zstring) and STRCOMP(@zstring1, @zstring2). Spending memory making a hash table may not be necessary. I mean, you've got under 100 strings you want to compare to, right? If you made a DAT list of all the targets in z-string form, you could use STRSIZE and STRCOMP to navigate through it and do the comparisons pretty rapidly. You could get the partial benefit of a hash table just by having several smart starting positions within the target list. The targets could each contain a z-string and a pointer to their respective data sets using 'WORD @dataset'. I think this all applies to what you were asking about. BTW, STRSIZE and STRCOMP are very fast. They'd take as much time to execute as it would to handle a single-character comparison discretely in Spin.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi)
11-09-2006, 03:37 PM
Chip,

Actually, I don't think there will be any strings to compare that're longer than four characters, so I could just do a compare on appropriately-constructed longs. I was more concerned about the loop overhead, which virtually disappears when using lookdown or case on static data or program structures. But now that I think about it, there's no reason I couldn't just sort the dictionary and do a binary search. That'd be plenty fast!

The drill would be to keep a four-byte (long variable) shift register of the incoming string data. If the first byte is a lower-case letter, then the entire four-byte value is looked up in the table. If there's no exact match, the table position prior to where the match would've been will hold the correct pointer. (There will always be 26 single-letter entries, some with null pointers, so "fob." isn't going to match "eer."; it will match "f..." first.) Then as many characters as there were non-zero bytes in the found long can be shifted in for the next match, and so forth.

Dang! I'd put this stuff out of my mind for the night and closed up my shop. Now I'm inspired to go back out there and work into the wee hours — again! http://forums.parallax.com/images/smilies/smile.gif

-Phil

Loopy Byteloose
11-09-2006, 05:39 PM
I am quite amazed by all of this. The whole study of phonology is based on the physical limits of the oral cavity, nasal cavity, and larnyx to produce sound. Since I teach ESL, I have to deal with it on a daily basis.

I would have simply sampled speech from an appropriate source and used that rather than get involved in the physics. After all, there is even a tonal register for gender. And another tonal register for culture.

By the way, British phonology tends to have more phonemes than American phonolgy. With more phonemes, more permutations; and more software overhead.

Of course if you want the complete phoneme set, the IPA or Internationa Phonology Alphabete will provide you with an inventory. But, it really is quite unwieldly.

In sum, whatever voice you give a robotic device is going to give it a personality or the lack of one. Might I suggest that you use sampling to get the personality factor right? It really isn't just an expedient.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"If you want more fiber, eat the package.· Not enough?· Eat the manual."········


···················· Tropical regards,····· G. Herzog [·黃鶴 ]·in Taiwan

william chan
11-10-2006, 11:11 AM
Phil,

Sorry, must be the my new FireFox 2.0 browser that removed the .zip extension.
Anyway, I got the v1.02 propeller tool and it just works ! Congratulations !

Thanks.

P.S. Why the 1.02 IDE is not posted on the Parallax website yet?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Paul Baker
11-10-2006, 02:08 PM
Because it was an ad-hoc revision Chip and Jeff put together to incorporate the ability to use a RES as a return label point as required by Chip's vocal tract object. Since this is the only revision over v1.0 and it hasn't undergone the normal verification and testing process, it is technically a beta version and not an official Parallax release.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker (mailto:pbaker@parallax.com)
Propeller Applications Engineer
[/url][url=http://www.parallax.com] (http://www.parallax.com)
Parallax, Inc. (http://www.parallax.com)

kelvin james
11-10-2006, 04:25 PM
Phil

Here is a very simple thing called " can you do it? ". I added some timing from chips' version, seems to transition better. Just a couple of extra set timing parameters, a little smoother than the set tempo for sustain. Not perfect yet, but slowly making some headway. I have been trying to add some personality, a lot of trial and error here. I think you have all the basics there, it is just a matter of experimenting. The 2 modified files are attatched.

kelvin

Phil Pilgrim (PhiPi)
11-11-2006, 11:36 AM
Kelvin,

Hmmm, that sounds pretty good! I hadn't thought about adding a sustenudo, but it's a really good idea! Look for one in the next release.

Thanks!
Phil

yerpa58
11-12-2006, 05:55 AM
Any chance of an mp3 or wav for us uninitiates? Kudos on all the nice work so far. I'm up to my ears in projects right now but I look forward to using the propellor chip.

Phil Pilgrim (PhiPi)
11-12-2006, 09:19 AM
I've tried making a recording using my PC from some of the speech output, so I could convert it to MP3 and post it. But for some reason, the PC just isn't getting an adequate signal level. It may be a cable issue. At any rate, it would be worthwhile for people to hear the speech without any visual clues to help decipher it.

I was brought back down to ground a couple days ago when a friend stopped by my shop. The conversation went something like this:

Me: "Hey, you wanna hear this thing talk?"
Friend: "Wow! It can talk? Sure I'd like to hear it!"
Me: <starts demo>
Friend: <grimaces, looks quizzical, grimaces some more>
Demo: <finishes with a flourish>
Friend: "Now there's a voice only a mother could understand!"
Me: <PSSSssss! (balloon deflating)>

But, what the hey. Self-delusion is part of what keeps us going, right? And when reality comes knocking, it only makes us try harder! http://forums.parallax.com/images/smilies/smile.gif

-Phil

kelvin james
11-12-2006, 11:58 AM
Phil

Thanks, but not my idea, this is from Chips' programming, i was just adding it. Not to worry about other peoples' opinions, this is something new, and will take some time to please everyone. Your efforts on whatever you do are well appreciated.

kelvin

kelvin james
11-13-2006, 01:09 PM
Here is a mp3 of canyoudoit. The audio out from the demo board is not really designed for a line-in to the sound card, so it is a little on the noisy side.

kelvin

Joel Rosenzweig
11-15-2006, 12:18 PM
Phil, Chip,

I've been following the thread for a while and tonight, I finally had a few moments to give the demo a try. You both did an outstanding job with your respective pieces. The speech demo sounded even better than what I was anticipating. I agree that it's hard to understand some of the words, but it appears that this can be resolved by tweaking the phonemes you're using more than anything else. I experimented with the demo by adding a few words of my own, and the speech sounded quite good. I recall having to make the same types of tweaks to my SP0-256 speech synthesizer based projects.

I certainly look forward to your next set of enhancements. This is really neat. I was going to use a nice backlit LCD for the user interface on my propellor project. Maybe I'll have to reconsider and add the speech synthesizer instead. http://forums.parallax.com/images/smilies/smile.gif

Thanks to both of you for your work on this. I appreciate it. This is really good stuff!

Joel-

cgracey
11-15-2006, 02:30 PM
So, Phil,

What are you working on today? Are you off in large-model land?


Phil Pilgrim (PhiPi) said...
I've tried making a recording using my PC from some of the speech output, so I could convert it to MP3 and post it. But for some reason, the PC just isn't getting an adequate signal level. It may be a cable issue. At any rate, it would be worthwhile for people to hear the speech without any visual clues to help decipher it.

I was brought back down to ground a couple days ago when a friend stopped by my shop. The conversation went something like this:

Me: "Hey, you wanna hear this thing talk?"
Friend: "Wow! It can talk? Sure I'd like to hear it!"
Me: <starts demo>
Friend: <grimaces, looks quizzical, grimaces some more>
Demo: <finishes with a flourish>
Friend: "Now there's a voice only a mother could understand!"
Me: <PSSSssss! (balloon deflating)>

But, what the hey. Self-delusion is part of what keeps us going, right? And when reality comes knocking, it only makes us try harder! http://forums.parallax.com/images/smilies/smile.gif

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi)
11-15-2006, 03:05 PM
Hey Chip,

Right now, I'm laying out a new daughterboard (i.e. Real Work). 'Can't help staying checked into the forum, though.

But don't worry: I haven't given up on the speech stuff! I'd like to find a way to record a word and convert it programmatically to the proper VocalTract settings. The sound quality would be much more natural that way. I don't remember: can your frequency analyzer program output a file with frequency domain data?

-Phil

rokicki
11-15-2006, 04:33 PM
Actually, I'm very intrigued by that sort of thing myself. I'm quite deaf, actually,
so I'm hoping the Propeller can take some of the load. http://forums.parallax.com/images/smilies/smile.gif Phoneme recognition
would be amazing!

william chan
03-28-2007, 10:43 AM
Phil,

Any fourth installment coming soon?

How to make the speaker sound like a female?

Can the propeller be run at 5Mhz x 8 = 40Mhz or lower to save current consumption and yet get the same voice quality?

Can I embed the Propeller with a CR2032 coin battery into a voice greeting card?
How long will the battery last?


Thanks.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

RytonMike
04-01-2007, 01:49 AM
Phil Pilgrim (PhiPi) said...
Hey Chip,

Right now, I'm laying out a new daughterboard (i.e. Real Work). 'Can't help staying checked into the forum, though.

But don't worry: I haven't given up on the speech stuff! I'd like to find a way to record a word and convert it programmatically to the proper VocalTract settings. The sound quality would be much more natural that way. I don't remember: can your frequency analyzer program output a file with frequency domain data?

-Phil


Phil,

Back in the seventies I worked on the first generation of speech technology. We were developing direct waveform synthesis adding damped sinusoids to make formants and filtered noise for sibilants. We only had TTL ssi and a bit of msi technology in those days.
One of the things I developed which I was quite proud of at the time was a set of tools for extracting synthesiser parameters from real speech. I pass on the key things I learned which are not obvious – it was not published at the time and it may be useful to you.

1. Pitch synchronous analysis is best, using data from the glottal closed part of the waveform. Detect glottal closure by selecting the locally highest peak in the waveform and tracking back to the previous zero crossing. Repeat moving the analysis window forward by the parameter update rate through the recorded waveform.

2. Use a strange window function which is half a Hanning starting at 1 and tapering to 0 over a period corresponding to about 66% - 75% of the pitch period. This does not need to be adaptive. In theory if we assume that the glottal excitation is impulsive there is little energy before the start so a step at this side of the window has little effect on the output and we get maximum information from the glottal closed free response part of the signal.

3. I think we sampled at 10k, the highest third format we used was never more than 3kHz and used an FFT analysis window of 512 filling unused samples with zeros.

4. If you plot the resulting spectrum it is, of course very smooth. The real frequency resolution is quite course because of the relatively short sampling window, however,

5. we are only going to select and use the spectral peak values so we are getting a good interpolation effect. Select the peaks in the pitch synchronous spectra coding amplitude as a number and frequency as position in the array and setting them one after the other.

6. If you look at the resulting patterns you will see a beautifully clear spectrogram with very well defined formants in the voiced segments. We used an operator assisted tool to then extract parameters.

7. Pitch extraction is a separate process. We used a relatively simple time domain peak picking process.

8. There are some very interesting possibilities for using this approach to bootstrap speech by rule algorithms and in speech recognition using elastic matching.

Hope this gives some ideas, I really like your synthesiser which brought back many memories.

By the way, our stuff required a computer the size of a medium sized van, things move on!

Mike

cgracey
04-01-2007, 05:32 AM
RytonMike said...

Back in the seventies I worked on the first generation of speech technology. We were developing direct waveform synthesis adding damped sinusoids to make formants and filtered noise for sibilants. We only had TTL ssi and a bit of msi technology in those days.
One of the things I developed which I was quite proud of at the time was a set of tools for extracting synthesiser parameters from real speech. I pass on the key things I learned which are not obvious – it was not published at the time and it may be useful to you.


Mike,

These are real gems!!! This is the kind of·know-how that is all but lost these days, but remains as applicable and relevant as ever. These kinds of algorithms just amaze me, and the discipline of their development fascinates me more than any other aspect of computing.

I've thought for a long time that if we could get reliably distilled formant analyses, speech recognition could become pretty straightforward. This is great! I don't think it would take more than a few kilobytes of code to get the recognizer core running. I probably won't have time to do this for a while, but I'm really happy knowing HOW to think about it now.

Thank you very much for sharing these ideas. They are invaluable.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi)
04-01-2007, 08:04 AM
Mike,

The knowledge density within those few paragraphs is mind-blowing. That the results could be applied to coax voice from SSI and MSI circuitry is even more astounding. I'm anxious to apply your principals — perhaps in a couple months when I get a breath from work.

Thanks for taking the time to impart your wisdom!

-Phil

kelvin james
08-06-2007, 01:48 PM
This was just for fun, a lot of trial and error, i figured most had forgotten about this. It would ne nice to add Chips' spatial program in with this to add some effects, the sound is pretty flat.

Rayman
05-22-2009, 12:37 AM
I just found this thread! After Phil's latest post mentioned it... It's very interesting actually. Phil definitely has time on his hands!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
My Prop Info&Apps: ·http://www.rayslogic.com/propeller/propeller.htm