+ Reply to Thread
Page 1 of 4 1234 LastLast
Results 1 to 20 of 63

Thread: Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)

  1. #1

    Default Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)

    Attached is a very crude attempt at speech synthesis using Chip's recently posted VocalTract object. The "talk" object is quite rough around the edges, and to say that some of my phonemes are barely intelligible gives them way too much credit. But maybe with input from the community and some fine tuning (okay, coarse tuning), the quality can be improved over time. Chip's marvelously compact object has everything that's needed for intelligible speech. But like any tool of its utility and complexity, it needs to be mastered; and that takes time.

    I've relied heavily on this paper for the formant values used in the program. The internet has many other valuable resources for synthesized speech, some dating back decades. This can be a problem, too, since much of the seminal work on the subject was done before the internet existed, and the resulting papers have likely never been converted to machine-readable form and posted.

    Much of what is done here via individual argument lists might more efficiently be accomplished by table-driven methods. But in its current form, it's somewhat more readable, which is important for development and debugging. Plus it makes playing with the settings a little easier.

    The attached archive includes the latest (v1.02) IDE/compiler exe. If you haven't already installed that version, copy the exe from the ZIP over the existing copy in your Propeller IDE directory.

    Anyway, for what it's worth, enjoy!

    -Phil

    Update (2006.11.04): Attached is a somewhat improved version. Some of the consonants are better, there are more demos, and I've added whispering and a spell procedure. 'Still some extraneous popping and hissing to cure.

    Update (2006.11.07): Added inflections, rolled r's, better musical notation, on-the-fly tempo adjustments, multiple speakers.

    Post Edited (Phil Pilgrim (PhiPi)) : 11/8/2006 6:26:51 AM GMT
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  2. #2

    Default

    Phil,

    Wow!·I didn't imagine anyone·would accomplish so much, so soon. You've made a phoneme layer for the VocalTract in about 300 lines of code.

    Interested Propeller programmers could glean a lot from looking at your talk.spin object, as it shows a flow for feeding the VocalTract. As you said, a table-driven implementation would be more compact, but what you've made is very readable and understandable -- and it's a functional general-purpose speech synthesizer!

    You could make different formant sets for "man", "woman", and "child" tracts, as well as corresponding pitch ranges... Well, I'm sure you've thought of all that. What you have actually works quite well, already. As you said, the annunciation is crude compared to what's possible, but it is synthesizing speech, all right. It sounds like the Votrax SC-01A chip.

    Good job!

    BTW, if you go the stereo spatializer thread, the VocalTract in that demo is v1.1. It·behaves more sensibly during frame gaps. In fact, I'll just attach it here...

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
    Attached Files Attached Files
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  3. #3

    Default

    WOW... just wow, this is simply awsome. First Chip made the monks/seven demo, and now this... I think I'm going to be losing some sleep tonight!

    Also, I seem to be getting some poping and such when I play back the sentences, it looks like the "~" is causing most of them, any idea why?
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  4. #4

    Default

    Cobalt said...
    WOW... just wow, this is simply awsome. First Chip made the monks/seven demo, and now this... I think I'm going to be losing some sleep tonight!

    Also, I seem to be getting some poping and such when I play back the sentences, it looks like the "~" is causing most of them, any idea why?
    Have YOU ever tried to say "~" ??

    I agree - this stuff is quite impressive..
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  5. #5

    Default

    Some of my transitions between frames are a pretty rough. The popping that you hear may be coming from too abrupt changes, or it might be from bad gain settings leading to overflow. I'm just not sure which. I added the "~" to give emphasis to terminal consonants -- sort of a Lawrence Welk effect, though not nearly so protracted. The reason is that some of them seemed to get swallowed without the added vocalization.

    Another thing I need to add is a dynamic tempo modifier. The optimum duration of a vowel is context-dependent. Sometimes you want to extend them for emphasis, particularly long vowels; other times shortening them almost to the point of inaudibility works better.

    In addition, I haven't really figured out the stress thing. The glottal pitch modifier works fine for songs; but when it's applied to stressed syllables, it sounds totally fake.

    Hopefully, people will feel free to experiment with the settings and offer improvements as they discover them. In particular, some of the consonants are virtually unintelligible and need a lot of help.

    -Phil
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  6. #6

    Default

    Phil Pilgrim (PhiPi) said...


    I haven't really figured out the stress thing. The glottal pitch modifier works fine for songs; but when it's applied to stressed syllables, it sounds totally fake.
    Phil,

    Maybe stress could be better conveyed through a combination of timing, glottal amplitude, perhaps some subtle formant tweaks, as well as glottal pitch.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  7. #7

    Default

    Try this:

    Code:
        
           t.say(string("+7he-loa ever+i-won. -doa-nt ++yoo, th+ink -tha-t dher proa+pel-er is +soa, --cooool"))
           t.say(string("+9Videe-oakild-her ++raid--ee--oa star. +4Videe-oakild-her ++raid--ee--oa star"))
           t.say(string("+5in ++mae mae-n-d, an-d ++in -mae, car"))
           t.say(string("+5wee ++cahnt ree-wae-nd wee-v ++gon -too, far"))
           t.say(string("+8oa, +we, -oa. yoo +wer-dher +ferst -won"))
           t.say(string("+8oa, +we, -oa. yoo +wer-dher +last -won"))
           t.say(string("+9Videe-oakild-her ++raid--ee--oa star. +4Videe-oakild-her ++raid--ee--oa star"))



    This is fun but I must go to bed. Thanks for the fun Phil

    Graham
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  8. #8

    Default

    Chip,

    Thanks for your comments and suggestions, but mainly for VocalTract.spin! This is too much fun!

    I think you're right about the stress thing. It's got to be a combination of all those factors. I need to nail down some consonants first, though. k and g are particularly nettlesome. And I may need different phonemes for leading and trailing zs: 'can't seem to get one to work in both places.

    -Phil
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  9. #9

    Default

    Fantastic! I was wanting to build a talking altimeter for my HPR rockets.

    Behold, the world's most interesting speech processing microchip..

    Really, I was just going to connect my digital altimeter to my bluetooth PDA, but now I can do what I wanted in the first place. Thank you!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Dave Evartt

    People don't hate computers, they just hate lousy programmers.

    http://wehali.com
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  10. #10

    Default

    Phil Pilgrim (PhiPi) said...
    Chip,

    Thanks for your comments and suggestions, but mainly for VocalTract.spin! This is too much fun!

    I think you're right about the stress thing. It's got to be a combination of all those factors. I need to nail down some consonants first, though. k and g are particularly nettlesome. And I may need different phonemes for leading and trailing zs: 'can't seem to get one to work in both places.

    -Phil
    If I recall, a leading "k" is made by a short white noise burst between the following vowel's F2 and F3 positions,·then F2 and F3 rapidly head to their vowel positions from the "k" center, with an aspiration turning to voiced excitation. For a trailing "k" the leading vowel's F2 and F3 converge onto their average as they fade, then there's a silent pause, followed by the white noise burst at where F2 and F3 converged, then an unvoiced (aspirated) "uhhh" sound·starting at the same point.·It's necessary to use the surrounding vowels like this. The "k" in "hike" is audibly higher than in "hook".

    ...Oh, and "g" is a voiced "k", just as "d" is a voiced "t", and "b" is a voiced "p", and "v" is a voiced "f", and "zh" is a voice "sh", and "z" is a voiced "s". All these symmetries, and you realize the human speech aparatus has a rather limited set of basic sounds it can make.

    What I think we need is a motion model of the mouth, where we have only two or three bytes worth of data which define its position. The raw 13 parameters (perhaps requisite at the bottom level) can define nearly inumerable configurations, 99.99% of which are physiological impossibilites. The real range of mouth movement and behavior is relatively constrained. How to qualify this is tough, though. We need to reduce the complexity somehow and make it very intuitive to configure using mouth-movement type data. For example, formants are significant mainly in relation to eachother. Rather than specify exact resonator frequencies, we need a model whereby they find their places based on overall tract formation within the confines of·a base tract model's geometries (the male/female/kid/baby differentiator).·From those constraints the fricative, plosive, affricate, etc. qualities could be inferred. This could be done right on top of VocalTract. The real magic would come from some lava-lamp like morphing of the formants in response to mouth movement. This would mean moving formants in a way that the speech aparatus would have to, which is often not a straight line·between point-A and point-B.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.

    Post Edited (Chip Gracey (Parallax)) : 11/2/2006 7:39:35 AM GMT
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  11. #11

    Default

    Thanks, Chip. That may explain the difficulty I've been having. I was trying to keep things context-independent as much as possible. But it looks like I'll need a bit of look-ahead when processing things like k.

    Chip Gracey said...
    ... with an aspiration turning to voiced excitation.
    Whooh! You should should write a textbook. That'd be enough to keep any undergrad riveted!

    Chip Gracey said...
    What I think we need is a motion model of the mouth, where we have only two or three bytes worth of data which define its position. The raw 13 parameters (perhaps requisite at the bottom level) can define nearly inumerable configurations, 99.99% of which are physiological impossibilites. ...
    So basically, all the trajectories and interpolation would be done in a smaller-dimensional space, from which the raw parameters could be derived at any given point in time. That makes sense. It would certainly keep memory requirements to a minimum.

    Going one step further still, and to yield the most natural-sounding speech (giving encodings like ADPCM a run for the money), there needs to be a way to go the other way, too: from natural speech (recordings) to the parameters that can produce a reasonable facsimile. That's going to be extremely hard. It'd be an interesting task to train a neural net on.

    -Phil
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  12. #12

    Default

    Phil Pilgrim (PhiPi) said...


    Going one step further still, and to yield the most natural-sounding speech (giving encodings like ADPCM a run for the money), there needs to be a way to go the other way, too: from natural speech (recordings) to the parameters that can produce a reasonable facsimile. That's going to be extremely hard. It'd be an interesting task to train a neural net on.

    -Phil
    In some old speech processing book I have from 1978, they mention a formant-based vocoder system that squished speech down to 600 bits per second, and they said you could recognize a person's voice through it. Imagine that -- 600bps, without compression. It could probably be compressed to less than half that in real-time, maybe even a tenth with a bit of loss over a longer recording.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  13. #13

    Default

    Hi,
    ·I'm having vocaltrack.spin proplems , i've redownloaded chip's new one from above. Heres the line of code that i'm getting a error on.

    ······················· mov···· t1,vr·················· 'vibrato rate
    ······················· shr···· t1,#10
    ······················· add···· vphase,t1
    ······················· mov···· t1,vp·················· 'vibrato pitch
    ······················· mov···· t2,vphase
    ······················· call··· #sine··· *expected DAT symbol

    ··Thank's Brian·····················
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  14. #14

    Default

    I think maybe the 1.0.3 version of the tool should be plugged into the propeller download page. My guess is that truckwiz is using the old tool.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Dave Evartt

    People don't hate computers, they just hate lousy programmers.

    http://wehali.com
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  15. #15

    Default

    Phil, if there are any papers you want I have access to an academic library and a scanner, I'd be happy to pdf anything you think it really seminal.

    Graham
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  16. #16

    Default

    Dave,
    that was the problem, thank's (sounds awsome)

    Brian
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  17. #17

    Default

    Graham,

    Thanks for the offer. Unfortunately, I don't have a particular title in mind. I just remember there being a lot of ferment in the area back in the 70's. I even took a college course that included speech synthesis, but didn't keep any of my notes or instructional materials. Now I wish I had. One of the great things about the internet is that one can live in a backwater town, like I do, and still have access to a world of resources. But if the resources you need are pre-90s, you're often out of luck.

    Chip,

    Do you recall where you read about the parameters for the k sound? That's the kind of info that would come in handy. The source I cited is pretty sketchy on consonants.

    Thanks,
    Phil
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  18. #18

    Default

    Phil Pilgrim (PhiPi) said...

    Do you recall where you read about the parameters for the k sound? That's the kind of info that would come in handy. The source I cited is pretty sketchy on consonants.
    It can be found here:

    http://web.inter.nl.net/hcc/davies/esp7cpt.html

    Scroll 80% the way down and you'll see all the consonant recipes. This is the most straightforward description I've found. It took me about 1/2 hour to read this documentation, but afterwards, I felt like I was on very solid ground. Here's the picture of interest:

    ·

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
    Attached Thumbnails Attached Thumbnails Click image for larger version

Name:	espct1p.gif‎
Views:	2741
Size:	7.8 KB
ID:	43969  
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  19. #19

    Default

    For what it is worth to those interested in a good starting point.. (to understand some of what these guys are talking about) =)

    http://en.wikipedia.org/wiki/Speech_synthesis
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

  20. #20

    Default

    Chip,

    A belated thanks for the reference: it contains some good insights. I'm still trying to get that k sound right, with some success, but I'm still not satisfied. In the process, I managed to kill the glottal amplitude completely once. Oddly enough, I got a whisper, and it was intelligible! And that got me thinking: if a g is just a voiced k, and a z a voiced s, how are we able to make them sound different when whispering? By trying it, I realized that the tongue positions are a little different. In the unvoiced consonants, it's flattened against the teeth or the palate more than with their voiced brethren. This may explain why z has been so danged elusive. Mine sounds like a buzzy s, but there's more to it than that. The investigation contniues...

    -Phil

    Post Edited (Phil Pilgrim (PhiPi)) : 11/4/2006 7:16:09 AM GMT
    Last edited by ForumTools; 09-30-2010 at 05:40 AM. Reason: Forum Migration

+ Reply to Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts