Shop OBEX P1 Docs P2 Docs Learn Events
Phonemic Speech Synthesis (3rd installment, 7 Nov 2006) - Page 2 — Parallax Forums

Phonemic Speech Synthesis (3rd installment, 7 Nov 2006)

2

Comments

  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-08 09:03
    Chip,

    Thanks! You're definitely right about context, and I think your word-based approach will yield some vital clues. I've still got some troublesome consonants and even some vowels that would benefit by a more contextual approach. Right now, the k sound is algorithmic, determined by the following vowel. That's a step up from context-free, but hasn't solved the problem. Better, would be a separate k for each diphone: ka, ke, koo, etc. This could be accomplished easily in a table-driven system, where matches are performed on the longest patterns first, then working down the line to single letters. Whole words could be accommodated this way, too, for any that are truly exceptional. In such as system, for example, I wouldn't have to spell "beer" with three e's, "beeer", i.e. "b ee er", since "eer" would have it's own rule set. Also, to save space, phonemic macros would be useful for phoneme groups that get used in more than one sound sequence.

    Also, my inflections are frame-based, rather than phoneme-based. Oddly enough, this sounds better than it should, given that only the first frame in a compound phoneme will get inflected. Pitch is a tricky thing. There's an awfully fine line between monotone and sing-song, and I certainly haven't mastered it. At first I thought your 1/4 semitone resolution would be too coarse for speech. But inflections do cover a wider range than that. The key, I think, is the blend, which makes the spoken word sound less like individual musical notes and more like a continuum. I just need to figure a way to make the blend occur over a wider context. This will likely entail pre-buffering groups of frames and modifying them en masse before passing them on to the vocal tract. Notation-wise, this will likely involve brackets, braces, or maybe just spaces to delineate the inflected units.

    Tempo is another tough nut. Stressed syllables are often drawn out, as well as being inflected. And certain vowels have different durations, depending where in a phonemic group they appear. It would be nice to find some rules for the latter, since the notational burden gets to be cumbersome otherwise. (I'm not sold on the "%nnn" notation, either. It's too wordy.) Also, in songs, where tempo needs to be strict, there has to be a way to make a phonemic group fit a particular time slot. This usually involves stretching or compressing a single vowel to make the group fit. But, again, there's some notational baggage that needs to be optimized.

    There's still a lot to do...

    -Phil
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-08 16:54
    Most of the phonemic TTS systems I've worked with have a "hint" database, which includes fine-tuned word-level pronunciations for hard words (like SCSI, in a technical context). For any word that's not in the database, they apply some basic grammatical rules and cook up a phoneme string. The difference is usually pretty obvious.

    For those of you on Macs, the Vicki voice is an example of how good such a system can sound -- but if you throw it a curve ball, like a Spanish text, the quality breaks down. (You should hear the voices from Leopard. smile.gif )
  • Graham StablerGraham Stabler Posts: 2,510
    edited 2006-11-08 17:39
    All of this is so cool and the propeller is both liberating and limiting, and I reckon that will be the mother of invention! Plus the R&D "team" on this forum is pretty tasty really.

    I'm struggling to catch up, I really want to help!

    Graham
  • cgraceycgracey Posts: 14,155
    edited 2006-11-08 18:34
    Cliff L. Biffle said...

    For those of you on Macs, the Vicki voice is an example of how good such a system can sound -- but if you throw it a curve ball, like a Spanish text, the quality breaks down. (You should hear the voices from Leopard. smile.gif )
    I was intrigued by what you said here about this "Vicki" voice, so I Googled it and found out that it takes ~25MB!!! It had better sound good. A Propeller target would be more like ~3KB.

    http://developer.apple.com/releasenotes/Carbon/Speech.html


    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Ym2413aYm2413a Posts: 630
    edited 2006-11-08 19:01
    This sort of reminds me of the Voder for some odd reason.
    The Voder was a old voice synthesizer from the late 1930s that you controlled my hand.

    It had buttons and switches to control the systhesis parameters.
    image50.gif

    I bet it was a real pain to learn and use!

    www.obsolete.com/120_years/machines/vocoder/
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-08 19:19
    Chip,

    Yes, the Vicki voice sounds good. It was the best realtime TTS I'd heard until Apple demoed their next-gen voices (coming next year) -- which sound better, but of course take even more space.


    As for the Voder, Chip's synthesis code is basically the digital equivalent, sans keyswitches. Anyone want to interface some? smile.gif
  • Kevin WoodKevin Wood Posts: 1,266
    edited 2006-11-08 19:33
    So who will be the first person to create a "Funkytown" object?
  • Ym2413aYm2413a Posts: 630
    edited 2006-11-08 19:40
    Cliff L. Biffle said...
    Chip,
    As for the Voder, Chip's synthesis code is basically the digital equivalent, sans keyswitches. Anyone want to interface some? smile.gif

    Oh darn Cliff! That gives me an idea!
    I'm a pianist and composer as well. Cliff you just gave me an idea for a new instrument design. (lol)
    The Prop-Voder! *laughs*

    Eitherway you could get some cool sounds out of it!
    smilewinkgrin.gif
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-08 21:45
    Chip,

    I've been thinking about how I'd make the synth table-driven. It would be nice to have data structures that look something like this:

    DAT
    
    table   BYTE "eer", 0, "ee", 0, "er", 0, 0
            BYTE "eel", 0, "ee", 0, "el", 0, 0
            ..
            BYTE "ee", 0, 0, F, 310 / 19, 2020 / 19, 2960 / 19, 3500 / 19, GA, 30, 0, 10, 20, 10
            etc.
    
    
    


    The idea is that the start routine would scan the table and create an array of "dictionary" entries, each indexed by one of the one- to four-letter patterns and pointing to the rest of the string. All well and good so far. But then I'd have to create my own lookdown routine, since Spin's lookdown doesn't accept an array address, but only a fixed expression list. Written in Spin, such a routine would be too slow, and I don't want to waste an assembly cog on just a dictionary search function.

    Okay, I could do something like the following, but it's rather awkward (and would have to be quite long):

    address := [b]lookdown[/b](pattern: d[noparse][[/noparse] 0], d[noparse][[/noparse] 1], d[noparse][[/noparse] 2], d[noparse][[/noparse] 3], d[noparse][[/noparse] 4], d[noparse][[/noparse] 5], ... , d[noparse][[/noparse]n])
    
    
    


    Spin's built-in lookdown and case constructs are plenty fast for this sort of thing when the parameters are static. But their speed would be hard to duplicate when simulated in Spin from dynamic data. The only other option I can think of would be a hash function. Properly constructed in Spin, that might eliminate a linear search and be fast enough. This may be the route I have to take, unless I've overlooked some Spin feature I'm not yet familiar with...

    -Phil
  • william chanwilliam chan Posts: 1,326
    edited 2006-11-09 02:19
    Help !

    I can't compile or download the talk_demo.spin !
    I downloaded the latest update. ( why is there no .zip extension ?)

    I tried to compile but
    it gives an error "Expected a DAT symbol" at this line

    call #sine

    in the VocalTract.spin file.

    Why is the 1st version (zip file ) much larger than the 2nd or 3rd versions?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.fd.com.my
    www.mercedes.com.my
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-09 02:35
    William,

    There is a zip extension in the file name. I just rechecked. The reason the first zip is so much larger is that it includes the latest (v1.02) Propeller IDE. The others dont. If you haven't installed this version, that may be the reason you're having trouble getting the package to compile. So download the first zip, and extract the .exe into your Propeller program directory. Then try compiling the newest talk_demo again.

    -Phil
  • cgraceycgracey Posts: 14,155
    edited 2006-11-09 06:47
    Phil Pilgrim (PhiPi) said...
    Chip,

    I've been thinking about how I'd make the synth table-driven.
    Phil,

    As you probably know, the Spin interpreter has two built-in functions which could aid in this: STRSIZE(@zstring) and STRCOMP(@zstring1, @zstring2). Spending memory making a hash table may not be necessary. I mean, you've got under 100 strings you want to compare to, right? If you made a DAT list of all the targets in z-string form, you could use STRSIZE and STRCOMP to navigate through it and do the comparisons pretty rapidly. You could get the partial benefit of a hash table just by having several smart starting positions within the target list. The targets could each contain a z-string and a pointer to their respective data sets using 'WORD @dataset'. I think this all applies to what you were asking about. BTW, STRSIZE and STRCOMP are very fast. They'd take as much time to execute as it would to handle a single-character comparison discretely in Spin.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-09 07:37
    Chip,

    Actually, I don't think there will be any strings to compare that're longer than four characters, so I could just do a compare on appropriately-constructed longs. I was more concerned about the loop overhead, which virtually disappears when using lookdown or case on static data or program structures. But now that I think about it, there's no reason I couldn't just sort the dictionary and do a binary search. That'd be plenty fast!

    The drill would be to keep a four-byte (long variable) shift register of the incoming string data. If the first byte is a lower-case letter, then the entire four-byte value is looked up in the table. If there's no exact match, the table position prior to where the match would've been will hold the correct pointer. (There will always be 26 single-letter entries, some with null pointers, so "fob." isn't going to match "eer."; it will match "f..." first.) Then as many characters as there were non-zero bytes in the found long can be shifted in for the next match, and so forth.

    Dang! I'd put this stuff out of my mind for the night and closed up my shop. Now I'm inspired to go back out there and work into the wee hours — again! smile.gif

    -Phil
  • LoopyBytelooseLoopyByteloose Posts: 12,537
    edited 2006-11-09 09:39
    I am quite amazed by all of this. The whole study of phonology is based on the physical limits of the oral cavity, nasal cavity, and larnyx to produce sound. Since I teach ESL, I have to deal with it on a daily basis.

    I would have simply sampled speech from an appropriate source and used that rather than get involved in the physics. After all, there is even a tonal register for gender. And another tonal register for culture.

    By the way, British phonology tends to have more phonemes than American phonolgy. With more phonemes, more permutations; and more software overhead.

    Of course if you want the complete phoneme set, the IPA or Internationa Phonology Alphabete will provide you with an inventory. But, it really is quite unwieldly.

    In sum, whatever voice you give a robotic device is going to give it a personality or the lack of one. Might I suggest that you use sampling to get the personality factor right? It really isn't just an expedient.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    "If you want more fiber, eat the package.· Not enough?· Eat the manual."········
    ···················· Tropical regards,····· G. Herzog [noparse][[/noparse]·黃鶴 ]·in Taiwan
  • william chanwilliam chan Posts: 1,326
    edited 2006-11-10 03:11
    Phil,

    Sorry, must be the my new FireFox 2.0 browser that removed the .zip extension.
    Anyway, I got the v1.02 propeller tool and it just works ! Congratulations !

    Thanks.

    P.S. Why the 1.02 IDE is not posted on the Parallax website yet?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.fd.com.my
    www.mercedes.com.my
  • Paul BakerPaul Baker Posts: 6,351
    edited 2006-11-10 06:08
    Because it was an ad-hoc revision Chip and Jeff put together to incorporate the ability to use a RES as a return label point as required by Chip's vocal tract object. Since this is the only revision over v1.0 and it hasn't undergone the normal verification and testing process, it is technically a beta version and not an official Parallax release.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.
  • kelvin jameskelvin james Posts: 531
    edited 2006-11-10 08:25
    Phil

    Here is a very simple thing called " can you do it? ". I added some timing from chips' version, seems to transition better. Just a couple of extra set timing parameters, a little smoother than the set tempo for sustain. Not perfect yet, but slowly making some headway. I have been trying to add some personality, a lot of trial and error here. I think you have all the basics there, it is just a matter of experimenting. The 2 modified files are attatched.

    kelvin
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-11 03:36
    Kelvin,

    Hmmm, that sounds pretty good! I hadn't thought about adding a sustenudo, but it's a really good idea! Look for one in the next release.

    Thanks!
    Phil
  • yerpa58yerpa58 Posts: 25
    edited 2006-11-11 21:55
    Any chance of an mp3 or wav for us uninitiates? Kudos on all the nice work so far. I'm up to my ears in projects right now but I look forward to using the propellor chip.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-12 01:19
    I've tried making a recording using my PC from some of the speech output, so I could convert it to MP3 and post it. But for some reason, the PC just isn't getting an adequate signal level. It may be a cable issue. At any rate, it would be worthwhile for people to hear the speech without any visual clues to help decipher it.

    I was brought back down to ground a couple days ago when a friend stopped by my shop. The conversation went something like this:

    Me: "Hey, you wanna hear this thing talk?"
    Friend: "Wow! It can talk? Sure I'd like to hear it!"
    Me: <starts demo>
    Friend: <grimaces, looks quizzical, grimaces some more>
    Demo: <finishes with a flourish>
    Friend: "Now there's a voice only a mother could understand!"
    Me: <PSSSssss! (balloon deflating)>

    But, what the hey. Self-delusion is part of what keeps us going, right? And when reality comes knocking, it only makes us try harder! smile.gif

    -Phil
  • kelvin jameskelvin james Posts: 531
    edited 2006-11-12 03:58
    Phil

    Thanks, but not my idea, this is from Chips' programming, i was just adding it. Not to worry about other peoples' opinions, this is something new, and will take some time to please everyone. Your efforts on whatever you do are well appreciated.

    kelvin
  • kelvin jameskelvin james Posts: 531
    edited 2006-11-13 05:09
    Here is a mp3 of canyoudoit. The audio out from the demo board is not really designed for a line-in to the sound card, so it is a little on the noisy side.

    kelvin
  • Joel RosenzweigJoel Rosenzweig Posts: 52
    edited 2006-11-15 04:18
    Phil, Chip,

    I've been following the thread for a while and tonight, I finally had a few moments to give the demo a try. You both did an outstanding job with your respective pieces. The speech demo sounded even better than what I was anticipating. I agree that it's hard to understand some of the words, but it appears that this can be resolved by tweaking the phonemes you're using more than anything else. I experimented with the demo by adding a few words of my own, and the speech sounded quite good. I recall having to make the same types of tweaks to my SP0-256 speech synthesizer based projects.

    I certainly look forward to your next set of enhancements. This is really neat. I was going to use a nice backlit LCD for the user interface on my propellor project. Maybe I'll have to reconsider and add the speech synthesizer instead. smile.gif

    Thanks to both of you for your work on this. I appreciate it. This is really good stuff!

    Joel-
  • cgraceycgracey Posts: 14,155
    edited 2006-11-15 06:30
    So, Phil,

    What are you working on today? Are you off in large-model land?
    Phil Pilgrim (PhiPi) said...
    I've tried making a recording using my PC from some of the speech output, so I could convert it to MP3 and post it. But for some reason, the PC just isn't getting an adequate signal level. It may be a cable issue. At any rate, it would be worthwhile for people to hear the speech without any visual clues to help decipher it.

    I was brought back down to ground a couple days ago when a friend stopped by my shop. The conversation went something like this:

    Me: "Hey, you wanna hear this thing talk?"
    Friend: "Wow! It can talk? Sure I'd like to hear it!"
    Me: <starts demo>
    Friend: <grimaces, looks quizzical, grimaces some more>
    Demo: <finishes with a flourish>
    Friend: "Now there's a voice only a mother could understand!"
    Me: <PSSSssss! (balloon deflating)>

    But, what the hey. Self-delusion is part of what keeps us going, right? And when reality comes knocking, it only makes us try harder! smile.gif

    -Phil
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-15 07:05
    Hey Chip,

    Right now, I'm laying out a new daughterboard (i.e. Real Work). 'Can't help staying checked into the forum, though.

    But don't worry: I haven't given up on the speech stuff! I'd like to find a way to record a word and convert it programmatically to the proper VocalTract settings. The sound quality would be much more natural that way. I don't remember: can your frequency analyzer program output a file with frequency domain data?

    -Phil
  • rokickirokicki Posts: 1,000
    edited 2006-11-15 08:33
    Actually, I'm very intrigued by that sort of thing myself. I'm quite deaf, actually,
    so I'm hoping the Propeller can take some of the load. smile.gif Phoneme recognition
    would be amazing!
  • william chanwilliam chan Posts: 1,326
    edited 2007-03-28 02:43
    Phil,

    Any fourth installment coming soon?

    How to make the speaker sound like a female?

    Can the propeller be run at 5Mhz x 8 = 40Mhz or lower to save current consumption and yet get the same voice quality?

    Can I embed the Propeller with a CR2032 coin battery into a voice greeting card?
    How long will the battery last?


    Thanks.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.fd.com.my
    www.mercedes.com.my
  • RytonMikeRytonMike Posts: 12
    edited 2007-03-31 17:49
    Phil Pilgrim (PhiPi) said...
    Hey Chip,

    Right now, I'm laying out a new daughterboard (i.e. Real Work). 'Can't help staying checked into the forum, though.

    But don't worry: I haven't given up on the speech stuff! I'd like to find a way to record a word and convert it programmatically to the proper VocalTract settings. The sound quality would be much more natural that way. I don't remember: can your frequency analyzer program output a file with frequency domain data?

    -Phil

    Phil,

    Back in the seventies I worked on the first generation of speech technology. We were developing direct waveform synthesis adding damped sinusoids to make formants and filtered noise for sibilants. We only had TTL ssi and a bit of msi technology in those days.
    One of the things I developed which I was quite proud of at the time was a set of tools for extracting synthesiser parameters from real speech. I pass on the key things I learned which are not obvious – it was not published at the time and it may be useful to you.

    1. Pitch synchronous analysis is best, using data from the glottal closed part of the waveform. Detect glottal closure by selecting the locally highest peak in the waveform and tracking back to the previous zero crossing. Repeat moving the analysis window forward by the parameter update rate through the recorded waveform.

    2. Use a strange window function which is half a Hanning starting at 1 and tapering to 0 over a period corresponding to about 66% - 75% of the pitch period. This does not need to be adaptive. In theory if we assume that the glottal excitation is impulsive there is little energy before the start so a step at this side of the window has little effect on the output and we get maximum information from the glottal closed free response part of the signal.

    3. I think we sampled at 10k, the highest third format we used was never more than 3kHz and used an FFT analysis window of 512 filling unused samples with zeros.

    4. If you plot the resulting spectrum it is, of course very smooth. The real frequency resolution is quite course because of the relatively short sampling window, however,

    5. we are only going to select and use the spectral peak values so we are getting a good interpolation effect. Select the peaks in the pitch synchronous spectra coding amplitude as a number and frequency as position in the array and setting them one after the other.

    6. If you look at the resulting patterns you will see a beautifully clear spectrogram with very well defined formants in the voiced segments. We used an operator assisted tool to then extract parameters.

    7. Pitch extraction is a separate process. We used a relatively simple time domain peak picking process.

    8. There are some very interesting possibilities for using this approach to bootstrap speech by rule algorithms and in speech recognition using elastic matching.

    Hope this gives some ideas, I really like your synthesiser which brought back many memories.

    By the way, our stuff required a computer the size of a medium sized van, things move on!

    Mike
  • cgraceycgracey Posts: 14,155
    edited 2007-03-31 21:32
    RytonMike said...

    Back in the seventies I worked on the first generation of speech technology. We were developing direct waveform synthesis adding damped sinusoids to make formants and filtered noise for sibilants. We only had TTL ssi and a bit of msi technology in those days.
    One of the things I developed which I was quite proud of at the time was a set of tools for extracting synthesiser parameters from real speech. I pass on the key things I learned which are not obvious – it was not published at the time and it may be useful to you.

    Mike,

    These are real gems!!! This is the kind of·know-how that is all but lost these days, but remains as applicable and relevant as ever. These kinds of algorithms just amaze me, and the discipline of their development fascinates me more than any other aspect of computing.

    I've thought for a long time that if we could get reliably distilled formant analyses, speech recognition could become pretty straightforward. This is great! I don't think it would take more than a few kilobytes of code to get the recognizer core running. I probably won't have time to do this for a while, but I'm really happy knowing HOW to think about it now.

    Thank you very much for sharing these ideas. They are invaluable.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2007-04-01 00:04
    Mike,

    The knowledge density within those few paragraphs is mind-blowing. That the results could be applied to coax voice from SSI and MSI circuitry is even more astounding. I'm anxious to apply your principals — perhaps in a couple months when I get a breath from work.

    Thanks for taking the time to impart your wisdom!

    -Phil
Sign In or Register to comment.