new speech synthesizer for the propeller
Jack Buffington
Posts: 115
I have mentioned this in a couple of other threads but I am creating this one as the official thread for it. I have been working on a new speech synthesizer for the propeller for the past couple of months. A few days ago I decided that I was just a few days away from releasing it so I let my wife have a listen. Unfortunately she could only pick up a word here and there. I can understand it pretty clearly but it is probably because I have been working with it almost daily and have become used to its 'accent". I have recorded a few samples of it speaking to get some second opinions from the users of this forum. I have chosen things that people are generally familiar with. If you find that you can't understand things, it would be great to get feedback on why you were having a hard time understanding things (if you know).
I'll be releasing it with a MIT licence and will put it in the OBEX should it become understandable enough.
(hopefully) for your listening pleasure are these three selections:
Numbers zero through twenty: (480K)
http://www.buffingtonfx.com/temp/numbers_July_12_2011.wav
Gettysburgh address: (4.5M)
http://www.buffingtonfx.com/temp/Gettysburgh_address_July_12_2011.wav
Psalm 23: (1.5M)
http://www.buffingtonfx.com/temp/psalm 23 July 12 2011.wav
I'll be releasing it with a MIT licence and will put it in the OBEX should it become understandable enough.
(hopefully) for your listening pleasure are these three selections:
Numbers zero through twenty: (480K)
http://www.buffingtonfx.com/temp/numbers_July_12_2011.wav
Gettysburgh address: (4.5M)
http://www.buffingtonfx.com/temp/Gettysburgh_address_July_12_2011.wav
Psalm 23: (1.5M)
http://www.buffingtonfx.com/temp/psalm 23 July 12 2011.wav
Comments
Graham
the numbers were quite easy. Maybe because I made a countdown for my son's astronaut-costume and so I'm used to the acent too
The two others no chance.
I remember that someone mentioned in another thread about speech-synthesis that the wave-forms can be adjusted through software by comparing the synthesised sound with the wave-form of a real voice.
Then changing some parameters and do the compare again until a certain matching level between them is reached?
Have you ever thought about optimising it that way?
keep the questions coming
best regards
Stefan
-Phil
John Abshier
Overall quite amazing, but there is work that needs to be done. Some quick critiques:
I think one of the simpler overall changes may be just to add more of a pause or spacing between words.
It seems to add a "w" in five: "fwive" is what it sounds like.
"K"s sound like hard "T"s.
The word "you" seems to missing the "y" (or there's just not enough of it)
It sounds like a audible CAPTCHA system I use on one of my websites, but it was purposefully designed to be difficult to understand so computers couldn't understand it. Though, this would probably be quite the opposite.
You've chosen a huge task and have already got a working framework. The numbers were just fine.
There's an awful lot of material and theory out there on this subject, and some known 'short cuts' to maximise intelligibility.
Would you care to share some information with us about the approach you are taking?
Best regards,
T o n y
All that said however, this is an impressive achievement. It still just needs some work.
-Steve
Pretty much the approach that I took was to create beginning, middle, and ending cases for each phoneme. I compiled a list of a few words that had the phonemes in those positions and then recorded them. In an audio editor I checked the timing for the phoneme. For most of them I have programmed a transitional period and then a period of the phoneme by itself. In end cases, I ramp them down. Some phonemes like 'r' don't really ever fully present themselves and mostly blend with the vowel sound next to them. Others are sort of a hybrid. For example, a hard 'I' is made of 'ah' and 'ee' but if I use 'ahee' in my text, it sounds strange so I created an 'i' sound that is faster. As far as frication goes, Chip's documentation is misleading for frication and the formants. He gives the proper amount that frequency changes for a given value but then implies that there is a specific range for them. Most likely what he has listed is what he has determined to be the common range for the formants and what he thought to be the proper range for the frication. In reality, the formants can go up to 5KHz and frication can go up to 10KHz.
Once I figured that out, some of the frication phonemes sounded a lot better but still, the frication as he programmed it is still a pretty narrow frequency. When you look at the spectrum of a fricative sound, it is a pretty broad range of frequencies. I partially worked around that by cranking up the volume of the aspiration parameter and setting the formants within the range of the fricative. This, of course has some drawbacks because the vocal tract tries to interpolate from regular formants to the higher fricative formants. I tried to keep the transitions as fast as possible but in some situations, the vocal tract can go haywire for about 15 to 20 milliseconds with a high pitch resonance if you try to transition too quickly.
One other issue with the vocal tract that cropped up is that you can't set the formants too close to each other. If they got too close, I tended to get numerical overflow. This has forced some of the phonemes to have odd transitions because I may have had to stick formant 2 up where formant 3 is supposed to be so that I didn't get overflow errors.
I may later choose to write my own vocal tract using a completely different strategy than Chip did but as of now, I suspect that the problem with understandability isn't with the vocal tract but with what I am doing in my own program.
Bobb,
Thanks for the suggestions. I have increased the length of the pauses between words. I used to have it a lot slower but sped it up to the spacing that I was using in my speech. A longer pause will probably help though. I tracked down the issue with 'five'. It was actually saying 'frive'. The 'f' in five wasn't setting formants since 'f' is one of the cases where pure frication seems to work OK. It was interpolating the 'r' from 'four' into the 'i' in 'five'. I have made all of the fricative sounds indicate to the glottal sounds that they should set formants before becoming gutteral. I haven't made all of the glottal sounds recognize the indicator yet but at least 'five' sounds good now.
StefanL38,
I haven't really looked at all at the waveforms of the sounds that I have been creating. I have been relying pretty heavily on spectral analysis instead. Of course, I did use the waveforms to help adjust the relative volumes of the different phonemes. At first I had some of them too loud and others too quiet. I think that they are about right now though I am wondering if I should change things so that I pronounce by the sylable instead of by the phoneme as far as glottal volume goes. In the end, the waveform would be pretty similar so I don't know if that would be it or not.
Phil,
Your synthesizer adjusts pitch. Did you find that doing so increased how easy it was for you to personally understand it? Pitch is one of the things that I have thought about changing to increase how understandable things are.
Leon and others,
Even negative comments are great. Thanks! Whenever I post the next revision, if you guys can understand things then I'll know I'm going in the right direction.
I don't know that varying pitch helped much with comprehension. But if my synthesis were better, I think it would have made it more pleasant to listen to.
My most frustrating (and still-unsolved) problem was the leading "K" sound, which others have commented on here. Hopefully, you'll be able to master it where I failed. :-)
-Phil
I could understand it! Difficult though.
A quick breakdown of what I heard on the intro, "four score and seven years ago"
It sounds like this: fourscoreandsevenyearsago. The "next" word signals are pretty much not there. Right now, it appears as though you have used a short pause. Barring some improvement in vocalization to signal that to the listener, increase that pause considerably, so that the words are distinct. IMHO, that one change would improve it for a large fraction of listeners, at the expense of natural sound. Understandable comes before natural or pleasant, in my humble opinion.
The word "four" came out as "fooor" The "f" is actually pretty good, IMHO. Rather than just use one sound for the "ou", break into distinct sounds. "Fower" would work better, as it would help to add some texture to this word. I used to use this trick with the older speech synths. Another example of this would be "Teresa", input and vocalized as, "Tereeza", which helped to enunciate the word. More texture helps with understanding. Just so you know, good vocalists maintain a set of "professional" pronunciations that are different from ordinary speech, for the purpose of emphasis and continuity when singing. This software has a similar problem, with a similar solution, if that makes any sense. By varying things this way, you can get more done with fewer sounds. (vocal music vs ordinary talking is a similar problem)
I liked the word "score", but for a greater pause. Would be nice to have some pitch variation, but a better choice on delays would make this one pop more. You might try, "Scower" too. To get what I am trying to communicate here, go listen to a man say, "door" or "listen", and then have a woman say it. The man will keep the number of distinct sounds down, using volume and emphasis to convey the word. The woman will vary pitch, but she will also add texture to the word. "Door" becomes two syllables, "Do-wer", "listen" becomes, "list-ten". I would explore that some, if it were me, perhaps at a higher overall base pitch, so that the added syllables are more acceptable, and if you do add them, be sure and increase the pause between words. That's another female trait well worth paying some good attention to. They speak more slowly, in that their breaks between words are often longer, even though they may actually vocalize the word more quickly. Gender in the vocalization here isn't important, just the elements you can steal to make it work better. Knowing those elements are there gives you options, that's all I am trying to convey here.
"and" is funky. Right now, I hear "alnd", where there is a bizzare transition between the "aa" sound, and "nd. The "nd" actually works well! Maybe extend the "aa", and do some work to get rid of that transition some" "aand", one syllable, little texture.
"Seven" more or less works. Nicely done. It's a bit heavy on "shaaven" but not bad at all! Like it.
"Years" has the same transitional problem "and" does.
All in all, I think this is a very solid effort. Do not take my post the wrong way. I've a good ear for vocalizations, and thought I would share detail impressions, which you asked for. (way too many years doing vocal music and theater)
A friend and I were discussing my speech synthesizer last night. We hit upon the idea of him spelling out exactly what he hears. His English to English translation will help me a lot to improve how understandable the synthesizer is. From our conversation last night it seems that he isn't hearing the pauses between words either. It also sounds like he isn't hearing the d sound at all. For him, "The lord is my shephard" was heard as "De lor esheuli setter" It will be interesting to see how things improve once I get a hold of his 'translation'.
I wasn't sure if I understood it because I have those texts memorized, or if I understood it because I heard it.
You might try some unfamiliar passages of text next time to get true feedback.
My hats off to you for picking this amazing project!
OBC
Great idea. Very interested.
Where do you get your absolute phoneme references?
As far as what it runs on, I am currently developing on the professional development board. I'm just using the audio out that is built into the board.
I see what you're doing now. So is your goal to make good sounding code for all the words?
Here is the state of things now though:
west silver borrowed cheat fashion live potter epileptic breathing
worst signal, auro, chi, fashion, milk, bottle, perpetrate, breezing
was several follow she have little problem verti verti freezy
best sailru borrow s#@% pesan little bathroom perpiloted breathing
The first line is what I passed to my speech program except that it was phonetically spelled. The following three lines are what people heard. I asked that people respond with how long it took them to 'translate' and they were coming up with between five and eight minutes. The original clip, which had a few sentences before these isolated words was only 14 seconds long so presumably they were trying pretty hard to understand what was being said.
Despite the abysmal understanding level, it is possible to figure out which sounds or combination of sounds people are understanding. Overall, they are mostly hearing the spaces between words so I have mostly fixed that now. I'll notch it up a bit more in the next pass. Hopefully a few rounds of doing it with the Mechanical Turk and I'll have some passable speech. It will at least speed up my development effort and will assure that the people doing my testing don't get used to its accent.
Now I am trying to implement a bandpass filter that I will pass noise through to get what I want. The going is slow though. I recently switched from freelancing to a 9-5 job, which is putting the crimp on my fun projects like this one. I am still working on it though.
If you need a bandpass filter, perhaps the info in this thread can be of some help.
-Phil
Addit - It might be due to that saying "English is a language of 200 rules (and 2000 exceptions)"