new speech synthesizer for the propeller

Jack Buffington · 2011-07-12 11:00

I have mentioned this in a couple of other threads but I am creating this one as the official thread for it. I have been working on a new speech synthesizer for the propeller for the past couple of months. A few days ago I decided that I was just a few days away from releasing it so I let my wife have a listen. Unfortunately she could only pick up a word here and there. I can understand it pretty clearly but it is probably because I have been working with it almost daily and have become used to its 'accent". I have recorded a few samples of it speaking to get some second opinions from the users of this forum. I have chosen things that people are generally familiar with. If you find that you can't understand things, it would be great to get feedback on why you were having a hard time understanding things (if you know).

I'll be releasing it with a MIT licence and will put it in the OBEX should it become understandable enough.

(hopefully) for your listening pleasure are these three selections:

Numbers zero through twenty: (480K)
http://www.buffingtonfx.com/temp/numbers_July_12_2011.wav

Gettysburgh address: (4.5M)
http://www.buffingtonfx.com/temp/Gettysburgh_address_July_12_2011.wav

Psalm 23: (1.5M)
http://www.buffingtonfx.com/temp/psalm 23 July 12 2011.wav

Leon · 2011-07-12 11:12

I tried the Gettysburg Address and couldn't understand a single word.

Graham Stabler · 2011-07-12 11:55

I picked up a few words of that but not many. I got more from the other tracks but I think because I knew what to expect.

Graham

StefanL38 · 2011-07-12 11:57

first of all impressive work if you have created all this from scratch.
the numbers were quite easy. Maybe because I made a countdown for my son's astronaut-costume and so I'm used to the acent too
The two others no chance.

I remember that someone mentioned in another thread about speech-synthesis that the wave-forms can be adjusted through software by comparing the synthesised sound with the wave-form of a real voice.
Then changing some parameters and do the compare again until a certain matching level between them is reached?

Have you ever thought about optimising it that way?

keep the questions coming
best regards

Stefan

Phil Pilgrim (PhiPi) · 2011-07-12 12:26

Jack Buffington wrote:

...so I let my wife have a listen. Unfortunately she could only pick up a word here and there.

'My sympathy.

I had the same problem with my phonemic speech synthesizer object. I had grown so used to listening to it that I thought it sounded pretty good. But when others heard it, they couldn't make out much. But if objects like this are employed where only a limited vocabulary is expected, they can still succeed. Here is an example of one I did that no one had trouble understanding.

-Phil

John Abshier · 2011-07-12 12:46

I could understand many of the words, but had to give total attention to listening.

John Abshier

Bobb Fwed · 2011-07-12 14:01

As was already said, where I knew what words/numbers were expected, I could pick them out, but the Gettysburg address (who has that memorized?) I only picked out a few words on the first go round. The second time I could figure out a few phrases.

Overall quite amazing, but there is work that needs to be done. Some quick critiques:
I think one of the simpler overall changes may be just to add more of a pause or spacing between words.
It seems to add a "w" in five: "fwive" is what it sounds like.
"K"s sound like hard "T"s.
The word "you" seems to missing the "y" (or there's just not enough of it)

It sounds like a audible CAPTCHA system I use on one of my websites, but it was purposefully designed to be difficult to understand so computers couldn't understand it. Though, this would probably be quite the opposite.

TonyWaite · 2011-07-12 15:37

Well done Jack!

You've chosen a huge task and have already got a working framework. The numbers were just fine.

There's an awful lot of material and theory out there on this subject, and some known 'short cuts' to maximise intelligibility.

Would you care to share some information with us about the approach you are taking?

Best regards,

T o n y

SSteve · 2011-07-12 16:12

After a few seconds of the Gettysburgh address (the familiar part), I got completely lost. A few observations:

The plosives are not well-defined ("continent" sounds like "ontinent" or maybe "hontinent"
A lot of meaning in spoken text is conveyed by pitch and this is all monotone
The lack of pauses makes it harder to follow

All that said however, this is an impressive achievement. It still just needs some work.

-Steve

Jack Buffington · 2011-07-12 17:29

Tony,

Pretty much the approach that I took was to create beginning, middle, and ending cases for each phoneme. I compiled a list of a few words that had the phonemes in those positions and then recorded them. In an audio editor I checked the timing for the phoneme. For most of them I have programmed a transitional period and then a period of the phoneme by itself. In end cases, I ramp them down. Some phonemes like 'r' don't really ever fully present themselves and mostly blend with the vowel sound next to them. Others are sort of a hybrid. For example, a hard 'I' is made of 'ah' and 'ee' but if I use 'ahee' in my text, it sounds strange so I created an 'i' sound that is faster. As far as frication goes, Chip's documentation is misleading for frication and the formants. He gives the proper amount that frequency changes for a given value but then implies that there is a specific range for them. Most likely what he has listed is what he has determined to be the common range for the formants and what he thought to be the proper range for the frication. In reality, the formants can go up to 5KHz and frication can go up to 10KHz.

Once I figured that out, some of the frication phonemes sounded a lot better but still, the frication as he programmed it is still a pretty narrow frequency. When you look at the spectrum of a fricative sound, it is a pretty broad range of frequencies. I partially worked around that by cranking up the volume of the aspiration parameter and setting the formants within the range of the fricative. This, of course has some drawbacks because the vocal tract tries to interpolate from regular formants to the higher fricative formants. I tried to keep the transitions as fast as possible but in some situations, the vocal tract can go haywire for about 15 to 20 milliseconds with a high pitch resonance if you try to transition too quickly.

One other issue with the vocal tract that cropped up is that you can't set the formants too close to each other. If they got too close, I tended to get numerical overflow. This has forced some of the phonemes to have odd transitions because I may have had to stick formant 2 up where formant 3 is supposed to be so that I didn't get overflow errors.

I may later choose to write my own vocal tract using a completely different strategy than Chip did but as of now, I suspect that the problem with understandability isn't with the vocal tract but with what I am doing in my own program.

Bobb,

Thanks for the suggestions. I have increased the length of the pauses between words. I used to have it a lot slower but sped it up to the spacing that I was using in my speech. A longer pause will probably help though. I tracked down the issue with 'five'. It was actually saying 'frive'. The 'f' in five wasn't setting formants since 'f' is one of the cases where pure frication seems to work OK. It was interpolating the 'r' from 'four' into the 'i' in 'five'. I have made all of the fricative sounds indicate to the glottal sounds that they should set formants before becoming gutteral. I haven't made all of the glottal sounds recognize the indicator yet but at least 'five' sounds good now.

StefanL38,
I haven't really looked at all at the waveforms of the sounds that I have been creating. I have been relying pretty heavily on spectral analysis instead. Of course, I did use the waveforms to help adjust the relative volumes of the different phonemes. At first I had some of them too loud and others too quiet. I think that they are about right now though I am wondering if I should change things so that I pronounce by the sylable instead of by the phoneme as far as glottal volume goes. In the end, the waveform would be pretty similar so I don't know if that would be it or not.

Phil,
Your synthesizer adjusts pitch. Did you find that doing so increased how easy it was for you to personally understand it? Pitch is one of the things that I have thought about changing to increase how understandable things are.

Leon and others,
Even negative comments are great. Thanks! Whenever I post the next revision, if you guys can understand things then I'll know I'm going in the right direction.

Martin_H · 2011-07-12 18:15

Impressive and better than I could do. I agree that the numbers were intelligible, but I wouldn't have understood the other samples if I hadn't know the text ahead of time.

Phil Pilgrim (PhiPi) · 2011-07-12 18:21

Jack,

I don't know that varying pitch helped much with comprehension. But if my synthesis were better, I think it would have made it more pleasant to listen to.

My most frustrating (and still-unsolved) problem was the leading "K" sound, which others have commented on here. Hopefully, you'll be able to master it where I failed. :-)

-Phil

jeff-o · 2011-07-12 20:19

Wow, amazing work. I've got nothing more to add, other than that I'm looking forward to using it once it's finished!

potatohead · 2011-07-12 21:29

Tried the Gettysburg address too.

I could understand it! Difficult though.

A quick breakdown of what I heard on the intro, "four score and seven years ago"

It sounds like this: fourscoreandsevenyearsago. The "next" word signals are pretty much not there. Right now, it appears as though you have used a short pause. Barring some improvement in vocalization to signal that to the listener, increase that pause considerably, so that the words are distinct. IMHO, that one change would improve it for a large fraction of listeners, at the expense of natural sound. Understandable comes before natural or pleasant, in my humble opinion.

The word "four" came out as "fooor" The "f" is actually pretty good, IMHO. Rather than just use one sound for the "ou", break into distinct sounds. "Fower" would work better, as it would help to add some texture to this word. I used to use this trick with the older speech synths. Another example of this would be "Teresa", input and vocalized as, "Tereeza", which helped to enunciate the word. More texture helps with understanding. Just so you know, good vocalists maintain a set of "professional" pronunciations that are different from ordinary speech, for the purpose of emphasis and continuity when singing. This software has a similar problem, with a similar solution, if that makes any sense. By varying things this way, you can get more done with fewer sounds. (vocal music vs ordinary talking is a similar problem)

I liked the word "score", but for a greater pause. Would be nice to have some pitch variation, but a better choice on delays would make this one pop more. You might try, "Scower" too. To get what I am trying to communicate here, go listen to a man say, "door" or "listen", and then have a woman say it. The man will keep the number of distinct sounds down, using volume and emphasis to convey the word. The woman will vary pitch, but she will also add texture to the word. "Door" becomes two syllables, "Do-wer", "listen" becomes, "list-ten". I would explore that some, if it were me, perhaps at a higher overall base pitch, so that the added syllables are more acceptable, and if you do add them, be sure and increase the pause between words. That's another female trait well worth paying some good attention to. They speak more slowly, in that their breaks between words are often longer, even though they may actually vocalize the word more quickly. Gender in the vocalization here isn't important, just the elements you can steal to make it work better. Knowing those elements are there gives you options, that's all I am trying to convey here.

"and" is funky. Right now, I hear "alnd", where there is a bizzare transition between the "aa" sound, and "nd. The "nd" actually works well! Maybe extend the "aa", and do some work to get rid of that transition some" "aand", one syllable, little texture.

"Seven" more or less works. Nicely done. It's a bit heavy on "shaaven" but not bad at all! Like it.

"Years" has the same transitional problem "and" does.

All in all, I think this is a very solid effort. Do not take my post the wrong way. I've a good ear for vocalizations, and thought I would share detail impressions, which you asked for. (way too many years doing vocal music and theater)

Humanoido · 2011-07-13 01:35

Jack Buffington: This is excellent work and a great accomplishment! Please continue your fine work on it. The text has an excellent machine sound which I like. Only a few words are off and could be tweaked. For the numbers, some pauses in between would help understanding after ten and before 13. I understand it well, maybe because I know the Gettysburg Address, numbers, and Psalms. Great job! Was the text hand coded manually or is this an automatic TTS? Please tell us more about it.

Jack Buffington · 2011-07-13 09:54

The text was hand coded. For now that will be sufficient for my need but it would be nice to later have a text to phoneme translator. That would make things a lot easier though I am getting faster at writing phonetically.

A friend and I were discussing my speech synthesizer last night. We hit upon the idea of him spelling out exactly what he hears. His English to English translation will help me a lot to improve how understandable the synthesizer is. From our conversation last night it seems that he isn't hearing the pauses between words either. It also sounds like he isn't hearing the d sound at all. For him, "The lord is my shephard" was heard as "De lor esheuli setter" It will be interesting to see how things improve once I get a hold of his 'translation'.

Oldbitcollector (Jeff) · 2011-07-13 10:00

@Jack,

I wasn't sure if I understood it because I have those texts memorized, or if I understood it because I heard it.
You might try some unfamiliar passages of text next time to get true feedback.

My hats off to you for picking this amazing project!

OBC

prof_braino · 2011-07-13 16:18

Jack Buffington wrote: »

... We hit upon the idea of him spelling out exactly what he hears. ... "The lord is my shephard" was heard as "De lor esheuli setter" It will be interesting to see how things improve once I get a hold of his 'translation'.

Great idea. Very interested.

Jack Buffington · 2011-07-15 09:23

I did my first round of testing with my friend yesterday. He got 36% of the words correct. That isn't so hot but by having him transcribe what he heard, I was able to hear it as he did. Like some of the commenters here, he wasn't always hearing the spaces between words. I had increased it after your comments here but now it is even longer. Some of the phonemes like n and r weren't coming through clearly. I think that I have fixed all of the issues with what I sent to him. Once he can better understand what it is saying then I'll stop giving him sentences and switch to individual words in no particular order. That will rule out his use of context to figure out what words are being said and will let me focus on the phonemes that he is having trouble with. It is actually fairly difficult to come up with sentences that use a particular set of two or three phonemes without it sounding like "peter piper picked a peck of pickled peppers".

Humanoido · 2011-07-15 10:41

Jack, are you modifying phonemes in the code or you are modifying words to sound correctly using existing phonemes?

Where do you get your absolute phoneme references?

Jack Buffington · 2011-07-15 10:59

I'm modifying the timing, volume, and sometimes formants in my code. To figure out timing and relative volumes, I am using audacity. Audacity has a pretty good spectrum display but it isn't as good as zelscope, which can give a real time spectrum graph. That is the program that I am using to figure out my formants when they sound odd. You can click on the graph and it will tell you what frequency you clicked on.

Humanoido · 2011-07-16 00:55

Jack, maybe I see now - you've used a totally different approach to speech generation by introducing stored sounds / formants that you can use as phonemes and tweak. Will this program run on a standard Parallax Propeller Demo Board?

Jack Buffington · 2011-07-16 09:41

I don't think that you understand what I am doing correctly. I'm just manipulating Chip's vocal tract. I had thought of storing a single waveform for each phoneme but after thinking about it, I don't think that would work because the transitions would be wrong if I just faded between them. There is a different type of speech synthesizer called a biphone speech synthesizer. In this kind, speech is recorded and split up into phoneme transitions. For example, the word 'house' might be built out of the following transitions h->aa, aa->w, w->s. This sort of speech synthesizer is a lot more natural but takes a lot more space. If I just can't get this formant-based synthesizer to work nicely then I'll take a stab at a biphone synthesizer using prerecorded clips on a SD card.

As far as what it runs on, I am currently developing on the professional development board. I'm just using the audio out that is built into the board.

Humanoido · 2011-07-16 12:30

Jack:

I see what you're doing now. So is your goal to make good sounding code for all the words?

Jack Buffington · 2011-07-19 22:30

It seems to be pretty slow moving on this project. I have been using my wife and friend as translators but it has been slow to get results beck from them. I hit upon using Amazon's mechanical turk service yesterday and submitted some speech tonight. The results were pretty disappointing though people did do what I asked. I'm going to keep at it though... This seems to be more of an art form than a science!

Humanoido · 2011-07-20 13:08

Jack, I like the results you have so far, though you can't rebuild the entire world overnight. Why not put in some controls in your front end code and let the user develop good sounding words? There are just too many words (and in too many languages) for you to develop everything working alone. Then you can concentrate on a few fricatives that were giving some trouble for everyone.

Jack Buffington · 2011-07-20 15:11

That is pretty much what I am doing. I'm passing phonetically spelled speech to my program and it is driving the vocal tract the Chip wrote. I'm not trying to encode every word individually. That would take entirely too much time.

Here is the state of things now though:

west silver borrowed cheat fashion live potter epileptic breathing
worst signal, auro, chi, fashion, milk, bottle, perpetrate, breezing
was several follow she have little problem verti verti freezy
best sailru borrow s#@% pesan little bathroom perpiloted breathing

The first line is what I passed to my speech program except that it was phonetically spelled. The following three lines are what people heard. I asked that people respond with how long it took them to 'translate' and they were coming up with between five and eight minutes. The original clip, which had a few sentences before these isolated words was only 14 seconds long so presumably they were trying pretty hard to understand what was being said.

Despite the abysmal understanding level, it is possible to figure out which sounds or combination of sounds people are understanding. Overall, they are mostly hearing the spaces between words so I have mostly fixed that now. I'll notch it up a bit more in the next pass. Hopefully a few rounds of doing it with the Mechanical Turk and I'll have some passable speech. It will at least speed up my development effort and will assure that the people doing my testing don't get used to its accent.

Humanoido · 2011-08-11 07:48

When will your new style speech synthesizer be ready?

Jack Buffington · 2011-08-11 08:39

It is hard to say. I have been working on my own vocal tract recently. I'm trying to improve frication but that is proving to be difficult. At first I was trying having multiple noise sources like Chip programmed but the result was that I was getting a strong signal at the difference between the pitches. For example, if I created noise centered at 6000Hz and noise centered at 5500Hz, I would also get a strong signal at 500 Hz. That wasn't good so I tried sweeping the frequency around while doing what Chip was doing. That spread the spectrum but not enough. When I would try to get several thousand Hz range, you could hear the sweep too well. If I were to try multiple ones, I would end up with unwanted low frequencies.

Now I am trying to implement a bandpass filter that I will pass noise through to get what I want. The going is slow though. I recently switched from freelancing to a 9-5 job, which is putting the crimp on my fun projects like this one. I am still working on it though.

Phil Pilgrim (PhiPi) · 2011-08-11 09:36

Jack,

If you need a bandpass filter, perhaps the info in this thread can be of some help.

-Phil

Toby Seckshund · 2011-08-11 11:52

I had a voice synth that ran on a C64, way back when, something about Walsh functions I seem to remember. The output could be made so much better with judicious miss-spelling.

Addit - It might be due to that saying "English is a language of 200 rules (and 2000 exceptions)"

new speech synthesizer for the propeller

Comments