Phoneme Extraction with Propeller - is it possible
marcwolf
Posts: 38
Hi Folks.
I am looking at a way for a very crude speech to phoneme extraction system. I'm not talking down to speech recognition but more to control an animatronic mask that has lip servo's in it.
The idea being that when the actor talks the mask can give an approximate response with the servo and make it look more lifelike than a simple open and close.
Most of the code I have seen goes the other was re taking text -> phoneme -> speech which is fairly easy.
Any suggestions/advice appreciated.
Many thanks
Dave
I am looking at a way for a very crude speech to phoneme extraction system. I'm not talking down to speech recognition but more to control an animatronic mask that has lip servo's in it.
The idea being that when the actor talks the mask can give an approximate response with the servo and make it look more lifelike than a simple open and close.
Most of the code I have seen goes the other was re taking text -> phoneme -> speech which is fairly easy.
Any suggestions/advice appreciated.
Many thanks
Dave
Comments
Just as a thought experiment, suppose you had such a device that would recognize phonemes perfectly and control the facial "muscles". Wouldn't the mask's facial movements, then, lag the vocalization by a noticeable amount? For example, a plosive such as a leading "P" sound would not be recognized until after the vocal energy had been released, even though the original speaker had already pressed his lips together noiselessly in anticipation of pronouncing it.
I suppose, as a corrective action, one could delay the sound output to give the servos a chance to stay in sync.
-Phil
Firstly the speed that the Propeller can process and isolate the phoneme
Secondly the speed and distance that the servo has to move. The servo's I am planning to use are micro servos and would be directly coupled to the linkages (not cable like some systems) so the movement should be noticeable with only a small movement of the servo.
What I am trying to achieve is not something that is almost lip-readable but to give an approximation that the mouth movements are 'real'. I could write a small routine that will wave the lips around when I am talking but that in no way would match anything that I am saying.
Also I would be using a small subset of the phonemes so in total there would be about 5 lip sequences in total.
Many thanks for replying Phil.
Dave
-Phil
I can put the toungue between teeth and move it forward and back, up and down.
I can raise upper and lower lips apart and close them again (each independantly)
I can widen and narrow the mouth
I can open and close the jaw.
Using a combination of these movements I should be able to the 'oo' and 'ee', 'f' would be more difficult, and I can do the 'th'
If I have access to other phonemes I can make small movements of the lips to give the indication that the words are being formed and articulated rather than just a static display.
The costume is of a werewolf with a canid muzzle so I am not duplicating the full human range of motion, but I want something more than just no movement at all, or random meaningless movement.
Many thanks for your interest and questions.
Take Care
Dave
-Phil
This is part of a longer term system I am working on for one of my own suits (for which the Prop Backpack is an essential part too)
With many professional style costumes like the Preditors or the Underworld characters it required a small army of people to control these costumes. You have the actor in the suit and then several people off stage controlling the animatronics.
The idea I am working on for my own suit (on a hobbiest level) is that everything is self contained. The suit can be preprogrammed to run through several emotional scenario's like "Angry","Happy","Curious", and a general one where ear's etc are kept moving to give the illusion that it is a living entity and not a static display.
As the actor will be interacting directly with people i.e. walking through a convention or people coming up to him and talking to him I'd like to also add in the lip and mouth movements when he talks so to add to the realism.
As no-one had chatted to a werewolf we don't know how they would enunciate their words but we can speculate that the approximate lips and tounge movements would be the same. So there is plenty of room for variation etc.
Since it will be difficult for the actor to see in this costume there will be 2 video camera's set behind the character's eyes and these would be fed into a set of stereo video glasses. One feed would go through the Prop Backpack and be overlayed with information about the suit such as Battery health, currently running commands, heat of wearer and of suit, etc. Another advantage is that the camera's are sensitive to IR so with a LDR and some IR Led's one can make the suit see in the dark :>
Anyway - as mentioned I do special effects as a hobby.
Many thanks for the questions Phil.
Take Care
Dave
-Phil
> Hi Phil
> Many thanks for your input re the Phoneme Recognition project. I
> have been looking at your code that you used for the Goertzel
> function
>
> I just need to clarify what the overall process does.
>
> 1. Accept in some speech
> 2. Create a template which I take is a number(s) that described the
> structure of the spoken word 3. Compares the template against a
> previously stored template to check for matches 4. If a match then
> return word number.
>
> What I am curious about is what is stored in the template. Is it a
> number, range of numbers, and how are the numbers derived
> (Frequency vs Time). I am purchasing another Propeller to load the
> code into and have a play but I am trying to get some insight into
> what the Goertzel will return to me.
>
> Many thanks for any help.
> Dave
Dave,
The templates and samples are vectors of responses from the Goertzel
frequency bands. The samples are dilated to fit the templates. The
comparison to each template is done via a Pearson correlation test to
find the one with the highest correlation.
-Phil
________________________________
So the Goertzel function produces one number per word?, or is it a stream of numbers representing the analysis of the word.
I'm not sure of the format that the template is in. (Awaiting for my new Pro chip so I can run the code with Viewpoint)
Sadly higher maths is not one of my forte's
Many thanks
Dave
-Phil
I'll just wait until I get my Prop kit and start to see the numbers that I get. Interestingly I was planning to look at using a MSGEQ7 to extract the frequency spectrum however it covers a much larger spectrum range than the spoken voice.
Many thanks
Dave
Since you are trying to have the mask mimic the facial and tongue movements of the person in the suit would it not be simpler to add sensors that measure those facial movements and use the resulting signals to control the mask. Finding the appropriate sensors would be a bit of work, but there are strain gauges, resistive elastomer compounds, and other devices to do this. Tongue position would be a little trickier.
With my designs the wearer generall has a small laptop fan blowing into the face help with the cooling. From experience things that need to stick to the skin quickly start to detatch in those conditions.
Many thanks for your comments
Dave
I can see a suit like that would not be comfortable, and I was not thinking of sticking anything to the skin. My idea was more along the lines of several narrow elastic bands across the face with sensors to measure the stretching. It would look something like a dog muzzle, but with much thinner and narrower sections.