Shop OBEX P1 Docs P2 Docs Learn Events
Phoneme Extraction with Propeller - is it possible — Parallax Forums

Phoneme Extraction with Propeller - is it possible

marcwolfmarcwolf Posts: 38
edited 2011-03-20 17:03 in Propeller 1
Hi Folks.
I am looking at a way for a very crude speech to phoneme extraction system. I'm not talking down to speech recognition but more to control an animatronic mask that has lip servo's in it.

The idea being that when the actor talks the mask can give an approximate response with the servo and make it look more lifelike than a simple open and close.

Most of the code I have seen goes the other was re taking text -> phoneme -> speech which is fairly easy.

Any suggestions/advice appreciated.

Many thanks
Dave

Comments

  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-03-17 07:59
    Dave,

    Just as a thought experiment, suppose you had such a device that would recognize phonemes perfectly and control the facial "muscles". Wouldn't the mask's facial movements, then, lag the vocalization by a noticeable amount? For example, a plosive such as a leading "P" sound would not be recognized until after the vocal energy had been released, even though the original speaker had already pressed his lips together noiselessly in anticipation of pronouncing it.

    I suppose, as a corrective action, one could delay the sound output to give the servos a chance to stay in sync.

    -Phil
  • marcwolfmarcwolf Posts: 38
    edited 2011-03-17 15:02
    Possibly - the lag would depend on several things

    Firstly the speed that the Propeller can process and isolate the phoneme

    Secondly the speed and distance that the servo has to move. The servo's I am planning to use are micro servos and would be directly coupled to the linkages (not cable like some systems) so the movement should be noticeable with only a small movement of the servo.

    What I am trying to achieve is not something that is almost lip-readable but to give an approximation that the mouth movements are 'real'. I could write a small routine that will wave the lips around when I am talking but that in no way would match anything that I am saying.

    Also I would be using a small subset of the phonemes so in total there would be about 5 lip sequences in total.

    Many thanks for replying Phil.

    Dave
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-03-17 15:45
    How much control do you have over the shape of the mouth opening? Is it possible to make it do both "oo" and "ee" shapes? How about "f" with the upper teeth resting on the lower lip? Or "th" with the tongue between the teeth?

    -Phil
  • marcwolfmarcwolf Posts: 38
    edited 2011-03-17 16:40
    Hi Phil
    I can put the toungue between teeth and move it forward and back, up and down.
    I can raise upper and lower lips apart and close them again (each independantly)
    I can widen and narrow the mouth
    I can open and close the jaw.

    Using a combination of these movements I should be able to the 'oo' and 'ee', 'f' would be more difficult, and I can do the 'th'

    If I have access to other phonemes I can make small movements of the lips to give the indication that the words are being formed and articulated rather than just a static display.

    The costume is of a werewolf with a canid muzzle so I am not duplicating the full human range of motion, but I want something more than just no movement at all, or random meaningless movement.

    Many thanks for your interest and questions.
    Take Care
    Dave
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-03-17 16:57
    Are the wolf's vocalizations scripted or extemporaneous? The reason I ask is that, in the former case, a separate command track paralleling the audio track would be another -- perhaps easier -- route to take.

    -Phil
  • marcwolfmarcwolf Posts: 38
    edited 2011-03-17 17:39
    Sadly realtime or as realtime as possible.

    This is part of a longer term system I am working on for one of my own suits (for which the Prop Backpack is an essential part too)

    With many professional style costumes like the Preditors or the Underworld characters it required a small army of people to control these costumes. You have the actor in the suit and then several people off stage controlling the animatronics.

    The idea I am working on for my own suit (on a hobbiest level) is that everything is self contained. The suit can be preprogrammed to run through several emotional scenario's like "Angry","Happy","Curious", and a general one where ear's etc are kept moving to give the illusion that it is a living entity and not a static display.

    As the actor will be interacting directly with people i.e. walking through a convention or people coming up to him and talking to him I'd like to also add in the lip and mouth movements when he talks so to add to the realism.

    As no-one had chatted to a werewolf we don't know how they would enunciate their words but we can speculate that the approximate lips and tounge movements would be the same. So there is plenty of room for variation etc.

    Since it will be difficult for the actor to see in this costume there will be 2 video camera's set behind the character's eyes and these would be fed into a set of stereo video glasses. One feed would go through the Prop Backpack and be overlayed with information about the suit such as Battery health, currently running commands, heat of wearer and of suit, etc. Another advantage is that the camera's are sensitive to IR so with a LDR and some IR Led's one can make the suit see in the dark :>

    Anyway - as mentioned I do special effects as a hobby.

    Many thanks for the questions Phil.
    Take Care
    Dave
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-03-17 21:31
    Wow! Fascinating project! Thanks for sharing the big picture.

    -Phil
  • marcwolfmarcwolf Posts: 38
    edited 2011-03-19 19:58
    I send an email to Phil and am reposting his reply here for others to use

    > Hi Phil
    > Many thanks for your input re the Phoneme Recognition project. I
    > have been looking at your code that you used for the Goertzel
    > function
    >
    > I just need to clarify what the overall process does.
    >
    > 1. Accept in some speech
    > 2. Create a template which I take is a number(s) that described the
    > structure of the spoken word 3. Compares the template against a
    > previously stored template to check for matches 4. If a match then
    > return word number.
    >
    > What I am curious about is what is stored in the template. Is it a
    > number, range of numbers, and how are the numbers derived
    > (Frequency vs Time). I am purchasing another Propeller to load the
    > code into and have a play but I am trying to get some insight into
    > what the Goertzel will return to me.
    >
    > Many thanks for any help.
    > Dave




    Dave,

    The templates and samples are vectors of responses from the Goertzel
    frequency bands. The samples are dilated to fit the templates. The
    comparison to each template is done via a Pearson correlation test to
    find the one with the highest correlation.

    -Phil
    ________________________________

  • marcwolfmarcwolf Posts: 38
    edited 2011-03-19 20:02
    Hi Phil

    So the Goertzel function produces one number per word?, or is it a stream of numbers representing the analysis of the word.

    I'm not sure of the format that the template is in. (Awaiting for my new Pro chip so I can run the code with Viewpoint)

    Sadly higher maths is not one of my forte's

    Many thanks
    Dave
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-03-20 01:57
    Each time sample contains eight numbers, corresponding to the amplitudes at eight different frequencies. The entire word sample contains multiple time samples. The time samples are dilated by copying in order to fit the template length, so there is a one-to-one correspondence between numbers in the sample and numbers in the template. That way each number in the sample can be paired with a number in the template. These pairs are subjected to a Pearson correlation coefficient analysis to determine which template best matches the sample.

    -Phil
  • marcwolfmarcwolf Posts: 38
    edited 2011-03-20 02:19
    Hi Phil thanks for that. I can start to visualise what is happening now.

    I'll just wait until I get my Prop kit and start to see the numbers that I get. Interestingly I was planning to look at using a MSGEQ7 to extract the frequency spectrum however it covers a much larger spectrum range than the spoken voice.

    Many thanks
    Dave
  • kwinnkwinn Posts: 8,697
    edited 2011-03-20 09:40
    marcwolf wrote: »
    Sadly realtime or as realtime as possible.

    This is part of a longer term system I am working on for one of my own suits (for which the Prop Backpack is an essential part too)

    With many professional style costumes like the Preditors or the Underworld characters it required a small army of people to control these costumes. You have the actor in the suit and then several people off stage controlling the animatronics.

    The idea I am working on for my own suit (on a hobbiest level) is that everything is self contained. The suit can be preprogrammed to run through several emotional scenario's like "Angry","Happy","Curious", and a general one where ear's etc are kept moving to give the illusion that it is a living entity and not a static display.

    As the actor will be interacting directly with people i.e. walking through a convention or people coming up to him and talking to him I'd like to also add in the lip and mouth movements when he talks so to add to the realism.

    As no-one had chatted to a werewolf we don't know how they would enunciate their words but we can speculate that the approximate lips and tounge movements would be the same. So there is plenty of room for variation etc.

    Since it will be difficult for the actor to see in this costume there will be 2 video camera's set behind the character's eyes and these would be fed into a set of stereo video glasses. One feed would go through the Prop Backpack and be overlayed with information about the suit such as Battery health, currently running commands, heat of wearer and of suit, etc. Another advantage is that the camera's are sensitive to IR so with a LDR and some IR Led's one can make the suit see in the dark :>

    Anyway - as mentioned I do special effects as a hobby.

    Many thanks for the questions Phil.
    Take Care
    Dave

    Since you are trying to have the mask mimic the facial and tongue movements of the person in the suit would it not be simpler to add sensors that measure those facial movements and use the resulting signals to control the mask. Finding the appropriate sensors would be a bit of work, but there are strain gauges, resistive elastomer compounds, and other devices to do this. Tongue position would be a little trickier.
  • marcwolfmarcwolf Posts: 38
    edited 2011-03-20 15:17
    That might work however wearing one of those suit is not a comforable affair. It's hot, your swaeting, and you need to keep cool anyway possible.

    With my designs the wearer generall has a small laptop fan blowing into the face help with the cooling. From experience things that need to stick to the skin quickly start to detatch in those conditions.

    Many thanks for your comments
    Dave
  • kwinnkwinn Posts: 8,697
    edited 2011-03-20 17:03
    marcwolf wrote: »
    That might work however wearing one of those suit is not a comforable affair. It's hot, your swaeting, and you need to keep cool anyway possible.

    With my designs the wearer generall has a small laptop fan blowing into the face help with the cooling. From experience things that need to stick to the skin quickly start to detatch in those conditions.

    Many thanks for your comments
    Dave

    I can see a suit like that would not be comfortable, and I was not thinking of sticking anything to the skin. My idea was more along the lines of several narrow elastic bands across the face with sensors to measure the stretching. It would look something like a dog muzzle, but with much thinner and narrower sections.
Sign In or Register to comment.