If you do develop a speech recognition algorithm, you'll probably want to do some filtering, so understanding FIR filters is probably a good idea (lots of stuff on the web).
Also, the main chip company for speech recognition chips is sensory Inc (www.sensoryinc.com). Their 4X series are for recognition. The chips are cheap but the software is what you pay for Looking at their chip's block diagram might give you an idea as to how to break up the cog functionality. Of course, they have a digital filtering block.
Now, the issue that I never got a chance to realize when filtering audio, was if it were possible to filter in real time with arbitrary sample lengths. Let me explain.
My application where I used that FIR algorithm I mentioned earlier just used 32 samples at a time....and I filtered in blocks.
So, I'd sample 32 samples (s0 through s31). Then I was applying a FIR filter of the form aS0+bS1+cS2 and the next one would be aS1+bS2+cS3 and so on....
so, I'm using 3 samples, shifting one sample at a time. Now, the better way for real time work would be to do the filtering as the samples are coming in. So, you are sampling into one circular buffer. Then, maybe another cog is calculating the filter info using that buffer data and writing the results into another buffer. There may be more than one filter going at once. All you gotta do is make sure that there are enough samples in the buffers so that the filters can access the number of samples they need, and they can keep up with the circular buffer so the sampling doesn't overwrite a sample that is still needed by the calculations.
You also may want to just calculate the relative strength of the signal in a certain frequency range. This way you don't have to store the filtered values, just use them in some ongoing figure of merit calculation. Thats what I was doing with my app. I think I did an add and a shift to a running constant or something like that (I can't remember off the top of my head).
Another application here is applying it to lie detection. I was just watching discovery and they were showing how speech can show if someone is lying (this was different than stress analysis, but things like droops in pronoun pronunciation).
Better still is natural language understanding. Thats a big feature desired in the world at the moment. We are nearing the time of the star trek computer where we just tell it what to do. Typing is so passe....
I just hope that instead of the blue screen my computer doesn't say illogical and start emitting smoke
Phil and webmasterpdx have touched on some key aspects that I mention and have tried to explain.
Phil - ...the time normalization step...
webmasterpdx - ...relative strength of the signal...
While there is merit in looking at specific frequencies or formants within speech, at the same time, I feel that there is too much focus on looking at the formant frequencies. Specific formants are generally characterized to an individual person or small group of people and would be considered part of the noise mentioned in my document. Think of it in reverse with regard to Chip's Speech synthesis program. By adjusting the formants for F1,F2,F3, and F4 you end up creating a particular sound but the underlying 'recipe' or pattern used is the same, only the formants you specify change. By identifying the pattern of speech or the "relative strength of the 'entire' signal" and not just the strength at a specific frequency, we can ignore the frequency as noise and focus more on the underlying pattern.
Suppose that for a particular spoken word your sampled data comes in with a length of 250ms, and for the same word already in storage, it might have a length of 230ms because it was said just a little bit faster. Time normalization means that you make adjustments to the data by either stretching the 230ms to 250ms or compressing the 250ms down to 230ms. This step is critical to make sure that the pattern that you are looking for will align properly to the data received. If this step is skipped, it would require you to say the word at EXACTY the same rate that you sampled it at, otherwise the detection would be missed.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Post Edited (Beau Schwabe (Parallax)) : 8/31/2009 5:03:00 AM GMT
You're right about the time normalization (which I prefer to call "dilation" to contrast it with amplitude normalization, which I also do). The tricky part is determining where each utterance begins and ends. It's not enough just to set a threshold, apparently. What I've had to do is set a low threshold to start capturing, then measure the highest amplitude during the 1.28 sec capture interval. The span of the utterance is then taken to be the smallest interval required to contain all amplitudes (in eight channels) that are at least 6% of max. These are then expanded linearly over the entire 32-sample array.
My biggest concern initially was with compression: i.e. do you throw stuff out or average neighboring samples? But by always expanding (i.e. duplicating), you don't have to worry about information getting lost, either by removal or by mushing its neighbors.
I've found that the Goertzel algo is helpful to distinguish words that have similar amplitude envelopes but that differ in their vowel content. I'm using eight frequency bands at present. I tried four, but it seems not to be enough. I'm also preemphasizing each channel by the fourth root of its center frequency. This seems to help with words like "three" that have high-frequency formants.
Also, rather than Cartesian distance, I'm using a Pearson correlation coefficient to rank each utterance against the trained templates (which makes me extra happy that the Prop can do 32-bit arithmetic).
There has been some work done with nonlinear time dilation to get a better fit between each utterance and the various candidate word templates. But that seems to have given way to the hidden Markov model approach.
"My biggest concern initially was with compression: i.e. do you throw stuff out or average neighboring samples?" - yes absolutely, you average neighboring samples.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Beau Schwabe
IC Layout Engineer
Parallax, Inc.
It does seem that the Hidden Markov model is the most commonly used, but I'm not convinced it's the best. I think doing fast correllation between some "signature" for the word and one captured in real time is the way to go. How to do that best is the trick. Getting the envelope of the signal and storing the change times and by how much, you might be able to capture consonants that way, and the vowels by doing something with the formants.
Lots of room for experimentation and invention. Become famous, invent a new fast way that works for any voice (without training the software for a particular voice) and that is accurate. This is still not a reality for a low cost. Sure you can get it to turn on an LED when you say "Computer" or something like that, but you cannot just speak and have it record the phonemes accurately in something the price of a propeller.....for anyone's voice.
This is the first step in the "star trek" computer phase which is eventually the way we'll go. The phase after that is natural language understanding but thats beyond the scope of this problem. First get the speech recognition working...
I haven't been ignoring this project, but I have been extremely busy this week and I have not gotten more then 6 hours work in on it.
I am in a current state of confusion, but I won't let it die. I just want you to know that I will be on vacation starting Saterday morning at 2am and will not be availible unless I post on my Palm late at night. I won't be able to work on my programs while I am gone, but I was just letting you know that I am not giving up. I will see you all next week!
--Micro
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Computers are microcontrolled.
Robots are microcontrolled. I am microcontrolled.
But you·can·call me micro.
If it's not Parallax then don't even bother.
I have changed my avatar so that I will no longer be confused with others who use generic avatars (and I'm more of a Prop head then a BS2 nut, anyway)
Yes, enjoy your time off! I have a feeling this challenge will still be around when you return, so relax and enjoy yourself. Wonderful how much discussion we've had due to this project- thanks OBC and Microcontrolled!
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Download a free trial of ViewPort- the premier visual debugger for the Propeller
Includes full debugger, simulated instruments, fuzzy logic, and OpenCV for computer vision. Now a Parallax Product!
One part of me hopes that you will NOT succeed <grin> We already have too many computers answering the phones in large organisations! Can you imagine everyone installing a $20-50 prop board on their phones. Or worse yet, you knock at the household door and the prop asks you who you are, who you want to see and what you want and then politely tells you he/she is not in (for you)!!!!
Now, on the other hand, imagine telling the robot vacuum cleaner (Judy from the Jetsons fame) that you want it/he/she to vacuum the study today
This sounds exciting, but I have other things to do, so I wish you every success. After all, they mapped the human genome by looking at the algorithms differently
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
Is Microcontrolled back yet? This challenge hasn't been claimed yet! There's still a free license to ViewPort Ultimate to the first person to demonstrate speech recognition on the Propeller... (see requirements above)
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Download a free trial of ViewPort- the premier visual debugger for the Propeller
Includes full debugger, simulated instruments, fuzzy logic, and OpenCV for computer vision. Now a Parallax Product!
I've been busy with other stuff but I am still working on the speech recognition. I've been gone for a week, and have had to catch up in school, so I have not had much Prop time. I am also currently busy on building a security system for a friend, so I am juggling 2 projects at once.
I've had no success with the SD card working, but because it is out of Hanno's requirements, that's ok. The one main problem is writing the voice samples to the EEPROM and then unloading and "compressing" them. I have given up on that idea so now I just desided to record them to internal RAM on startup and compress them to fit. Does anyone know how to compress files?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Computers are microcontrolled.
Microcontrolled- any progress? Or anyone else? My challenge, with the free ViewPort Ultimate license ends in 3 days! I'll start a new challenge then....
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here), 12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer
I've discovered — to my great chagrin — that one is credited with the projects he finishes, not with the ones he starts. (If only it were otherwise! I'd be living it up on stock dividends by now. )
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here), 12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer
Comments
Also, the main chip company for speech recognition chips is sensory Inc (www.sensoryinc.com). Their 4X series are for recognition. The chips are cheap but the software is what you pay for Looking at their chip's block diagram might give you an idea as to how to break up the cog functionality. Of course, they have a digital filtering block.
Now, the issue that I never got a chance to realize when filtering audio, was if it were possible to filter in real time with arbitrary sample lengths. Let me explain.
My application where I used that FIR algorithm I mentioned earlier just used 32 samples at a time....and I filtered in blocks.
So, I'd sample 32 samples (s0 through s31). Then I was applying a FIR filter of the form aS0+bS1+cS2 and the next one would be aS1+bS2+cS3 and so on....
so, I'm using 3 samples, shifting one sample at a time. Now, the better way for real time work would be to do the filtering as the samples are coming in. So, you are sampling into one circular buffer. Then, maybe another cog is calculating the filter info using that buffer data and writing the results into another buffer. There may be more than one filter going at once. All you gotta do is make sure that there are enough samples in the buffers so that the filters can access the number of samples they need, and they can keep up with the circular buffer so the sampling doesn't overwrite a sample that is still needed by the calculations.
You also may want to just calculate the relative strength of the signal in a certain frequency range. This way you don't have to store the filtered values, just use them in some ongoing figure of merit calculation. Thats what I was doing with my app. I think I did an add and a shift to a running constant or something like that (I can't remember off the top of my head).
Another application here is applying it to lie detection. I was just watching discovery and they were showing how speech can show if someone is lying (this was different than stress analysis, but things like droops in pronoun pronunciation).
Better still is natural language understanding. Thats a big feature desired in the world at the moment. We are nearing the time of the star trek computer where we just tell it what to do. Typing is so passe....
I just hope that instead of the blue screen my computer doesn't say illogical and start emitting smoke
-Donald
Phil - ...the time normalization step...
webmasterpdx - ...relative strength of the signal...
While there is merit in looking at specific frequencies or formants within speech, at the same time, I feel that there is too much focus on looking at the formant frequencies. Specific formants are generally characterized to an individual person or small group of people and would be considered part of the noise mentioned in my document. Think of it in reverse with regard to Chip's Speech synthesis program. By adjusting the formants for F1,F2,F3, and F4 you end up creating a particular sound but the underlying 'recipe' or pattern used is the same, only the formants you specify change. By identifying the pattern of speech or the "relative strength of the 'entire' signal" and not just the strength at a specific frequency, we can ignore the frequency as noise and focus more on the underlying pattern.
Suppose that for a particular spoken word your sampled data comes in with a length of 250ms, and for the same word already in storage, it might have a length of 230ms because it was said just a little bit faster. Time normalization means that you make adjustments to the data by either stretching the 230ms to 250ms or compressing the 250ms down to 230ms. This step is critical to make sure that the pattern that you are looking for will align properly to the data received. If this step is skipped, it would require you to say the word at EXACTY the same rate that you sampled it at, otherwise the detection would be missed.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Post Edited (Beau Schwabe (Parallax)) : 8/31/2009 5:03:00 AM GMT
You are correcting everyone's spelling errors except mine !
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
· Propeller Object Exchange (last Publications / Updates)
Why would assume you've made any spelling errors?
Beau,
You're right about the time normalization (which I prefer to call "dilation" to contrast it with amplitude normalization, which I also do). The tricky part is determining where each utterance begins and ends. It's not enough just to set a threshold, apparently. What I've had to do is set a low threshold to start capturing, then measure the highest amplitude during the 1.28 sec capture interval. The span of the utterance is then taken to be the smallest interval required to contain all amplitudes (in eight channels) that are at least 6% of max. These are then expanded linearly over the entire 32-sample array.
My biggest concern initially was with compression: i.e. do you throw stuff out or average neighboring samples? But by always expanding (i.e. duplicating), you don't have to worry about information getting lost, either by removal or by mushing its neighbors.
I've found that the Goertzel algo is helpful to distinguish words that have similar amplitude envelopes but that differ in their vowel content. I'm using eight frequency bands at present. I tried four, but it seems not to be enough. I'm also preemphasizing each channel by the fourth root of its center frequency. This seems to help with words like "three" that have high-frequency formants.
Also, rather than Cartesian distance, I'm using a Pearson correlation coefficient to rank each utterance against the trained templates (which makes me extra happy that the Prop can do 32-bit arithmetic).
There has been some work done with nonlinear time dilation to get a better fit between each utterance and the various candidate word templates. But that seems to have given way to the hidden Markov model approach.
-Phil
"My biggest concern initially was with compression: i.e. do you throw stuff out or average neighboring samples?" - yes absolutely, you average neighboring samples.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Lots of room for experimentation and invention. Become famous, invent a new fast way that works for any voice (without training the software for a particular voice) and that is accurate. This is still not a reality for a low cost. Sure you can get it to turn on an LED when you say "Computer" or something like that, but you cannot just speak and have it record the phonemes accurately in something the price of a propeller.....for anyone's voice.
This is the first step in the "star trek" computer phase which is eventually the way we'll go. The phase after that is natural language understanding but thats beyond the scope of this problem. First get the speech recognition working...
-D
I am in a current state of confusion, but I won't let it die. I just want you to know that I will be on vacation starting Saterday morning at 2am and will not be availible unless I post on my Palm late at night. I won't be able to work on my programs while I am gone, but I was just letting you know that I am not giving up. I will see you all next week!
--Micro
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.
Robots are microcontrolled.
I am microcontrolled.
But you·can·call me micro.
If it's not Parallax then don't even bother.
I have changed my avatar so that I will no longer be confused with others who use generic avatars (and I'm more of a Prop head then a BS2 nut, anyway)
Enjoy some time off... [noparse]:)[/noparse]
OBC
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
New to the Propeller?
Visit the: The Propeller Pages @ Warranty Void.
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Download a free trial of ViewPort- the premier visual debugger for the Propeller
Includes full debugger, simulated instruments, fuzzy logic, and OpenCV for computer vision. Now a Parallax Product!
Now, on the other hand, imagine telling the robot vacuum cleaner (Judy from the Jetsons fame) that you want it/he/she to vacuum the study today
This sounds exciting, but I have other things to do, so I wish you every success. After all, they mapped the human genome by looking at the algorithms differently
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propeller Tools
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Download a free trial of ViewPort- the premier visual debugger for the Propeller
Includes full debugger, simulated instruments, fuzzy logic, and OpenCV for computer vision. Now a Parallax Product!
I've had no success with the SD card working, but because it is out of Hanno's requirements, that's ok. The one main problem is writing the voice samples to the EEPROM and then unloading and "compressing" them. I have given up on that idea so now I just desided to record them to internal RAM on startup and compress them to fit. Does anyone know how to compress files?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.
Robots are microcontrolled.
I am microcontrolled.
But you·can·call me micro.
Want to·experiment with the SX or just put together a cool project?
SX Spinning light display·
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here),
12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.
Robots are microcontrolled.
I am microcontrolled.
But you·can·call me micro.
Want to·experiment with the SX or just put together a cool project?
SX Spinning light display·
I've discovered — to my great chagrin — that one is credited with the projects he finishes, not with the ones he starts. (If only it were otherwise! I'd be living it up on stock dividends by now. )
-Phil
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here),
12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer