@OBC: Chuckle and walk away. OK, we've done that. The question is, do you keep walking when you hear them peddling their bridge to the next guy behind you on the sidewalk?
square brackets -- to hell with memory expansions, we need forum emphasis standardization.
Square brackets is standard for forums. Allows for removal of all unsafe html easily.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
propmod_us and propmod_1x1 are in stock. Only $30. PCB available for $5
Want to make projects and have Gadget Gangster sell them for you? propmod-us_ps_sd and propmod-1x1 are now available for use in your Gadget Gangster Projects.
Need to upload large images or movies for use in the forum. you can do so at uploader.propmodule.com for free.
I had the fortune to talk to a speach recognition expert a few years ago. I'm curious to see what Jim comes up with, but given the over-simplification of the problem space, I am not holding my breath.
@mctrivia -- not all forums use square brackets. Scoop sites (including dailykos) use gt/lt, as does the custom platform at metafilter.
@Nick -- reading that article it almost seems that "Dr." Jim does not know about Fourier transforms, or that the very first thing that happens when you convert to frequency domain is you trivially filter out the carrier and normalize pitch.· Or that it's been known for ages that the human ear starts by converting to frequency domain.· Or that nearly all serious voice recognition software does this too.
Post Edited (localroger) : 8/21/2009 5:40:49 PM GMT
I do like the observation that whispered speech carries the same important content as voiced speech. As to how much that can simplify the info, the jury is still out. And of course the matching function has yet to be implemented That being said, it would be great if you got this to work.
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
> ... and anybody that uses Wikipedia as a source, is ready to buy that bridge.
Now that's a sick argument!
I posted the WIKI-link, to show that the modulation doesn't carry enough information (compare "a" and "o" in the second link) and you come up with that lame old ... WIKI-links-are-always-false.
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.: YADRO
Dr. Jim's blog said...
The vocal cords provide a carrier wave for the spoken word. But when you whisper, that carrier wave is eliminated, but you can still understand speech. All you have left is merely the rush of air through the oral cavity.
Dr. Jim's blog said...
Would that not be step one in the creation of speaker-independent voice recognition, the removal of the carrier wave and the analysis of only the modulations?
This is contrasted with the current technology of voice recognition where the carrier wave is always considered. This needlessly and exponentially increases the complexity and computation power necessary for voice recognition, not to mention speaker-independence.
I'm not sure which journals Dr. Jim has been reading, but it's been well-known for decades that much of the information content in speech comes from the formants, which are independent of vocalization. For example, in the following screen shot from Chip's FFT program, I uttered "testing one two three" three times: once with a low pitch, once normally, and once with a high pitch. In this plot, the X-axis is time; the Y-axis, frequency; the relative blackness, the energy at the given frequency:
You can see the effects of pitch in the spacing of the frequency peaks, which are simply the pitch's harmonics. The formants are the dark patterned areas, whose shape over time varies independently of my pitch, even though they're interrupted by the harmonic peaks and valleys.
Unless I'm missing something, I don't see anything new or unique in Dr. Jim's analysis of — or approach to — the problem. As a consequence, I don't see anything wrong with it either. The devil, of course, is in the implementation details.
> As a consequence, I don't see anything wrong with it either. The devil, of course, is in the implementation details.
As long as I didn't understand Jim wrong, he is wrong.
He is talking about modulation. I read that as "demodulated signal", the envelope of the signal, the amplitude.
And then, he is wrong. This doesn't carry enough information. What you showed is a 3D-plot of:
Frequency(y), time(x) and amlitude of a frequency component(shade of black).
After filterin g (as describe and repeated over and over again by Jim) he only has amplitude and time.
=> 2 components missing => FAIL!
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.: YADRO
By "demodulated signal" he means the signal stripped of the vocalization, leaving only the vowel formants and, I assume, consonant cues. His use of the term "modulation" is a little misleading. Typically, when one is referring to a voice, "modulation" pertains to the vocalization (e.g. a "well-modulated" voice), which he calls the "carrier". If you lowpass filter my diagram along the Y-axis, you can remove the vestiges of vocalization:
I believe that this is what he's referring to, but it would be nice if he provided more details so we could be sure.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 8/21/2009 10:09:33 PM GMT
Phil, the words "carrier" and "modulation" have fairly precise meanings; it's hard to use them in "a little misleading" way if you know what the hell they mean. It seems fairly clear from the blog post that he isn't converting to frequency domain; bear in mind he's doing this on a propeller and while it's possible to do FFT on a prop it's not easy and there are a lot of different words the good Dr. would have almost certainly used had he been going that route. In particular it really sounds like he is collecting a point, not a vector, for each time sample. So he is probably doing some kind of baseline noise removal followed by looking at amplitude over time, which we have known for oh fifty years or so is totally inadequate to do speech recognition.
localroger said...
Phil, the words "carrier" and "modulation" have fairly precise meanings; it's hard to use them in "a little misleading" way if you know what the hell they mean.
I don't know about "precise". There are many different kinds of modulation; but, typically, when one signal modulates another, the higher frequency signal is considered to be the carrier, with the lower frequency signal being the modulation. This assumption pretty much goes out the door, though, when one or both signals are rich in harmonics, as is the case with human vocalizaiton. In other words, who's the modulator, and who's the modulatee? I was only pointing out that Dr. Jim may be confusing the issue by using terms that contradict their usual vernacular meanings.
Believe me, I'm not giving Dr. Jim a pass for his lack of clarity or dirth of detail. I'm just trying to read between the lines a little to get an idea of what he's talking about. Again, as far as I can tell, it's neither novel nor completely wrong. While his empirical observations may seem groundbreaking to him, such apparent freshness may be borne of nothing more than a lack of familiarity with the state of the art. But we can only wait and see, since covertness appears, at this point, to be more deliberate than accidental.
With these two cites from Dr Jim it's quite clear what he intends to do:
"Step 1: We have to design a circuit to filter out the carrier wave component of speech, regardless of what that is, whether the vibration of vocal cords in normal speech, or the rush of air in a whisper. This should be done before digitization of the audio input and should leave nothing but the modulation waveforms."
"Step 2: Now we digitize the raw modulation information. This is a relatively low frequency component. A one to two KSPS (thousand samples per second) digitization should be more than sufficient to yield a good tracking of the modulation information."
With this, there is no spectral information over time. And if I look at the second link I gave you (with simply "modulation" or amplitude) and the spectral information you see that he'll have a hard time understanding something usefull. But maybe the AI that processes the single words will fix that because it understands complete sentences and can fill in missing information.
Nick
PS: For me, it's no wonder that people make so many jokes about the M.I.T-gang. If THEY would be serious, WE would be even more serious.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.: YADRO
Nick Mueller said...
With this, there is no spectral information over time.
Dr. Jim hasn't provided enough detail to make such an assertion, Nick. His "filtering" could well consist of a bank of bandpass filters, each of which would smooth the vocalization harmonics or white noise within its passband, leaving only the average energy over time. This is "spectral information over time" (of a crude form, which is the whole point), from which the formant shapes and intensities might be inferred. But I doubt he's even interested in "formants" per se, choosing rather to let some sort of neural net extract its own "meaning" from the filtered data.
One filter with multiple outputs? Who can really cut through Dr. Jim's vagueness without more detail? I'm just trying to make sense of what he said in terms of something that could work. But neither of us can know for sure what he means without summoning ESP at this point.
He stressed the fact that he reinvented the wheel and ignores the carrier. Any carrier! Be it white noise or a single frequency (as he thinks what the ... ummm ... forgot the word ... "strings" do vibrate at). He completely forgot / ignores the fact that the mouth / lips etc. filter and amplyfy frequency-ranges and thus the "carrier" is neither white noise nor a single frequency ("carrier" as in AM-modulation). If you re-read the complete blog-entry he's just analizing the modulation. And he says that several times.
Anyhow, we *WON'T* see the result! But he'll take the chance to sell a "special" microfone-board for $99.95 and announce a preliminary software for voice-recognition to come along with the board to get some sales.
Next blog announcement will be about image-recognition ... I bet 2 DIP-40 propellers for 1 SMT-Prop (shipped worldwide).
Maybe you could get along with 8 bands in the telephone-transmission-range (300 Hz ... 3.4 kHz).
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.: YADRO
Nick Mueller said...
Maybe you could get along with 8 bands in the telephone-transmission-range (300 Hz ... 3.4 kHz).
That's pretty much what I'm thinking, too. I'm actually planning to write a set of filters for the Prop to try it out, after I get some docs done for a new product.
Comments
square brackets -- to hell with memory expansions, we need forum emphasis standardization.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
propmod_us and propmod_1x1 are in stock. Only $30. PCB available for $5
Want to make projects and have Gadget Gangster sell them for you? propmod-us_ps_sd and propmod-1x1 are now available for use in your Gadget Gangster Projects.
Need to upload large images or movies for use in the forum. you can do so at uploader.propmodule.com for free.
Dr. Jim talks about voice recognition: <http://machineinteltech.com/blog/blog1.php>
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.:
YADRO
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.
Robots are microcontrolled.
I am microcontrolled.
But you·can·call me micro.
If it's not Parallax then don't even bother.
I have changed my avatar so that I will no longer be confused with others who use generic avatars (and I'm more of a Prop head then a BS2 nut, anyway)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
@Nick -- reading that article it almost seems that "Dr." Jim does not know about Fourier transforms, or that the very first thing that happens when you convert to frequency domain is you trivially filter out the carrier and normalize pitch.· Or that it's been known for ages that the human ear starts by converting to frequency domain.· Or that nearly all serious voice recognition software does this too.
Post Edited (localroger) : 8/21/2009 5:40:49 PM GMT
Ummm ... no carrier when you whisper. Only modulation?
You'v got to think about that!
en.wikipedia.org/wiki/Voice
And here, a spectrogram. Look how little information modulation transports. And look at the spectrum. Quite ritcher.
de.wikipedia.org/w/index.php?title=Datei:Spectrogram_-_mot%C3%A1ngo_mwa_basod%C3%A1.png&filetimestamp=20070423120729
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.:
YADRO
Post Edited (Nick Mueller) : 8/21/2009 6:53:14 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Now that's a sick argument!
I posted the WIKI-link, to show that the modulation doesn't carry enough information (compare "a" and "o" in the second link) and you come up with that lame old ... WIKI-links-are-always-false.
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.:
YADRO
You can see the effects of pitch in the spacing of the frequency peaks, which are simply the pitch's harmonics. The formants are the dark patterned areas, whose shape over time varies independently of my pitch, even though they're interrupted by the harmonic peaks and valleys.
Unless I'm missing something, I don't see anything new or unique in Dr. Jim's analysis of — or approach to — the problem. As a consequence, I don't see anything wrong with it either. The devil, of course, is in the implementation details.
-Phil
As long as I didn't understand Jim wrong, he is wrong.
He is talking about modulation. I read that as "demodulated signal", the envelope of the signal, the amplitude.
And then, he is wrong. This doesn't carry enough information. What you showed is a 3D-plot of:
Frequency(y), time(x) and amlitude of a frequency component(shade of black).
After filterin g (as describe and repeated over and over again by Jim) he only has amplitude and time.
=> 2 components missing => FAIL!
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.:
YADRO
By "demodulated signal" he means the signal stripped of the vocalization, leaving only the vowel formants and, I assume, consonant cues. His use of the term "modulation" is a little misleading. Typically, when one is referring to a voice, "modulation" pertains to the vocalization (e.g. a "well-modulated" voice), which he calls the "carrier". If you lowpass filter my diagram along the Y-axis, you can remove the vestiges of vocalization:
I believe that this is what he's referring to, but it would be nice if he provided more details so we could be sure.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 8/21/2009 10:09:33 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Believe me, I'm not giving Dr. Jim a pass for his lack of clarity or dirth of detail. I'm just trying to read between the lines a little to get an idea of what he's talking about. Again, as far as I can tell, it's neither novel nor completely wrong. While his empirical observations may seem groundbreaking to him, such apparent freshness may be borne of nothing more than a lack of familiarity with the state of the art. But we can only wait and see, since covertness appears, at this point, to be more deliberate than accidental.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.
Robots are microcontrolled.
I am microcontrolled.
But you·can·call me micro.
If it's not Parallax then don't even bother.
I have changed my avatar so that I will no longer be confused with others who use generic avatars (and I'm more of a Prop head then a BS2 nut, anyway)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
PG
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.
Robots are microcontrolled.
I am microcontrolled.
But you·can·call me micro.
If it's not Parallax then don't even bother.
I have changed my avatar so that I will no longer be confused with others who use generic avatars (and I'm more of a Prop head then a BS2 nut, anyway)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
PG
Jim drops a little info and the forum does educate him about this by discussing it with more knowledge ?
I would prefer if Jim would ask I'm a newbee and have a lot of questions about this ....
"Step 1: We have to design a circuit to filter out the carrier wave component of speech, regardless of what that is, whether the vibration of vocal cords in normal speech, or the rush of air in a whisper. This should be done before digitization of the audio input and should leave nothing but the modulation waveforms."
"Step 2: Now we digitize the raw modulation information. This is a relatively low frequency component. A one to two KSPS (thousand samples per second) digitization should be more than sufficient to yield a good tracking of the modulation information."
With this, there is no spectral information over time. And if I look at the second link I gave you (with simply "modulation" or amplitude) and the spectral information you see that he'll have a hard time understanding something usefull. But maybe the AI that processes the single words will fix that because it understands complete sentences and can fill in missing information.
Nick
PS: For me, it's no wonder that people make so many jokes about the M.I.T-gang. If THEY would be serious, WE would be even more serious.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.:
YADRO
-Phil
Read my quotes again! There is no plural of "filter", "modulation", "digitization", "information".
But maybe he's as lousy in writing descriptions ... er ... blob ... er ... blog-entries as he is in his videos.
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.:
YADRO
-Phil
He stressed the fact that he reinvented the wheel and ignores the carrier. Any carrier! Be it white noise or a single frequency (as he thinks what the ... ummm ... forgot the word ... "strings" do vibrate at). He completely forgot / ignores the fact that the mouth / lips etc. filter and amplyfy frequency-ranges and thus the "carrier" is neither white noise nor a single frequency ("carrier" as in AM-modulation). If you re-read the complete blog-entry he's just analizing the modulation. And he says that several times.
Anyhow, we *WON'T* see the result! But he'll take the chance to sell a "special" microfone-board for $99.95 and announce a preliminary software for voice-recognition to come along with the board to get some sales.
Next blog announcement will be about image-recognition ... I bet 2 DIP-40 propellers for 1 SMT-Prop (shipped worldwide).
Maybe you could get along with 8 bands in the telephone-transmission-range (300 Hz ... 3.4 kHz).
Nick
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Never use force, just go for a bigger hammer!
The DIY Digital-Readout for mills, lathes etc.:
YADRO
-Phil