The problem I had was running out of bits. The heater_FFT uses 32 bit fixed point arithmetic with 12 bits right of the binary point. For large sample sets you end up adding together a lot of samples and overflowing the number range. Hence I only do an FFT over 1024 samples. Or maybe I was just looking at the problem wrongly.

So you are saying that P2 which uses 32 bits will have break up the 1 sec cough to 10 separate pieces of 100ms each?
This could be a problem for the matching later.
It would be easier if the entire 1 sec cough can be processed in one FFT pass.

BTW, in your heater_fft assembly code, you used "mul" ( which seems to be a reserved word in Propeller Tool ) as a label which does not allow Propeller Tool to compile it. How about making a separate heater_fft file for Propeller Tool?

mul is an assembly mnemonic whose opcode (%000100) was not implemented in the original P1. Other reserved mnemonics for unimplemented opcodes are muls (%000101), enc (%000110), and ones (%000111).

I can't help but wonder - I've seen a lot of references to Convolutional Neural Networks. I know that Convolution can be accelerated with an FFT, so I wonder if using the CORDIC to do an FFT would accelerate a CNN effectively?

I can't help but wonder - I've seen a lot of references to Convolutional Neural Networks. I know that Convolution can be accelerated with an FFT, so I wonder if using the CORDIC to do an FFT would accelerate a CNN effectively?

I think spectral data (FFT output) might just make more sense for a neural network to process, because it's lower-frequency and is a distilled version of what the ear hears. We hear spectral components changing over time, not sampled waveforms.

Erna,
I think you might be confusing Fast Fourier Transform with Artificial Neural Network. I would think a FFT will always be the first stage in building a compact signature.

Certainly I'm confused, if I see how deals are interpreted, but I'm certainly not confusing. The fourier transform transforms a period of an infinite periodical signal, what is not the case in speech. Even with limited vocabulary and limited use of such vocabulary you can not prevent to confuse the sound of would with wouldnt. If you argue over and over the same, like: the speed of light is limited, fft assumes a period and if you select light is limited to be the period, this give a different spectrum from limited is light. I believe, Chip is on the right path, as speech is the controlled activation of biological structures that in a natural way allow sounds to be concatenated we fill with sense or nonsense, if we like.

you are wrong. fft is a practical solution and there war many variations too.
fourier transformation calculates a spectrum of a signal under the assumption, that the signal is periodic. If you calculate the interval from minus to plus infinity, the signal must go to zero so fast, that the integrated square of the signal results in a finite value. If the interval is limited, the first and the last value should be close, if the signal has limited frequency range, so if you expect that a voice contains no frequency higher 10 khz, all contributions in the spectrum >10 khz are artefacts.
The amplitude spectrum normally used doesn't give any information about where which frequency exist in time, so "so la mi do" gives the same spectrum as "do mi las so", but obviously the information is quite different. So the spectrum of " no tariffs are great" can not be discriminated from "tariffs are not great". That's why we should be carefull. Or did I just confuse "fast fury transformer" to "fast fourier transform" ?

right. it's not specific to a FFT. we see in the case we take reality tv for reality.
But see Chips approach: formants. formants are sequentially arranged to form speech. Formants result in specific spectra. these have to be identified in sound and transformed back into sequences and so words, sentences, novels.. That is what ann do. But we call deep learning when we don't know, how learning takes place.

ErNa, I can't figure out if you're arguing for the FFT or against it. The FFT (Fast Fourier Transform) is a efficient way of implementing the DFT (Discrete Fourier Transform). The DFT works on a block of discrete samples of a continuous signal. The analog signal must be sampled at greater than twice the highest frequency to be able to reconstruct the analog signal from the sampled data.

For a single block of samples to represent the entire analog signal, the signal must be periodic and contain only frequencies that are multiples of the fundamental frequency represented by the block length. Of course, this is not the case with real-world signals, such as speech. However, the FFT can be used effectively in determining the spectrum by using small block sizes where the signal is fairly constant. The edge effects caused by blocking are reduced by applying a weighted window to the block, where the window weights are zero at the edges.

hello dave, I'm arguing against FFT to use in voice recognition. The subtleties of the spectrum of a chunk of voise signals contain the information and as the spectrum depends on many free parameters, there is no single interpretation. A simple example: 1 hz, 1khz sample rate: 1024 samples are a little more than 1s sampling time. You will get a base frequency of a little less then 1 hz and many harmonics. Even if you know that the signal is one hz, FFT will give 0.99x hz. You can increase the sample rate a little and watch the harmonics disappear. That will allow to determine the frequency exactly. But it is not realistic. FFT makes sense to analyse an unknown signal and see the presence of a carrier or so, the you can adjust a filter to suppress or select the carrier. I believe using parametrized IRF or similar to test the presence of structures originated by formants is more successful. Who in person here on the forum successfully used FFT to solve a problem?

I've used the FFT on images to do noise reduction and inversion of point spread functions. Three decades ago I developed an audio codec based on the FFT. It used slightly overlapping blocks of data, with a trapezoidal window. I think the blocks were about 10 or 20 msec long. The audio codec was used in a video conferencing system. The FFT is also commonly used in spectrum analyzers.

In your example of 1 Hz sinewave, a 1024 block would produce harmonics. Applying a weighted window function would help reduce the harmonics. However, the resulting spectrum would not show a single defined peak at 1Hz, since the fundamental frequency with 1024 samples would be 0.977 Hz. Instead of a single peak the spectrum would be spread out over many frequencies. This would be improved by using a larger block size, such as 8192.

ok, improvement is possible, but doesn't solve the problem in principle. I know that fft can be usefull. mainly when a signal from a limited source is looked at. For example it makes perfect sense to ft a ct, because tomographie folds the object and ft unfolds the information again. Or an ftir fourier transform infra red spectrometer measures all frequencies in one take and then fft gives you the frequency distribution. Also filtering pictures makes sense, but today wavelet gives better results. In the end, identifying a signal is always a comparison between something known and unknown. If they correlate, we say the unknown to be known. If the signal is an array of equidistant numbers, it can be seen as a vector and representing this vector in an appropriate base, like the orthonormal harmonic functions is the way to go. For speech recognition, harmonics are the last choice. For noise cancelation or compression it may be usefull. Physics without fourier transform is as unthinkable as electing a wolf to be the boss in a chicken farm. I have no better picture ;-)

For speech recognition, harmonics are the last choice.

Hi Erna,

Only harmonics will tell us the difference between a piano C and a guitar C.
So if harmonics can differentiate between a piano and a guitar, it's only logical that it can differentiate between two humans who say "hi".
Only FFT can capture and present the harmonics.

Voice recognition, speech recognition, or cough recognition. Which is it?
Whose voice is it, or what are they saying, or spraying.
There is a lot of interesting information and opinions in this thread but i think we have been misled by the thread title.

For speech recognition, harmonics are the last choice.

Hi Erna,

Only harmonics will tell us the difference between a piano C and a guitar C.
So if harmonics can differentiate between a piano and a guitar, it's only logical that it can differentiate between two humans who say "hi".
Only FFT can capture and present the harmonics.

No. piano and guitar differ in sustaining, vibrato, percussion, .... And: harmonics are perfectly presented by the time signal, because you are able to hear them, while looking to spectrum, you see peaks, not a sound.

I once used fft to determine the sound of roofing tiles from clay. The sound lasted less than 1 second. The tiles were hit with a wooden stick (selected material) and showed different sounds when they had hidden cracks or where to thick or to thin, burned too hot or not hot enough. The resonance frequency changed continuesly to to wear in the pressform, which have to be replaced periodically.
We at that time measured the signal with 20 kHz and 7 bit resolution, calculated water fall fft and searched for the spectral components of highest energy and characteristic decay. After we found a pattern of harmonics that characterized the sound, we build an analog filter bank where the frequencies were locked and tracked so the decay rate could be determinend in real time as bad tiles could be ejected.
So I can tell: fft is a perfect tool to do research, to build theories in physics, to compute CT and MRT or ultra sound imaging, but very likely not for efficient voice recognition. In terms of algebra, a fourier transform changes the base of a vector space where the signal is a vector in form time base to harmonics base. Tchebychev orthogonal vectors or hermite functions could also be used, in the and the analysis of frequencies can be used to trim a filterbank and the characteristics of the filtered signals: that is data of the bank itself and the signal output must be analyzed, e.g. by an ann

Physics without fourier transform is as unthinkable as...

I'm glad you said that because I was about to go off on a long spiel about how significant the Fourier Transform is to all of science and technology. Everything from my humble checking of amplifier and filter frequency responses to quantum mechanics.

I'd go as far as to say that the Fast Fourier Transform, and friends, are the most significant algorithms we ever came up with.

I'm arguing against FFT to use in voice recognition.

For speech recognition, harmonics are the last choice.

As it happens Mother Nature disagrees with you. The best, most efficient, voice recognizers we have are human, or at least mammal, so perhaps we should have a look at how that works.

Every school child learns something about how our sense of hearing works in biology class.

Sound waves from the voice or whatever we are to recognize enter the ear and vibrate the eardrum. Then there is some weird transfer of sound through some little bones, eventually entering the Cochlea. The Cochlea is a long coiled tube full of fluid. Inside are thousands of "hair" cells which vibrate with the fluid doing the actual sound sensing and feeding signals to neurons.

Turns out these hair cells are tuned to different frequencies along the length of the cochlea by virtue of their mechanical construction. The signals they input to neurons are frequency specific.

Ergo, our hearing system uses a Fourier Transform right at the input stage. Not only that it does it mechanically!

What the neurons get, our brains, is a Fourier Transform of the input sound. With that processing done up front by the cochlea and hair cells, the relatively slow processing of the neurons, your brain, can take it's time to categorize that into the sounds we recognize.

Now, it seems to me that this is all pretty smart. Neurons are really slow. If they were fed with raw sound pressure amplitude signal they could never keep up with it fast enough to make anything out. Except perhaps gross amplitude estimates. Having the Cochlea do a "fast" fourier transform up front makes a lot of sense.

Now, you are right about other features of sounds that help us recognize them, attack time and so on. They of course can be measured on the frequency domain side of things just as well. Even better because in the frequency domain we can detect different tones and harmonics rising and falling simultaneously and make comparisons.

The amplitude spectrum normally used doesn't give any information about where which frequency exist in time, so "so la mi do" gives the same spectrum as "do mi las so", but obviously the information is quite different.

That is perhaps true. From a complete sample set of "so la mi do" and "do mi la so" we could expect the spectrum to look the same.

But that is not what we do. If we chop that time period up into smaller chuks we migh have four separate spectra measured over "so", "la", "mi", "do" and another four for "do", "mi", "la", "so".

We now have the same four sprectra but in different time order the second time. There is your information extraction.

Ergo, our hearing system uses a Fourier Transform right at the input stage. Not only that it does it mechanically!

Yes and No. Our hearing system is more likely system of damped resonators arranged in an ascending array, or decending, if you like. This system determines the amplitude of certain frequencies at a rate proportional to frequency and resolution inverse to frequency. A Fourier transform takes a chunk of a signal, takes some time and give a result in one moment with a resolution 1/number of samples.
Fourier long ago found, that every mathematical function that is periodic can be synthesized by adding harmonics of the frequency that has the period of the signal period.
Max Planck was very aware of this when he found: if I take a quantum of energy e0 and add multiple of energy quanta en of energy en = e0 * n, then I can decribe the black body radiation. He found, that this energy distribution is independ of the actual size of the value e0.

## Comments

21,2331,323The problem I had was running out of bits. The heater_FFT uses 32 bit fixed point arithmetic with 12 bits right of the binary point. For large sample sets you end up adding together a lot of samples and overflowing the number range. Hence I only do an FFT over 1024 samples. Or maybe I was just looking at the problem wrongly.So you are saying that P2 which uses 32 bits will have break up the 1 sec cough to 10 separate pieces of 100ms each?

This could be a problem for the matching later.

It would be easier if the entire 1 sec cough can be processed in one FFT pass.

BTW, in your heater_fft assembly code, you used "mul" ( which seems to be a reserved word in Propeller Tool ) as a label which does not allow Propeller Tool to compile it. How about making a separate heater_fft file for Propeller Tool?

21,233I have almost never used the Propeller Tool so I never met that problem.

Is it so that no one else has done so either?

You know how to fix it. I'm going to leave as is.

23,001mulis an assembly mnemonic whose opcode (%000100) was not implemented in the original P1. Other reserved mnemonics for unimplemented opcodes aremuls(%000101),enc(%000110), andones(%000111).-Phil

1,05913,610I think spectral data (FFT output) might just make more sense for a neural network to process, because it's lower-frequency and is a distilled version of what the ear hears. We hear spectral components changing over time, not sampled waveforms.

1,583Certainly I'm confused, if I see how deals are interpreted, but I'm certainly not confusing. The fourier transform transforms a period of an infinite periodical signal, what is not the case in speech. Even with limited vocabulary and limited use of such vocabulary you can not prevent to confuse the sound of would with wouldnt. If you argue over and over the same, like: the speed of light is limited, fft assumes a period and if you select light is limited to be the period, this give a different spectrum from limited is light. I believe, Chip is on the right path, as speech is the controlled activation of biological structures that in a natural way allow sounds to be concatenated we fill with sense or nonsense, if we like.

11,0611,583fourier transformation calculates a spectrum of a signal under the assumption, that the signal is periodic. If you calculate the interval from minus to plus infinity, the signal must go to zero so fast, that the integrated square of the signal results in a finite value. If the interval is limited, the first and the last value should be close, if the signal has limited frequency range, so if you expect that a voice contains no frequency higher 10 khz, all contributions in the spectrum >10 khz are artefacts.

The amplitude spectrum normally used doesn't give any information about where which frequency exist in time, so "so la mi do" gives the same spectrum as "do mi las so", but obviously the information is quite different. So the spectrum of " no tariffs are great" can not be discriminated from "tariffs are not great". That's why we should be carefull. Or did I just confuse "fast fury transformer" to "fast fourier transform" ?

11,0611,583But see Chips approach: formants. formants are sequentially arranged to form speech. Formants result in specific spectra. these have to be identified in sound and transformed back into sequences and so words, sentences, novels.. That is what ann do. But we call deep learning when we don't know, how learning takes place.

4,913fastspin compiles to native P2 HUB code. So does the p2gcc set of scripts for compiling C on P2.

6,298For a single block of samples to represent the entire analog signal, the signal must be periodic and contain only frequencies that are multiples of the fundamental frequency represented by the block length. Of course, this is not the case with real-world signals, such as speech. However, the FFT can be used effectively in determining the spectrum by using small block sizes where the signal is fairly constant. The edge effects caused by blocking are reduced by applying a weighted window to the block, where the window weights are zero at the edges.

1,3236,2983361,5836,298In your example of 1 Hz sinewave, a 1024 block would produce harmonics. Applying a weighted window function would help reduce the harmonics. However, the resulting spectrum would not show a single defined peak at 1Hz, since the fundamental frequency with 1024 samples would be 0.977 Hz. Instead of a single peak the spectrum would be spread out over many frequencies. This would be improved by using a larger block size, such as 8192.

1,5831,323For speech recognition, harmonics are the last choice.Hi Erna,

Only harmonics will tell us the difference between a piano C and a guitar C.

So if harmonics can differentiate between a piano and a guitar, it's only logical that it can differentiate between two humans who say "hi".

Only FFT can capture and present the harmonics.

10,193Whose voice is it, or what are they saying, or spraying.

There is a lot of interesting information and opinions in this thread but i think we have been misled by the thread title.

1,3231,583No. piano and guitar differ in sustaining, vibrato, percussion, .... And: harmonics are perfectly presented by the time signal, because you are able to hear them, while looking to spectrum, you see peaks, not a sound.

1,583We at that time measured the signal with 20 kHz and 7 bit resolution, calculated water fall fft and searched for the spectral components of highest energy and characteristic decay. After we found a pattern of harmonics that characterized the sound, we build an analog filter bank where the frequencies were locked and tracked so the decay rate could be determinend in real time as bad tiles could be ejected.

So I can tell: fft is a perfect tool to do research, to build theories in physics, to compute CT and MRT or ultra sound imaging, but very likely not for efficient voice recognition. In terms of algebra, a fourier transform changes the base of a vector space where the signal is a vector in form time base to harmonics base. Tchebychev orthogonal vectors or hermite functions could also be used, in the and the analysis of frequencies can be used to trim a filterbank and the characteristics of the filtered signals: that is data of the bank itself and the signal output must be analyzed, e.g. by an ann

21,233I'd go as far as to say that the Fast Fourier Transform, and friends, are the most significant algorithms we ever came up with. As it happens Mother Nature disagrees with you. The best, most efficient, voice recognizers we have are human, or at least mammal, so perhaps we should have a look at how that works.

Every school child learns something about how our sense of hearing works in biology class.

Sound waves from the voice or whatever we are to recognize enter the ear and vibrate the eardrum. Then there is some weird transfer of sound through some little bones, eventually entering the Cochlea. The Cochlea is a long coiled tube full of fluid. Inside are thousands of "hair" cells which vibrate with the fluid doing the actual sound sensing and feeding signals to neurons.

Turns out these hair cells are tuned to different frequencies along the length of the cochlea by virtue of their mechanical construction. The signals they input to neurons are frequency specific.

Ergo, our hearing system uses a Fourier Transform right at the input stage. Not only that it does it mechanically!

What the neurons get, our brains, is a Fourier Transform of the input sound. With that processing done up front by the cochlea and hair cells, the relatively slow processing of the neurons, your brain, can take it's time to categorize that into the sounds we recognize.

Now, it seems to me that this is all pretty smart. Neurons are really slow. If they were fed with raw sound pressure amplitude signal they could never keep up with it fast enough to make anything out. Except perhaps gross amplitude estimates. Having the Cochlea do a "fast" fourier transform up front makes a lot of sense.

Of course Wikipedia has a better introduction to all of this: https://en.wikipedia.org/wiki/Cochlea

Now, you are right about other features of sounds that help us recognize them, attack time and so on. They of course can be measured on the frequency domain side of things just as well. Even better because in the frequency domain we can detect different tones and harmonics rising and falling simultaneously and make comparisons.

21,233But that is not what we do. If we chop that time period up into smaller chuks we migh have four separate spectra measured over "so", "la", "mi", "do" and another four for "do", "mi", "la", "so".

We now have the same four sprectra but in different time order the second time. There is your information extraction.

1,583Fourier long ago found, that every mathematical function that is periodic can be synthesized by adding harmonics of the frequency that has the period of the signal period.

Max Planck was very aware of this when he found: if I take a quantum of energy e0 and add multiple of energy quanta en of energy en = e0 * n, then I can decribe the black body radiation. He found, that this energy distribution is independ of the actual size of the value e0.

537https://www.embedded.com/electronics-blogs/max-unleashed-and-unfettered/4458616/XMOS---Setem-could-be-a-game-changer-for-embedded-speech

1,58313,610