Will P2 be able to do professional Voice Identification?
william chan
Posts: 1,326
in Propeller 2
So now P2 will have ADC pins so it will be super easy to connect Microphones to any pin.
My question, Is the P2 powerful enough to identify a person based on his/her voice?
My question, Is the P2 powerful enough to identify a person based on his/her voice?
Comments
However you don't need an ADC to interface digital MEMS microphones, just clock and data. I would have thought an ARM chip more suitable for this type of thing though since it may need quite some memory.
Not that I have any knowledge on best methods but I don't see why a recognition signature would need to be bigger than 1 KB. It only needs a true/false decision made.
Would it be better to store the voice sample in PDM or PCM for fast identification processing later?
I think our Alexa uploads sounds to the cloud, right?
Maybe that's a better solution needing less MPU power?
Still, Phil Pilgrim seems to have made one that seemed to work OK even on P1.
So, maybe easy to improve with P2.
I think someone here will find a way to do it.
It seems, though, that the recent growth of such capabilities is enabled by neural networks, which are "trained" to perform decision-making tasks, instead of "programmed" to. They don't work in an exact manner, like how a program executes. At the bottom-most level, each neural node makes sense, but nobody can make any high-level determination from that level, because the purposeful behavior emerges from too many such sub-parts to get one's head around.
Neural nets have made lots of neat things possible, but they substitute exactness (which is hard to program, maybe too discriminatory) for probability (a child could train it, but it will be as errant as a child).
If you could figure out what it is about people's voices that differentiate them from each other, you could write an algorithm on the P2 to perform the discrimination, and then weigh out your determinations to arrive at a result.
Neural nets are being applied to a lot of things these days, like spell checking and predicting the end of what someone is typing into a text window. I see these things making obvious mistakes, like capitalizing letters on words because they've seen those words appear in their training data sets as parts of proper nouns. My point is, they know nothing about English, exactly, just statistical likelihoods of one thing following another.
Still, good hobby level fun, I would think.
I just wanted to add something about the neural nets. Big, complicated ones and their training are resource intensive, but once they are created, they can be pruned down some. And sometimes pruned down much more than one would think.
A while back, I was reading a paper on fidelity needed to perform. A couple cases were shown where a robust net, say 8 bits per node, was scaled down to as little as one bit while retaining useful functionality!
This is being done in the mobile space. When possible, and to reduce latency and the need to exchange data with servers, parts of features are slimmed down amd then optimized to run right on the mobile device.
I was in the EU a while back and got stuck with some terrible 100mb data plan. That didn't last long.
When using maps, I got a notice that I was using reduced functionality. Prior to my trip, I downloaded a cache of the region I would be in.
The reduced version omitted some place names and did not route as well as maps, big brother edition would, but was more than enough. Enough that I got the feeling most of the value is in the data, both cached at the data center, as well as real time, or near time data collected by other users in my region.
It may be we can build nets with some of the cool tools out there now, then bit reduce them to run just fine on a P2, perhaps stashing data in a big external RAM or maybe even SD card
a. Enroll a person's coughs by storing the FFT peaks at the loudest point and link to his user ID.
b. Use a proximity sensor to detect when a person is in front of the cough reader and when he has left.
c. Each time a user coughs but does not match, the mismatched FFT signature is stored in a temporary storage.
d. Once a cough matches, all the previous mismatched FFT signatures will be enrolled as that user's additional cough templates.
e. However, if any new template happens to match another user's template, the new template will be discarded.
f. If proximity sensor detects that a person has left without success, the temporary templates will be deleted.
This allows the cough reader to learn all the possible coughs of a person.
It seems your topic should have been "cough identification" rather than "Voice Identification". FFTs are well suited to processors with FPUs such as M4 or M7 ARM chips whereas the P2 could do FFT but not as well.
So let's say to recognize a person from the cough sound, we need to to find amplitude of frequencies from 0Hz to 20Khz,
divided into 40 discrete frequency spacing of 500Hz each. ( is 500Hz too wide? )
So we prepare 40 accumulators to accumulate the
product of sine and cosine of each frequency with the actual sample amplitudes every 90 degrees phase.
Finally, we sort by the accumulators by the highest to lowest values to get the bio-signature.
Sounds simple for the P2 right?
This thread
https://forums.parallax.com/discussion/166671/p2-interpreter-comparison
already has some FFT benchmarks
My FFT comes in Spin and C versions and runs fine on the P1, google "heater_fft" and fft_bech". The guts of the Spin version is written in assembler but I include PASM as part of Spin as it's all integrated into the same source file.
The C version of heater_fft can spread the workload over 2 or 4 COGs automatically for extra performance.
If you want to do an FFT from 0Hz to 20KHz you will need 40 thousand samples and hence frequency bins. That's OK, the P2 has space. But for speech 8 or 10K samples per second would do. And of course you need not use a whole seconds worth of samples at a time. Perhaps only 100ms worth would do.
The problem I had was running out of bits. The heater_FFT uses 32 bit fixed point arithmetic with 12 bits right of the binary point. For large sample sets you end up adding together a lot of samples and overflowing the number range. Hence I only do an FFT over 1024 samples. Or maybe I was just looking at the problem wrongly.
I'm not convince that taking just the FFT is enough to recognize a voice.
How would one do the signature matching?
1) Take the FFT.
2) Normalize it so that the total area under the curve is 1.
3) Calculate the root-mean-squared difference between it and a normalized FFT of the reference voice.
See here for solutions like that, and others, https://www.researchgate.net/post/How_can_I_compare_the_shape_of_two_curves
However, it doesn't need training to recognize what you are saying.
I think you might be confusing Fast Fourier Transform with Artificial Neural Network. I would think a FFT will always be the first stage in building a compact signature.
Oddly enough all the examples in the benchmarks are interpreted at runtime from some kind of non-native instructions.
I thought I had seen a native (compiled to PASM) benchmark go past, but cannot find it easily...
Is there anything that compiles to native P2 HUB code? I don't recall seeing it go by.
Hmm... actually, the FFT above can be compiled with fache. Which means it's inner loop(s) are compiled to native P1 code that runs in COG directly, even if the rest of the code is LMM.
and to the Discrete Cosine Transform for image compression [2].
[1] https://scialert.net/fulltext/?doi=jas.2013.465.471
[2] https://spectrum.library.concordia.ca/976081/
If you happen to have the code to either of those, in C, Spin or PASM, that would be great.
I discovered the DTT yesterday!